TY - GEN
T1 - A Weight Ternary Ensemble Vision Transformer Toward Memory Size Reduction
AU - Kayanoma, Ryota
AU - Nakahara, Hiroki
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Vision Transformer (ViT) is an image recognition model with advanced accuracy. However, there is a problem with the large parameter size. A ternary representation that introduces zero to the binary representation of {-1, + 1} can be applied to reduce the parameter size. We define it as weight ternary ViT, Although the weight ternary ViT can reduce the parameter size, the recognition accuracy decreases compared to ViT with float32 accuracy. Therefore, we introduce an ensemble method that computes multiple weight ternary ViTs in parallel. Recognition accuracy can be improved by majority voting through ensemble formation. In this paper, we propose a method for adding a normalization layer to advance the training of a weight ternary ViT and a training algorithm. Next, we will describe the training method of an ensemble weight ternary ViT. We evaluate the ensemble weight ternary ViT using CIFAR10 benchmark images, an image classification task. As a result, we achieve recognition accuracy equivalent to the original float32 precision ViT by using ten weight ternary ViTs with 80% zero representation applied. The parameter size is reduced by 87.5%.
AB - Vision Transformer (ViT) is an image recognition model with advanced accuracy. However, there is a problem with the large parameter size. A ternary representation that introduces zero to the binary representation of {-1, + 1} can be applied to reduce the parameter size. We define it as weight ternary ViT, Although the weight ternary ViT can reduce the parameter size, the recognition accuracy decreases compared to ViT with float32 accuracy. Therefore, we introduce an ensemble method that computes multiple weight ternary ViTs in parallel. Recognition accuracy can be improved by majority voting through ensemble formation. In this paper, we propose a method for adding a normalization layer to advance the training of a weight ternary ViT and a training algorithm. Next, we will describe the training method of an ensemble weight ternary ViT. We evaluate the ensemble weight ternary ViT using CIFAR10 benchmark images, an image classification task. As a result, we achieve recognition accuracy equivalent to the original float32 precision ViT by using ten weight ternary ViTs with 80% zero representation applied. The parameter size is reduced by 87.5%.
KW - Emsamble Model
KW - Multiple-Valued Logic
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=85203154029&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85203154029&partnerID=8YFLogxK
U2 - 10.1109/ISMVL60454.2024.00038
DO - 10.1109/ISMVL60454.2024.00038
M3 - Conference contribution
AN - SCOPUS:85203154029
T3 - Proceedings of The International Symposium on Multiple-Valued Logic
SP - 155
EP - 160
BT - Proceedings - 2024 IEEE 54th International Symposium on Multiple-Valued Logic, ISMVL 2024
PB - IEEE Computer Society
T2 - 54th IEEE International Symposium on Multiple-Valued Logic, ISMVL 2024
Y2 - 28 May 2024 through 30 May 2024
ER -