(Full PyTorch Code Reference: https://daebaq27.tistory.com/112)
Transformer (Attention) Concept
Attention is All You Need
(https://frost-crate-a82.notion.site/Attention-is-All-You-Need-e5b8aca9d98c4056a75b8301256cd47e?pvs=4)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale


Pros of ViT
- Use the construction of transformer almost as it is → scalable
- Verified that transformer has a great performance for large scale training → this effect can be achieved by using ViT
- During transfer learning → less computational resources for learning than CNN
Cons of ViT
- Require more data than CNN due to lack of inductive bias
(inductive bias: assumptions used by the model to predict the output for the first input it sees)
(ex. inductive bias for CNN → translation equivariance, locality)