(Full PyTorch Code Reference: https://daebaq27.tistory.com/112)

Transformer (Attention) Concept

Attention is All You Need (https://frost-crate-a82.notion.site/Attention-is-All-You-Need-e5b8aca9d98c4056a75b8301256cd47e?pvs=4)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ViT & CLIP-5.jpg

ViT & CLIP-6.jpg

Pros of ViT

Use the construction of transformer almost as it is → scalable
Verified that transformer has a great performance for large scale training → this effect can be achieved by using ViT
During transfer learning → less computational resources for learning than CNN

Cons of ViT

Require more data than CNN due to lack of inductive bias (inductive bias: assumptions used by the model to predict the output for the first input it sees) (ex. inductive bias for CNN → translation equivariance, locality)