Tokenizing → Word Tokens

Tokenizing → Word Tokens

Embedding → Each Dense Vector (Original Data?)

Embedding → Each Dense Vector (Original Data?)

Encoding → Sentence Matrix

Encoding → Sentence Matrix

Attention (between Sentence Matrix & Context Vector (or Sentence Matrix)) → Single Vector

Attention (between Sentence Matrix & Context Vector (or Sentence Matrix)) → Single Vector

Prediction

Prediction

ViT & CLIP-1.jpg

ViT & CLIP-2.jpg

ViT & CLIP-3.jpg

ViT & CLIP-4.jpg