(Full PyTorch Code Reference: https://github.com/csm-kr/swin_transformer_pytorch)

(Code Explanation Reference: https://csm-kr.tistory.com/86)

Purpose

Previous Problem:

Existing ViT model - proposed as a model for solving classification problems
Unlike text, ViT has no characteristics for an image
As the number of tokens increases, the amount of computation increases quadratically

Purpose:

Model that can be used as backbone for a variety of purposes
Methods for reflect the characteristics of an image in Transformer architecture
Methods for having less computation than existing ViT model

Notation

M: size of local window
n: number of local windows
patch: divided images(token) within the image
P_h, P_w: size of patch
N(=N_h x N_w): number of patch(token)
B: size of batch