(Full PyTorch Code Reference: https://github.com/csm-kr/swin_transformer_pytorch)
(Code Explanation Reference: https://csm-kr.tistory.com/86)
Purpose
Previous Problem:
- Existing ViT model - proposed as a model for solving classification problems
- Unlike text, ViT has no characteristics for an image
- As the number of tokens increases, the amount of computation increases quadratically
Purpose:
- Model that can be used as backbone for a variety of purposes
- Methods for reflect the characteristics of an image in Transformer architecture
- Methods for having less computation than existing ViT model
Notation
- M: size of local window
- n: number of local windows
- patch: divided images(token) within the image
- P_h, P_w: size of patch
- N(=N_h x N_w): number of patch(token)
- B: size of batch