(Full PyTorch Code Reference: https://github.com/hanouticelina/deformable-DETR)
Background
DETR
https://frost-crate-a82.notion.site/End-to-End-Object-Detection-with-Transformers-513a58b0b55f4b5bafdb596f862f5601?pvs=4
Deformable Convolution
(Normal convolution calculation - perform based on fixed-sized kernel)
Deformable convolution - predict sampling point by put the feature to specific layer → perform convolution calculation based on that point

visualize deformable convolution and sampled point
Architecture
- Multi-scale: use feature maps of multiple scales, use feature from each scale for input of attention
(In the DETR, perform global attention operation for only one scale data)
- Deformable Attention: while performing attention, the key does not become all pixels, but only performs attention operation on sampling points predicted through a specific layer (= Deformable Convolution)

Deformable Attention
- Instead of performing attention for every pixels, put it to independent linear layer → get (1) sampling offset and (2) attention weights, perform attention by using this
- Use attention weight gained through linear layer (not gained through inner-product)
- Aggregate(모으다) that attention weight and feature of sampled points → get attention value