Background

(Normal convolution calculation - perform based on fixed-sized kernel)

Deformable convolution - predict sampling point by put the feature to specific layer → perform convolution calculation based on that point

visualize deformable convolution and sampled point

Architecture

Multi-scale: use feature maps of multiple scales, use feature from each scale for input of attention (In the DETR, perform global attention operation for only one scale data)
Deformable Attention: while performing attention, the key does not become all pixels, but only performs attention operation on sampling points predicted through a specific layer (= Deformable Convolution)

Untitled

Instead of performing attention for every pixels, put it to independent linear layer → get (1) sampling offset and (2) attention weights, perform attention by using this
Use attention weight gained through linear layer (not gained through inner-product)
Aggregate(모으다) that attention weight and feature of sampled points → get attention value