(Full PyTorch Code Reference: https://github.com/facebookresearch/detr)
Background
DETR(DEtection with TRansformers): New detection architecture based on transformer and bipartite-matching(이분 매칭)
- Approaches object detection as direct set-prediction(object detection 모델이 각 object에 대하여 set of bounding boxes와 category labels을 예측하는 task) & end-to-end model → does not require geometric prior (hand-crafted engineering like RPN, NMS)
- It has structural simplicity but it is highly scalable to other tasks (ex. panoptic segmentation)
- It uses global information by attention mechanism → showing higher performance compared to Faster R-CNN for large object detection

Architecture


1. Input to Encoder
- Get feature map from Backbone CNN (in the paper, ResNet)
(Different from ViT that use image divided by patch as input directly)
- Reduce feature map to preset token embedding dimension (d=256) by 1x1 convolution layer
- Flatten to dxHW → get sequence that can be used as inputs for transformer

2. Encoder