End-to-End Object Detection with Transformers

Background

DETR(DEtection with TRansformers): New detection architecture based on transformer and bipartite-matching(이분 매칭)

Approaches object detection as direct set-prediction(object detection 모델이 각 object에 대하여 set of bounding boxes와 category labels을 예측하는 task) & end-to-end model → does not require geometric prior (hand-crafted engineering like RPN, NMS)
It has structural simplicity but it is highly scalable to other tasks (ex. panoptic segmentation)
It uses global information by attention mechanism → showing higher performance compared to Faster R-CNN for large object detection

Untitled

Untitled

Get feature map from Backbone CNN (in the paper, ResNet) (Different from ViT that use image divided by patch as input directly)
Reduce feature map to preset token embedding dimension (d=256) by 1x1 convolution layer
Flatten to dxHW → get sequence that can be used as inputs for transformer

Untitled