(Full PyTorch Code Reference: https://github.com/facebookresearch/detr)

Background

DETR(DEtection with TRansformers): New detection architecture based on transformer and bipartite-matching(이분 매칭)

Untitled

Architecture

Untitled

Untitled

1. Input to Encoder

  1. Get feature map from Backbone CNN (in the paper, ResNet) (Different from ViT that use image divided by patch as input directly)
  2. Reduce feature map to preset token embedding dimension (d=256) by 1x1 convolution layer
  3. Flatten to dxHW → get sequence that can be used as inputs for transformer

Untitled

2. Encoder