Module 2 / Week 1 — DETR from scratch

Goal: build a Transformer-based object detector from scratch, learn end-to-end without anchor boxes or non-maximum suppression.

This week’s reading: - Carion et al. — End-to-End Object Detection with Transformers (2020) — https://arxiv.org/abs/2005.12872 - Reference implementation: facebookresearch/detr — read models/detr.py, models/matcher.py, models/transformer.py

What you should be able to do by the end of the week: 1. Explain — in plain English — why DETR doesn’t need NMS or anchor boxes. 2. Describe what an object query is and how N queries → N predictions works. 3. Explain bipartite matching with the Hungarian algorithm and why it makes set prediction differentiable. 4. Implement, end-to-end, a small DETR that detects objects on a toy dataset (single GPU or CPU). 5. Visualize the decoder’s cross-attention to see which image regions each query attends to.

Time budget: ~1 week. Read paper day 1, build days 2–5, train + analyze days 6–7.

Why DETR — the conceptual pivot

Pre-DETR detectors (Faster R-CNN, RetinaNet, YOLO) all share a basic recipe:

  1. Generate anchor boxes at many scales/ratios at every spatial location.
  2. For each anchor, predict whether it contains an object + a refinement of the box.
  3. Apply non-maximum suppression (NMS) to remove redundant detections.

DETR throws all of this out. Instead:

  1. Encode the image with a CNN + Transformer encoder → spatial feature tokens.
  2. Use N learned object queries (e.g. 100) as decoder inputs.
  3. Each query, via cross-attention, attends to the image and produces one prediction (class + box).
  4. Bipartite matching between the N predictions and the M ground-truth objects assigns each prediction a target. Unmatched predictions are trained to predict the special class “no object”.

No anchors. No NMS. Set prediction, end-to-end differentiable.

Q. Before you read the paper, write what you think will be hard about this. Where might it fail?

Architecture in one diagram

  image [B, 3, H, W]
         │
         ▼
   ┌─────────────┐
   │  Backbone   │  (ResNet-50, frozen or fine-tuned)
   └─────────────┘
         │
  features [B, C, H/32, W/32]
         │
  1×1 conv → [B, d_model, H/32, W/32]
         │
  flatten + 2D positional encoding
         │
  tokens [B, HW/1024, d_model]
         │
         ▼
   ┌─────────────┐
   │  Encoder    │  (6 layers self-attention)
   └─────────────┘
         │
  memory [B, HW/1024, d_model]
         │
         ▼
   ┌──────────────┐    ◄──  N learned object queries [N, d_model]
   │   Decoder    │
   │  (6 layers:  │
   │   self-attn  │
   │   cross-attn │
   │   FFN)       │
   └──────────────┘
         │
  decoder_out [B, N, d_model]
         │
    ┌────┴────┐
    ▼         ▼
  class_head bbox_head
  [B, N, K+1] [B, N, 4]

You’ll build each block in the cells below. Read the corresponding section of the paper before coding each block.

Symbols you’ll see throughout: - B = batch size - d_model = transformer hidden dim, e.g. 256 - N = number of object queries, e.g. 100 - K = number of classes (not counting “no object”); for COCO, K=91 - H, W = input image height/width

Setup

import math
from typing import List, Tuple, Dict, Optional

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')
device: cpu

Dataset choice

Full COCO is overkill for a from-scratch learning exercise. Three options:

  1. COCO val2017 (~5k images, 80 classes) — realistic but training is slow.
  2. Pascal VOC (~16k images, 20 classes) — classic, faster.
  3. Toy synthetic dataset — generate colored shapes on a black background. Trains in minutes, lets you isolate “does my architecture work at all.”

Recommendation: start with option 3 to debug the implementation, then move to option 2 once it works.

Q. Why is debugging on a tiny synthetic dataset useful? What can a synthetic-data success not tell you?

# TODO — build a tiny synthetic detection dataset
# Generate images with K random colored shapes (rectangles) on a black background.
# Return: (image [3, H, W], targets {'boxes': [M, 4] in cxcywh normalized [0,1], 'labels': [M] in [1, K]})
# Hint: class 0 is reserved for 'no object' — keep your real classes in [1, K].

class ToyDetectionDataset(torch.utils.data.Dataset):
    def __init__(self, num_samples: int = 1000, image_size: int = 128, max_objects: int = 5, num_classes: int = 3):
        ...
    def __len__(self):
        ...
    def __getitem__(self, idx: int):
        ...

Step 1 — Backbone

The backbone is a standard CNN (ResNet-50 in the paper). It produces a feature map at ~1/32 spatial resolution: [B, 2048, H/32, W/32]. A 1×1 conv projects 2048 → d_model (=256).

For from-scratch implementation, you have two choices: - Use torchvision.models.resnet50(pretrained=True) and strip the last layers. - Use a smaller backbone (ResNet-18 or even just a few conv layers) for the toy dataset.

Q. Why is the backbone usually pretrained (ImageNet) rather than trained from scratch? What if your dataset is OOD for ImageNet?

# TODO — Backbone
# 1. Wrap a torchvision ResNet, return the C5 feature map.
# 2. Apply a 1x1 conv to project from 2048 (resnet50) or 512 (resnet18) to d_model.
# 3. Forward should accept [B, 3, H, W] and return [B, d_model, H', W'].

class Backbone(nn.Module):
    def __init__(self, d_model: int = 256, name: str = 'resnet18'):
        super().__init__()
        # TODO: build the truncated backbone + projection
        ...

    def forward(self, x: Tensor) -> Tensor:
        # TODO
        ...

Step 2 — 2D positional encoding

The transformer is permutation-invariant — it has no idea where each spatial token sits in the image. We add a 2D sinusoidal positional encoding that gives every spatial location a unique signature.

Different from the 1D PE in the original Transformer paper. The 2D version splits the channels in half: one half encodes the row (y), the other encodes the column (x), each with sinusoidal frequencies.

Q. Why sinusoidal and not learned positional embeddings here? (Hint: input image sizes vary; sinusoidal generalizes to unseen sizes.)

# TODO — 2D sinusoidal positional encoding
# Given a feature map [B, C, H, W], produce a positional encoding [B, C, H, W] (or [B, H*W, C] after flatten).
# Standard recipe:
#   - half the channels encode y position with sin/cos at log-spaced frequencies
#   - the other half encode x position similarly
#   - frequencies typically: 10000^(2i/d) for i in [0, d/2)

class PositionalEncoding2D(nn.Module):
    def __init__(self, d_model: int = 256, temperature: float = 10000.0):
        super().__init__()
        assert d_model % 4 == 0, 'd_model must be divisible by 4 (half for x, half for y, each with sin+cos)'
        # TODO: store config
        ...

    def forward(self, x: Tensor) -> Tensor:
        # x: [B, C, H, W] (just used for shape, contents ignored)
        # returns: [B, C, H, W] positional encoding
        # TODO
        ...

Step 3 — Transformer encoder

Standard ViT-style encoder: N layers, each consisting of multi-head self-attention + MLP, with layer norm and residual connections.

DETR-specific: the positional encoding is added to the queries and keys at every layer, not just at the input. (The original DETR adds PE to Q and K but not V — slightly nonstandard.)

Q. Why add the PE at every layer instead of just at the input? What does this give you?

# TODO — Transformer encoder layer + stack
# Reuse nn.MultiheadAttention from PyTorch. Implement one encoder layer:
#   - self-attention(Q=K=tokens+pe, V=tokens)
#   - residual + LN
#   - FFN (Linear -> ReLU/GELU -> Linear)
#   - residual + LN
# Then stack N of these.

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model: int = 256, n_heads: int = 8, dim_ff: int = 2048, dropout: float = 0.1):
        super().__init__()
        # TODO
        ...

    def forward(self, x: Tensor, pos: Tensor) -> Tensor:
        # x: [B, L, d_model], pos: [B, L, d_model]
        # TODO
        ...


class TransformerEncoder(nn.Module):
    def __init__(self, n_layers: int = 6, **layer_kwargs):
        super().__init__()
        self.layers = nn.ModuleList([TransformerEncoderLayer(**layer_kwargs) for _ in range(n_layers)])

    def forward(self, x: Tensor, pos: Tensor) -> Tensor:
        for layer in self.layers:
            x = layer(x, pos)
        return x

Step 4 — Transformer decoder + object queries

This is the conceptually weirdest part of DETR. Read paper §3.2 and the original Transformer §3.2 before coding.

Object queries are N learned vectors (e.g. 100 vectors of dim 256). They are positional embeddings without content — each query is a slot that will, after the decoder, contain the embedding of one detected object (or “no object”).

Each decoder layer does three operations: 1. Self-attention over queries — queries can communicate with each other, letting them coordinate which one detects which object. 2. Cross-attention to the encoder’s output — each query reads features from the image, gathering evidence for its prediction. 3. FFN — process the gathered evidence.

Q1. What does it mean to say object queries are “positional embeddings without content”? How does this differ from autoregressive decoding in NMT?

Q2. The decoder takes 100 queries in and produces 100 output vectors. What if your image has only 3 objects? What happens to the other 97 queries?

Q3. Why is self-attention over the queries important? What would break if you removed it?

# TODO — Decoder layer + stack + learned object queries
# Each layer:
#   - self-attn over queries (Q=K=queries+query_embed, V=queries)
#   - cross-attn (Q=queries+query_embed, K=memory+pos, V=memory)
#   - FFN
# query_embed is a learned [N, d_model] parameter — this IS the object queries.

class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model: int = 256, n_heads: int = 8, dim_ff: int = 2048, dropout: float = 0.1):
        super().__init__()
        # TODO
        ...

    def forward(self, tgt: Tensor, memory: Tensor, pos: Tensor, query_pos: Tensor) -> Tensor:
        # tgt: [B, N, d_model] -- decoder queries (start at zero, get updated by each layer)
        # memory: [B, L, d_model] -- encoder output
        # pos: [B, L, d_model] -- spatial PE for memory
        # query_pos: [B, N, d_model] -- the learned object queries (positional)
        # TODO
        ...


class TransformerDecoder(nn.Module):
    def __init__(self, n_layers: int = 6, **layer_kwargs):
        super().__init__()
        self.layers = nn.ModuleList([TransformerDecoderLayer(**layer_kwargs) for _ in range(n_layers)])

    def forward(self, tgt: Tensor, memory: Tensor, pos: Tensor, query_pos: Tensor) -> List[Tensor]:
        # Return the output AFTER EACH LAYER (for auxiliary losses) -- list of length n_layers.
        outs = []
        for layer in self.layers:
            tgt = layer(tgt, memory, pos, query_pos)
            outs.append(tgt)
        return outs

Step 5 — Prediction heads

Two MLPs on top of each decoder output: - Class head: Linear → K+1 logits. The extra class is the special “no object” class. - Bbox head: 3-layer MLP → 4 values (cx, cy, w, h), sigmoid’d to [0, 1] (image-normalized).

Q. Why predict (cx, cy, w, h) in normalized coordinates rather than (x_min, y_min, x_max, y_max) in pixels? List two reasons.

# TODO — Class head and bbox head
# Class head: single Linear(d_model -> num_classes + 1)
# Bbox head: small MLP (3 linear layers, ReLU between) + sigmoid on output

class DetrPredictionHeads(nn.Module):
    def __init__(self, d_model: int = 256, num_classes: int = 3):
        super().__init__()
        # TODO
        ...

    def forward(self, x: Tensor) -> Tuple[Tensor, Tensor]:
        # x: [B, N, d_model] -> (logits [B, N, K+1], boxes [B, N, 4] in cxcywh normalized)
        # TODO
        ...

Step 6 — Hungarian matcher

This is the conceptual centerpiece of DETR. Without it, set prediction wouldn’t work.

The model produces N predictions, the image has M ground-truth boxes (M ≤ N). We need to assign each ground-truth to exactly one prediction before computing the loss. The assignment that minimizes total cost is found by bipartite matching (Hungarian algorithm).

Cost between prediction i and ground-truth j:

cost[i, j] = -p_i[c_j] + λ_L1 · ||b_i - b_j||_1 + λ_giou · (1 - GIoU(b_i, b_j))

where p_i[c_j] is prediction i’s probability for ground-truth j’s class.

Q1. Why is argmax matching (greedy: each gt → highest-scoring prediction) bad here? What goes wrong?

Q2. The Hungarian algorithm finds the optimal assignment in O(n³) time. Why does that not bottleneck training? (Hint: how big is N?)

Q3. The matching is computed with torch.no_grad() — the gradient does NOT flow through the matching. Why is that OK?

# TODO — Hungarian matcher
# Use scipy.optimize.linear_sum_assignment on the cost matrix.
# Inputs:
#   outputs = {'pred_logits': [B, N, K+1], 'pred_boxes': [B, N, 4]}
#   targets = [{'labels': [M_b], 'boxes': [M_b, 4]} for b in range(B)]  # variable M per image
# Returns per batch: a tuple (pred_indices [M_b], target_indices [M_b]) of matched pairs.

from scipy.optimize import linear_sum_assignment

class HungarianMatcher(nn.Module):
    def __init__(self, cost_class: float = 1.0, cost_bbox: float = 5.0, cost_giou: float = 2.0):
        super().__init__()
        # TODO
        ...

    @torch.no_grad()
    def forward(self, outputs: Dict[str, Tensor], targets: List[Dict[str, Tensor]]) -> List[Tuple[Tensor, Tensor]]:
        # TODO
        # 1. Build cost matrix per image
        # 2. Run linear_sum_assignment
        # 3. Return list of (pred_idx, tgt_idx) tensors, one per image
        ...

Step 7 — Loss

Once you have the matching, the loss is three terms:

  1. Classification loss — cross-entropy over all N predictions. Matched predictions are trained toward their assigned gt class; unmatched ones are trained toward “no object”.
  2. L1 box loss||b_pred - b_gt||_1 over matched pairs only.
  3. GIoU loss1 - GIoU(b_pred, b_gt) over matched pairs only.

Plus auxiliary losses: apply this entire loss at the output of every decoder layer (deep supervision). This is just additive — sum the per-layer losses.

Q1. Why L1 and GIoU? Why not just L1?

Q2. What does the auxiliary loss do? Why does it help training?

Q3. The “no object” class will be by far the most common target (most of the 100 queries every image have no match). What problem does this create, and how does DETR fix it? (Hint: see paper §3.3.)

# TODO — Loss
# Implement:
#   loss_class: cross-entropy with class-weighting (down-weight 'no object', e.g. weight=0.1)
#   loss_bbox: L1 on matched pairs
#   loss_giou: 1 - generalized_box_iou on matched pairs
# Helpers you may want:
#   from torchvision.ops import generalized_box_iou, box_convert

class DetrLoss(nn.Module):
    def __init__(self, num_classes: int = 3, matcher: HungarianMatcher = None,
                 weight_class: float = 1.0, weight_bbox: float = 5.0, weight_giou: float = 2.0,
                 noobj_weight: float = 0.1):
        super().__init__()
        # TODO
        ...

    def forward(self, outputs: Dict[str, Tensor], targets: List[Dict[str, Tensor]]) -> Dict[str, Tensor]:
        # outputs: {'pred_logits', 'pred_boxes', 'aux_outputs': [per-layer dicts]}
        # Return a dict of loss components and a 'total' key.
        # TODO
        ...

Step 8 — Assemble the full DETR model

# TODO — Glue everything together
# Build the full DETR by composing: Backbone -> 1x1 conv -> + PE -> Encoder -> Decoder w/ object queries -> heads.
# The model should output:
#   {'pred_logits': [B, N, K+1], 'pred_boxes': [B, N, 4], 'aux_outputs': [per-layer dicts]}

class DETR(nn.Module):
    def __init__(self, num_classes: int = 3, num_queries: int = 100, d_model: int = 256,
                 n_heads: int = 8, n_enc_layers: int = 6, n_dec_layers: int = 6, dim_ff: int = 2048,
                 backbone_name: str = 'resnet18'):
        super().__init__()
        # TODO: backbone, pe, encoder, decoder, query_embed (nn.Embedding), heads
        ...

    def forward(self, images: Tensor) -> Dict[str, Tensor]:
        # TODO
        ...

Step 9 — Training

Standard PyTorch training loop. Things to watch:

  • Learning rate: paper uses 1e-4 for transformer, 1e-5 for backbone (lower for pretrained backbone).
  • Schedule: original DETR trains for 500 epochs on COCO. For toy/synthetic data, a few hundred steps should suffice.
  • Watch the loss components separately — if loss_class drops fast but loss_bbox is flat, your box head or matching is off.
  • Watch the “no object” probability — early in training every query predicts “no object”. Check that the matched queries gradually transition to real classes.
# TODO — training loop
# 1. Instantiate dataset, model, matcher, loss, optimizer (AdamW with split LR).
# 2. For each batch: forward -> loss -> backward -> step.
# 3. Print loss components every K steps.
# 4. Optionally: eval on a held-out set every E epochs.

def train_detr(num_steps: int = 1000, batch_size: int = 16):
    ...

Step 10 — Inference + attention visualization

Inference is trivial in DETR — there’s no NMS. Just:

  1. Run the model.
  2. For each of the N queries, take the argmax over K+1 classes.
  3. Keep predictions whose argmax is not “no object” (and optionally above a confidence threshold).
  4. Return the boxes for the kept predictions.

Visualizing cross-attention is the most interesting analysis: - For each query, the decoder’s final-layer cross-attention is a heatmap over image patches. - Plotting attention[query_i, :, :] reshaped to [H/32, W/32] shows which part of the image query i is looking at. - You’ll see that each query specializes — some queries always look at the top-left, some at small objects, etc.

Q. What does it tell you if many queries’ attention heatmaps look identical? What if they’re all uniform?

# TODO — inference + attention viz
# 1. forward an image, get pred_logits and pred_boxes
# 2. softmax over classes, drop 'no object', threshold by score
# 3. plot image + remaining boxes
# 4. for one image, also extract cross-attention from the last decoder layer
#    (you'll need to register a hook or modify MHA to return attention weights)
# 5. plot the attention heatmap for each kept query, overlaid on the image

@torch.no_grad()
def predict_and_visualize(model: DETR, image: Tensor, score_threshold: float = 0.7):
    ...

Exercises / extensions

Once your DETR is training on the toy dataset, try one or two:

  1. Ablate object queries. Set num_queries = 1 and re-train. Can it detect anything? What about num_queries = 10 when the image has 5 objects?
  2. Ablate decoder self-attention. Remove the self-attention block (keep only cross-attn + FFN). What breaks? (Hint: duplicate detections.)
  3. Ablate auxiliary loss. Train with loss only on the final layer. Compare convergence speed.
  4. Query specialization plot. After training, for each query, average the centers of its top predictions across the validation set. Plot the 100 query centers on a 2D plane — you should see specialization (some queries always predict in the top-left, etc.).
  5. Move to Pascal VOC. Once toy data works, try a real (smaller) detection benchmark.

End-of-week recap

Write a ≤200-word summary you could send to a colleague who hasn’t read the DETR paper. Cover:

  1. The conceptual pivot from anchor-based detectors to set prediction.
  2. What object queries are and how N → N predictions works.
  3. Why bipartite matching is needed and what it gives you.
  4. The biggest weakness of vanilla DETR (slow training, ~500 epochs).
  5. The one thing you’d want to follow up on.


Next week (Module 2 / Week 2): LW-DETR and RF-DETR — building on top of vanilla DETR. Key things to look for: deformable attention (Deformable-DETR), two-stage variants, denoising-based training (DN-DETR / DINO-DETR / RF-DETR), and why these recipes cut training from 500 epochs to ~50.