Module 2 / Week 2 — Building on top of DETR

Goal: understand the lineage of DETR variants that fixed vanilla DETR’s slow convergence and weak small-object detection. Read each paper, identify the one trick that defines it, and extend your vanilla_detr.py with the single most impactful trick (denoising queries).

This week’s reading (in chronological order — each builds on the previous): - Zhu et al. — Deformable DETR (2020) — https://arxiv.org/abs/2010.04159 - Meng et al. — Conditional DETR (2021) — https://arxiv.org/abs/2108.06152 (skim) - Liu et al. — DAB-DETR (2022) — https://arxiv.org/abs/2201.12329 (skim) - Li et al. — DN-DETR (2022) — https://arxiv.org/abs/2203.01305 - Zhang et al. — DINO (DETR variant; Zhang et al., not the SSL DINO) (2022) — https://arxiv.org/abs/2203.03605 - Chen et al. — LW-DETR (2024) — https://arxiv.org/abs/2406.03459 - Roboflow — RF-DETR (2025) — https://github.com/roboflow/rf-detr (blog/repo)

What you should be able to do by the end of the week: 1. State, in one sentence each, the central convergence fix of Deformable DETR, DN-DETR, DINO-DETR, LW-DETR, and RF-DETR. 2. Explain deformable attention in your own words, including why it’s faster and better than dense cross-attention. 3. Implement denoising queries on top of your vanilla_detr.py and train — observe the convergence speedup. 4. Fill in the comparison table (epochs to converge, COCO AP, key trick) for each variant. 5. Argue when you’d reach for LW-DETR vs RF-DETR vs DINO-DETR for a real project.

Recap — the vanilla DETR convergence problem

You felt this firsthand in Week 1. Three concrete symptoms:

  1. Sparse supervision per gt. Bipartite matching assigns exactly one positive query per gt. That’s ~5 positives per image vs the ~150 anchor-based methods provide. ~30× fewer gradient signals per object per step.
  2. Matching instability early in training. Which query owns which gt flickers between steps — queries can’t specialize.
  3. Global attention is slow and weak on small objects. Encoder cross-attention is O(L²) over all spatial tokens at one feature scale (/32). Tiny objects get blurred away.

Result: vanilla DETR needs ~500 epochs on COCO to reach Faster R-CNN-level accuracy, and still loses to anchor-based methods on small objects.

Every variant this week tackles one or more of those three problems. Keep this map in mind as you read:

Problem               | Vanilla DETR | Variant that fixes it
----------------------|--------------|-----------------------------
1. Sparse supervision | ~1 per gt    | DN-DETR, DINO-DETR
2. Slow attention     | O(L²) full   | Deformable DETR
3. Multi-scale        | only /32     | Deformable DETR
4. Query design       | learned PE   | Conditional / DAB / DINO

Genealogy at a glance

Vanilla DETR (2020, 500 epochs)
    │
    ├── Deformable DETR (2020, 50 epochs)
    │     └── deformable attention + multi-scale
    │
    ├── Conditional DETR (2021)
    │     └── content-dependent query positional encoding
    │
    ├── DAB-DETR (2022)
    │     └── object queries as explicit 4D anchor boxes
    │
    ├── DN-DETR (2022, 50 epochs)
    │     └── denoising queries: noisy GTs as extra training queries
    │
    └── DINO-DETR (2022, 12 epochs)
          └── contrastive denoising + mixed query selection + look-forward-twice
                │
                ├── LW-DETR (2024)
                │     └── DETR + ViT backbone + multi-scale efficient training
                │
                └── RF-DETR (2025)
                      └── real-time variant: Roboflow's open-source production-ready DETR

Two things to notice as you read: - Each variant claims a faster training schedule than the previous (500 → 50 → 12 epochs). - LW-DETR and RF-DETR are combinations of the earlier tricks, not new tricks themselves. Understanding the intermediate methods is mandatory to understand them.

1. Deformable DETR — deformable attention + multi-scale

The two contributions: 1. Deformable attention — each query attends to a small, learnable set of keys (~4 per head per scale), not all L spatial tokens. 2. Multi-scale features — features at /8, /16, /32 scales (a feature pyramid), not just /32.

Q1. Vanilla DETR’s cross-attention is O(L²) where L = H' × W'. What’s the complexity of deformable attention, and why?

Q2. Why does restricting each query to ~4 keys help convergence, not just inference speed? (Hint: think about the matcher’s stability — fewer, more focused attention targets means…)

Q3. Why does multi-scale help with small objects? Compare to vanilla DETR’s single /32 feature map.

Q4. What’s a deformation? Each query predicts both where to look (offset from a reference point) and how much to weight each look (attention weight). Sketch this with a small example.

2. Conditional DETR — content-dependent query PE (skim)

Central observation: in vanilla DETR, the cross-attention’s query positional component is fixed (the learned query_embed doesn’t depend on the image content). This means queries can only weakly localize.

The fix: make the query’s positional component content-dependent by deriving it from the decoder’s current content stream (tgt).

Not load-bearing for you to implement, but two things to remember when reading later DETR variants:

Q1. Why does content-dependent query PE matter for cross-attention specifically (vs self-attention)?

Q2. This idea reappears in DAB-DETR (which makes the query a 4D box) and DINO-DETR. What’s the through-line?

3. DAB-DETR — queries as 4D anchor boxes (skim)

Central observation: the learned query_embed in vanilla DETR is mysterious — it’s a 256-d vector with no human-interpretable meaning. DAB-DETR replaces it with a 4D anchor box (cx, cy, w, h) per query, which is initialized and refined through the decoder layers.

Each query carries its anchor box as state. The cross-attention uses the box’s position as a spatial prior on where to look in the encoder memory.

Q. This makes DETR look more like a classical anchor-based detector — but in a learnable, queryable form. What’s preserved from set prediction (vs anchor methods)?

4. DN-DETR — denoising queries ⭐ THE KEY METHOD

This is the most important method in this week’s reading. You already spent time understanding it in our Week 1 conversations — now read the paper to lock it in.

Central trick: at training time, in addition to the N object queries, inject extra queries built from noisy versions of the ground-truth boxes. Each noisy query is pre-assigned to its corresponding clean gt — bypassing the Hungarian matcher entirely for these queries. The model is trained to denoise them (recover the clean gt from the noisy input).

The geometry: - For each gt: make K noisy copies (jitter box coords + sometimes flip class). - These K queries are all trained as positives toward the same clean gt. - Effectively gives each gt ~K stable positives per step instead of 1 unstable positive.

The architectural detail: attention mask prevents denoising queries from leaking gt info to regular object queries.

The result: 500 epochs → 50 epochs.

Q1. State the core mechanism in your own words (4 sentences max).

Q2. Why is the attention mask necessary? What would happen without it?

Q3. Denoising queries are training-only. Why? What changes between training and inference?

Q4. Compare denoising to anchor-based detection’s “many positive anchors per gt”. What does each provide, and what’s the same / different?

Q5. In §3.2, the paper says noise is parameterized by two scalars: λ₁ (center) and λ₂ (scale). What happens at λ₁=0, λ₂=0? At very high values?

5. DINO-DETR — contrastive denoising + mixed query selection

Note on naming: this is not the SSL DINO (Caron et al.) — it’s a confusingly-named DETR variant from Zhang et al. 2022. Often just called “DINO” in detection papers.

Three improvements over DN-DETR:

  1. Contrastive denoising — in addition to positive denoising queries (small noise → predict the gt), add negative denoising queries (larger noise → predict “no object”). Teaches the model when to abstain.
  2. Mixed query selection — initialize object queries from the encoder’s top-confidence features (anchor-like) instead of learning them from scratch.
  3. Look-forward-twice — each decoder layer’s box predictions use information from two layers ahead during training.

The result: 50 epochs (DN-DETR) → 12 epochs.

Q1. Why are negative denoising queries useful? What’s the failure mode they prevent?

Q2. How does mixed query selection differ from DAB-DETR’s anchor queries? Which is more flexible?

Q3. Why is DINO-DETR the SOTA-anchor that almost every recent DETR variant is benchmarked against?

6. LW-DETR — lightweight, real-time DETR

The premise: DINO-DETR is accurate but heavy. LW-DETR aims to be faster than YOLO at similar accuracy while keeping DETR’s set-prediction simplicity.

Headline tricks:

  1. Pre-trained ViT backbone — replace CNN with a lightweight ViT (e.g., ViT-S/16). Cheaper than the ResNet stacks used in earlier DETR variants for the same accuracy.
  2. Multi-scale feature extraction — but cleverly aggregated to avoid the cost of a full FPN.
  3. IoU-aware classification loss — classification confidence is calibrated against IoU (so highly-confident-but-poorly-localized predictions are penalized).
  4. Heavy DETR-side compression — fewer queries, smaller decoder, distilled to a small model.

Q1. Why does using a ViT backbone work well here, given that vanilla DETR used ResNet?

Q2. IoU-aware classification — what’s the problem with vanilla DETR’s classification confidence?

Q3. The paper reports real-time speeds (>60 FPS) on consumer hardware. Where does that speedup come from — backbone, decoder, both, neither?

Q4. Is LW-DETR a fundamentally new method or an engineering recombination of earlier tricks?

7. RF-DETR — Roboflow’s real-time, open-source DETR

The premise: practical, ready-to-deploy DETR with Roboflow’s polish on top of LW-DETR-style ideas. Open source, with checkpoints and a clean training/inference API.

Not a single research paper — more of a release that combines proven tricks. Read the README and blog post rather than chasing a PDF.

What it inherits from earlier variants: - DINO-DETR-style denoising and contrastive denoising. - Deformable attention. - LW-DETR-style pre-trained ViT backbone. - Multi-scale features.

Q1. What gap in the open-source DETR ecosystem did RF-DETR fill?

Q2. From a practitioner’s perspective, when would you reach for RF-DETR vs DINO-DETR (research codebase) vs a YOLO variant?

Q3. The repo provides pretrained models. Run inference on a few of your toy synthetic images using their pretrained checkpoint and compare the predictions to your vanilla DETR’s. What’s the gap?

Comparison table — fill this in as you read

Method Year Epochs (COCO) COCO mAP Key trick Speed (FPS)
Vanilla DETR 2020 500 bipartite matching, set prediction
Deformable DETR 2020 50 deformable attention + multi-scale
Conditional DETR 2021 50 content-dependent query PE
DAB-DETR 2022 50 queries as 4D anchor boxes
DN-DETR 2022 50 denoising queries (pre-matched positives)
DINO-DETR 2022 12–24 contrastive denoising + mixed query select
LW-DETR 2024 ViT backbone + efficient multi-scale 60+
RF-DETR 2025 open-source LW-DETR + denoising polish

Fill numbers from each paper’s main results table. mAP is on COCO val2017 unless noted. Speed in FPS on a single A100 or V100 — note which GPU each paper used.

Implementation exercise — extend vanilla_detr.py with denoising queries

Goal: add the single most impactful trick (DN-DETR’s denoising) on top of your Week 1 implementation and observe the convergence speedup.

From scratch, this is a moderate-sized change (~100 lines). Step-by-step plan:

Step 1 — build the noisy queries

Given a batch of targets, build noisy versions of the gt boxes and labels:

def build_denoising_queries(targets, num_classes, num_groups=5,
                            box_noise_scale=0.4, label_noise_ratio=0.5):
    """For each image and each of num_groups noise groups, make a noisy copy of every gt.

    Returns:
        dn_queries:  [B, K, d_model]  — content stream (zero-initialized)
        dn_query_pos:[B, K, d_model]  — positional embedding built from noisy box + class
        dn_targets:  list of (clean labels, clean boxes) per image, repeated num_groups times
        dn_attn_mask:[K+N, K+N]       — attention mask preventing leakage to regular queries
    """
    ...

Where K = num_groups × total_M_per_image (variable per batch).

Q. What’s the encoding from a (noisy class, noisy box) tuple into a d_model vector? (Hint: class embed + box positional embed, concatenated or summed.)

# TODO — build_denoising_queries(targets, ...)
# 1. For each image's gts, create num_groups noisy copies (jitter box + sometimes flip class)
# 2. Encode each noisy (class, box) as a d_model vector:
#    - class_emb: nn.Embedding(num_classes + 1, d_model)
#    - box positional: sinusoidal PE on (cx, cy, w, h)
#    - sum or concat to get the query positional embedding
# 3. Build the attention mask (K+N x K+N) such that:
#    - DN queries within the same group can see each other
#    - DN queries from different groups cannot see each other
#    - Regular object queries cannot see DN queries
#    - DN queries cannot see regular object queries

Step 2 — modify DETR.forward to inject denoising queries

The decoder receives two streams of queries concatenated: the DN queries (size K) + the regular object queries (size N). Total decoder sequence length: K + N.

After the decoder, split the output back: - First K tokens → DN predictions (used for denoising loss). - Last N tokens → regular predictions (used for bipartite-matched loss + inference).

Skeleton:

def forward(self, images, targets=None):
    # ... backbone, PE, encoder as before ...
    
    if self.training and targets is not None:
        dn_pos, dn_attn_mask, dn_meta = build_denoising_queries(targets, ...)
        K = dn_pos.shape[1]
        
        # Build full query stack: [dn_queries, object_queries]
        obj_pos = self.query_pos_emb.weight.unsqueeze(0).expand(B, -1, -1)
        query_pos = torch.cat([dn_pos, obj_pos], dim=1)             # [B, K+N, d]
        tgt = memory.new_zeros(B, K + self.num_queries, self.d_model)
    else:
        K = 0
        query_pos = self.query_pos_emb.weight.unsqueeze(0).expand(B, -1, -1)
        tgt = memory.new_zeros(B, self.num_queries, self.d_model)
        dn_attn_mask = None
    
    dec_outs = self.decoder(tgt, memory, pos_flat, query_pos, attn_mask=dn_attn_mask)
    # ... apply heads ...
    
    # Split predictions back
    dn_logits  = all_logits[:, :, :K, :]
    dn_boxes   = all_boxes [:, :, :K, :]
    obj_logits = all_logits[:, :, K:, :]
    obj_boxes  = all_boxes [:, :, K:, :]
    
    return {
        'pred_logits': obj_logits[-1], 'pred_boxes': obj_boxes[-1],
        'aux_outputs': [...],
        'dn_outputs':  {'pred_logits': dn_logits[-1], 'pred_boxes': dn_boxes[-1],
                        'dn_meta': dn_meta, 'aux': [...]}
    }

Q. Why does forward need to receive targets during training? What does this break compared to vanilla DETR?

Q. Modify TransformerDecoderLayer to accept an attn_mask and pass it to both the self-attention and cross-attention calls. Why both?

# TODO — modify DETR.forward to inject DN queries at training time
# Also modify TransformerDecoderLayer to accept an attn_mask

Step 3 — add the denoising loss

DN queries don’t go through the matcher. Each DN query has a known target (the clean gt it was noised from). Loss:

  • Classification: cross-entropy with the clean class as target (regardless of noisy input class).
  • Box L1 + GIoU: against the clean box.

Same loss components as the main loss, just with explicit indices instead of matched ones.

Q. Why don’t DN queries get a “no object” class option in the loss? (Hint: they’re always pre-paired to a real gt.)

# TODO — add denoising loss component
# Refactor DetrLoss to also compute losses on outputs['dn_outputs'] using the metadata.
# Total loss = main + aux + dn (and dn_aux for each decoder layer)

Step 4 — train and compare convergence

Train your DN-DETR with the same hyperparameters as your vanilla DETR run. Compare the class loss curve.

Expected: class loss should drop faster (and to a lower floor) than vanilla DETR. Box losses should also drop faster as queries specialize more quickly.

Plot to make: loss curves of vanilla vs DN-DETR side by side, log-scaled y-axis. Identify the point where DN-DETR’s class loss crosses vanilla’s eventual floor.

# TODO — train DN-DETR, plot loss curves
# 1. Train vanilla_detr (already done in Week 1) — log losses to a list
# 2. Train dn_detr — log losses to a list
# 3. plt.plot both, log-y

Exercises

Pick one or two:

  1. Off-the-shelf comparison. Install RF-DETR via pip install rf-detr (or follow their repo). Run inference on your toy synthetic images and compare predictions visually + numerically to your vanilla DETR output.

  2. Negative denoising queries (DINO-DETR slice). Extend your DN-DETR implementation to also include negative denoising queries (larger noise, target = “no object”). Does the model become better at rejecting near-misses?

  3. Query specialization plot. For your trained DN-DETR, plot a 2D scatter of (avg cx, avg cy) per query across the validation set. You should see queries specializing on different image regions — a phenomenon that emerges much earlier with denoising than without.

  4. Deformable attention prototype. Write a from-scratch DeformableAttention module (just the formula — no need for the optimized CUDA kernel). Replace one of your decoder’s cross-attention sublayers with it and compare convergence.

Synthesis — connecting the dots

Q1. Of the four problems with vanilla DETR (sparse supervision, slow attention, no multi-scale, weak query design), which is the most fundamental? Which method most directly attacks it?

Q2. DN-DETR is sometimes described as “putting anchors back into DETR.” Argue the case for and against this framing.

Q3. Why hasn’t DETR-family killed YOLO yet, despite SOTA accuracy on COCO since DINO-DETR? Speed, ecosystem, ease-of-deployment, or something else?

End-of-week recap

Write a ≤250-word summary you could send to a colleague who knows vanilla DETR but hasn’t read these papers. Cover:

  1. The three core problems with vanilla DETR (recap).
  2. The two highest-impact tricks (one for attention, one for supervision) that made DETR practical.
  3. Why DN-DETR / DINO-DETR are the architectural defaults today.
  4. What LW-DETR / RF-DETR add on top of DINO-DETR.
  5. The one paper you’d recommend if your colleague only had time for one.


Next week (Module 2 / Week 3): segmentation and pose — MaskDINO (segmentation), DETR-Pose (pose estimation). Same set-prediction framing, different prediction targets.