Module 2 / Week 2 — Building on top of DETR

Goal: understand the lineage of DETR variants that fixed vanilla DETR’s slow convergence and weak small-object detection. Read each paper, identify the one trick that defines it, and extend your vanilla_detr.py with the single most impactful trick (denoising queries).

This week’s reading (in chronological order — each builds on the previous): - Zhu et al. — Deformable DETR (2020) — https://arxiv.org/abs/2010.04159 - Meng et al. — Conditional DETR (2021) — https://arxiv.org/abs/2108.06152 (skim) - Liu et al. — DAB-DETR (2022) — https://arxiv.org/abs/2201.12329 (skim) - Li et al. — DN-DETR (2022) — https://arxiv.org/abs/2203.01305 - Zhang et al. — DINO (DETR variant; Zhang et al., not the SSL DINO) (2022) — https://arxiv.org/abs/2203.03605 - Chen et al. — LW-DETR (2024) — https://arxiv.org/abs/2406.03459 - Roboflow — RF-DETR (2025) — https://github.com/roboflow/rf-detr (blog/repo)

What you should be able to do by the end of the week: 1. State, in one sentence each, the central convergence fix of Deformable DETR, DN-DETR, DINO-DETR, LW-DETR, and RF-DETR. 2. Explain deformable attention in your own words, including why it’s faster and better than dense cross-attention. 3. Implement denoising queries on top of your vanilla_detr.py and train — observe the convergence speedup. 4. Fill in the comparison table (epochs to converge, COCO AP, key trick) for each variant. 5. Argue when you’d reach for LW-DETR vs RF-DETR vs DINO-DETR for a real project.

Recap — the vanilla DETR convergence problem

You felt this firsthand in Week 1. Three concrete symptoms:

Sparse supervision per gt. Bipartite matching assigns exactly one positive query per gt. That’s ~5 positives per image vs the ~150 anchor-based methods provide. ~30× fewer gradient signals per object per step.
Matching instability early in training. Which query owns which gt flickers between steps — queries can’t specialize.
Global attention is slow and weak on small objects. Encoder cross-attention is O(L²) over all spatial tokens at one feature scale (/32). Tiny objects get blurred away.

Result: vanilla DETR needs ~500 epochs on COCO to reach Faster R-CNN-level accuracy, and still loses to anchor-based methods on small objects.

Every variant this week tackles one or more of those three problems. Keep this map in mind as you read:

Problem               | Vanilla DETR | Variant that fixes it
----------------------|--------------|-----------------------------
1. Sparse supervision | ~1 per gt    | DN-DETR, DINO-DETR
2. Slow attention     | O(L²) full   | Deformable DETR
3. Multi-scale        | only /32     | Deformable DETR
4. Query design       | learned PE   | Conditional / DAB / DINO

Genealogy at a glance

Vanilla DETR (2020, 500 epochs)
    │
    ├── Deformable DETR (2020, 50 epochs)
    │     └── deformable attention + multi-scale
    │
    ├── Conditional DETR (2021)
    │     └── content-dependent query positional encoding
    │
    ├── DAB-DETR (2022)
    │     └── object queries as explicit 4D anchor boxes
    │
    ├── DN-DETR (2022, 50 epochs)
    │     └── denoising queries: noisy GTs as extra training queries
    │
    └── DINO-DETR (2022, 12 epochs)
          └── contrastive denoising + mixed query selection + look-forward-twice
                │
                ├── LW-DETR (2024)
                │     └── DETR + ViT backbone + multi-scale efficient training
                │
                └── RF-DETR (2025)
                      └── real-time variant: Roboflow's open-source production-ready DETR

Two things to notice as you read: - Each variant claims a faster training schedule than the previous (500 → 50 → 12 epochs). - LW-DETR and RF-DETR are combinations of the earlier tricks, not new tricks themselves. Understanding the intermediate methods is mandatory to understand them.

1. Deformable DETR — deformable attention + multi-scale

The two contributions: 1. Deformable attention — each query attends to a small, learnable set of keys (~4 per head per scale), not all L spatial tokens. 2. Multi-scale features — features at /8, /16, /32 scales (a feature pyramid), not just /32.

Q1. Vanilla DETR’s cross-attention is O(L²) where L = H' × W'. What’s the complexity of deformable attention, and why?

So in DETR, the CNN outputs [1, 256, 4, 4] which we just transform into the “token view” of [1, 16, 256]. For the transformer encoder, both query and key elements are of pixels in the feature maps. Computational complexity of self-attention is of O(H^2*W^2C) which grows quadratically in in the spatial size. In a real DETR (not like our toy example), the CNN featues would be [1, 256, 25, 19]. Every ’pixel in the feature maps attends to every other pixel in the feature map. The complexity of cross attention in the decoder is of O(2NC^2 + N^2C), where N is the number of positional mebeddings, or object queries we add. This means that the complexity adds with number of queries. DETR struggles with small objects, so a way to mitigate that would be to have larger feture maps, but that would also increase complexity quadratically. In DETR, the attention maps are trained to be very sparse, focusing only on the object extremities.

In Deformable DETR: “we present a deformable attention module”. Insipred by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. ( They achieve this by assigning only a small fixed number of keys for each query).
The complexity in the decoder of Def-DETR is now O(2N_q * C^2 + min(HWC^2, N_q*K*C^2)).
Go from:

  Vanilla:    O(2 HW·C² + (HW)²·C)                  →  dominant: O(H²W²C)
  Deformable: O(2 HW·C² + min(HW·C², HW·K·C²))      →  dominant: O(HW·C² + HW·K·C)

Q2. Why does restricting each query to ~4 keys help convergence, not just inference speed? (Hint: think about the matcher’s stability — fewer, more focused attention targets means…)

…

Q3. Why does multi-scale help with small objects? Compare to vanilla DETR’s single /32 feature map.

…

Q4. What’s a deformation? Each query predicts both where to look (offset from a reference point) and how much to weight each look (attention weight). Sketch this with a small example.

…

2. Conditional DETR — content-dependent query PE (skim)

Central observation: in vanilla DETR, the cross-attention’s query positional component is fixed (the learned query_embed doesn’t depend on the image content). This means queries can only weakly localize.

The fix: make the query’s positional component content-dependent by deriving it from the decoder’s current content stream (tgt).

Not load-bearing for you to implement, but two things to remember when reading later DETR variants:

Q1. Why does content-dependent query PE matter for cross-attention specifically (vs self-attention)?

…

Q2. This idea reappears in DAB-DETR (which makes the query a 4D box) and DINO-DETR. What’s the through-line?

…

3. DAB-DETR — queries as 4D anchor boxes (skim)

Central observation: the learned query_embed in vanilla DETR is mysterious — it’s a 256-d vector with no human-interpretable meaning. DAB-DETR replaces it with a 4D anchor box (cx, cy, w, h) per query, which is initialized and refined through the decoder layers.

Each query carries its anchor box as state. The cross-attention uses the box’s position as a spatial prior on where to look in the encoder memory.

Q. This makes DETR look more like a classical anchor-based detector — but in a learnable, queryable form. What’s preserved from set prediction (vs anchor methods)?

…

4. DN-DETR — denoising queries ⭐ THE KEY METHOD

This is the most important method in this week’s reading. You already spent time understanding it in our Week 1 conversations — now read the paper to lock it in.

Central trick: at training time, in addition to the N object queries, inject extra queries built from noisy versions of the ground-truth boxes. Each noisy query is pre-assigned to its corresponding clean gt — bypassing the Hungarian matcher entirely for these queries. The model is trained to denoise them (recover the clean gt from the noisy input).

The geometry: - For each gt: make K noisy copies (jitter box coords + sometimes flip class). - These K queries are all trained as positives toward the same clean gt. - Effectively gives each gt ~K stable positives per step instead of 1 unstable positive.

The architectural detail: attention mask prevents denoising queries from leaking gt info to regular object queries.

The result: 500 epochs → 50 epochs.

Q1. State the core mechanism in your own words (4 sentences max).

…

Q2. Why is the attention mask necessary? What would happen without it?

…

Q3. Denoising queries are training-only. Why? What changes between training and inference?

…

Q4. Compare denoising to anchor-based detection’s “many positive anchors per gt”. What does each provide, and what’s the same / different?

…

Q5. In §3.2, the paper says noise is parameterized by two scalars: `λ₁` (center) and `λ₂` (scale). What happens at `λ₁=0, λ₂=0`? At very high values?

…

5. DINO-DETR — contrastive denoising + mixed query selection

Note on naming: this is not the SSL DINO (Caron et al.) — it’s a confusingly-named DETR variant from Zhang et al. 2022. Often just called “DINO” in detection papers.

Three improvements over DN-DETR:

Contrastive denoising — in addition to positive denoising queries (small noise → predict the gt), add negative denoising queries (larger noise → predict “no object”). Teaches the model when to abstain.
Mixed query selection — initialize object queries from the encoder’s top-confidence features (anchor-like) instead of learning them from scratch.
Look-forward-twice — each decoder layer’s box predictions use information from two layers ahead during training.

The result: 50 epochs (DN-DETR) → 12 epochs.

Q1. Why are negative denoising queries useful? What’s the failure mode they prevent?

…

Q2. How does mixed query selection differ from DAB-DETR’s anchor queries? Which is more flexible?

…

Q3. Why is DINO-DETR the SOTA-anchor that almost every recent DETR variant is benchmarked against?

…

6. LW-DETR — lightweight, real-time DETR

The premise: DINO-DETR is accurate but heavy. LW-DETR aims to be faster than YOLO at similar accuracy while keeping DETR’s set-prediction simplicity.

Headline tricks:

Pre-trained ViT backbone — replace CNN with a lightweight ViT (e.g., ViT-S/16). Cheaper than the ResNet stacks used in earlier DETR variants for the same accuracy.
Multi-scale feature extraction — but cleverly aggregated to avoid the cost of a full FPN.
IoU-aware classification loss — classification confidence is calibrated against IoU (so highly-confident-but-poorly-localized predictions are penalized).
Heavy DETR-side compression — fewer queries, smaller decoder, distilled to a small model.

Q1. Why does using a ViT backbone work well here, given that vanilla DETR used ResNet?

…

Q2. IoU-aware classification — what’s the problem with vanilla DETR’s classification confidence?

…

Q3. The paper reports real-time speeds (>60 FPS) on consumer hardware. Where does that speedup come from — backbone, decoder, both, neither?

…

Q4. Is LW-DETR a fundamentally new method or an engineering recombination of earlier tricks?

…

7. RF-DETR — Roboflow’s real-time, open-source DETR

The premise: practical, ready-to-deploy DETR with Roboflow’s polish on top of LW-DETR-style ideas. Open source, with checkpoints and a clean training/inference API.

Not a single research paper — more of a release that combines proven tricks. Read the README and blog post rather than chasing a PDF.

What it inherits from earlier variants: - DINO-DETR-style denoising and contrastive denoising. - Deformable attention. - LW-DETR-style pre-trained ViT backbone. - Multi-scale features.

Q1. What gap in the open-source DETR ecosystem did RF-DETR fill?

…

Q2. From a practitioner’s perspective, when would you reach for RF-DETR vs DINO-DETR (research codebase) vs a YOLO variant?

…

Q3. The repo provides pretrained models. Run inference on a few of your toy synthetic images using their pretrained checkpoint and compare the predictions to your vanilla DETR’s. What’s the gap?

…

Comparison table — fill this in as you read

Method	Year	Epochs (COCO)	COCO mAP	Key trick	Speed (FPS)
Vanilla DETR	2020	500	…	bipartite matching, set prediction	…
Deformable DETR	2020	50	…	deformable attention + multi-scale	…
Conditional DETR	2021	50	…	content-dependent query PE	…
DAB-DETR	2022	50	…	queries as 4D anchor boxes	…
DN-DETR	2022	50	…	denoising queries (pre-matched positives)	…
DINO-DETR	2022	12–24	…	contrastive denoising + mixed query select	…
LW-DETR	2024	…	…	ViT backbone + efficient multi-scale	60+
RF-DETR	2025	…	…	open-source LW-DETR + denoising polish	…

Fill numbers from each paper’s main results table. mAP is on COCO val2017 unless noted. Speed in FPS on a single A100 or V100 — note which GPU each paper used.

Implementation exercise — extend `vanilla_detr.py` with denoising queries

Goal: add the single most impactful trick (DN-DETR’s denoising) on top of your Week 1 implementation and observe the convergence speedup.

From scratch, this is a moderate-sized change (~100 lines). Step-by-step plan:

Step 1 — build the noisy queries

Given a batch of targets, build noisy versions of the gt boxes and labels:

def build_denoising_queries(targets, num_classes, num_groups=5,
                            box_noise_scale=0.4, label_noise_ratio=0.5):
    """For each image and each of num_groups noise groups, make a noisy copy of every gt.

    Returns:
        dn_queries:  [B, K, d_model]  — content stream (zero-initialized)
        dn_query_pos:[B, K, d_model]  — positional embedding built from noisy box + class
        dn_targets:  list of (clean labels, clean boxes) per image, repeated num_groups times
        dn_attn_mask:[K+N, K+N]       — attention mask preventing leakage to regular queries
    """
    ...

Where K = num_groups × total_M_per_image (variable per batch).

Q. What’s the encoding from a (noisy class, noisy box) tuple into a d_model vector? (Hint: class embed + box positional embed, concatenated or summed.)

…

# TODO — build_denoising_queries(targets, ...)
# 1. For each image's gts, create num_groups noisy copies (jitter box + sometimes flip class)
# 2. Encode each noisy (class, box) as a d_model vector:
#    - class_emb: nn.Embedding(num_classes + 1, d_model)
#    - box positional: sinusoidal PE on (cx, cy, w, h)
#    - sum or concat to get the query positional embedding
# 3. Build the attention mask (K+N x K+N) such that:
#    - DN queries within the same group can see each other
#    - DN queries from different groups cannot see each other
#    - Regular object queries cannot see DN queries
#    - DN queries cannot see regular object queries

Step 2 — modify `DETR.forward` to inject denoising queries

The decoder receives two streams of queries concatenated: the DN queries (size K) + the regular object queries (size N). Total decoder sequence length: K + N.

After the decoder, split the output back: - First K tokens → DN predictions (used for denoising loss). - Last N tokens → regular predictions (used for bipartite-matched loss + inference).

Skeleton:

def forward(self, images, targets=None):
    # ... backbone, PE, encoder as before ...
    
    if self.training and targets is not None:
        dn_pos, dn_attn_mask, dn_meta = build_denoising_queries(targets, ...)
        K = dn_pos.shape[1]
        
        # Build full query stack: [dn_queries, object_queries]
        obj_pos = self.query_pos_emb.weight.unsqueeze(0).expand(B, -1, -1)
        query_pos = torch.cat([dn_pos, obj_pos], dim=1)             # [B, K+N, d]
        tgt = memory.new_zeros(B, K + self.num_queries, self.d_model)
    else:
        K = 0
        query_pos = self.query_pos_emb.weight.unsqueeze(0).expand(B, -1, -1)
        tgt = memory.new_zeros(B, self.num_queries, self.d_model)
        dn_attn_mask = None
    
    dec_outs = self.decoder(tgt, memory, pos_flat, query_pos, attn_mask=dn_attn_mask)
    # ... apply heads ...
    
    # Split predictions back
    dn_logits  = all_logits[:, :, :K, :]
    dn_boxes   = all_boxes [:, :, :K, :]
    obj_logits = all_logits[:, :, K:, :]
    obj_boxes  = all_boxes [:, :, K:, :]
    
    return {
        'pred_logits': obj_logits[-1], 'pred_boxes': obj_boxes[-1],
        'aux_outputs': [...],
        'dn_outputs':  {'pred_logits': dn_logits[-1], 'pred_boxes': dn_boxes[-1],
                        'dn_meta': dn_meta, 'aux': [...]}
    }

Q. Why does forward need to receive targets during training? What does this break compared to vanilla DETR?

…

Q. Modify TransformerDecoderLayer to accept an attn_mask and pass it to both the self-attention and cross-attention calls. Why both?

…

# TODO — modify DETR.forward to inject DN queries at training time
# Also modify TransformerDecoderLayer to accept an attn_mask

Step 3 — add the denoising loss

DN queries don’t go through the matcher. Each DN query has a known target (the clean gt it was noised from). Loss:

Classification: cross-entropy with the clean class as target (regardless of noisy input class).
Box L1 + GIoU: against the clean box.

Same loss components as the main loss, just with explicit indices instead of matched ones.

Q. Why don’t DN queries get a “no object” class option in the loss? (Hint: they’re always pre-paired to a real gt.)

…

# TODO — add denoising loss component
# Refactor DetrLoss to also compute losses on outputs['dn_outputs'] using the metadata.
# Total loss = main + aux + dn (and dn_aux for each decoder layer)

Step 4 — train and compare convergence

Train your DN-DETR with the same hyperparameters as your vanilla DETR run. Compare the class loss curve.

Expected: class loss should drop faster (and to a lower floor) than vanilla DETR. Box losses should also drop faster as queries specialize more quickly.

Plot to make: loss curves of vanilla vs DN-DETR side by side, log-scaled y-axis. Identify the point where DN-DETR’s class loss crosses vanilla’s eventual floor.

# TODO — train DN-DETR, plot loss curves
# 1. Train vanilla_detr (already done in Week 1) — log losses to a list
# 2. Train dn_detr — log losses to a list
# 3. plt.plot both, log-y

Exercises

Pick one or two:

Off-the-shelf comparison. Install RF-DETR via pip install rf-detr (or follow their repo). Run inference on your toy synthetic images and compare predictions visually + numerically to your vanilla DETR output.
Negative denoising queries (DINO-DETR slice). Extend your DN-DETR implementation to also include negative denoising queries (larger noise, target = “no object”). Does the model become better at rejecting near-misses?
Query specialization plot. For your trained DN-DETR, plot a 2D scatter of (avg cx, avg cy) per query across the validation set. You should see queries specializing on different image regions — a phenomenon that emerges much earlier with denoising than without.
Deformable attention prototype. Write a from-scratch DeformableAttention module (just the formula — no need for the optimized CUDA kernel). Replace one of your decoder’s cross-attention sublayers with it and compare convergence.

Synthesis — connecting the dots

Q1. Of the four problems with vanilla DETR (sparse supervision, slow attention, no multi-scale, weak query design), which is the most fundamental? Which method most directly attacks it?

…

Q2. DN-DETR is sometimes described as “putting anchors back into DETR.” Argue the case for and against this framing.

…

Q3. Why hasn’t DETR-family killed YOLO yet, despite SOTA accuracy on COCO since DINO-DETR? Speed, ecosystem, ease-of-deployment, or something else?

…

End-of-week recap

Write a ≤250-word summary you could send to a colleague who knows vanilla DETR but hasn’t read these papers. Cover:

The three core problems with vanilla DETR (recap).
The two highest-impact tricks (one for attention, one for supervision) that made DETR practical.
Why DN-DETR / DINO-DETR are the architectural defaults today.
What LW-DETR / RF-DETR add on top of DINO-DETR.
The one paper you’d recommend if your colleague only had time for one.

…

Next week (Module 2 / Week 3): segmentation and pose — MaskDINO (segmentation), DETR-Pose (pose estimation). Same set-prediction framing, different prediction targets.