# TODO — build_denoising_queries(targets, ...)
# 1. For each image's gts, create num_groups noisy copies (jitter box + sometimes flip class)
# 2. Encode each noisy (class, box) as a d_model vector:
# - class_emb: nn.Embedding(num_classes + 1, d_model)
# - box positional: sinusoidal PE on (cx, cy, w, h)
# - sum or concat to get the query positional embedding
# 3. Build the attention mask (K+N x K+N) such that:
# - DN queries within the same group can see each other
# - DN queries from different groups cannot see each other
# - Regular object queries cannot see DN queries
# - DN queries cannot see regular object queriesModule 2 / Week 2 — Building on top of DETR
Goal: understand the lineage of DETR variants that fixed vanilla DETR’s slow convergence and weak small-object detection. Read each paper, identify the one trick that defines it, and extend your vanilla_detr.py with the single most impactful trick (denoising queries).
This week’s reading (in chronological order — each builds on the previous): - Zhu et al. — Deformable DETR (2020) — https://arxiv.org/abs/2010.04159 - Meng et al. — Conditional DETR (2021) — https://arxiv.org/abs/2108.06152 (skim) - Liu et al. — DAB-DETR (2022) — https://arxiv.org/abs/2201.12329 (skim) - Li et al. — DN-DETR (2022) — https://arxiv.org/abs/2203.01305 - Zhang et al. — DINO (DETR variant; Zhang et al., not the SSL DINO) (2022) — https://arxiv.org/abs/2203.03605 - Chen et al. — LW-DETR (2024) — https://arxiv.org/abs/2406.03459 - Roboflow — RF-DETR (2025) — https://github.com/roboflow/rf-detr (blog/repo)
What you should be able to do by the end of the week: 1. State, in one sentence each, the central convergence fix of Deformable DETR, DN-DETR, DINO-DETR, LW-DETR, and RF-DETR. 2. Explain deformable attention in your own words, including why it’s faster and better than dense cross-attention. 3. Implement denoising queries on top of your vanilla_detr.py and train — observe the convergence speedup. 4. Fill in the comparison table (epochs to converge, COCO AP, key trick) for each variant. 5. Argue when you’d reach for LW-DETR vs RF-DETR vs DINO-DETR for a real project.
Recap — the vanilla DETR convergence problem
You felt this firsthand in Week 1. Three concrete symptoms:
- Sparse supervision per gt. Bipartite matching assigns exactly one positive query per gt. That’s ~5 positives per image vs the ~150 anchor-based methods provide. ~30× fewer gradient signals per object per step.
- Matching instability early in training. Which query owns which gt flickers between steps — queries can’t specialize.
- Global attention is slow and weak on small objects. Encoder cross-attention is
O(L²)over all spatial tokens at one feature scale (/32). Tiny objects get blurred away.
Result: vanilla DETR needs ~500 epochs on COCO to reach Faster R-CNN-level accuracy, and still loses to anchor-based methods on small objects.
Every variant this week tackles one or more of those three problems. Keep this map in mind as you read:
Problem | Vanilla DETR | Variant that fixes it
----------------------|--------------|-----------------------------
1. Sparse supervision | ~1 per gt | DN-DETR, DINO-DETR
2. Slow attention | O(L²) full | Deformable DETR
3. Multi-scale | only /32 | Deformable DETR
4. Query design | learned PE | Conditional / DAB / DINO
Genealogy at a glance
Vanilla DETR (2020, 500 epochs)
│
├── Deformable DETR (2020, 50 epochs)
│ └── deformable attention + multi-scale
│
├── Conditional DETR (2021)
│ └── content-dependent query positional encoding
│
├── DAB-DETR (2022)
│ └── object queries as explicit 4D anchor boxes
│
├── DN-DETR (2022, 50 epochs)
│ └── denoising queries: noisy GTs as extra training queries
│
└── DINO-DETR (2022, 12 epochs)
└── contrastive denoising + mixed query selection + look-forward-twice
│
├── LW-DETR (2024)
│ └── DETR + ViT backbone + multi-scale efficient training
│
└── RF-DETR (2025)
└── real-time variant: Roboflow's open-source production-ready DETR
Two things to notice as you read: - Each variant claims a faster training schedule than the previous (500 → 50 → 12 epochs). - LW-DETR and RF-DETR are combinations of the earlier tricks, not new tricks themselves. Understanding the intermediate methods is mandatory to understand them.
1. Deformable DETR — deformable attention + multi-scale
The two contributions: 1. Deformable attention — each query attends to a small, learnable set of keys (~4 per head per scale), not all L spatial tokens. 2. Multi-scale features — features at /8, /16, /32 scales (a feature pyramid), not just /32.
Q1. Vanilla DETR’s cross-attention is O(L²) where L = H' × W'. What’s the complexity of deformable attention, and why?
…
Q2. Why does restricting each query to ~4 keys help convergence, not just inference speed? (Hint: think about the matcher’s stability — fewer, more focused attention targets means…)
…
Q3. Why does multi-scale help with small objects? Compare to vanilla DETR’s single /32 feature map.
…
Q4. What’s a deformation? Each query predicts both where to look (offset from a reference point) and how much to weight each look (attention weight). Sketch this with a small example.
…
2. Conditional DETR — content-dependent query PE (skim)
Central observation: in vanilla DETR, the cross-attention’s query positional component is fixed (the learned query_embed doesn’t depend on the image content). This means queries can only weakly localize.
The fix: make the query’s positional component content-dependent by deriving it from the decoder’s current content stream (tgt).
Not load-bearing for you to implement, but two things to remember when reading later DETR variants:
Q1. Why does content-dependent query PE matter for cross-attention specifically (vs self-attention)?
…
Q2. This idea reappears in DAB-DETR (which makes the query a 4D box) and DINO-DETR. What’s the through-line?
…
3. DAB-DETR — queries as 4D anchor boxes (skim)
Central observation: the learned query_embed in vanilla DETR is mysterious — it’s a 256-d vector with no human-interpretable meaning. DAB-DETR replaces it with a 4D anchor box (cx, cy, w, h) per query, which is initialized and refined through the decoder layers.
Each query carries its anchor box as state. The cross-attention uses the box’s position as a spatial prior on where to look in the encoder memory.
Q. This makes DETR look more like a classical anchor-based detector — but in a learnable, queryable form. What’s preserved from set prediction (vs anchor methods)?
…
4. DN-DETR — denoising queries ⭐ THE KEY METHOD
This is the most important method in this week’s reading. You already spent time understanding it in our Week 1 conversations — now read the paper to lock it in.
Central trick: at training time, in addition to the N object queries, inject extra queries built from noisy versions of the ground-truth boxes. Each noisy query is pre-assigned to its corresponding clean gt — bypassing the Hungarian matcher entirely for these queries. The model is trained to denoise them (recover the clean gt from the noisy input).
The geometry: - For each gt: make K noisy copies (jitter box coords + sometimes flip class). - These K queries are all trained as positives toward the same clean gt. - Effectively gives each gt ~K stable positives per step instead of 1 unstable positive.
The architectural detail: attention mask prevents denoising queries from leaking gt info to regular object queries.
The result: 500 epochs → 50 epochs.
Q1. State the core mechanism in your own words (4 sentences max).
…
Q2. Why is the attention mask necessary? What would happen without it?
…
Q3. Denoising queries are training-only. Why? What changes between training and inference?
…
Q4. Compare denoising to anchor-based detection’s “many positive anchors per gt”. What does each provide, and what’s the same / different?
…
Q5. In §3.2, the paper says noise is parameterized by two scalars: λ₁ (center) and λ₂ (scale). What happens at λ₁=0, λ₂=0? At very high values?
…
5. DINO-DETR — contrastive denoising + mixed query selection
Note on naming: this is not the SSL DINO (Caron et al.) — it’s a confusingly-named DETR variant from Zhang et al. 2022. Often just called “DINO” in detection papers.
Three improvements over DN-DETR:
- Contrastive denoising — in addition to positive denoising queries (small noise → predict the gt), add negative denoising queries (larger noise → predict “no object”). Teaches the model when to abstain.
- Mixed query selection — initialize object queries from the encoder’s top-confidence features (anchor-like) instead of learning them from scratch.
- Look-forward-twice — each decoder layer’s box predictions use information from two layers ahead during training.
The result: 50 epochs (DN-DETR) → 12 epochs.
Q1. Why are negative denoising queries useful? What’s the failure mode they prevent?
…
Q2. How does mixed query selection differ from DAB-DETR’s anchor queries? Which is more flexible?
…
Q3. Why is DINO-DETR the SOTA-anchor that almost every recent DETR variant is benchmarked against?
…
6. LW-DETR — lightweight, real-time DETR
The premise: DINO-DETR is accurate but heavy. LW-DETR aims to be faster than YOLO at similar accuracy while keeping DETR’s set-prediction simplicity.
Headline tricks:
- Pre-trained ViT backbone — replace CNN with a lightweight ViT (e.g., ViT-S/16). Cheaper than the ResNet stacks used in earlier DETR variants for the same accuracy.
- Multi-scale feature extraction — but cleverly aggregated to avoid the cost of a full FPN.
- IoU-aware classification loss — classification confidence is calibrated against IoU (so highly-confident-but-poorly-localized predictions are penalized).
- Heavy DETR-side compression — fewer queries, smaller decoder, distilled to a small model.
Q1. Why does using a ViT backbone work well here, given that vanilla DETR used ResNet?
…
Q2. IoU-aware classification — what’s the problem with vanilla DETR’s classification confidence?
…
Q3. The paper reports real-time speeds (>60 FPS) on consumer hardware. Where does that speedup come from — backbone, decoder, both, neither?
…
Q4. Is LW-DETR a fundamentally new method or an engineering recombination of earlier tricks?
…
7. RF-DETR — Roboflow’s real-time, open-source DETR
The premise: practical, ready-to-deploy DETR with Roboflow’s polish on top of LW-DETR-style ideas. Open source, with checkpoints and a clean training/inference API.
Not a single research paper — more of a release that combines proven tricks. Read the README and blog post rather than chasing a PDF.
What it inherits from earlier variants: - DINO-DETR-style denoising and contrastive denoising. - Deformable attention. - LW-DETR-style pre-trained ViT backbone. - Multi-scale features.
Q1. What gap in the open-source DETR ecosystem did RF-DETR fill?
…
Q2. From a practitioner’s perspective, when would you reach for RF-DETR vs DINO-DETR (research codebase) vs a YOLO variant?
…
Q3. The repo provides pretrained models. Run inference on a few of your toy synthetic images using their pretrained checkpoint and compare the predictions to your vanilla DETR’s. What’s the gap?
…
Comparison table — fill this in as you read
| Method | Year | Epochs (COCO) | COCO mAP | Key trick | Speed (FPS) |
|---|---|---|---|---|---|
| Vanilla DETR | 2020 | 500 | … | bipartite matching, set prediction | … |
| Deformable DETR | 2020 | 50 | … | deformable attention + multi-scale | … |
| Conditional DETR | 2021 | 50 | … | content-dependent query PE | … |
| DAB-DETR | 2022 | 50 | … | queries as 4D anchor boxes | … |
| DN-DETR | 2022 | 50 | … | denoising queries (pre-matched positives) | … |
| DINO-DETR | 2022 | 12–24 | … | contrastive denoising + mixed query select | … |
| LW-DETR | 2024 | … | … | ViT backbone + efficient multi-scale | 60+ |
| RF-DETR | 2025 | … | … | open-source LW-DETR + denoising polish | … |
Fill numbers from each paper’s main results table. mAP is on COCO val2017 unless noted. Speed in FPS on a single A100 or V100 — note which GPU each paper used.
Implementation exercise — extend vanilla_detr.py with denoising queries
Goal: add the single most impactful trick (DN-DETR’s denoising) on top of your Week 1 implementation and observe the convergence speedup.
From scratch, this is a moderate-sized change (~100 lines). Step-by-step plan:
Step 1 — build the noisy queries
Given a batch of targets, build noisy versions of the gt boxes and labels:
def build_denoising_queries(targets, num_classes, num_groups=5,
box_noise_scale=0.4, label_noise_ratio=0.5):
"""For each image and each of num_groups noise groups, make a noisy copy of every gt.
Returns:
dn_queries: [B, K, d_model] — content stream (zero-initialized)
dn_query_pos:[B, K, d_model] — positional embedding built from noisy box + class
dn_targets: list of (clean labels, clean boxes) per image, repeated num_groups times
dn_attn_mask:[K+N, K+N] — attention mask preventing leakage to regular queries
"""
...Where K = num_groups × total_M_per_image (variable per batch).
Q. What’s the encoding from a (noisy class, noisy box) tuple into a d_model vector? (Hint: class embed + box positional embed, concatenated or summed.)
…
Step 2 — modify DETR.forward to inject denoising queries
The decoder receives two streams of queries concatenated: the DN queries (size K) + the regular object queries (size N). Total decoder sequence length: K + N.
After the decoder, split the output back: - First K tokens → DN predictions (used for denoising loss). - Last N tokens → regular predictions (used for bipartite-matched loss + inference).
Skeleton:
def forward(self, images, targets=None):
# ... backbone, PE, encoder as before ...
if self.training and targets is not None:
dn_pos, dn_attn_mask, dn_meta = build_denoising_queries(targets, ...)
K = dn_pos.shape[1]
# Build full query stack: [dn_queries, object_queries]
obj_pos = self.query_pos_emb.weight.unsqueeze(0).expand(B, -1, -1)
query_pos = torch.cat([dn_pos, obj_pos], dim=1) # [B, K+N, d]
tgt = memory.new_zeros(B, K + self.num_queries, self.d_model)
else:
K = 0
query_pos = self.query_pos_emb.weight.unsqueeze(0).expand(B, -1, -1)
tgt = memory.new_zeros(B, self.num_queries, self.d_model)
dn_attn_mask = None
dec_outs = self.decoder(tgt, memory, pos_flat, query_pos, attn_mask=dn_attn_mask)
# ... apply heads ...
# Split predictions back
dn_logits = all_logits[:, :, :K, :]
dn_boxes = all_boxes [:, :, :K, :]
obj_logits = all_logits[:, :, K:, :]
obj_boxes = all_boxes [:, :, K:, :]
return {
'pred_logits': obj_logits[-1], 'pred_boxes': obj_boxes[-1],
'aux_outputs': [...],
'dn_outputs': {'pred_logits': dn_logits[-1], 'pred_boxes': dn_boxes[-1],
'dn_meta': dn_meta, 'aux': [...]}
}Q. Why does forward need to receive targets during training? What does this break compared to vanilla DETR?
…
Q. Modify TransformerDecoderLayer to accept an attn_mask and pass it to both the self-attention and cross-attention calls. Why both?
…
# TODO — modify DETR.forward to inject DN queries at training time
# Also modify TransformerDecoderLayer to accept an attn_maskStep 3 — add the denoising loss
DN queries don’t go through the matcher. Each DN query has a known target (the clean gt it was noised from). Loss:
- Classification: cross-entropy with the clean class as target (regardless of noisy input class).
- Box L1 + GIoU: against the clean box.
Same loss components as the main loss, just with explicit indices instead of matched ones.
Q. Why don’t DN queries get a “no object” class option in the loss? (Hint: they’re always pre-paired to a real gt.)
…
# TODO — add denoising loss component
# Refactor DetrLoss to also compute losses on outputs['dn_outputs'] using the metadata.
# Total loss = main + aux + dn (and dn_aux for each decoder layer)Step 4 — train and compare convergence
Train your DN-DETR with the same hyperparameters as your vanilla DETR run. Compare the class loss curve.
Expected: class loss should drop faster (and to a lower floor) than vanilla DETR. Box losses should also drop faster as queries specialize more quickly.
Plot to make: loss curves of vanilla vs DN-DETR side by side, log-scaled y-axis. Identify the point where DN-DETR’s class loss crosses vanilla’s eventual floor.
# TODO — train DN-DETR, plot loss curves
# 1. Train vanilla_detr (already done in Week 1) — log losses to a list
# 2. Train dn_detr — log losses to a list
# 3. plt.plot both, log-yExercises
Pick one or two:
Off-the-shelf comparison. Install RF-DETR via
pip install rf-detr(or follow their repo). Run inference on your toy synthetic images and compare predictions visually + numerically to your vanilla DETR output.Negative denoising queries (DINO-DETR slice). Extend your DN-DETR implementation to also include negative denoising queries (larger noise, target = “no object”). Does the model become better at rejecting near-misses?
Query specialization plot. For your trained DN-DETR, plot a 2D scatter of (avg cx, avg cy) per query across the validation set. You should see queries specializing on different image regions — a phenomenon that emerges much earlier with denoising than without.
Deformable attention prototype. Write a from-scratch
DeformableAttentionmodule (just the formula — no need for the optimized CUDA kernel). Replace one of your decoder’s cross-attention sublayers with it and compare convergence.
Synthesis — connecting the dots
Q1. Of the four problems with vanilla DETR (sparse supervision, slow attention, no multi-scale, weak query design), which is the most fundamental? Which method most directly attacks it?
…
Q2. DN-DETR is sometimes described as “putting anchors back into DETR.” Argue the case for and against this framing.
…
Q3. Why hasn’t DETR-family killed YOLO yet, despite SOTA accuracy on COCO since DINO-DETR? Speed, ecosystem, ease-of-deployment, or something else?
…
End-of-week recap
Write a ≤250-word summary you could send to a colleague who knows vanilla DETR but hasn’t read these papers. Cover:
- The three core problems with vanilla DETR (recap).
- The two highest-impact tricks (one for attention, one for supervision) that made DETR practical.
- Why DN-DETR / DINO-DETR are the architectural defaults today.
- What LW-DETR / RF-DETR add on top of DINO-DETR.
- The one paper you’d recommend if your colleague only had time for one.
…
Next week (Module 2 / Week 3): segmentation and pose — MaskDINO (segmentation), DETR-Pose (pose estimation). Same set-prediction framing, different prediction targets.