# TODO — build_denoising_queries(targets, ...)
# 1. For each image's gts, create num_groups noisy copies (jitter box + sometimes flip class)
# 2. Encode each noisy (class, box) as a d_model vector:
# - class_emb: nn.Embedding(num_classes + 1, d_model)
# - box positional: sinusoidal PE on (cx, cy, w, h)
# - sum or concat to get the query positional embedding
# 3. Build the attention mask (K+N x K+N) such that:
# - DN queries within the same group can see each other
# - DN queries from different groups cannot see each other
# - Regular object queries cannot see DN queries
# - DN queries cannot see regular object queriesModule 2 / Week 2 — Building on top of DETR
Goal: understand the lineage of DETR variants that fixed vanilla DETR’s slow convergence and weak small-object detection. Read each paper, identify the one trick that defines it, and extend your vanilla_detr.py with the single most impactful trick (denoising queries).
This week’s reading (in chronological order — each builds on the previous): - Zhu et al. — Deformable DETR (2020) — https://arxiv.org/abs/2010.04159 - Meng et al. — Conditional DETR (2021) — https://arxiv.org/abs/2108.06152 (skim) - Liu et al. — DAB-DETR (2022) — https://arxiv.org/abs/2201.12329 (skim) - Li et al. — DN-DETR (2022) — https://arxiv.org/abs/2203.01305 - Zhang et al. — DINO (DETR variant; Zhang et al., not the SSL DINO) (2022) — https://arxiv.org/abs/2203.03605 - Chen et al. — LW-DETR (2024) — https://arxiv.org/abs/2406.03459 - Roboflow — RF-DETR (2025) — https://github.com/roboflow/rf-detr (blog/repo)
What you should be able to do by the end of the week: 1. State, in one sentence each, the central convergence fix of Deformable DETR, DN-DETR, DINO-DETR, LW-DETR, and RF-DETR. 2. Explain deformable attention in your own words, including why it’s faster and better than dense cross-attention. 3. Implement denoising queries on top of your vanilla_detr.py and train — observe the convergence speedup. 4. Fill in the comparison table (epochs to converge, COCO AP, key trick) for each variant. 5. Argue when you’d reach for LW-DETR vs RF-DETR vs DINO-DETR for a real project.
Recap — the vanilla DETR convergence problem
You felt this firsthand in Week 1. Three concrete symptoms:
- Sparse supervision per gt. Bipartite matching assigns exactly one positive query per gt. That’s ~5 positives per image vs the ~150 anchor-based methods provide. ~30× fewer gradient signals per object per step.
- Matching instability early in training. Which query owns which gt flickers between steps — queries can’t specialize.
- Global attention is slow and weak on small objects. Encoder cross-attention is
O(L²)over all spatial tokens at one feature scale (/32). Tiny objects get blurred away.
Result: vanilla DETR needs ~500 epochs on COCO to reach Faster R-CNN-level accuracy, and still loses to anchor-based methods on small objects.
Every variant this week tackles one or more of those three problems. Keep this map in mind as you read:
Problem | Vanilla DETR | Variant that fixes it
----------------------|--------------|-----------------------------
1. Sparse supervision | ~1 per gt | DN-DETR, DINO-DETR
2. Slow attention | O(L²) full | Deformable DETR
3. Multi-scale | only /32 | Deformable DETR
4. Query design | learned PE | Conditional / DAB / DINO
Genealogy at a glance
Vanilla DETR (2020, 500 epochs)
│
├── Deformable DETR (2020, 50 epochs)
│ └── deformable attention + multi-scale
│
├── Conditional DETR (2021)
│ └── content-dependent query positional encoding
│
├── DAB-DETR (2022)
│ └── object queries as explicit 4D anchor boxes
│
├── DN-DETR (2022, 50 epochs)
│ └── denoising queries: noisy GTs as extra training queries
│
└── DINO-DETR (2022, 12 epochs)
└── contrastive denoising + mixed query selection + look-forward-twice
│
├── LW-DETR (2024)
│ └── DETR + ViT backbone + multi-scale efficient training
│
└── RF-DETR (2025)
└── real-time variant: Roboflow's open-source production-ready DETR
Two things to notice as you read: - Each variant claims a faster training schedule than the previous (500 → 50 → 12 epochs). - LW-DETR and RF-DETR are combinations of the earlier tricks, not new tricks themselves. Understanding the intermediate methods is mandatory to understand them.
1. Deformable DETR — deformable attention + multi-scale
The two contributions: 1. Deformable attention — each query attends to a small, learnable set of keys (~4 per head per scale), not all L spatial tokens. 2. Multi-scale features — features at /8, /16, /32 scales (a feature pyramid), not just /32.
Q1. Vanilla DETR’s cross-attention is O(L²) where L = H' × W'. What’s the complexity of deformable attention, and why?
So in DETR, the CNN outputs [1, 256, 4, 4] which we just transform into the “token view” of [1, 16, 256]. For the transformer encoder, both query and key elements are of pixels in the feature maps. Computational complexity of self-attention is of O(H^2*W^2C) which grows quadratically in in the spatial size. In a real DETR (not like our toy example), the CNN featues would be [1, 256, 25, 19]. Every ’pixel in the feature maps attends to every other pixel in the feature map. The complexity of cross attention in the decoder is of O(2NC^2 + N^2C), where N is the number of positional mebeddings, or object queries we add. This means that the complexity adds with number of queries. DETR struggles with small objects, so a way to mitigate that would be to have larger feture maps, but that would also increase complexity quadratically. In DETR, the attention maps are trained to be very sparse, focusing only on the object extremities.
- In Deformable DETR: “we present a deformable attention module”. Insipred by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. ( They achieve this by assigning only a small fixed number of keys for each query).
- The complexity in the decoder of Def-DETR is now
O(2N_q * C^2 + min(HWC^2, N_q*K*C^2)). - Go from:
Vanilla: O(2 HW·C² + (HW)²·C) → dominant: O(H²W²C)
Deformable: O(2 HW·C² + min(HW·C², HW·K·C²)) → dominant: O(HW·C² + HW·K·C)
Q2. Why does restricting each query to ~4 keys help convergence, not just inference speed? (Hint: think about the matcher’s stability — fewer, more focused attention targets means…)
…
Q3. Why does multi-scale help with small objects? Compare to vanilla DETR’s single /32 feature map.
…
Q4. What’s a deformation? Each query predicts both where to look (offset from a reference point) and how much to weight each look (attention weight). Sketch this with a small example.
…
2. Conditional DETR — content-dependent query PE (skim)
Central observation: in vanilla DETR, the cross-attention’s query positional component is fixed (the learned query_embed doesn’t depend on the image content). This means queries can only weakly localize.
The fix: make the query’s positional component content-dependent by deriving it from the decoder’s current content stream (tgt).
Not load-bearing for you to implement, but two things to remember when reading later DETR variants:
Q1. Why does content-dependent query PE matter for cross-attention specifically (vs self-attention)?
…
Q2. This idea reappears in DAB-DETR (which makes the query a 4D box) and DINO-DETR. What’s the through-line?
…
3. DAB-DETR — queries as 4D anchor boxes (skim)
Central observation: the learned query_embed in vanilla DETR is mysterious — it’s a 256-d vector with no human-interpretable meaning. DAB-DETR replaces it with a 4D anchor box (cx, cy, w, h) per query, which is initialized and refined through the decoder layers.
Each query carries its anchor box as state. The cross-attention uses the box’s position as a spatial prior on where to look in the encoder memory.
Q. This makes DETR look more like a classical anchor-based detector — but in a learnable, queryable form. What’s preserved from set prediction (vs anchor methods)?
…
4. DN-DETR — denoising queries ⭐ THE KEY METHOD
This is the most important method in this week’s reading. You already spent time understanding it in our Week 1 conversations — now read the paper to lock it in.
Central trick: at training time, in addition to the N object queries, inject extra queries built from noisy versions of the ground-truth boxes. Each noisy query is pre-assigned to its corresponding clean gt — bypassing the Hungarian matcher entirely for these queries. The model is trained to denoise them (recover the clean gt from the noisy input).
The geometry: - For each gt: make K noisy copies (jitter box coords + sometimes flip class). - These K queries are all trained as positives toward the same clean gt. - Effectively gives each gt ~K stable positives per step instead of 1 unstable positive.
The architectural detail: attention mask prevents denoising queries from leaking gt info to regular object queries.
The result: 500 epochs → 50 epochs.
Q1. State the core mechanism in your own words (4 sentences max).
…
Q2. Why is the attention mask necessary? What would happen without it?
…
Q3. Denoising queries are training-only. Why? What changes between training and inference?
…
Q4. Compare denoising to anchor-based detection’s “many positive anchors per gt”. What does each provide, and what’s the same / different?
…
Q5. In §3.2, the paper says noise is parameterized by two scalars: λ₁ (center) and λ₂ (scale). What happens at λ₁=0, λ₂=0? At very high values?
…
5. DINO-DETR — contrastive denoising + mixed query selection
Note on naming: this is not the SSL DINO (Caron et al.) — it’s a confusingly-named DETR variant from Zhang et al. 2022. Often just called “DINO” in detection papers.
Three improvements over DN-DETR:
- Contrastive denoising — in addition to positive denoising queries (small noise → predict the gt), add negative denoising queries (larger noise → predict “no object”). Teaches the model when to abstain.
- Mixed query selection — initialize object queries from the encoder’s top-confidence features (anchor-like) instead of learning them from scratch.
- Look-forward-twice — each decoder layer’s box predictions use information from two layers ahead during training.
The result: 50 epochs (DN-DETR) → 12 epochs.
Q1. Why are negative denoising queries useful? What’s the failure mode they prevent?
…
Q2. How does mixed query selection differ from DAB-DETR’s anchor queries? Which is more flexible?
…
Q3. Why is DINO-DETR the SOTA-anchor that almost every recent DETR variant is benchmarked against?
…
6. LW-DETR — lightweight, real-time DETR
The premise: DINO-DETR is accurate but heavy. LW-DETR aims to be faster than YOLO at similar accuracy while keeping DETR’s set-prediction simplicity.
Headline tricks:
- Pre-trained ViT backbone — replace CNN with a lightweight ViT (e.g., ViT-S/16). Cheaper than the ResNet stacks used in earlier DETR variants for the same accuracy.
- Multi-scale feature extraction — but cleverly aggregated to avoid the cost of a full FPN.
- IoU-aware classification loss — classification confidence is calibrated against IoU (so highly-confident-but-poorly-localized predictions are penalized).
- Heavy DETR-side compression — fewer queries, smaller decoder, distilled to a small model.
Q1. Why does using a ViT backbone work well here, given that vanilla DETR used ResNet?
…
Q2. IoU-aware classification — what’s the problem with vanilla DETR’s classification confidence?
…
Q3. The paper reports real-time speeds (>60 FPS) on consumer hardware. Where does that speedup come from — backbone, decoder, both, neither?
…
Q4. Is LW-DETR a fundamentally new method or an engineering recombination of earlier tricks?
…
7. RF-DETR — Roboflow’s real-time, open-source DETR
The premise: practical, ready-to-deploy DETR with Roboflow’s polish on top of LW-DETR-style ideas. Open source, with checkpoints and a clean training/inference API.
Not a single research paper — more of a release that combines proven tricks. Read the README and blog post rather than chasing a PDF.
What it inherits from earlier variants: - DINO-DETR-style denoising and contrastive denoising. - Deformable attention. - LW-DETR-style pre-trained ViT backbone. - Multi-scale features.
Q1. What gap in the open-source DETR ecosystem did RF-DETR fill?
…
Q2. From a practitioner’s perspective, when would you reach for RF-DETR vs DINO-DETR (research codebase) vs a YOLO variant?
…
Q3. The repo provides pretrained models. Run inference on a few of your toy synthetic images using their pretrained checkpoint and compare the predictions to your vanilla DETR’s. What’s the gap?
…
Comparison table — fill this in as you read
| Method | Year | Epochs (COCO) | COCO mAP | Key trick | Speed (FPS) |
|---|---|---|---|---|---|
| Vanilla DETR | 2020 | 500 | … | bipartite matching, set prediction | … |
| Deformable DETR | 2020 | 50 | … | deformable attention + multi-scale | … |
| Conditional DETR | 2021 | 50 | … | content-dependent query PE | … |
| DAB-DETR | 2022 | 50 | … | queries as 4D anchor boxes | … |
| DN-DETR | 2022 | 50 | … | denoising queries (pre-matched positives) | … |
| DINO-DETR | 2022 | 12–24 | … | contrastive denoising + mixed query select | … |
| LW-DETR | 2024 | … | … | ViT backbone + efficient multi-scale | 60+ |
| RF-DETR | 2025 | … | … | open-source LW-DETR + denoising polish | … |
Fill numbers from each paper’s main results table. mAP is on COCO val2017 unless noted. Speed in FPS on a single A100 or V100 — note which GPU each paper used.
Implementation exercise — extend vanilla_detr.py with denoising queries
Goal: add the single most impactful trick (DN-DETR’s denoising) on top of your Week 1 implementation and observe the convergence speedup.
From scratch, this is a moderate-sized change (~100 lines). Step-by-step plan:
Step 1 — build the noisy queries
Given a batch of targets, build noisy versions of the gt boxes and labels:
def build_denoising_queries(targets, num_classes, num_groups=5,
box_noise_scale=0.4, label_noise_ratio=0.5):
"""For each image and each of num_groups noise groups, make a noisy copy of every gt.
Returns:
dn_queries: [B, K, d_model] — content stream (zero-initialized)
dn_query_pos:[B, K, d_model] — positional embedding built from noisy box + class
dn_targets: list of (clean labels, clean boxes) per image, repeated num_groups times
dn_attn_mask:[K+N, K+N] — attention mask preventing leakage to regular queries
"""
...Where K = num_groups × total_M_per_image (variable per batch).
Q. What’s the encoding from a (noisy class, noisy box) tuple into a d_model vector? (Hint: class embed + box positional embed, concatenated or summed.)
…
Step 2 — modify DETR.forward to inject denoising queries
The decoder receives two streams of queries concatenated: the DN queries (size K) + the regular object queries (size N). Total decoder sequence length: K + N.
After the decoder, split the output back: - First K tokens → DN predictions (used for denoising loss). - Last N tokens → regular predictions (used for bipartite-matched loss + inference).
Skeleton:
def forward(self, images, targets=None):
# ... backbone, PE, encoder as before ...
if self.training and targets is not None:
dn_pos, dn_attn_mask, dn_meta = build_denoising_queries(targets, ...)
K = dn_pos.shape[1]
# Build full query stack: [dn_queries, object_queries]
obj_pos = self.query_pos_emb.weight.unsqueeze(0).expand(B, -1, -1)
query_pos = torch.cat([dn_pos, obj_pos], dim=1) # [B, K+N, d]
tgt = memory.new_zeros(B, K + self.num_queries, self.d_model)
else:
K = 0
query_pos = self.query_pos_emb.weight.unsqueeze(0).expand(B, -1, -1)
tgt = memory.new_zeros(B, self.num_queries, self.d_model)
dn_attn_mask = None
dec_outs = self.decoder(tgt, memory, pos_flat, query_pos, attn_mask=dn_attn_mask)
# ... apply heads ...
# Split predictions back
dn_logits = all_logits[:, :, :K, :]
dn_boxes = all_boxes [:, :, :K, :]
obj_logits = all_logits[:, :, K:, :]
obj_boxes = all_boxes [:, :, K:, :]
return {
'pred_logits': obj_logits[-1], 'pred_boxes': obj_boxes[-1],
'aux_outputs': [...],
'dn_outputs': {'pred_logits': dn_logits[-1], 'pred_boxes': dn_boxes[-1],
'dn_meta': dn_meta, 'aux': [...]}
}Q. Why does forward need to receive targets during training? What does this break compared to vanilla DETR?
…
Q. Modify TransformerDecoderLayer to accept an attn_mask and pass it to both the self-attention and cross-attention calls. Why both?
…
# TODO — modify DETR.forward to inject DN queries at training time
# Also modify TransformerDecoderLayer to accept an attn_maskStep 3 — add the denoising loss
DN queries don’t go through the matcher. Each DN query has a known target (the clean gt it was noised from). Loss:
- Classification: cross-entropy with the clean class as target (regardless of noisy input class).
- Box L1 + GIoU: against the clean box.
Same loss components as the main loss, just with explicit indices instead of matched ones.
Q. Why don’t DN queries get a “no object” class option in the loss? (Hint: they’re always pre-paired to a real gt.)
…
# TODO — add denoising loss component
# Refactor DetrLoss to also compute losses on outputs['dn_outputs'] using the metadata.
# Total loss = main + aux + dn (and dn_aux for each decoder layer)Step 4 — train and compare convergence
Train your DN-DETR with the same hyperparameters as your vanilla DETR run. Compare the class loss curve.
Expected: class loss should drop faster (and to a lower floor) than vanilla DETR. Box losses should also drop faster as queries specialize more quickly.
Plot to make: loss curves of vanilla vs DN-DETR side by side, log-scaled y-axis. Identify the point where DN-DETR’s class loss crosses vanilla’s eventual floor.
# TODO — train DN-DETR, plot loss curves
# 1. Train vanilla_detr (already done in Week 1) — log losses to a list
# 2. Train dn_detr — log losses to a list
# 3. plt.plot both, log-yExercises
Pick one or two:
Off-the-shelf comparison. Install RF-DETR via
pip install rf-detr(or follow their repo). Run inference on your toy synthetic images and compare predictions visually + numerically to your vanilla DETR output.Negative denoising queries (DINO-DETR slice). Extend your DN-DETR implementation to also include negative denoising queries (larger noise, target = “no object”). Does the model become better at rejecting near-misses?
Query specialization plot. For your trained DN-DETR, plot a 2D scatter of (avg cx, avg cy) per query across the validation set. You should see queries specializing on different image regions — a phenomenon that emerges much earlier with denoising than without.
Deformable attention prototype. Write a from-scratch
DeformableAttentionmodule (just the formula — no need for the optimized CUDA kernel). Replace one of your decoder’s cross-attention sublayers with it and compare convergence.
Synthesis — connecting the dots
Q1. Of the four problems with vanilla DETR (sparse supervision, slow attention, no multi-scale, weak query design), which is the most fundamental? Which method most directly attacks it?
…
Q2. DN-DETR is sometimes described as “putting anchors back into DETR.” Argue the case for and against this framing.
…
Q3. Why hasn’t DETR-family killed YOLO yet, despite SOTA accuracy on COCO since DINO-DETR? Speed, ecosystem, ease-of-deployment, or something else?
…
End-of-week recap
Write a ≤250-word summary you could send to a colleague who knows vanilla DETR but hasn’t read these papers. Cover:
- The three core problems with vanilla DETR (recap).
- The two highest-impact tricks (one for attention, one for supervision) that made DETR practical.
- Why DN-DETR / DINO-DETR are the architectural defaults today.
- What LW-DETR / RF-DETR add on top of DINO-DETR.
- The one paper you’d recommend if your colleague only had time for one.
…
Next week (Module 2 / Week 3): segmentation and pose — MaskDINO (segmentation), DETR-Pose (pose estimation). Same set-prediction framing, different prediction targets.