Vision Transformers & SSL

**** Vision Transformers & Self-Supervised Learning

Transformer architectures for vision, from scratch ViT through modern self-supervised representation learning. A hands-on course in three modules:

Foundations — implement ViT from scratch, compare against Swin / CoAtNet / DeiT, build masked image modeling (MAE).
Transformer-based perception — DETR family for detection, segmentation, and tracking.
Image & video SSL — contrastive methods, masked image modeling, DINOv2, V-JEPA, evaluation harness, and applied projects.

Each module has runnable notebooks (small enough to train on a single GPU) and exercise notebooks with placeholders for your own implementation. Solutions are included alongside.

Companion repo: intro-to-vlms picks up where this leaves off — from CLIP through modern VLMs, alignment, and agentic systems.

Module 1 — Transformer Foundations

Status	Week	Task / Goal	Category	Resources	Solutions
X	Week 1	Re-implement basic ViT from scratch (no framework)	Code	ViT paper (Dosovitskiy), lucidrains’ vit-pytorch [https://arxiv.org/abs/2010.11929]	001_vit_from_scratch.ipynb
X	Week 2	Compare ViT with Swin, CoAtNet, DeiT + deep dive on multi-head self-attention in spatial domain	Theory/Compare	Papers: Swin, DeiT, CoAtNet; timm repo; “Attention Is All You Need”, Annotated Transformer	002_compare_ViTs.ipynb, 002_CoAtNet.ipynb
X	Week 3	Implement masked image modeling (MIM) pretraining	Code	MAE (He et al.), SimMIM	003_MAE.ipynb
X	Week 4	Visualize attention maps, frozen feature extraction, linear probing. Compare DINOv2 vs CLIP attention.	Analysis/Code	DINOv2, DINO, CLIP papers

Module 2 — Transformers for Detection, Segmentation & Tracking

Status	Week	Task / Goal	Category	Resources
X	Week 1	DETR — build a Transformer detector from scratch	Code	DETR paper (Carion et al.), facebookresearch/detr
🔲	Week 2	Building on top of DETR — LW-DETR, RF-DETR	Code/Compare	LW-DETR, RF-DETR papers
🔲	Week 3	Segmentation and pose — MaskDINO (segmentation), DETR-Pose (pose estimation), Extend `vanilla_detr.py` for segmentation	Code	MaskDINO, DETR-Pose papers
🔲	Week 4	Multi-object tracking — MOTR, MOTRv2, TrackFormer, SAM	Code	MOTR, MOTRv2, TrackFormer, SAM papers

Module 3 — Self-Supervised Image & Video Representation Learning

Goal: build up from the theory of what makes a good representation, work through every major family of image SSL, move into video SSL, and finish with evaluation + applied projects on real domains.
Anchor question for the whole module: what does a good representation look like, and how do we know we’ve learned one without labels?

Phase 1 — Theory & Foundations of Representation Learning

Status	Week	Task / Goal	Category	Resources	Solutions
◐	Week 1	Representation learning theory — explanatory factors, smoothness, invariance/equivariance, disentanglement, why SSL works at all	Theory	Bengio “Representation Learning: A Review and New Perspectives” (2013), LeCun’s “Cake” / EBM-SSL talks	intro_to_representation_learning.ipynb
🔲	Week 2	Information-theoretic view — mutual information, InfoNCE bound, information bottleneck, why MI bounds are loose in practice	Theory	“On Mutual Information Maximization for Representation Learning” (Tschannen et al.), InfoNCE (Oord et al.), “On Variational Bounds of Mutual Information” (Poole et al.)
🔲	Week 3	Pretext-task era — rotation prediction, jigsaw puzzles, colorization, context prediction. Implement one from scratch as a baseline	Code/Theory	Gidaris (rotation), Noroozi (jigsaw), Zhang (colorization), Doersch (context)

Phase 2 — Contrastive & Joint-Embedding Image SSL

Status	Week	Task / Goal	Category	Resources
🔲	Week 4	Implement SimCLR from scratch — augmentation pipeline, projection head, NT-Xent loss. Train on CIFAR-10 / TinyImageNet on a single GPU	Code	SimCLR paper, google-research/simclr, lightly-ai/lightly
🔲	Week 5	MoCo v1→v2→v3 — momentum encoder + queue, then ViT backbone. Compare against SimCLR on the same data	Code/Compare	MoCo papers, facebookresearch/moco, facebookresearch/moco-v3
🔲	Week 6	Non-contrastive methods — BYOL (predictor + stop-grad), SwAV (clustering), Barlow Twins (cross-corr), VICReg (var/inv/cov). Understand why no negatives still works	Theory/Code	BYOL, SwAV, Barlow Twins, VICReg papers; vturrisi/solo-learn
🔲	Week 7	Read “Understanding self-supervised learning dynamics without contrastive pairs” (Tian et al.) + the collapse-prevention literature	Theory	Tian et al. 2021, “Towards the Generalization of Contrastive SSL” (Wang & Isola), alignment & uniformity paper

Phase 3 — Masked Image Modeling & Distillation-based SSL

Status	Week	Task / Goal	Category	Resources
🔲	Week 8	Re-visit MAE (from Module 1), compare against SimMIM, BEiT, iBOT, data2vec. Pixel target vs feature target vs discrete-token target	Theory/Compare	MAE, SimMIM, BEiT, iBOT, data2vec papers; 003_MAE.ipynb
🔲	Week 9	DINO & DINOv2 — student/teacher with EMA, centering, sharpening, multi-crop. Implement DINO from scratch on a small dataset	Code	DINO, DINOv2 papers; facebookresearch/dino, facebookresearch/dinov2
🔲	Week 10	I-JEPA — predicting in representation space instead of pixel space. Read paper, run inference, understand why latent prediction beats pixel reconstruction for semantic features	Theory/Code	I-JEPA paper, facebookresearch/ijepa
🔲	Week 11	DINOv3 + Franca (fully open-source DINOv2-class). Compare DINOv3 register tokens vs DINOv2; reproduce a small Franca run	Code/Compare	DINOv3, valeoai/Franca, “Vision Transformers Need Registers” (Darcet et al.)

Phase 4 — Video Self-Supervised Learning

Status	Week	Task / Goal	Category	Resources	Solutions
🔲	Week 12	VideoMAE / VideoMAEv2 — tube masking, why video needs higher masking ratios than images, dual masking. Run inference + small finetune on Kinetics subset	Code	VideoMAE, VideoMAEv2 papers; MCG-NJU/VideoMAE
X	Week 13	V-JEPA & V-JEPA 2.1 — life-of-an-input walkthrough: 3D patch embed, multi-level deep supervision, mask-token predictor, EMA target	Code/Theory	V-JEPA, V-JEPA 2 papers; facebookresearch/jepa	v-jepa2_1.ipynb
🔲	Week 14	Temporal-coherence & motion-aware SSL — CVRL, TimeContrast, MaskedFeat, ST-MAE. Why slow-features and temporal-equivariance priors matter	Theory/Compare	CVRL, MaskedFeat, ST-MAE, “Slow Feature Analysis” (Wiskott & Sejnowski)
🔲	Week 15	Cross-modal SSL for video — audio-visual (AVID-CMA, MAViL), video-text (VideoCLIP, InternVideo). Discuss CLIP itself as an SSL method	Theory	AVID-CMA, MAViL, VideoCLIP, InternVideo papers

Phase 5 — Evaluation Frameworks for Representation Learning

How do we actually measure whether a representation is good? This phase is its own deliverable — a small eval/ library you can reuse for the rest of the course.

Status	Week	Task / Goal	Category	Resources
🔲	Week 16	Build a probe harness — linear probe, kNN probe, fine-tune. Apply to a frozen DINOv2 / V-JEPA backbone on CIFAR-100 / iNaturalist / UCF-101	Code/Eval	DINOv2 eval recipe, VISSL, lightly benchmarks
🔲	Week 17	Dense / structured evaluation — frozen-features for detection (COCO), segmentation (ADE20k), depth (NYUv2), tracking. The “DINOv2-style” eval suite	Eval	DINOv2 paper §5, “How well do SSL models transfer?” (Ericsson et al.)
🔲	Week 18	Robustness & OOD probing — ImageNet-A/C/R, ObjectNet, VTAB-1k. Disentanglement metrics (β-VAE, FactorVAE, DCI) on a synthetic dataset	Eval/Theory	VTAB paper, “Challenging Common Assumptions in Disentanglement” (Locatello et al.)

Deep dive on SSL evaluation benchmarks

Benchmark	What it measures	Modality
Linear probe @ ImageNet-1k	Canonical headline number — is the feature linearly separable for 1k classes?	Image
kNN @ ImageNet-1k	Quality of the raw embedding geometry without any training on top	Image
VTAB-1k	19 diverse tasks in low-data regime — natural, specialized, structured	Image
ADE20k / Cityscapes (frozen features)	Dense semantic features — pixel-level not just image-level	Image (dense)
COCO detection / instance seg (frozen)	Localization quality of the frozen backbone	Image (dense)
NYUv2 depth / NAVI (frozen)	Geometric features, 3D awareness from 2D pretraining	Image (dense)
Kinetics-400 / SSv2 (linear / finetune)	Action recognition — Kinetics rewards appearance, SSv2 rewards temporal reasoning	Video
UCF-101 / HMDB-51 (linear / finetune)	Smaller video benchmarks, common for compute-constrained eval	Video
EPIC-Kitchens	Egocentric, long-tail action recognition + anticipation	Video
ImageNet-A / C / R / Sketch / ObjectNet	Robustness to natural adversarials, corruptions, renditions, OOD	Image
DCI / MIG / FactorVAE score	Disentanglement — does the representation align with true factors of variation?	Synthetic

Phase 6 — Applied Projects: SSL That Actually Shipped

Status	Week	Task / Goal	Category	Resources
🔲	Week 19	DINOv2-as-backbone project — pick a niche domain (medical, satellite, microscopy, retail) and beat a supervised baseline with frozen DINOv2 features + small head	Project	DINOv2 repo, RetFound (retinal), SatMAE (satellite), Prov-GigaPath (pathology)
🔲	Week 20	Video SSL for embodied AI — read VC-1 (“Where are we in the search for an artificial visual cortex?”) and run V-JEPA-2 features on a robotics / video-QA task	Project	VC-1 paper, V-JEPA 2 release, facebookresearch/eai-vc
🔲	Week 21	Capstone — train a small SSL model from scratch on your own domain (≤ 100k unlabeled images or ≤ 1k unlabeled videos), evaluate with the Phase-5 harness, write up what worked	Capstone	All of the above

Module 1 — Transformer Foundations

Module 2 — Transformers for Detection, Segmentation & Tracking

Module 3 — Self-Supervised Image & Video Representation Learning

Phase 1 — Theory & Foundations of Representation Learning

Phase 2 — Contrastive & Joint-Embedding Image SSL

Phase 3 — Masked Image Modeling & Distillation-based SSL

Phase 4 — Video Self-Supervised Learning

Phase 5 — Evaluation Frameworks for Representation Learning

Deep dive on SSL evaluation benchmarks

Phase 6 — Applied Projects: SSL That Actually Shipped

Suggested reading / watching

Theory & survey

Talks / videos

Libraries to know

Domain success stories worth studying