**** Vision Transformers & Self-Supervised Learning
Transformer architectures for vision, from scratch ViT through modern self-supervised representation learning. A hands-on course in three modules:
- Foundations — implement ViT from scratch, compare against Swin / CoAtNet / DeiT, build masked image modeling (MAE).
- Transformer-based perception — DETR family for detection, segmentation, and tracking.
- Image & video SSL — contrastive methods, masked image modeling, DINOv2, V-JEPA, evaluation harness, and applied projects.
Each module has runnable notebooks (small enough to train on a single GPU) and exercise notebooks with placeholders for your own implementation. Solutions are included alongside.
Companion repo: intro-to-vlms picks up where this leaves off — from CLIP through modern VLMs, alignment, and agentic systems.
Module 3 — Self-Supervised Image & Video Representation Learning
- Goal: build up from the theory of what makes a good representation, work through every major family of image SSL, move into video SSL, and finish with evaluation + applied projects on real domains.
- Anchor question for the whole module: what does a good representation look like, and how do we know we’ve learned one without labels?
Phase 1 — Theory & Foundations of Representation Learning
| ◐ |
Week 1 |
Representation learning theory — explanatory factors, smoothness, invariance/equivariance, disentanglement, why SSL works at all |
Theory |
Bengio “Representation Learning: A Review and New Perspectives” (2013), LeCun’s “Cake” / EBM-SSL talks |
intro_to_representation_learning.ipynb |
| 🔲 |
Week 2 |
Information-theoretic view — mutual information, InfoNCE bound, information bottleneck, why MI bounds are loose in practice |
Theory |
“On Mutual Information Maximization for Representation Learning” (Tschannen et al.), InfoNCE (Oord et al.), “On Variational Bounds of Mutual Information” (Poole et al.) |
|
| 🔲 |
Week 3 |
Pretext-task era — rotation prediction, jigsaw puzzles, colorization, context prediction. Implement one from scratch as a baseline |
Code/Theory |
Gidaris (rotation), Noroozi (jigsaw), Zhang (colorization), Doersch (context) |
|
Phase 2 — Contrastive & Joint-Embedding Image SSL
| 🔲 |
Week 4 |
Implement SimCLR from scratch — augmentation pipeline, projection head, NT-Xent loss. Train on CIFAR-10 / TinyImageNet on a single GPU |
Code |
SimCLR paper, google-research/simclr, lightly-ai/lightly |
|
| 🔲 |
Week 5 |
MoCo v1→v2→v3 — momentum encoder + queue, then ViT backbone. Compare against SimCLR on the same data |
Code/Compare |
MoCo papers, facebookresearch/moco, facebookresearch/moco-v3 |
|
| 🔲 |
Week 6 |
Non-contrastive methods — BYOL (predictor + stop-grad), SwAV (clustering), Barlow Twins (cross-corr), VICReg (var/inv/cov). Understand why no negatives still works |
Theory/Code |
BYOL, SwAV, Barlow Twins, VICReg papers; vturrisi/solo-learn |
|
| 🔲 |
Week 7 |
Read “Understanding self-supervised learning dynamics without contrastive pairs” (Tian et al.) + the collapse-prevention literature |
Theory |
Tian et al. 2021, “Towards the Generalization of Contrastive SSL” (Wang & Isola), alignment & uniformity paper |
|
Phase 3 — Masked Image Modeling & Distillation-based SSL
| 🔲 |
Week 8 |
Re-visit MAE (from Module 1), compare against SimMIM, BEiT, iBOT, data2vec. Pixel target vs feature target vs discrete-token target |
Theory/Compare |
MAE, SimMIM, BEiT, iBOT, data2vec papers; 003_MAE.ipynb |
|
| 🔲 |
Week 9 |
DINO & DINOv2 — student/teacher with EMA, centering, sharpening, multi-crop. Implement DINO from scratch on a small dataset |
Code |
DINO, DINOv2 papers; facebookresearch/dino, facebookresearch/dinov2 |
|
| 🔲 |
Week 10 |
I-JEPA — predicting in representation space instead of pixel space. Read paper, run inference, understand why latent prediction beats pixel reconstruction for semantic features |
Theory/Code |
I-JEPA paper, facebookresearch/ijepa |
|
| 🔲 |
Week 11 |
DINOv3 + Franca (fully open-source DINOv2-class). Compare DINOv3 register tokens vs DINOv2; reproduce a small Franca run |
Code/Compare |
DINOv3, valeoai/Franca, “Vision Transformers Need Registers” (Darcet et al.) |
|
Phase 4 — Video Self-Supervised Learning
| 🔲 |
Week 12 |
VideoMAE / VideoMAEv2 — tube masking, why video needs higher masking ratios than images, dual masking. Run inference + small finetune on Kinetics subset |
Code |
VideoMAE, VideoMAEv2 papers; MCG-NJU/VideoMAE |
|
| X |
Week 13 |
V-JEPA & V-JEPA 2.1 — life-of-an-input walkthrough: 3D patch embed, multi-level deep supervision, mask-token predictor, EMA target |
Code/Theory |
V-JEPA, V-JEPA 2 papers; facebookresearch/jepa |
v-jepa2_1.ipynb |
| 🔲 |
Week 14 |
Temporal-coherence & motion-aware SSL — CVRL, TimeContrast, MaskedFeat, ST-MAE. Why slow-features and temporal-equivariance priors matter |
Theory/Compare |
CVRL, MaskedFeat, ST-MAE, “Slow Feature Analysis” (Wiskott & Sejnowski) |
|
| 🔲 |
Week 15 |
Cross-modal SSL for video — audio-visual (AVID-CMA, MAViL), video-text (VideoCLIP, InternVideo). Discuss CLIP itself as an SSL method |
Theory |
AVID-CMA, MAViL, VideoCLIP, InternVideo papers |
|
Phase 5 — Evaluation Frameworks for Representation Learning
- How do we actually measure whether a representation is good? This phase is its own deliverable — a small
eval/ library you can reuse for the rest of the course.
| 🔲 |
Week 16 |
Build a probe harness — linear probe, kNN probe, fine-tune. Apply to a frozen DINOv2 / V-JEPA backbone on CIFAR-100 / iNaturalist / UCF-101 |
Code/Eval |
DINOv2 eval recipe, VISSL, lightly benchmarks |
|
| 🔲 |
Week 17 |
Dense / structured evaluation — frozen-features for detection (COCO), segmentation (ADE20k), depth (NYUv2), tracking. The “DINOv2-style” eval suite |
Eval |
DINOv2 paper §5, “How well do SSL models transfer?” (Ericsson et al.) |
|
| 🔲 |
Week 18 |
Robustness & OOD probing — ImageNet-A/C/R, ObjectNet, VTAB-1k. Disentanglement metrics (β-VAE, FactorVAE, DCI) on a synthetic dataset |
Eval/Theory |
VTAB paper, “Challenging Common Assumptions in Disentanglement” (Locatello et al.) |
|
Deep dive on SSL evaluation benchmarks
| Linear probe @ ImageNet-1k |
Canonical headline number — is the feature linearly separable for 1k classes? |
Image |
| kNN @ ImageNet-1k |
Quality of the raw embedding geometry without any training on top |
Image |
| VTAB-1k |
19 diverse tasks in low-data regime — natural, specialized, structured |
Image |
| ADE20k / Cityscapes (frozen features) |
Dense semantic features — pixel-level not just image-level |
Image (dense) |
| COCO detection / instance seg (frozen) |
Localization quality of the frozen backbone |
Image (dense) |
| NYUv2 depth / NAVI (frozen) |
Geometric features, 3D awareness from 2D pretraining |
Image (dense) |
| Kinetics-400 / SSv2 (linear / finetune) |
Action recognition — Kinetics rewards appearance, SSv2 rewards temporal reasoning |
Video |
| UCF-101 / HMDB-51 (linear / finetune) |
Smaller video benchmarks, common for compute-constrained eval |
Video |
| EPIC-Kitchens |
Egocentric, long-tail action recognition + anticipation |
Video |
| ImageNet-A / C / R / Sketch / ObjectNet |
Robustness to natural adversarials, corruptions, renditions, OOD |
Image |
| DCI / MIG / FactorVAE score |
Disentanglement — does the representation align with true factors of variation? |
Synthetic |
Phase 6 — Applied Projects: SSL That Actually Shipped
| 🔲 |
Week 19 |
DINOv2-as-backbone project — pick a niche domain (medical, satellite, microscopy, retail) and beat a supervised baseline with frozen DINOv2 features + small head |
Project |
DINOv2 repo, RetFound (retinal), SatMAE (satellite), Prov-GigaPath (pathology) |
| 🔲 |
Week 20 |
Video SSL for embodied AI — read VC-1 (“Where are we in the search for an artificial visual cortex?”) and run V-JEPA-2 features on a robotics / video-QA task |
Project |
VC-1 paper, V-JEPA 2 release, facebookresearch/eai-vc |
| 🔲 |
Week 21 |
Capstone — train a small SSL model from scratch on your own domain (≤ 100k unlabeled images or ≤ 1k unlabeled videos), evaluate with the Phase-5 harness, write up what worked |
Capstone |
All of the above |
Suggested reading / watching
Theory & survey
- Bengio, Courville, Vincent — “Representation Learning: A Review and New Perspectives” (the canonical reference, still worth reading)
- Yann LeCun — “A Path Towards Autonomous Machine Intelligence” (the JEPA / world-model position paper)
- Ericsson, Gouk, Hospedales — “How Well Do Self-Supervised Models Transfer?” (CVPR 2021) — sober empirical comparison
- Balestriero et al. — “A Cookbook of Self-Supervised Learning” (Meta, 2023) — practical recipes and failure modes
Talks / videos
- Yann LeCun — Energy-Based SSL / JEPA talks (multiple venues, search “LeCun JEPA”)
- Lucas Beyer — “Vision Transformers” and “DINO / DINOv2” talks
- Mathilde Caron — DINO presentation (original author)
- Yannic Kilcher paper walkthroughs — SimCLR, MoCo, BYOL, DINO, MAE, V-JEPA
- Stanford CS231n lecture on self-supervised learning
- MIT 6.S898 Deep Learning — SSL lectures
Libraries to know
- lightly — clean PyTorch implementations of every major image SSL method, good for ablations
- solo-learn — research-grade SSL training library
- VISSL — Meta’s SSL benchmarking framework (older but still useful)
- mmselfsup — OpenMMLab’s SSL toolbox
Domain success stories worth studying
- RetFound (Nature 2023) — MAE on 1.6M retinal images, transfers to ocular and systemic disease
- SatMAE / Scale-MAE — masked autoencoders for satellite imagery
- Prov-GigaPath — pathology foundation model
- VC-1 / V-JEPA 2 — embodied / robotics
- DINOv2 + SAM — segmentation pipelines built on frozen SSL features