Phase 1 / Week 1 — Representation Learning: Theory & Foundations

Anchor question for Module 3: what does a good representation look like, and how do we know we’ve learned one without labels?

This week’s reading: - Bengio, Courville, Vincent — Representation Learning: A Review and New Perspectives (2013) — https://arxiv.org/abs/1206.5538 - Yann LeCun — Cake analogy + Energy-Based SSL / JEPA talks (e.g. NeurIPS 2016 keynote, AAAI 2020, A Path Towards Autonomous Machine Intelligence position paper, 2022)

What you should be able to do by the end of the week: 1. State, in one sentence each, the ~10 priors Bengio argues a good representation should respect. 2. Explain the difference between invariance and equivariance with a concrete image example. 3. Argue why pixel-space reconstruction is a weak SSL signal (LeCun’s framing — this motivates JEPA later). 4. Reproduce LeCun’s cake analogy from memory, and explain what changed between his 2016 and 2022 versions of it. 5. List 2–3 concrete failure modes of joint-embedding SSL (representation collapse) that any of the methods in Phases 2–3 must avoid.

Learning Representations

Instead of learning f(x) -> y we can learn a representatoin g: X -> Z and then learn the classifier/predictor f: Z->Y on top of it.

Goal: learn only one/few represenatatoins for each domain, then learn simple predictors for different rasks.

Learning g should need no.less different supervision, so w can use more data.
“Ai must […] learn to identiy and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data.”

Explanatory facors / factors of variation

observed data is a causal result of underlying factors, Variations in these explain the variation in the data.
example: factors of images –> obejctec depicted, rotation, lighning, etc..
We want to recover the underlying factors.

What makes a good representation?

Smootheness if x == x2 then g(x1) == g(x2). this is the most basic prior
“Less” supervised learning: train semi-supervised / self supervised.
Invariance/equivariance/coherene: genreally, small temporal/spatial changes should result in similar representations.
1. domain specific: image representations should be invariant under transformations like rotations, color jitter etc..

Bengio — the full list of generic priors

Bengio §3 enumerates priors a good representation should encode. You have smoothness, less-supervised, and a version of coherence/invariance above. Below are the rest — fill each one in your own words as you read, with a concrete image/video example where possible.

Convention used below: prior name — one-sentence definition. Concrete example. Why it matters for SSL.

1. Smoothness (already noted above)

Already in your notes. One thing worth adding when you read: why is smoothness alone not enough for high-dimensional data? (Hint: curse of dimensionality — local generalization buys you nothing in a space the size of natural images.)

Your note on the curse-of-dim argument:

…

2. Multiple explanatory factors / distributed representations

One-hot vs distributed: why does a distributed code give exponentially more expressive power for the same number of parameters?

Definition:

…

Example (image):

…

Why it matters for SSL:

…

3. Hierarchical organization / depth

Composition of simple features → complex features. Why does this map onto deep nets specifically?

Definition:

…

Example (image):

…

Why it matters for SSL:

…

4. Semi-supervised learning (already noted above)

Restate in Bengio’s exact framing: when is P(X) informative about P(Y|X)? When is it not?

Your note:

…

5. Shared factors across tasks

The reason a single backbone can serve detection + segmentation + classification.

Definition:

…

Example you’ll see later in the course (link this prior to a method):

…

6. Manifold hypothesis

Natural data concentrates near a low-dimensional manifold embedded in a high-dim ambient space.

Definition:

…

Example — sketch (or describe) the manifold of MNIST 3s under rotation:

…

Why it matters for SSL (connect to contrastive geometry — Phase 2):

…

7. Natural clustering

Class-conditional density tends to concentrate, with low-density regions between classes.

Definition:

…

Where you’ll see this used explicitly (hint: SwAV, DeepCluster, DINO centering):

…

8. Temporal & spatial coherence (slow features)

Consecutive video frames share most factors; nearby image patches share most factors.

Definition:

…

Connection to a method you’ll implement (CVRL / V-JEPA — Phase 4):

…

9. Sparsity

Most factors are irrelevant for any given observation — only a few are ‘on’.

Definition:

…

Tension with smoothness — when do these priors fight each other?

…

10. Simplicity of factor dependencies

In a good representation, the factors should depend on each other in simple ways — ideally linearly.

Definition:

…

Why this is the assumption behind linear probing as an SSL eval (forward-link to Phase 5):

…

Disentanglement — a closer look

Bengio’s strongest claim: a representation is disentangled if the latent units align with the true underlying factors (object identity, pose, lighting, …) such that changing one factor changes one (or few) units.

Q1. Write down a definition of disentanglement that doesn’t presuppose access to the true factors. (Why is this hard?)

…

Q2. What’s the difference between disentanglement and invariance? A representation that’s invariant to lighting is, in a sense, throwing lighting away. A disentangled one separates it. When do you want which?

…

Q3. Locatello et al. 2019 (Challenging Common Assumptions in Disentanglement) showed that unsupervised disentanglement is fundamentally impossible without inductive biases. Skim the abstract — what’s the impossibility result?

…

Invariance vs. equivariance — make this precise

Your notes above bundle these together; the distinction matters a lot for detection/segmentation (Module 2) and for video SSL (Phase 4).

Let T be a transformation on the input (e.g. translation) and g the encoder.

Invariant: g(T x) = g(x) — the rep doesn’t move when the input moves.
Equivariant: g(T x) = T' g(x) — the rep moves in a predictable way (some T') when the input moves.

Q1. For a classification head, do you want translation invariance or equivariance in the backbone? Why?

…

Q2. For a detection head (DETR — you’ll build this in Module 2), which do you want?

…

Q3. Contrastive SSL with random crops forces approximate invariance to crop position. Is that always desirable? When does it hurt?

…

LeCun — the Cake analogy and the EBM/SSL framing

LeCun’s argument runs on two tracks: a quantitative one (information bits per training signal) and a structural one (Energy-Based Models as a unifying lens over all of SSL).

The cake analogy

LeCun, NeurIPS 2016 keynote — paraphrased: “If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing is supervised learning, and the cherry on top is reinforcement learning.” He later (~2019+) replaced unsupervised with self-supervised.

Q1. What was his rough bits-per-sample estimate for each of the three? (RL: a few bits per episode; supervised: ~10 bits per sample; SSL: millions of bits per sample.) Write it out as you find it in a talk.

…

Q2. Why does that bit-count argument lead him to conclude most learning must be SSL?

…

Q3. Why did he change unsupervised → self-supervised? What did the rename clarify?

…

Energy-Based Models as a unifying framework

An EBM defines a scalar F(x, y) (the energy) with the property: - F(x, y) is low for compatible (x, y) pairs, - F(x, y) is high for incompatible ones.

Inference: y* = argmin_y F(x, y).

Learning: shape the energy landscape so the data points sit in valleys and everything else sits on hills.

Q1. Pick three SSL methods you’ve heard of (e.g. SimCLR, BYOL, MAE). For each, what plays the role of x, y, and F?

Method	`x`	`y`	`F(x, y)`
SimCLR	…	…	…
BYOL	…	…	…
MAE	…	…	…

Q2. LeCun splits EBM training into two families: - Contrastive — push down on data, push up on negatives. - Regularized / architectural — restrict the capacity of the energy function so it can’t be low everywhere.

Which family is each of {SimCLR, MoCo, BYOL, MAE, VICReg, DINO} in? You’ll be revisiting this table every week in Phase 2/3 — start it now.

…

Why pixel-space generative models are bad SSL (the JEPA argument)

LeCun argues that predicting in pixel space (à la classical autoencoders, even MAE) forces the model to model lots of irrelevant high-frequency detail — the model spends capacity on what shade of green a leaf is, when only “there is a leaf” matters for downstream tasks.

JEPA (Joint-Embedding Predictive Architecture) instead predicts in representation space: predict the embedding of the masked region, not its pixels.

Q1. State the JEPA argument in 2–3 sentences in your own words.

…

Q2. What’s the obvious failure mode of “predict embeddings”? (Hint: collapse — what stops g(x) = 0 from being optimal?)

…

Q3. Forward-link: you’ll see three families of collapse-prevention mechanisms in Phase 2/3 — contrastive (negatives), architectural (predictor + stop-grad), and regularization (var/cov). List which method maps to which.

…

Synthesis — connecting Bengio to LeCun

Bengio (2013) wrote before the SSL revolution. LeCun’s framing came after the field had already tried — and mostly failed — at the generative route to representation learning.

Q1. Which of Bengio’s 10 priors does LeCun’s EBM/JEPA framing operationalize? Which does it leave out?

…

Q2. Bengio emphasized disentanglement. Modern SSL (DINOv2, V-JEPA) doesn’t optimize for disentanglement directly — it optimizes for useful features. Is disentanglement still the right north star? Argue both sides.

…

Exercises

These are small, single-cell exercises you can run on your laptop — no GPU needed. They’re designed to make the priors concrete.

Exercise 1 — Manifold hypothesis on MNIST

Take 100 images of a single MNIST digit. Apply small random rotations (±15°). Plot the first 2 PCA components and the first 2 t-SNE components of the raw pixel vectors. Does the rotated digit live on something that looks like a 1-D manifold in pixel space?

Then do the same for the embeddings of a pretrained encoder (e.g. torchvision ResNet18 — penultimate layer). What changes?

Expected takeaway: in pixel space the manifold is curved and tangled; in feature space it (often) untangles.

# TODO
# 1. load 100 MNIST images of one digit
# 2. apply rotations in [-15, 15] degrees
# 3. PCA(2) + t-SNE(2) on raw pixels  -> plot
# 4. PCA(2) + t-SNE(2) on ResNet18 features -> plot
# 5. write one sentence in a markdown cell below: what did you observe?

Exercise 2 — Invariance vs equivariance, empirically

Take the same pretrained ResNet18. For a single image, compute: - g(x) — features of the original image, - g(T x) — features of a translated copy (shift by 16 pixels), - g(R x) — features of a rotated copy (90°).

Compute cosine similarity for each pair. Is the network more invariant to translation or rotation? Why? (Hint: think about the training data and the architecture.)

# TODO
# 1. one image (any image, even a single PIL load)
# 2. compute features for x, shift(x, 16), rotate(x, 90)
# 3. cosine sims
# 4. one-sentence observation below

Exercise 3 — The collapse problem, by hand

No training. Just construct two encoders g_A and g_B mapping R^d -> R^k:

g_A(x) = 0 for all x (the trivial collapsed solution).
g_B(x) = W x with W a random projection.

Both are perfectly smooth. Both are perfectly invariant to noise. Yet only one is useful.

Write 2–3 sentences: which of Bengio’s priors does g_A violate? Which of LeCun’s EBM constraints rules it out?

Your answer:

…

End-of-week recap

Write a ≤200-word summary you could send to a colleague who hasn’t read either paper. Cover: 1. What a representation is and why we want one separate from the predictor. 2. Bengio’s priors — pick the 3 you think matter most for image SSL. 3. LeCun’s argument for SSL over supervised. 4. The one open question you most want to follow up on.

…

Next week (Phase 1 / Week 2): information-theoretic view — mutual information, InfoNCE bound, why MI bounds are loose in practice. Tschannen et al., Oord et al., Poole et al.