How AI Taught Itself to See

How do we design a good feature representation from images?

we can learn features by training a nn to solve a specific task. In other words, build a linear classifier on top of a nn that minimizes loss on a specific classification task. We can then take those learned weights and adapt them to a new task. This is called transfer learning. These learned weights can give us a starting point, but they often fail to capture all the rich semantic meaning in an image.
Natural language can give us more information about an image, without having to constrain the representations to find a fixed set of classes. We can describe the image with descriptions of the object, the actions, and the background. For that we need a text encoder to turn natural language into a feature vector. We want feature vectors for image and its descriptions to be similar. We can do this with contrastive learning on pairs of images and description sentences. Known as Contrastive language-image pretraining
The above still requires sentence descriptions for images which can be expensive. Is there a way to use only the images? Enter self-supervised learning. We need a supervision signal for training without labels. Historically, researches have used colorization as a supervision signal, rotation angle, masked pixel predictions (inpainting).
- The Dino models instead take augmented views of input images and the objective is to bring the embeddings pairs for matching source images closer, while pushing embeddings of different images far apart. Positive pairs should be high similarity (this was introduced in SimCLR). Dino extends this by taking a source images, making two augmented views and creating a student image encoder and a teacher image encoder. Features from both models are fed into a projection head to get the logits. After softmax, we get proba distributions. We train the student to match the teacher by calculating cross-entropy and minimizing. No gradient flows into teacher. Only student gets updated to match teacher. This is called knowledge distillation. The output dimension of the teacher is large. We only update the teacher gradually, using exponential moving average. To avoid collapsed representations, they use co-centering. This encourages the model to spread soft-maxed predictions more evenly across the output dimension.
- In DINOv2, authors add sinkhorn-knopp centering for better centering along with more data. They also add patch-level loss with masked-patches.
- In DINOv3, they add dense video features. They add Gram anchoring. This helps keep dense features more sharp and less noisy, along with more semantic coherence.