Intro to Visual-Language Modeling

I couldn’t find a good, hands-on course on this topic. This set of notebooks and exercises are a walkthrough, starting from vanilla ViT, going through image-text contrastive learning, and ending with a research project to bring everything together.

  • Every module has the exercises I found useful to understand the models, along with training scripts on a small dataset that can easily run on a small GPU.

  • Every module has an exercise notebook that walks through steps and has place-holders for your own implementation of the code. My solutions are also included if we need to reference.

  • In the first module I walk through vanilla ViT from scratch, add some tricks like window-shifted attention, and work on self-supervised masked image modeling. Everything from scratch, and then we dig into other implementations from timm and transformers to see how the pros do it.

Companion repo: vision-transformers-and-ssl covers deeper transformer foundations and self-supervised learning — DETR family for detection/segmentation/tracking, and a full 6-phase module on image + video SSL (DINOv2, V-JEPA, evaluation harness, applied projects). Module 1 here mirrors Module 1 there.

Module 1 Advanced Vision Transformer Foundations

Status Week Task / Goal Category Resources Solutions
X Week 1 Re-implement basic ViT from scratch (no framework) Code ViT paper (Dosovitskiy), lucidrains’ vit-pytorch [https://arxiv.org/abs/2010.11929] 001_vit_from_scratch.ipynb
X Week 2 Compare ViT with Swin, CoAtNet, DeiT + deep dive on multi-head self-attention in spatial domain Theory/Compare Papers: Swin, DeiT, CoAtNet; timm repo; “Attention Is All You Need”, Annotated Transformer 002_compare_ViTs.ipynb, 002_CoAtNet.ipynb
X Week 3 Implement masked image modeling (MIM) pretraining Code MAE (He et al.), SimMIM 003_MAE.ipynb
X Week 4 Visualize attention maps, frozen feature extraction, linear probing. Compare DINOv2 vs CLIP attention. Analysis/Code DINOv2, DINO, CLIP papers

Module 2 Vision + Language Pretraining & Integration

Status Week Task / Goal Category Resources Solutions
X Week 1 Build a text encoder from scratch — tokenization, text transformer, sentence embeddings Code “Attention Is All You Need”, HuggingFace tokenizers, Annotated Transformer
X Week 2 Reproduce CLIP (image-text contrastive training) Code OpenCLIP repo, CLIP paper
X Week 3 Implement full SigLip from scratch. Fine-tune our SigLip. Implement PaliGemma from scratch following tutorial video Model Dev SigLip paper, Umar Jamil’s tutorial
X Week 4 Fine-tune pretrained Molmo2 model on niche domain Fine-tuning Molmo2 repo, visual instruction tuning
🔲 Week 5 Look into positional embeddings in vision-language transformers Theory “Sinusoidal Encoding Explained”, Flamingo paper
🔲 Week 6 Evaluate model on retrieval/captioning metrics + ablate cross-attention Eval/Research pycocoevalcap, Recall@K, Flamingo, BLIP2, LLaVA

Deep dive on VLM Benchmarks

Key benchmarks to know when evaluating vision-language models:

Benchmark What it measures Task type
MMMU College-level multi-discipline reasoning over images (charts, diagrams, photos) VQA / reasoning
MMBench Broad multi-modal understanding — perception, reasoning, OCR VQA
SEED-Bench Generative comprehension across 12 dimensions (spatial, temporal, etc.) VQA / multi-choice
TextVQA Reading and reasoning about text within images OCR + VQA
DocVQA Understanding documents — forms, tables, receipts Document VQA
ChartQA Answering questions about charts and plots Chart reasoning
VQAv2 Open-ended visual question answering on natural images VQA (classic)
GQA Compositional reasoning — multi-step spatial questions VQA / reasoning
COCO Captions Image captioning quality (CIDEr, BLEU, METEOR scores) Captioning
Flickr30k / COCO Retrieval Image-text retrieval (Recall@1, Recall@5, Recall@10) Retrieval
RealWorldQA Real-world spatial understanding and reasoning VQA
POPE Detecting object hallucination — does the model see things that aren’t there? Hallucination
  • Retrieval metrics (Recall@K): given a query, is the correct match in the top K results? Directly relevant to CLIP.
  • Captioning metrics (CIDEr, BLEU, METEOR): compare generated captions to human references. Relevant for BLIP/LLaVA.
  • VQA accuracy: exact-match or soft-match against ground truth answers.
  • Hallucination is a major open problem — models confidently describe objects that aren’t in the image. POPE specifically tests for this.

Module 3 Modern VLMs, Alignment & Agentic Systems

  • Now that we’ve built everything from scratch, this module shifts to: understand production VLM architectures → run inference → fine-tune → synthetic data → alignment → agentic systems

Phase 1 — Modern VLM Architectures & Inference

Status Week Task / Goal Category Resources
🔲 Week 1 Deep dive into modern VLM architectures — read LLaVA, InternVL2, Qwen2.5-VL, Flamingo. Understand how vision encoder + connector + LLM backbone fit together Theory Raschka’s “Understanding Multimodal LLMs”, LLaVA, InternVL2, Qwen2.5-VL, Flamingo papers
🔲 Week 2 Fine-Tune and Understand Qwen3.5 and Unsloth Code Qwen2.5-VL, HuggingFace Transformers
🔲 Week 3 Compare VLM architectures: Qwen-VL vs LLaVA vs Molmo — unified embedding vs cross-attention, connector design, tradeoffs Theory/Compare Qwen-VL, LLaVA, Molmo papers + repos

Phase 2 — Fine-Tuning & Image Tokenization

Status Week Task / Goal Category Resources
🔲 Week 4 Fine-tune a VLM (Qwen-VL or LLaVA-1.5) with LoRA/QLoRA on a custom domain Fine-tuning LLaMA-Factory, Swift, HuggingFace PEFT, bitsandbytes
🔲 Week 5 Explore image tokenization: VQ-VAE, visual tokens in modern VLMs Theory/Code VQ-VAE paper, Chameleon, Emu
🔲 Week 6 Ablations on fine-tuned model + build Streamlit dashboard Experiment/Viz Torch hooks, wandb, Streamlit, seaborn, t-SNE

Phase 3 — Data: Curation, Captioning & Synthetic Pipelines

Status Week Task / Goal Category Resources
🔲 Week 7 Create image-caption dataset (10k+ pairs) + vision prompt QA pairs. Study data curation at scale — LAION, DataComp Data/Theory LAION viewer, COCO, local annotations, GPT-based auto-captioning, DataComp papers, “Scaling Data-Constrained Language Models” (Muennighoff et al.)
🔲 Week 8 Learn data mixture & quality research — optimal ratios, data selection Theory DoReMi, “Data Selection for LLMs” papers, data mixing laws
🔲 Week 9 Build a synthetic VQA dataset using a VLM as the labeler, fine-tune on it, evaluate Code/Data LLM-as-oracle approach, VQA generation pipelines

Phase 4 — RL & Alignment for VLMs

Status Week Task / Goal Category Resources
🔲 Week 10 RLHF foundations — InstructGPT, practical PPO/DPO tooling Theory/Code InstructGPT paper, TRL library (HuggingFace)
🔲 Week 11 Modern alignment methods — DPO, GRPO (DeepSeek-R1), RLCS (GLM-4.1V-Thinking) Theory DPO paper, DeepSeek-R1 report, GLM-4.1V-Thinking
🔲 Week 12 VLM-specific RL — RLVR (reinforcement learning from verifiable rewards). Hands-on: run a DPO fine-tune with TRL Code TRL, RLVR papers, HuggingFace PEFT

Suggested reading

Reinforcement Learning (RL) Guide from Unsloth

  • RL is where and agent learns to make decisions by interacting with an environment and receiving feedback [rewards, penalties].

  • Action: What the model generates (an answer to a question)

  • Reward: A signal that indicates how good or bad the model’s answer is. Dit it follow instructions, does it handle safety?

  • Environment: The scenario or task that the model is working on. For example code generation, helpfullness, etc..

  • Things to pay attention to: RL, RLVR, PPO, GRPO, RLHF, RF, DPO.

Phase 5 — Agentic Systems

  • What are agentic systems?
  • A ‘normal LLM interaction’ is basically you ask a question and it gives back a text answer.
  • An agent is adding actions into a loop. So instead of answering right away the loop can:
    1. Reason or think about the steps
    2. Call a tool (code, calculator, use an API, read a file for context)
    3. observe a result and decide if it makes sense, or if it needs another tool call or more reasoning
    4. repeat until answer is satisfactory.
  • It boils down to think/act/observe/loop

How do agents and VLMS interact?

  • If you give an agent vision capabilities, it now can read markdown and also a pdf to understand a chart for example. It can see a screenshot and understand the layout and feed that into the generation. It can watch a video and understand.
  • So the VLM becomes the perception ‘brain’ that an agent can use in its loop.
Status Week Task / Goal Category Resources
🔲 Week 13 Study agent foundations — ReAct, Toolformer, AgentBench. Learn tool-use patterns (function calling, structured output) Theory ReAct, Toolformer, AgentBench papers, Anthropic API docs
🔲 Week 14 Build a small agent using smolagents or LangGraph with a VLM as perception module Code smolagents, LangGraph, Anthropic Claude agent SDK

Phase 6 — Large-Scale Pretraining & JAX

Status Week Task / Goal Category Resources
🔲 Week 15 JAX/Flax basics — functional paradigm, jit, vmap, pmap Code JAX docs, Flax tutorials
🔲 Week 16 Scaling & systems papers — Chinchilla, LLM.int8(), FlashAttention Theory Chinchilla, LLM.int8(), FlashAttention papers
🔲 Week 17 Distributed pretraining concepts — Megatron-LM, NeMo Theory/Code Megatron-LM, NVIDIA NeMo

Suggested reading/watching

Umar Jamil’s Coding a Multimodal Vision Language Model from Scratch

Understanding MultiModal LLMs

How AI Taught Itself to See

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Open-source Multi-modal models

We can take a look at the fully open-sourced Molmo2-Models

Also, a competitive fully open source vision encoder in (Franca)[https://github.com/valeoai/Franca]