Intro to Visual-Language Modeling
I couldn’t find a good, hands-on course on this topic. This set of notebooks and exercises are a walkthrough, starting from vanilla ViT, going through image-text contrastive learning, and ending with a research project to bring everything together.
Every module has the exercises I found useful to understand the models, along with training scripts on a small dataset that can easily run on a small GPU.
Every module has an exercise notebook that walks through steps and has place-holders for your own implementation of the code. My solutions are also included if we need to reference.
In the first module I walk through vanilla ViT from scratch, add some tricks like window-shifted attention, and work on self-supervised masked image modeling. Everything from scratch, and then we dig into other implementations from
timmandtransformersto see how the pros do it.
Companion repo: vision-transformers-and-ssl covers deeper transformer foundations and self-supervised learning — DETR family for detection/segmentation/tracking, and a full 6-phase module on image + video SSL (DINOv2, V-JEPA, evaluation harness, applied projects). Module 1 here mirrors Module 1 there.
Module 1 Advanced Vision Transformer Foundations
| Status | Week | Task / Goal | Category | Resources | Solutions |
|---|---|---|---|---|---|
| X | Week 1 | Re-implement basic ViT from scratch (no framework) | Code | ViT paper (Dosovitskiy), lucidrains’ vit-pytorch [https://arxiv.org/abs/2010.11929] | 001_vit_from_scratch.ipynb |
| X | Week 2 | Compare ViT with Swin, CoAtNet, DeiT + deep dive on multi-head self-attention in spatial domain | Theory/Compare | Papers: Swin, DeiT, CoAtNet; timm repo; “Attention Is All You Need”, Annotated Transformer | 002_compare_ViTs.ipynb, 002_CoAtNet.ipynb |
| X | Week 3 | Implement masked image modeling (MIM) pretraining | Code | MAE (He et al.), SimMIM | 003_MAE.ipynb |
| X | Week 4 | Visualize attention maps, frozen feature extraction, linear probing. Compare DINOv2 vs CLIP attention. | Analysis/Code | DINOv2, DINO, CLIP papers |
Module 2 Vision + Language Pretraining & Integration
| Status | Week | Task / Goal | Category | Resources | Solutions |
|---|---|---|---|---|---|
| X | Week 1 | Build a text encoder from scratch — tokenization, text transformer, sentence embeddings | Code | “Attention Is All You Need”, HuggingFace tokenizers, Annotated Transformer | |
| X | Week 2 | Reproduce CLIP (image-text contrastive training) | Code | OpenCLIP repo, CLIP paper | |
| X | Week 3 | Implement full SigLip from scratch. Fine-tune our SigLip. Implement PaliGemma from scratch following tutorial video | Model Dev | SigLip paper, Umar Jamil’s tutorial | |
| X | Week 4 | Fine-tune pretrained Molmo2 model on niche domain | Fine-tuning | Molmo2 repo, visual instruction tuning | |
| 🔲 | Week 5 | Look into positional embeddings in vision-language transformers | Theory | “Sinusoidal Encoding Explained”, Flamingo paper | |
| 🔲 | Week 6 | Evaluate model on retrieval/captioning metrics + ablate cross-attention | Eval/Research | pycocoevalcap, Recall@K, Flamingo, BLIP2, LLaVA |
Deep dive on VLM Benchmarks
Key benchmarks to know when evaluating vision-language models:
| Benchmark | What it measures | Task type |
|---|---|---|
| MMMU | College-level multi-discipline reasoning over images (charts, diagrams, photos) | VQA / reasoning |
| MMBench | Broad multi-modal understanding — perception, reasoning, OCR | VQA |
| SEED-Bench | Generative comprehension across 12 dimensions (spatial, temporal, etc.) | VQA / multi-choice |
| TextVQA | Reading and reasoning about text within images | OCR + VQA |
| DocVQA | Understanding documents — forms, tables, receipts | Document VQA |
| ChartQA | Answering questions about charts and plots | Chart reasoning |
| VQAv2 | Open-ended visual question answering on natural images | VQA (classic) |
| GQA | Compositional reasoning — multi-step spatial questions | VQA / reasoning |
| COCO Captions | Image captioning quality (CIDEr, BLEU, METEOR scores) | Captioning |
| Flickr30k / COCO Retrieval | Image-text retrieval (Recall@1, Recall@5, Recall@10) | Retrieval |
| RealWorldQA | Real-world spatial understanding and reasoning | VQA |
| POPE | Detecting object hallucination — does the model see things that aren’t there? | Hallucination |
- Retrieval metrics (Recall@K): given a query, is the correct match in the top K results? Directly relevant to CLIP.
- Captioning metrics (CIDEr, BLEU, METEOR): compare generated captions to human references. Relevant for BLIP/LLaVA.
- VQA accuracy: exact-match or soft-match against ground truth answers.
- Hallucination is a major open problem — models confidently describe objects that aren’t in the image. POPE specifically tests for this.
Module 3 Modern VLMs, Alignment & Agentic Systems
- Now that we’ve built everything from scratch, this module shifts to: understand production VLM architectures → run inference → fine-tune → synthetic data → alignment → agentic systems
Phase 1 — Modern VLM Architectures & Inference
| Status | Week | Task / Goal | Category | Resources |
|---|---|---|---|---|
| 🔲 | Week 1 | Deep dive into modern VLM architectures — read LLaVA, InternVL2, Qwen2.5-VL, Flamingo. Understand how vision encoder + connector + LLM backbone fit together | Theory | Raschka’s “Understanding Multimodal LLMs”, LLaVA, InternVL2, Qwen2.5-VL, Flamingo papers |
| 🔲 | Week 2 | Fine-Tune and Understand Qwen3.5 and Unsloth | Code | Qwen2.5-VL, HuggingFace Transformers |
| 🔲 | Week 3 | Compare VLM architectures: Qwen-VL vs LLaVA vs Molmo — unified embedding vs cross-attention, connector design, tradeoffs | Theory/Compare | Qwen-VL, LLaVA, Molmo papers + repos |
Phase 2 — Fine-Tuning & Image Tokenization
| Status | Week | Task / Goal | Category | Resources |
|---|---|---|---|---|
| 🔲 | Week 4 | Fine-tune a VLM (Qwen-VL or LLaVA-1.5) with LoRA/QLoRA on a custom domain | Fine-tuning | LLaMA-Factory, Swift, HuggingFace PEFT, bitsandbytes |
| 🔲 | Week 5 | Explore image tokenization: VQ-VAE, visual tokens in modern VLMs | Theory/Code | VQ-VAE paper, Chameleon, Emu |
| 🔲 | Week 6 | Ablations on fine-tuned model + build Streamlit dashboard | Experiment/Viz | Torch hooks, wandb, Streamlit, seaborn, t-SNE |
Phase 3 — Data: Curation, Captioning & Synthetic Pipelines
| Status | Week | Task / Goal | Category | Resources |
|---|---|---|---|---|
| 🔲 | Week 7 | Create image-caption dataset (10k+ pairs) + vision prompt QA pairs. Study data curation at scale — LAION, DataComp | Data/Theory | LAION viewer, COCO, local annotations, GPT-based auto-captioning, DataComp papers, “Scaling Data-Constrained Language Models” (Muennighoff et al.) |
| 🔲 | Week 8 | Learn data mixture & quality research — optimal ratios, data selection | Theory | DoReMi, “Data Selection for LLMs” papers, data mixing laws |
| 🔲 | Week 9 | Build a synthetic VQA dataset using a VLM as the labeler, fine-tune on it, evaluate | Code/Data | LLM-as-oracle approach, VQA generation pipelines |
Phase 4 — RL & Alignment for VLMs
| Status | Week | Task / Goal | Category | Resources |
|---|---|---|---|---|
| 🔲 | Week 10 | RLHF foundations — InstructGPT, practical PPO/DPO tooling | Theory/Code | InstructGPT paper, TRL library (HuggingFace) |
| 🔲 | Week 11 | Modern alignment methods — DPO, GRPO (DeepSeek-R1), RLCS (GLM-4.1V-Thinking) | Theory | DPO paper, DeepSeek-R1 report, GLM-4.1V-Thinking |
| 🔲 | Week 12 | VLM-specific RL — RLVR (reinforcement learning from verifiable rewards). Hands-on: run a DPO fine-tune with TRL | Code | TRL, RLVR papers, HuggingFace PEFT |
Suggested reading
Reinforcement Learning (RL) Guide from Unsloth
RL is where and agent learns to make decisions by interacting with an environment and receiving feedback [rewards, penalties].
Action: What the model generates (an answer to a question)
Reward: A signal that indicates how good or bad the model’s answer is. Dit it follow instructions, does it handle safety?
Environment: The scenario or task that the model is working on. For example code generation, helpfullness, etc..
Things to pay attention to: RL, RLVR, PPO, GRPO, RLHF, RF, DPO.
Phase 5 — Agentic Systems
- What are agentic systems?
- A ‘normal LLM interaction’ is basically you ask a question and it gives back a text answer.
- An agent is adding actions into a loop. So instead of answering right away the loop can:
- Reason or think about the steps
- Call a tool (code, calculator, use an API, read a file for context)
- observe a result and decide if it makes sense, or if it needs another tool call or more reasoning
- repeat until answer is satisfactory.
- It boils down to think/act/observe/loop
How do agents and VLMS interact?
- If you give an agent vision capabilities, it now can read markdown and also a pdf to understand a chart for example. It can see a screenshot and understand the layout and feed that into the generation. It can watch a video and understand.
- So the VLM becomes the perception ‘brain’ that an agent can use in its loop.
| Status | Week | Task / Goal | Category | Resources |
|---|---|---|---|---|
| 🔲 | Week 13 | Study agent foundations — ReAct, Toolformer, AgentBench. Learn tool-use patterns (function calling, structured output) | Theory | ReAct, Toolformer, AgentBench papers, Anthropic API docs |
| 🔲 | Week 14 | Build a small agent using smolagents or LangGraph with a VLM as perception module | Code | smolagents, LangGraph, Anthropic Claude agent SDK |
Phase 6 — Large-Scale Pretraining & JAX
| Status | Week | Task / Goal | Category | Resources |
|---|---|---|---|---|
| 🔲 | Week 15 | JAX/Flax basics — functional paradigm, jit, vmap, pmap | Code | JAX docs, Flax tutorials |
| 🔲 | Week 16 | Scaling & systems papers — Chinchilla, LLM.int8(), FlashAttention | Theory | Chinchilla, LLM.int8(), FlashAttention papers |
| 🔲 | Week 17 | Distributed pretraining concepts — Megatron-LM, NeMo | Theory/Code | Megatron-LM, NVIDIA NeMo |
Suggested reading/watching
Umar Jamil’s Coding a Multimodal Vision Language Model from Scratch
Understanding MultiModal LLMs
How AI Taught Itself to See
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Open-source Multi-modal models
We can take a look at the fully open-sourced Molmo2-Models
Also, a competitive fully open source vision encoder in (Franca)[https://github.com/valeoai/Franca]