LLaVa

The main idea for this week is to turn what we have learned into an actual model we can use for a niche domain or out of distribution domain. Models like LLaVA are used as bases for more interesting VQA like this project that uses it as a base for action-recognition in videos
Understanding the architecture
Cloning the repo and understand how vision encoder connect to langauge model via prokection layer
Collecting 500-1000 images with question-answer pairs
Fine tune a LLaVA-1.5TB with LoRA/QLoRA
first, only fine tune the projection later
then try LoRA on language model
evaluate and document

github repo is here

Suggested watching: this video by Trellis

Lets review some things:

LLaVA: Uses a CLIP Vit -> linear projection -> Vicuna/LLaMA. Projection is a linear layer of small MLPs.
PaliGemma: USes a SigLip Vit -> linear projection -> Gemma.

To train: * LLaVA: 2 stage. * 1. Freeze vision encoder + LLM and train only projection layer in image-caption pairs. * 2. Unfreeze the LLM and fine-tune on visual instruction data * PaliGemma: 3 stage. * 1. unimodal pretraining of both encoders separately * 2. multi-modal pretraining on image-text pairs on both vision and language components * 3. task-specific fine-tuning