LLaVa

github repo is here

Suggested watching: this video by Trellis

Lets review some things:

To train: * LLaVA: 2 stage. * 1. Freeze vision encoder + LLM and train only projection layer in image-caption pairs. * 2. Unfreeze the LLM and fine-tune on visual instruction data * PaliGemma: 3 stage. * 1. unimodal pretraining of both encoders separately * 2. multi-modal pretraining on image-text pairs on both vision and language components * 3. task-specific fine-tuning