Understanding MultiModal LLMs
below are the highlights of my understanding of the article which can hopefully help build a better intuition into these kinds of models.
Common approaches to building multimodal LLMs
1. Unified embedding decoder architecture
- Convert images into tokens that look like text tokens. Everything goes into a single LLM decoder.
- So images -> vision encoder -> projection -> [visual tokens]. Then we feed [visual tokens] and [text tokens] into llm decoder.
- [visual tokens] and [text tokens] go into the same self-attention layers.
- examples LLaVA, Qwen-VL, PalM-E.
- fine-tuning is fairly straightfoward since the standard llm remains unchanged.
- llm’s self attention now needs to handle both language and the cross-modal reasoning in the same layers. Visual tokens use up our context length so that might not be best in some cases.
- How does this tie into our work? The image encoder is usually a pre-trained vision transformer like CLIP or OpenCLIP.
- Molmo models use and off-the-shelf vision transformer (CLIP) and a projector that aligns image features to the language model. They update all paramters in a single approach instead of using multi-stage training.
2. Cross-modality attention architecture
- Main idea is to keep vision and language in separate pipelines. Then we add cross-attention layers where we can get cross attention across the visual and text features.
- llm needs extra layers to handle the cross-attention.
- Flamingo, Emu, BLIP-2.
- LLama family of models use cross-attention based approach. They include video and speech as other modalities as well actually.