Understanding MultiModal LLMs

article link here
below are the highlights of my understanding of the article which can hopefully help build a better intuition into these kinds of models.

Common approaches to building multimodal LLMs

Convert images into tokens that look like text tokens. Everything goes into a single LLM decoder.
So images -> vision encoder -> projection -> [visual tokens]. Then we feed [visual tokens] and [text tokens] into llm decoder.
[visual tokens] and [text tokens] go into the same self-attention layers.
examples LLaVA, Qwen-VL, PalM-E.
fine-tuning is fairly straightfoward since the standard llm remains unchanged.
llm’s self attention now needs to handle both language and the cross-modal reasoning in the same layers. Visual tokens use up our context length so that might not be best in some cases.
How does this tie into our work? The image encoder is usually a pre-trained vision transformer like CLIP or OpenCLIP.
Molmo models use and off-the-shelf vision transformer (CLIP) and a projector that aligns image features to the language model. They update all paramters in a single approach instead of using multi-stage training.

Main idea is to keep vision and language in separate pipelines. Then we add cross-attention layers where we can get cross attention across the visual and text features.
llm needs extra layers to handle the cross-attention.
Flamingo, Emu, BLIP-2.
LLama family of models use cross-attention based approach. They include video and speech as other modalities as well actually.