Understanding MultiModal LLMs

Common approaches to building multimodal LLMs

1. Unified embedding decoder architecture

  • Convert images into tokens that look like text tokens. Everything goes into a single LLM decoder.
  • So images -> vision encoder -> projection -> [visual tokens]. Then we feed [visual tokens] and [text tokens] into llm decoder.
  • [visual tokens] and [text tokens] go into the same self-attention layers.
  • examples LLaVA, Qwen-VL, PalM-E.
  • fine-tuning is fairly straightfoward since the standard llm remains unchanged.
  • llm’s self attention now needs to handle both language and the cross-modal reasoning in the same layers. Visual tokens use up our context length so that might not be best in some cases.
  • How does this tie into our work? The image encoder is usually a pre-trained vision transformer like CLIP or OpenCLIP.
  • Molmo models use and off-the-shelf vision transformer (CLIP) and a projector that aligns image features to the language model. They update all paramters in a single approach instead of using multi-stage training.

2. Cross-modality attention architecture

  • Main idea is to keep vision and language in separate pipelines. Then we add cross-attention layers where we can get cross attention across the visual and text features.
  • llm needs extra layers to handle the cross-attention.
  • Flamingo, Emu, BLIP-2.
  • LLama family of models use cross-attention based approach. They include video and speech as other modalities as well actually.