PHD Discussions Logo

Ask, Learn and Accelerate in your PhD Research

Question Icon Post Your Answer

Question Icon

1 year ago in Multimodal AI Systems By Meghna R

How are terms like MLLM, VLM, and LMM used inconsistently in multimodal AI?

What are the consequences of the inconsistent use and overlapping definitions for models labeled as MLLMs, VLMs, and LMMs in the multimodal AI field?

All Answers (1 Answers In All)

By Morgan Wurst Answered 7 months ago

Inconsistency is common. Vision-Language Model (VLM) typically refers to any model processing vision and language, often for tasks like VQA. Multimodal Large Language Model (MLLM) usually emphasizes a large language model as the core reasoning engine augmented with multimodal encoders. Large Multimodal Model (LMM) is a broader, often synonymous term. "Multivision" is non-standard. Usage varies by community: NLP papers may favor MLLM, while computer vision may use VLM. Clarifying definitions in any work is crucial.

Your Answer