True Multimodal In-Context Learning Needs Attention to the Visual Context

Authors

Affiliations

LMU Munich, Siemens AG, MCML, relAI

Technical University of Munich

LMU Munich

University of Science and Technology of China

Technical University of Munich

University of Oxford

LMU Munich, MCML

University of Oxford

Conference

COLM 2025

Correspondence

jindong.gu@outlook.com

chenshuo.cs@outlook.com

Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)—adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility.

MICL and Visual Context Neglect

Multimodal In-Context Learning (MICL) enables models to adapt to new tasks from few multimodal demonstrations. However, current MLLMs struggle to effectively utilize visual information in demonstrations, tending to overlook visual cues and over-rely on textual patterns. This results in textual pattern imitation rather than genuine multimodal adaptation.

MICL Problem Illustration — Examples of using MICL to solve image captioning from MSCOCO (top) and Clock Math from our proposed dataset, TrueMICL (bottom). Generating captions relies more on task recognition and can be answered based on text patterns in demos, without deep understanding of demo images. However, our task requires task learning where the model needs to learn the relationship between text and images in the demos.

This limitation is often concealed by improved performance on tasks that do not require deep visual context understanding. For example, models can generate reasonable captions for query images even without referencing demo images, as they rely on textual pattern following rather than multimodal understanding.

DARA: Dynamic Attention Reallocation

To address the visual context neglect in MICL, we introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to visual context by rebalancing attention across visual and textual tokens. DARA introduces a set of learnable attention-balancing parameters that dynamically regulate the influence of visual and textual tokens during attention computation.

Lightweight and Efficient

DARA is remarkably lightweight, introducing only a small number of learnable parameters for rapid adaptation. We insert DARA into the first transformer layer of the language backbone, specifically targeting attention score matrices in all attention heads. Given 5 images (4-shot demos + query) and 32 attention heads, the total number of learnable parameters is only 5×32=160. With around 100 parameters, DARA achieves up to 10% improvement on downstream tasks.

Visualization of DARA's Reallocation Effects

To better understand how DARA affects attention, we present both qualitative and quantitative visualizations showing DARA's impact on attention patterns.

Spatial Attention Analysis

The spatial attention heatmap shows that without DARA, both demonstration and query images receive minimal attention, as indicated by predominantly blue regions. After applying DARA, attention over image tokens increases markedly (more red/yellow areas), indicating enhanced attention to visual input and improved visual grounding.

Quantitative Attention Distribution

We quantitatively compare attention allocation across different modality tokens. Without DARA, the model allocates only 28% of attention to image tokens, focusing primarily on text. With DARA, this increases to 46.7%, demonstrating a substantial shift toward visual content during response generation.

Attention Distribution Chart — Normalized attention ratios over image and text tokens without and with applying DARA. DARA significantly increases attention allocation to visual content from 28% to 46.7%.

Head-wise Attention Amplification

The learned attention amplification factors across different attention heads and images reveal structured visual emphasis. DARA induces clear redistribution: demo images consistently receive factors larger than 1, encouraging stronger reliance on context, while different heads specialize in different aspects - for example, Head 1 emphasizes Demo 2 (1.27), while Head 5 emphasizes Demo 4 (1.32).

Head-wise Analysis — Learned attention amplification factors across 8 heads and 5 images in the first transformer layer of Qwen2-VL. Values larger than 1 indicate attention amplification on visual tokens, showing DARA's selective, context-aware visual attention during MICL.

TrueMICL: A Dedicated MICL Dataset

We introduce TrueMICL, a MICL-dedicated dataset designed with a critical principle: correct responses must rely on comprehensive understanding of multimodal context, especially visual information. Unlike existing MICL datasets that focus on task recognition, TrueMICL emphasizes task learning where models must understand relationships between visual and textual elements.

Dataset Design Principles

To this end, we have designed a novel MICL-dedicated dataset, TrueMICL, guided by the following principles:

Context Dependency: The task must be unsolvable without the context images. This ensures that models cannot rely solely on textual patterns or prior knowledge.
Novelty: The task should introduce novel image-text relationships that are uncommon in pre-training or instruction tuning, to effectively challenge task learning ability.
Perceivable Visual Information: The necessary information extracted from the images should not be overly complex, ensuring that the visual encoder can perceive it accurately. This allows us to focus on MICL ability rather than visual perception challenges.
Compatibility with Backbone: The task should push the boundaries of multimodal in-context learning without exceeding the language backbone's capabilities.
Configurability and Extensibility: This pipeline should be easily configured to generate more data samples with different levels of difficulty.

TrueMICL comprises 867 samples across 4 task types and 7 distinct tasks, covering mathematical reasoning, pattern finding, and novel visual concept learning. The dataset is designed to be scalable and configurable for different levels of difficulty.

Dataset Examples

The table below shows representative examples from TrueMICL, illustrating how each task requires understanding the relationship between images and text in demonstrations to correctly answer queries.

Category	Task			Label	Explanation
Math Reasoning	Operator Induction	0	12	8	Multiplying the two numbers in the image
Math Reasoning	Clock Math	14	3	20	Adding the two numbers in the clock
Concept Binding	Outlier Detection	Green	Black	Red	The outlier color in the image
Concept Binding	CLEVR Count	2	5	2	The number of spheres
Pattern Finding	Sudoku	918	217	470	The missing number in between
Pattern Finding	Palindrome	7	9	9	To form palindrome number
Novel Concept	Character Classification	Flora	Walter	Flora	The same character in the demo

Table: An overview of task examples in TrueMICL. The label to the query requires the model to learn the relationship between images and text in the demonstrations.

Experimental Results

Our comprehensive experiments across various MLLMs demonstrate the effectiveness of both DARA and TrueMICL. Current MLLMs find TrueMICL evaluation tasks quite challenging, and DARA significantly improves MICL performance on both our evaluation tasks and standard VL tasks.

Key Findings

Significant MICL Improvements: DARA achieves up to 10% improvement on downstream tasks with minimal parameter overhead
Enhanced Visual Attention: Attention visualization confirms that DARA successfully redirects focus to visual content
Challenging Evaluation: TrueMICL reveals limitations of current MLLMs in genuine multimodal understanding
Broad Applicability: DARA shows consistent improvements across different model architectures and scales

Takeaways

🎯 Main Contributions

Visual Context Neglect Problem: We identify and address a critical limitation in current MLLMs - their tendency to neglect visual information in multimodal demonstrations, leading to superficial text imitation rather than genuine multimodal learning.
DARA Method: Our Dynamic Attention Reallocation approach provides an efficient solution with minimal parameters (as few as 160), achieving up to 10% performance improvements by strategically rebalancing attention toward visual content.
TrueMICL Benchmark: We introduce a rigorous evaluation dataset specifically designed to test true multimodal in-context learning capabilities, revealing significant gaps in current model performance.

💡 Key Insights

Attention Matters: Visualization analysis confirms that DARA successfully redirects model attention from text-dominant (28%) to more balanced multimodal processing (46.7% visual attention).
Parameter Efficiency: DARA demonstrates that targeted architectural modifications can be more effective than large-scale parameter tuning, requiring 40-50× fewer parameters than LoRA to achieve similar performance.
Evaluation Gap: Standard VL benchmarks fail to capture true MICL capabilities, highlighting the need for dedicated evaluation frameworks like TrueMICL.

🚀 Future Implications

Practical Impact: DARA's lightweight design makes it easily deployable in resource-constrained environments while providing consistent improvements across different model architectures.
Research Direction: Our work opens new avenues for investigating attention mechanisms in multimodal learning and designing more effective MICL evaluation protocols.
Broader Applications: The principles behind DARA can potentially be extended to other multimodal tasks beyond in-context learning, wherever visual attention reallocation is beneficial.