AVIS: Adaptive Test-Time Scaling for Vision-Language Models

Ahmadreza Jeddi1,2,3,†,*, Minh N. Le1,2,3,†,*, Amirhossein Kazerouni1,2,3,*, Hakki Karaimer1, Hue Nguyen1, Iqbal Mohomed1, Michael Brudno2,3, Konstantinos G. Derpanis1,3,4, Alex Levinshtein1, Babak Taati2,3, Radek Grzeszczuk1,†
1Samsung AI Center-Toronto, 2University of Toronto, 3Vector Institute, 4York University
Preprint

*Equal contribution. Work done at Samsung.

AVIS accuracy-compute trade-off on image and video benchmarks

AVIS treats VLM inference as a balance between seeing and thinking: it removes redundant visual context, then spends the saved compute on reasoning rollouts only when they are useful.

Overview

Chain-of-thought prompting and test-time scaling can improve visual reasoning, but they are expensive: high-resolution images and videos create long visual prefixes, while repeated reasoning rollouts increase decoding cost. AVIS frames this as a joint test-time allocation problem over two coupled axes: Visual Context Scaling (VCS), which controls how many visual tokens are retained, and Visual Reasoning Scaling (VRS), which controls how many reasoning trajectories are sampled and aggregated.

1. See less

Use Key Diversity Visual pruning to remove redundant visual tokens before the language-model prefill.

2. Think adaptively

Use a lightweight difficulty predictor to choose the rollout budget K for self-consistency.

3. Reuse prefill

Run K shared-prefill rollouts, aggregate answers by majority vote, and keep compute below the vanilla baseline.

Method

AVIS method diagram

AVIS has two small, deployment-friendly decisions. First, adaptive KDV scores visual tokens using diversity in attention-key space and keeps a sample-dependent subset of tokens. Second, a difficulty-aware rollout selector maps a predicted solvability score to K in {1, 3, 5, 7}. Easy examples use little compute, hopeless examples avoid wasteful rollouts, and hard-but-solvable examples receive the largest reasoning budget.

Main Results

-52%

image FLOPs
with +1.9 average score over vanilla

-66%

video FLOPs
with +1.3 average score over vanilla

3.7%

matched-FLOPs gain
over the closest fixed policy

Main benchmark results across image and video VQA tasks
VCS-VRS trade-off heatmap
Matched-FLOPs comparison table

Latency and RL Post-Training

Shared-prefill inference makes adaptive reasoning practical: AVIS keeps wall-clock latency close to vanilla while improving the accuracy-compute trade-off. The same allocation idea also transfers to RL post-trained VLMs such as VL-Rethinker, Vision-R1, and OpenVLThinker.

Latency comparison for shared-prefill inference
AVIS results on RL post-trained VLMs

BibTeX

@misc{jeddi2026avis,
  title        = {AVIS: Adaptive Test-Time Scaling for Vision--Language Models},
  author       = {Ahmadreza Jeddi and Minh N. Le and Amirhossein Kazerouni and Hakki Karaimer and Hue Nguyen and Iqbal Mohomed and Michael Brudno and Konstantinos G. Derpanis and Alex Levinshtein and Babak Taati and Radek Grzeszczuk},
  year         = {2026},
  note         = {Preprint},
  url          = {https://avis-vlm.github.io/}
}