AVIS: Adaptive Test-Time Scaling for Vision-Language Models

Ahmadreza Jeddi^{1,2,3,5,†,*}, Minh N. Le^1,2,3,†,*, Amirhossein Kazerouni^1,2,3,5,*, Hakki Karaimer¹, Hue Nguyen¹, Iqbal Mohomed¹, Michael Brudno^2,3,5, Konstantinos G. Derpanis^1,2,3,4, Alex Levinshtein¹, Babak Taati^2,3,5, Radek Grzeszczuk^1,†

¹Samsung AI Center-Toronto, ²University of Toronto, ³Vector Institute, ⁴York University ⁵UHN

Preprint

^*Equal contribution. ^†Work done at Samsung.

Paper Code

AVIS accuracy-compute trade-off on image and video benchmarks

AVIS treats VLM inference as a balance between seeing and thinking: it removes redundant visual context, then spends the saved compute on reasoning rollouts only when they are useful.

Overview

Chain-of-thought prompting and test-time scaling can improve visual reasoning, but they are expensive: high-resolution images and videos create long visual prefixes, while repeated reasoning rollouts increase decoding cost. AVIS frames this as a joint test-time allocation problem over two coupled axes: Visual Context Scaling (VCS), which controls how many visual tokens are retained, and Visual Reasoning Scaling (VRS), which controls how many reasoning trajectories are sampled and aggregated.

1. See less

Use Key Diversity Visual pruning to remove redundant visual tokens before the language-model prefill.

2. Think adaptively

Use a lightweight difficulty predictor to choose the rollout budget K for self-consistency.

3. Reuse prefill

Run K shared-prefill rollouts, aggregate answers by majority vote, and keep compute below the vanilla baseline.

Method

AVIS has two small, deployment-friendly decisions. First, adaptive KDV scores visual tokens using diversity in attention-key space and keeps a sample-dependent subset of tokens. Second, a difficulty-aware rollout selector maps a predicted solvability score to K in {1, 3, 5, 7}. Easy examples use little compute, hopeless examples avoid wasteful rollouts, and hard-but-solvable examples receive the largest reasoning budget.

Main Results

-52%

image FLOPs
with +1.9 average score over vanilla

-66%

video FLOPs
with +1.3 average score over vanilla

3.7%

matched-FLOPs gain
over the closest fixed policy

Main benchmark results across image and video VQA tasks

Latency and RL Post-Training

Shared-prefill inference makes adaptive reasoning practical: AVIS keeps wall-clock latency close to vanilla while improving the accuracy-compute trade-off. The same allocation idea also transfers to RL post-trained VLMs such as VL-Rethinker, Vision-R1, and OpenVLThinker.

Latency comparison for shared-prefill inference

BibTeX

@misc{jeddi2026avisadaptivetesttimescaling,
        title={AVIS: Adaptive Test-Time Scaling for Vision-Language Models}, 
        author={Ahmadreza Jeddi and Minh Ngoc Le and Amirhossein Kazerouni and Hakki Can Karaimer and Hue Nguyen and Iqbal Mohomed and Michael Brudno and Alex Levinshtein and Konstantinos G. Derpanis and Babak Taati and Radek Grzeszczuk},
        year={2026},
        eprint={2606.11576},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2606.11576}, 
  }