TACO: Steering Vision-Language-Action Models as Anti-Exploration

Steering Vision-Language-Action Models
as Anti-Exploration: A Test-Time Scaling Approach

¹Institute of Artificial Intelligence, China Telecom     ²University of Science and Technology of China
³Tsinghua University     ⁴The Hong Kong University of Science and Technology
^*Equal contributions    ^†Corresponding authors    ^‡Project Leader

Abstract

Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g. human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task.

Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset.

Thus, we propose TACO, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference.

Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.

Coupled Pseudo-Count Estimation for VLAs
& Test-Time Scaling as Anti-Exploration

We now present TACO, a framework that treats the pseudo count estimator as an off-the-shelf verifier to scale test-time compute for VLAs, implementing Anti-Exploration principle. Note that TACO can be directly integrated into VLAs that either use discrete token auto-regression (e.g., OpenVLA) or a score-based generative formulation (e.g., RDT, $\pi_0$, $\pi_{0.5}$) to model the action distribution.

In the training stage (Stage 1), we sample data from the SFT dataset, add a certain amount of noise to the expert actions, and feed them into the VLA to denoise the actions while extracting internal representations. These representations are then used to train the CFN. During inference (Stage 2), the VLA generates multiple candidate actions along with their corresponding internal representations, and the CFN serves as a selector to select the action with the highest count for execution.

Simulation Results

Our method demonstrates strong performance in simulation. For flow-matching–based VLAs, it improves the success rate of π₀ by 9.1% and 7.5% in Robotwin 1.0 and Simpler, respectively; and improves the success rate of π₀.₅ by 4.7% and 1.8% in Robotwin 2.0 and LIBERO, respectively. We also implement our test-time scaling approach in the autoregressive-based VLA, OpenVLA, achieving an average improvement of 6.0% in success rate on LIBERO. Across extensive simulation experiments and multiple VLAs with different architectures, our method consistently yields substantial improvements over the base policy.

RoboTwin 1.0 Results

Simpler-WindowX Results

Baseline results are taken from Simpler, RoboVLM, and SpatialVLA. Our method achieves an average improvement of 7.5% over $\pi_0$.

RoboTwin 2.0 Results

Each task is tested across 100 randomly generated scenes using 100 different seeds. For test-time scaling, the number of candidate actions is set to 50.

LIBERO-Long Results

Our method, TACO, is applied to both Pi0.5 and OpenVLA. For the autoregressive VLA architecture, we set $\text{temperature}=1$ for action sampling. Results marked with $\mathbf{*}$ are directly reported from Robomonkey. OpenVLA (reproduced) denotes our own reproduction results, which serves as the baseline for our TACO implementation. We observe that TACO can be effectively applied to autoregressive VLA and consistently improves performance.

Real-world Results: Dual Realman RM75 Robot Arm

Laptop

$\pi_0$ + TACO ✓

$\pi_0$ ✗

Paper and Pen

$\pi_0$ + TACO ✓

$\pi_0$ ✗

Receive Book

$\pi_0$ + TACO ✓

$\pi_0$ ✗

Storage Charger

$\pi_0$ + TACO ✓

$\pi_0$ ✗

In our five real-world experiments, TACO yields a 16.0% performance improvement. Moreover, to reduce the latency introduced by sampling multiple actions during test-time scaling, we apply a KV-cache optimization that significantly decreases inference delay, ensuring the control frequency and smoothness required for real-robot operation.

Key Moments: Grasping a Marker

$\pi_0$ + TACO ✓

Using TACO for test-time scaling enables the robot to grasp the pen in a more optimal manner and eliminates hesitation during the grasping process.

$\pi_0$ ✗

When using π₀ alone for inference, the robot tends to fall into suboptimal modes during grasping and often hesitates at the moment of grasp, leading to failures.

@article{yang2025taco, title={Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach}, author={Siyuan Yang and Yang Zhang and Haoran He and Ling Pan and Xiu Li and Chenjia Bai and Xuelong Li}, journal={arXiv preprint arXiv:2512.02834}, year={2025} }

Steering Vision-Language-Action Modelsas Anti-Exploration: A Test-Time Scaling Approach

Abstract

Coupled Pseudo-Count Estimation for VLAs& Test-Time Scaling as Anti-Exploration

Simulation Results

RoboTwin 1.0 Results

Simpler-WindowX Results

RoboTwin 2.0 Results

LIBERO-Long Results

Real-world Results: Dual Realman RM75 Robot Arm

Laptop

Paper and Pen

Receive Book

Storage Charger

Key Moments: Grasping a Marker

BibTeX

Steering Vision-Language-Action Models
as Anti-Exploration: A Test-Time Scaling Approach

Coupled Pseudo-Count Estimation for VLAs
& Test-Time Scaling as Anti-Exploration