★ 6/10 · Ai · 2026-03-17

Holotron-12B - High Throughput Computer Use Agent

H Company has released Holotron-12B, a multimodal model optimized for high-throughput computer-use agent tasks. The model is post-trained from NVIDIA's Nemotron-Nano-2 VL architecture to enhance performance in screen...

Holotron-12B - High Throughput Computer Use Agent

Summary

H Company has released Holotron-12B, a multimodal model optimized for high-throughput computer-use agent tasks. The model is post-trained from NVIDIA's Nemotron-Nano-2 VL architecture to enhance performance in screen understanding, localization, and navigation within interactive environments.

Key Points

  • Holotron-12B is post-trained from the NVIDIA Nemotron-Nano-12B-v2-VL-BF16 base model using approximately 14 billion tokens of proprietary localization and navigation data.
  • The model achieved an 80.5% score on the WebVoyager benchmark, a significant increase from the 35.1% achieved by the base Nemotron model.
  • When deployed on a single H100 GPU using vLLM (v0.14.1), Holotron-12B demonstrated over 2x higher throughput compared to Holo2-8B.
  • The model reaches a peak throughput of 8.9k tokens/s at a maximum concurrency of 100 benchmark workers.
  • It shows improved performance on localization and grounding benchmarks, including OS-World-G, GroundUI, and WebClick.
  • The model is available on Hugging Face under an NVIDIA Open Model License.

Technical Details

Holotron-12B utilizes a hybrid architecture that combines a State-Space Model (SSM) with an attention mechanism. This design addresses the quadratic computation costs associated with standard transformer-based attention by implementing a linear recurrent model. Unlike vanilla attention, which requires storing K and V activations per token (the KV Cache), the SSM component stores a constant state per layer per generated sequence, making the memory footprint independent of sequence length. This architecture is specifically optimized for long-context inference involving multiple high-resolution images and lengthy interaction histories.

The training process consisted of two stages, beginning with supervised fine-tuning on a specialized mixture of data focused on UI-level interactions and screen grounding. In production-style workloads, the model's efficiency is characterized by superior VRAM utilization, allowing for larger effective batch sizes. In controlled experiments, Holotron-12B demonstrated steady throughput scaling as concurrency increased, whereas the Holo2-8B architecture plateaued at approximately 5.1k tokens/s.

Impact / Why It Matters

Holotron-12B provides a scalable foundation for deploying high-concurrency, autonomous computer-use agents in production environments. Its reduced memory footprint and high throughput make it particularly suitable for resource-intensive workloads such as automated data generation, annotation, and online reinforcement learning.

ai automation llm