Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
Summary
VAKRA is an executable benchmark designed to evaluate the reasoning and tool-use capabilities of AI agents within enterprise-like environments. It moves beyond testing isolated skills by measuring compositional reasoning across a large-scale set of APIs and documents using full execution traces.
Key Points
- The benchmark features over 8,000 locally hosted APIs across 62 domains, paired with domain-aligned document collections.
- It evaluates four distinct capabilities: API Chaining, Tool Selection, Multi-Hop Reasoning, and Multi-Hop/Multi-Source Reasoning with Policy Adherence.
- Capability 1 (API Chaining) utilizes the SLOT-BIRD and SEL-BIRD collections, requiring agents to execute 1–12 tool calls per instance.
- Capability 2 (Tool Selection) uses the REST-BIRD collection, where domains contain between 6 and 328 tools, necessitating a shortlisting mechanism to comply with the 128-tool limit of the OpenAI API Specification.
- Capability 4 introduces multi-turn dialogs and tool-use policies, where agents must follow plain-text instructions regarding which knowledge sources to access.
- The evaluation uses a waterfall-style pipeline that verifies policy adherence, compares predicted tool-call trajectories against ground truth, and finally assesses response correctness.
Technical Details
The VAKRA environment is built on MCP (Model Context Protocol) servers. In the API Chaining task, a specialized get_data(tool_universe_id=id) tool is used to initialize data sources and provide a lightweight preview of the dataset. This design prevents the inefficient transfer of large datasets over the MCP protocol by storing the full dataset server-side. The SEL-BIRD collection extends the generic SLOT-BIRD set by introducing specialized tools and query-specific getters (e.g., get_KEY_NAME) for an average of four functions per instance.
For multi-source reasoning tasks, the benchmark implements a decontamination process during data generation. This ensures that the information required for any specific logical hop is available in only one source (either an API or a RAG-based document index), preventing the agent from using leaked information from the wrong source. The evaluation metric is execution-centric; rather than simple string matching, the framework executes the agent's predicted tool calls in the environment and compares the resulting tool responses against the ground truth to support alternative but valid reasoning paths.
Impact / Why It Matters
VAKRA provides developers with a rigorous framework to identify specific failure modes in agentic workflows, particularly regarding tool selection and policy adherence. It allows for the testing of complex, multi-step reasoning chains in a controlled, executable environment that mimics real-world enterprise constraints.