LOCA-bench Leaderboard

About LOCA-bench

LOCA-bench evaluates language agents under extreme and controllable context growth scenarios. Given a task prompt, it leverages automated and scalable control of environment states to regulate the agent's context length from 8K to 256K tokens.

Tasks use mock MCP servers simulating real services: Google Calendar, Canvas LMS, Email, BigQuery, Google Sheets, Snowflake, and WooCommerce.

Citation

Use the following citation when referencing LOCA-bench in your research

@article{zeng2026loca,
  title={LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth},
  author={Zeng, Weihao and Huang, Yuzhen and He, Junxian},
  journal={arXiv preprint arXiv:2602.07962},
  year={2026}
}