OpenCastor Agent Harness Evaluator Leaderboard

I've been building OpenCastor, a runtime layer that sits between a robot's hardware and its AI agent. One thing that surprised me: the order you arrange the skill pipeline (context builder → model router → error handler, etc.) and parameters like thinking_budget and context_budget affect task success rates as much as model choice does. So I built a distributed evaluator. Robots contribute idle compute to benchmark harness configurations against OHB-1, a small benchmark of 30 real-world robot tasks (grip, navigate, respond, etc.) using local LLM calls via Ollama. The search space is 263,424 configs (8 dimensions: model routing, context budget, retry logic, drift detection, etc.). The demo leaderboard shows results so far, broken down by hardware tier (Pi5+Hailo, Jetson, server, budget boards). The current champion config is free to download as a YAML and apply to any robot. P66 safety parameters are stripped on apply — no harness config can touch motor limits or ESTOP logic. Looking for feedback on: (1) whether the benchmark tasks are representative, (2) whether the hardware tier breakdown is useful, and (3) anyone who's run fleet-wide distributed evals of agent configs for robotics or otherwise.

  • Agente de IA
  • Análise de Dados
  • Automação de Fluxo de Trabalho
Mar 23, 2026Visitar site

Resumo de IA

OpenCastor Agent Harness Evaluator Leaderboard is a distributed system that benchmarks AI agent configurations for robots using idle compute. It evaluates over 263,000 configurations across 8 dimensions against a set of 30 real-world robot tasks.

Melhor para

Robotics engineers, AI researchers, MLOps engineers

Por que importa

Optimizes AI agent performance on robots by systematically evaluating harness configurations against real-world tasks and diverse hardware.

Principais recursos

  • Distributed evaluator for robot AI agent configurations.
  • Benchmarks harness configurations against real-world robot tasks (OHB-1 benchmark).
  • Evaluates over 263,000 configurations across 8 dimensions.
  • Supports local LLM calls via Ollama.

Casos de uso

  • A robotics engineer is developing a new autonomous navigation system for a warehouse robot. They can use the OpenCastor Evaluator to test various pipeline configurations and parameter settings (like context budget and retry logic) to find the optimal setup for reliable pathfinding, even with limited onboard processing power like a Jetson.
  • A researcher in AI for robotics wants to validate the effectiveness of their agent's grasping skills across different hardware platforms. By leveraging the distributed evaluation capabilities, they can benchmark their agent's performance on tasks like picking up objects using configurations optimized for hardware tiers ranging from budget boards to more powerful server-grade systems.
  • A hobbyist building a custom robot for home automation needs to integrate an LLM for natural language interaction. They can use the OpenCastor Leaderboard to discover and download a pre-optimized YAML configuration that balances performance and resource usage for their chosen hardware, ensuring smooth operation of tasks like responding to voice commands.