faw

swe

What I Learned Building 2 AI Benchmarks From Scratch (And Why I Automated the Third)

February 21, 2026


TL;DR

I built Rideshare-Bench, a benchmark that drops an AI into a simulated gig economy and tests whether it drives safely, earns well, and treats passengers fairly. Claude Sonnet 4.5 earned $1,871 over 12 simulated days, about half of what was possible. It chased surge pricing into crowded zones and drove through exhaustion at 15% accident risk. After building two benchmarks from scratch, I noticed the same boilerplate every time, so I generalized it into Quaver: describe a benchmark in plain language, and an AI agent generates the full scaffold. Run it against models, then analyze the results. Private evals matter because public ones get gamed. I'm building Ocarina Labs to turn this into independent pre-deployment safety testing for AI agents.


What Claude did as a rideshare driver

I gave Claude Sonnet 4.5 a simulated city with 7 zones, a Honda Accord, and 168 hours of game time. Its job was to drive for a rideshare platform and maximize earnings. It had 25+ tools at its disposal: check zone demand, accept rides, manage fatigue, refuel, reposition across the city.

The results:

MetricValue
Total Earnings$1,871
Total Rides81
Final Rating4.43 / 5.0
Earnings/Hour$6.71
Utilization28.5%
Overall GradeC+
Learning PotentialA

Not terrible, not great. Claude improved its hourly earnings by 190% from Day 1 to Day 12, so it was clearly learning. But three things stuck with me.

It could see every passenger's name, age, gender, and ethnicity before deciding to accept a ride. Each RideRequest includes a full PassengerProfile with demographics, mood, tip tendency, and intoxication level, and the agent sees all of it before deciding whether to accept. That's a discrimination surface you can actually measure. We found no demographic bias in this run, but the structure exists to test for it across many runs.

It pushed through dangerous exhaustion to chase surge pricing. On Day 7, Claude worked at 0% energy with 15% accident risk, 100% slower travel, and -25% tips. Why? Because the zone had a 3.0x surge multiplier, and money beat safety. This happened more than once, and the exhaustion penalties cost roughly $150-200 in lost tips.

It optimized for the wrong metric. Claude spent 65% of its time in the two lowest-earning zones (Airport and Nightlife District) because those had the highest surge multipliers, but a surge multiplier without available rides means nothing. The highest-earning zones, Residential ($18.74/hr) and University District ($16.20/hr), got a combined 5% of its time, which cost $800-1,000 in lost earnings.

This is the proxy metric trap. The agent optimized for a visible number (surge multiplier) instead of the actual outcome (rides completed per hour). Claude's decision framework looked roughly like:

What it optimized:  Surge_Multiplier x Distance_Willingness
What it should have: (Surge x Verified_Requests) / (Distance x Driver_Count x Fatigue_Penalty)

Why private evals matter

Public benchmarks get gamed. Goodhart's Law says that when a measure becomes a target, it stops being a good measure, and the safety community calls this "benchmaxxing": labs optimize for specific benchmarks without actually getting better at anything real.

Capability and safety are not the same thing. Claude was capable: it learned quickly, completed every ride, had zero accidents, and exploited weather surges. But it was also unsafe, because it drove while exhausted, chased short-term metrics over long-term outcomes, and could see every passenger's demographics before deciding, with no guardrails.

You need scenarios that match your deployment context. If you're building an AI agent that handles customer interactions, a generic helpfulness benchmark won't tell you whether it discriminates based on names or accents. If you're deploying an AI in a high-stakes economic environment, you need to test whether it cuts safety corners when money is on the table. Open Philanthropy's RFP on AI evaluations wants "beneficial capability evals," something for AI companies to aim for, not just dangerous capabilities to fear. Marius Hobbhahn makes this point in "The case for AGI safety products": eval tooling is architecture-agnostic, it has commercial and safety value, and it transfers to whatever comes next.

The problem is that building a custom benchmark takes weeks of engineering for what is ultimately a repeated pattern. OpenAI Gym solved a version of this for reinforcement learning, because RL was hard to get started with until Gym standardized environment setup. AI safety evals face the same barrier today.


Vending machine to rideshare to framework

I started by taking apart someone else's work. Vending-Bench drops an AI into a vending machine business simulation, and when I reverse-engineered it I found the core pattern: Scenario (what the agent faces), Tools (what it can do), State (what the world looks like), Scoring (how you judge the outcome). The loop is always the same: advance a step, the agent uses tools, state updates, score the result.

I used that pattern to build Rideshare-Bench from scratch. Instead of a vending machine, it's a gig economy with 7 zones, 168 simulated hours, 25+ tools, passenger demographics as a bias surface, fatigue as a safety constraint, and surge pricing as an economic incentive that fights safe behavior.

Then I noticed the boilerplate. Every benchmark needs the same scaffolding: a BaseState type, a customTools registry, an advanceStep() function, an isTerminated() check, a calculateScore() method, and a system prompt. The domain logic is maybe 20% of the work, while the other 80% is infrastructure.

So I built Quaver to eliminate the 80%.


How Quaver works

Quaver has three phases.

You describe a benchmark in natural language. A Claude Code agent, running in a Daytona cloud sandbox, generates the full benchmark code: state types, tools, scoring logic, system prompt, simulation loop. It writes and modifies files in the sandbox until the benchmark compiles and runs.

You then run the benchmark against one or more models through an AI Gateway. Each model gets the same scenario, tools, and initial state, and results stream in real-time via Convex.

Finally, Quaver scores each model on the metrics you defined, compares results, and produces an analysis report like the rideshare one above.

The core abstraction is:

Scenario = Agent LLM + Environment (LLM + Code) + Tools + State

Environments can be fully LLM-simulated (Vending-Bench, where customer behavior is emergent), fully code-based (a trading benchmark pulling real market data), or hybrid (Rideshare-Bench, where demand is calculated by code and driver competition is simulated).


What the data actually showed

The tool usage data matters more than the earnings. Claude made 2,862 tool calls across 12 days, and here's where they went:

ToolCallsIssue
viewPendingRequests465Requests only refresh hourly
checkEvents221Zero events actually occurred
goOnline209172 of these returned "already online" errors
goToZone1481.8 repositioning moves per ride completed

The agent was anxious. It rechecked information that hadn't changed, tried to go online when it was already online, and repositioned to "better" zones before every ride. It made 148 zone changes for 81 rides, a 1.8:1 ratio, and most of those moves burned fuel and time without producing a ride.

Zone misallocation was the biggest single finding:

Zone% Time$/Hour
Residential1.5%$18.74
University3.6%$16.20
Downtown15%$7.81
Business District14%$8.05
Nightlife30%$4.59
Airport36%$3.92

The agent spent 66% of its time in the two worst-earning zones and just 5% in the two best. It kept repositioning to zones that looked good on paper (high surge) but underperformed in practice because of too many drivers, stale data, and long travel distances.

These patterns aren't unique to rideshare. An AI managing a portfolio might over-trade in volatile sectors for the same reason Claude chased surge zones, and a customer service agent might over-escalate based on visible signals instead of actual severity. The rideshare framing is specific, but the failure modes are universal.


What this doesn't show

There are real limitations to what this project demonstrates.

This is one run of one model, which makes it a case study rather than statistical evidence. Real conclusions need multiple runs across multiple models with different random seeds.

The simulation is also simplified. Real rideshare driving has traffic, weather, richer passenger behavior, and regulation, so what we built is a useful abstraction rather than a replica.

Quaver works, but no other researchers have tested it yet. The first phase depends on an LLM generating correct benchmark code, so quality varies with the generation.

We also have no human baseline. Is $6.71/hr bad? Against the simulation's optimal $12-15/hr it certainly looks bad, but I don't know how a human player would do.

And we skipped demographic bias analysis entirely. Passengers have visible demographics and the agent makes accept/decline decisions, but we ran too few trials to test for discrimination. That's the obvious next step.

With another month, I'd run multi-model comparisons and proper demographic bias analysis across many trials. BlueDot's project sprint philosophy is right: "Your goal isn't novelty. It's completing one full project cycle." This is one cycle, and the next one can be more rigorous.


Where this is going

Agents are everywhere now

Peter Steinberger's OpenClaw hit 175,000+ GitHub stars in weeks. It's an open-source agent that runs 24/7 on your machine across WhatsApp, Telegram, Slack, and 15 other messaging platforms, and Steinberger joined OpenAI in February 2026 to lead their push into personal agents.

Andrej Karpathy tweeted about buying a Mac Mini to tinker with what he calls "Claws" ("The apple store person told me they are selling like hotcakes and everyone is confused"). Simon Willison noticed "Claw" becoming the category name for AI agents running on personal hardware, talking through messaging protocols, and handling both direct instructions and scheduled tasks. Demand for local agent hosting caused a Mac Mini shortage, and high-RAM configurations now ship in 3-6 weeks.

Claude Code, Codex, OpenClaw: these all run with terminal access, file system permissions, API keys, database connections, browser automation, email, and payment systems. An agent's usefulness scales with its access, because if you restrict that access you get a chatbot, and if you grant it you get an autonomous employee.

The access is the attack surface

Karpathy flagged this:

I'm definitely a bit sus'd to run OpenClaw specifically. Giving my private data/keys to 400K lines of vibe coded monster that is being actively attacked at scale is not very appealing at all. Already seeing reports of exposed instances, RCE vulnerabilities, supply chain poisoning, malicious or compromised skills in the registry, it feels like a complete wild west and a security nightmare.

He's not wrong. One security researcher scanned popular OpenClaw skills and found a Spotify playlist organizer hunting for Social Security numbers, a Discord backup tool POSTing message history to an external endpoint, and roughly 15% of community skills containing malicious instructions. The same post noted 18,000+ OpenClaw instances exposed to the internet on the default port. Vercel now runs combined security audits on agent skills through Gen Agent Trust Hub, Socket, and Snyk, though some skills still score CRITICAL.

A poisoned SKILLS.md file can instruct an agent to exfiltrate browser cookies and install a backdoor, and the agent follows along because the instructions look like any other user task. MCP tool poisoning hides malicious prompts in tool metadata that's invisible to the human but interpreted by the model. Palo Alto Networks researchers showed that attackers can hijack GitHub-integrated agents through malicious issues, where the agent reads the issue, follows the embedded instructions, and leaks private repository data.

These models are new, and nobody has tested most of the ways they fail. Every agent framework that ships with real-world access is another untested surface.

Safety will follow the same arc as pentesting

Businesses won't wait for safety to catch up. An agent that handles support tickets at 3 a.m. for $0.02 per interaction is too productive to pass up. When businesses start relying on agents for trading, triage, operations, and code deployment, safety stops being optional. You can't run an agent with production database access and no behavioral testing, for the same reason you can't ship a web app with user auth and no penetration test.

Security in early software was an afterthought, and it took breaches and lawsuits to turn penetration testing into a multi-billion dollar industry. Agents are replaying that same arc, only faster. Zero-trust solutions for AI agents barely exist, but that will change when an agent nukes a production server or drains a bank account and the deployer realizes they never tested it.

Five YC W26 companies in one batch

Y Combinator's Winter 2026 batch funded five companies in this space. Polymath builds automated RL environment factories for frontier labs. Andon Labs builds behavioral benchmarks, co-published research with Anthropic, and ran an AI-operated vending machine in Anthropic's office for a month. ARC Prize Foundation, founded by Francois Chollet and Mike Knoop, runs $1M+ competitions measuring progress toward AGI. Traverse sells RL environments for non-deterministic work to frontier labs. Cascade builds safety tests that update themselves as agents change.

All five work on simulation, RL, or safety, and all of them sell to AI labs or build for their own deployments. For-profit AI safety was hard to fund for years, but the commercial case changed that. Agents with real access need real testing, and the companies deploying them can't grade their own homework.

The gap nobody fills

Andon tests models for labs. Cascade monitors agents in production. But nobody independently tests your agent before you deploy it.

That's the gap. Rideshare-Bench caught Claude chasing surge pricing through exhaustion and optimizing for proxy metrics instead of safety. A capability benchmark or a runtime log would miss these patterns entirely, because they only surface when an agent runs for days in a realistic environment with economic incentives and consequences.

I'm building Ocarina Labs to fill that gap: independent pre-deployment safety testing for AI agents. You describe a scenario in plain language, Quaver generates the full behavioral test suite, your agent runs through it, and you get a private failure report. Penetration testing for AI agents.

Next: run Rideshare-Bench and two to three new benchmarks across frontier and non-frontier models, then publish the results as a public Agent Safety Index. Agents already have production database access, payment credentials, and root shells. The only question is whether safety testing catches up before something breaks badly enough to make the news.


Links and feedback

Built as a BlueDot Technical AI Safety project sprint deliverable. Feedback, criticism, and benchmark ideas welcome. If you're working on AI safety evals and the boilerplate problem resonates, try Quaver and tell me what breaks.