Rank #7
Model result · rank #7
Claude Sonnet 5
OpenRouter · extra-high reasoning · Full + SWE + Hard measured. Public result card with the model’s overall score, lane measurements, runtime/cost telemetry, and ranking formula.
Full rank #12
SWE rank #10
Hard rank #3
100.0% reliability
All-around publication view
The overall score averages the measured major lanes while keeping each source measurement visible.
Full / Agentic benchmark
This lane captures instruction following, structured behavior, tool discipline, and general agentic reliability.
Software engineering MVP
This lane is closer to implementation usefulness: source handling, architecture cleanliness, and deliverable quality.
Hard Intelligence diagnostic
Hard Intelligence measures active inquiry, online adaptation, evidence-driven self-repair, and authority/salience integrity.
Runtime economics
Cost, time, and token basis are normalized telemetry. They explain tradeoffs; they do not overwrite the capability score yet.
Why this result lands here.
The result is best read as a balanced benchmark entry: one overall score plus the lane measurements that produced it. Hard Intelligence score is 86.53 and contributes to the overall score alongside Full/Agentic and SWE. Measured with OpenRouter extra-high reasoning across Full/Agentic, SWE MVP, and Hard Intelligence. Hard Intelligence is a public diagnostic, not a hidden official score.