Rank #1
Model result · rank #1
DeepSeek V4 Flash
DeepSeek direct API. Public result card with the model’s overall score, lane measurements, runtime/cost telemetry, and ranking formula.
Full rank #1
SWE rank #3
100.0% reliability
All-around publication view
The overall score is calculated from the Full/Agentic and SWE lanes, keeping the aggregate comparable while preserving the measurements behind it.
Full / Agentic benchmark
This lane captures instruction following, structured behavior, tool discipline, and general agentic reliability.
Software engineering MVP
This lane is closer to implementation usefulness: source handling, architecture cleanliness, and deliverable quality.
Runtime economics
Cost, time, and runtime basis are telemetry. They explain tradeoffs; they do not secretly overwrite the capability scores.
Why this result lands here.
Rank #1 is the current all-around reference point: strong Full/Agentic performance, competitive SWE delivery, and transparent runtime telemetry.