← Back to ranking

Model result · rank #1

DeepSeek V4 Flash

DeepSeek direct API. Public result card with the model’s overall score, lane measurements, runtime/cost telemetry, and ranking formula.

Overall score91.38

Rank #1

Full / Agentic95.80

Full rank #1

SWE MVP86.95

SWE rank #3

Measured cost$0.086

100.0% reliability

Overall

All-around publication view

Score91.38
Formula50% Full + 50% SWE
BasisDeepSeek direct API

The overall score is calculated from the Full/Agentic and SWE lanes, keeping the aggregate comparable while preserving the measurements behind it.

Lane 01

Full / Agentic benchmark

Final95.80
Capability96.80
Agentic96.32
Pass rate97.7%
Prompts43

This lane captures instruction following, structured behavior, tool discipline, and general agentic reliability.

Lane 02

Software engineering MVP

SWE score86.95
Focused final86.59
Capability84.53
Daily driver87.04
Prompts24

This lane is closer to implementation usefulness: source handling, architecture cleanliness, and deliverable quality.

Telemetry

Runtime economics

Full cost$0.079
SWE cost$0.0071
Full avg seconds4.60
SWE time198.43s
Decode77.09

Cost, time, and runtime basis are telemetry. They explain tradeoffs; they do not secretly overwrite the capability scores.

Interpretation

Why this result lands here.

Rank #1 is the current all-around reference point: strong Full/Agentic performance, competitive SWE delivery, and transparent runtime telemetry.