Rank #11
Model result · rank #11
Step 3.7 Flash
OpenRouter · stepfun/step-3.7-flash · extra-high reasoning · Full + SWE + Hard measured. Public result card with the model’s overall score, lane measurements, runtime/cost telemetry, and ranking formula.
Full rank #3
SWE rank #7
Hard rank #11
100.0% reliability
All-around publication view
The overall score averages the measured major lanes while keeping each source measurement visible.
Full / Agentic benchmark
This lane captures instruction following, structured behavior, tool discipline, and general agentic reliability.
Software engineering MVP
This lane is closer to implementation usefulness: source handling, architecture cleanliness, and deliverable quality.
Hard Intelligence diagnostic
Hard Intelligence measures active inquiry, online adaptation, evidence-driven self-repair, and authority/salience integrity.
Runtime economics
Cost, time, and runtime basis are telemetry. They explain tradeoffs; they do not secretly overwrite the capability scores.
Why this result lands here.
The model is stronger in the Full/Agentic lane than in the SWE lane; the overall score is therefore shown with both component lanes visible. Hard Intelligence score is 33.99 and contributes to the overall score alongside Full/Agentic and SWE. Hard Intelligence is published alongside Full/Agentic and SWE for the current ranking.