← Overview

Unified ranking · lane-aware explanation

Why the ranking looks like this.

The public ranking is not a single vibe score. It orders measured entrants by a transparent overall formula while keeping Full / Agentic, SWE MVP, Hard Intelligence, cost, and reliability visible.

Ranked entrants12

11 with Hard Intelligence data

Current leaderGPT‑5.5

Overall 88.22

Score spread20.65

#1 to #12

FormulaLane mean

Full + SWE + published Hard Intelligence when measured

Data refreshJun 13, 2026

Static HTML plus public JSON

Chart

Overall ladder

Every ranked entrant ordered by public overall score.

GPT‑5.5 88.22 rank #1
DeepSeek V4 Flash 86.76 rank #2
Claude Opus 4.8 84.45 rank #3
Gemini 3.5 Flash 83.38 rank #4
DeepSeek V4 Pro 83.36 rank #5
Qwen3.7 Max 80.71 rank #6
MiniMax M3 79.66 rank #7
Claude Fable 5 79.22 rank #8
MiniMax M3 Direct Plus 74.63 rank #9
Kimi K2.7 Code 71.01 rank #10
Step 3.7 Flash 68.06 rank #11
NVIDIA Nemotron 3 Ultra 67.58 rank #12

Overall is a lane mean, not a hidden replacement for source measurements.

Chart

Tradeoff scatter maps

Each point is one tested model at the intersection of two public telemetry axes.

Cost × overall
Cost × overall Each point is one tested model. Measured cost is plotted on the X-axis and Overall score is plotted on the Y-axis. $065.9 $3.19071.9 $6.38077.9 $9.57083.9 $12.76089.9 #1 GPT‑5.5 · Measured cost $4.071 · Overall score 88.2#1 #2 DeepSeek V4 Flash · Measured cost $0.225 · Overall score 86.8#2 #3 Claude Opus 4.8 · Measured cost $6.115 · Overall score 84.5#3 #4 Gemini 3.5 Flash · Measured cost $1.646 · Overall score 83.4#4 #5 DeepSeek V4 Pro · Measured cost $0.335 · Overall score 83.4#5 #6 Qwen3.7 Max · Measured cost $0.906 · Overall score 80.7#6 #7 MiniMax M3 · Measured cost $0.182 · Overall score 79.7#7 #8 Claude Fable 5 · Measured cost $11.815 · Overall score 79.2#8 #9 MiniMax M3 Direct Plus · Measured cost $0.0041 · Overall score 74.6#9 #10 Kimi K2.7 Code · Measured cost $0.732 · Overall score 71.0#10 #11 Step 3.7 Flash · Measured cost $0.494 · Overall score 68.1#11 #12 NVIDIA Nemotron 3 Ultra · Measured cost $0.489 · Overall score 67.6#12 Measured cost Overall score
Runtime × overall
Runtime × overall Each point is one tested model. Avg seconds per item is plotted on the X-axis and Overall score is plotted on the Y-axis. 0.00s65.9 19.1s71.9 38.1s77.9 57.2s83.9 76.3s89.9 #1 GPT‑5.5 · Avg seconds per item 15.4s · Overall score 88.2#1 #2 DeepSeek V4 Flash · Avg seconds per item 1.29s · Overall score 86.8#2 #3 Claude Opus 4.8 · Avg seconds per item 3.75s · Overall score 84.5#3 #4 Gemini 3.5 Flash · Avg seconds per item 9.13s · Overall score 83.4#4 #5 DeepSeek V4 Pro · Avg seconds per item 23.2s · Overall score 83.4#5 #6 Qwen3.7 Max · Avg seconds per item 13.4s · Overall score 80.7#6 #7 MiniMax M3 · Avg seconds per item 19.6s · Overall score 79.7#7 #8 Claude Fable 5 · Avg seconds per item 12.3s · Overall score 79.2#8 #9 MiniMax M3 Direct Plus · Avg seconds per item 13.7s · Overall score 74.6#9 #10 Kimi K2.7 Code · Avg seconds per item 28.6s · Overall score 71.0#10 #11 Step 3.7 Flash · Avg seconds per item 27.8s · Overall score 68.1#11 #12 NVIDIA Nemotron 3 Ultra · Avg seconds per item 70.7s · Overall score 67.6#12 Avg seconds per item Overall score
Tokens × cost
Tokens × cost Each point is one tested model. Public token volume is plotted on the X-axis and Measured cost is plotted on the Y-axis. 0.00$0 289.2k$3.190 578.4k$6.380 867.6k$9.570 1.2M$12.760 #1 GPT‑5.5 · Public token volume 1.5k · Measured cost $4.071#1 #2 DeepSeek V4 Flash · Public token volume 540.8k · Measured cost $0.225#2 #3 Claude Opus 4.8 · Public token volume 10.1k · Measured cost $6.115#3 #4 Gemini 3.5 Flash · Public token volume 32.8k · Measured cost $1.646#4 #5 DeepSeek V4 Pro · Public token volume 124.0k · Measured cost $0.335#5 #6 Qwen3.7 Max · Public token volume 29.4k · Measured cost $0.906#6 #7 MiniMax M3 · Public token volume 25.1k · Measured cost $0.182#7 #8 Claude Fable 5 · Public token volume 1.1M · Measured cost $11.815#8 #9 MiniMax M3 Direct Plus · Public token volume 84.4k · Measured cost $0.0041#9 #10 Kimi K2.7 Code · Public token volume 89.8k · Measured cost $0.732#10 #11 Step 3.7 Flash · Public token volume 103.0k · Measured cost $0.494#11 #12 NVIDIA Nemotron 3 Ultra · Public token volume 38.8k · Measured cost $0.489#12 Public token volume Measured cost

Use these maps to read quality versus cost, speed, and token volume. Upper-left is usually the best region for score tradeoffs; lower-left is best for token/cost efficiency.

Chart

Lane contrast

Top eight entrants with Full, SWE, and Hard Intelligence shown side by side.

#1 GPT‑5.5
Full 85.76 SWE 85.51 Hard 93.40
#2 DeepSeek V4 Flash
Full 95.80 SWE 86.95 Hard 77.53
#3 Claude Opus 4.8
Full 83.32 SWE 88.67 Hard 81.37
#4 Gemini 3.5 Flash
Full 88.79 SWE 73.59 Hard 87.75
#5 DeepSeek V4 Pro
Full 92.04 SWE 79.68 Hard 78.38
#6 Qwen3.7 Max
Full 83.24 SWE 88.99 Hard 69.90
#7 MiniMax M3
Full 85.33 SWE 86.88 Hard 66.77
#8 Claude Fable 5
Full 77.02 SWE 81.42

Hard Intelligence is shown as its own lane so cross-lane strengths and weaknesses stay visible.

Chart

Measured cost context

Cost is shown because deployment economics matter, but it does not secretly rewrite capability scores.

GPT‑5.5 $4.071 rank #1
DeepSeek V4 Flash $0.225 rank #2
Claude Opus 4.8 $6.115 rank #3
Gemini 3.5 Flash $1.646 rank #4
DeepSeek V4 Pro $0.335 rank #5
Qwen3.7 Max $0.906 rank #6
MiniMax M3 $0.182 rank #7
Claude Fable 5 $11.815 rank #8
MiniMax M3 Direct Plus $0.0041 rank #9
Kimi K2.7 Code $0.732 rank #10
Step 3.7 Flash $0.494 rank #11
NVIDIA Nemotron 3 Ultra $0.489 rank #12

Very expensive rows are not punished twice; cost is visible telemetry and part of the public interpretation.

Chart

Lane balance pressure

Largest gap between each entrant’s strongest and weakest measured major lane.

Step 3.7 Flash 55.80 Hard Intelligence 33.99 vs Full / Agentic 89.79 · rank #11
Kimi K2.7 Code 29.34 SWE MVP 58.61 vs Full / Agentic 87.95 · rank #10
NVIDIA Nemotron 3 Ultra 20.91 SWE MVP 58.63 vs Full / Agentic 79.54 · rank #12
MiniMax M3 20.11 Hard Intelligence 66.77 vs SWE MVP 86.88 · rank #7
Qwen3.7 Max 19.09 Hard Intelligence 69.90 vs SWE MVP 88.99 · rank #6
DeepSeek V4 Flash 18.27 Hard Intelligence 77.53 vs Full / Agentic 95.80 · rank #2
MiniMax M3 Direct Plus 16.55 SWE MVP 67.77 vs Full / Agentic 84.32 · rank #9
Gemini 3.5 Flash 15.20 SWE MVP 73.59 vs Full / Agentic 88.79 · rank #4
DeepSeek V4 Pro 13.67 Hard Intelligence 78.38 vs Full / Agentic 92.04 · rank #5
GPT‑5.5 7.89 SWE MVP 85.51 vs Hard Intelligence 93.40 · rank #1

Lower pressure means a more even profile; higher pressure explains why one strong lane may not lift the overall rank by itself.

Breadth wins the top spot

GPT‑5.5 leads because its measured lanes stay high together: overall 88.22, Full 85.76, SWE 85.51, and Hard Intelligence 93.40.

Full / Agentic alone does not decide

DeepSeek V4 Flash owns Full rank #1 at 95.80, but the overall formula still checks SWE and Hard Intelligence before ordering the table.

SWE is a separate capability signal

Qwen3.7 Max owns SWE rank #1 at 88.99. That lane rewards practical implementation and review behavior rather than only general prompt competence.

Hard Intelligence reshapes the table

GPT‑5.5 owns Hard Intelligence rank #1 at 93.40. That lane tests active inquiry, adaptation, repair, and authority integrity separately from Full and SWE.

The clearest drag is visible

Step 3.7 Flash has a Full/SWE average near 85.09, but Hard Intelligence is 33.99, so the blended overall lands at 68.06.

Full ranking table

Table with reasons, not just numbers.

Each row states the score formula, lane ranks, cost context, and the main reason the entrant lands at its current position.

Ranking data
Rank Model Overall Full SWE Hard IQ Formula Cost Why here
#1 GPT‑5.5 88.22 85.76 #6 85.51 #5 93.40 #1 mean(Full, SWE, Hard Intelligence) $4.071 Overall 88.22 uses mean(Full, SWE, Hard Intelligence). Strength signal: Hard Intelligence rank #1. Main limiter: SWE MVP at 85.51. Hard Intelligence contributes to the ranking as a separate measured lane.
#2 DeepSeek V4 Flash 86.76 95.80 #1 86.95 #3 77.53 #5 mean(Full, SWE, Hard Intelligence) $0.225 Overall 86.76 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full rank #1, SWE rank #3. Main limiter: Hard Intelligence at 77.53. Hard Intelligence contributes to the ranking as a separate measured lane.
#3 Claude Opus 4.8 84.45 83.32 #9 88.67 #2 81.37 #3 mean(Full, SWE, Hard Intelligence) $6.115 Overall 84.45 uses mean(Full, SWE, Hard Intelligence). Strength signal: SWE rank #2, Hard Intelligence rank #3. Main limiter: Hard Intelligence at 81.37. Hard Intelligence contributes to the ranking as a separate measured lane.
#4 Gemini 3.5 Flash 83.38 88.79 #4 73.59 #9 87.75 #2 mean(Full, SWE, Hard Intelligence) $1.646 Overall 83.38 uses mean(Full, SWE, Hard Intelligence). Strength signal: Hard Intelligence rank #2. Main limiter: SWE MVP at 73.59. Hard Intelligence contributes to the ranking as a separate measured lane.
#5 DeepSeek V4 Pro 83.36 92.04 #2 79.68 #8 78.38 #4 mean(Full, SWE, Hard Intelligence) $0.335 Overall 83.36 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full rank #2. Main limiter: Hard Intelligence at 78.38. Hard Intelligence contributes to the ranking as a separate measured lane.
#6 Qwen3.7 Max 80.71 83.24 #10 88.99 #1 69.90 #7 mean(Full, SWE, Hard Intelligence) $0.906 Overall 80.71 uses mean(Full, SWE, Hard Intelligence). Strength signal: SWE rank #1. Main limiter: Hard Intelligence at 69.90. Hard Intelligence contributes to the ranking as a separate measured lane.
#7 MiniMax M3 79.66 85.33 #7 86.88 #4 66.77 #8 mean(Full, SWE, Hard Intelligence) $0.182 Overall 79.66 uses mean(Full, SWE, Hard Intelligence). Strength signal: SWE MVP at 86.88. Main limiter: Hard Intelligence at 66.77. Hard Intelligence contributes to the ranking as a separate measured lane.
#8 Claude Fable 5 79.22 77.02 #12 81.42 #6 mean(Full, SWE) $11.815 Overall 79.22 uses mean(Full, SWE). Strength signal: SWE MVP at 81.42. Main limiter: Full / Agentic at 77.02. Hard Intelligence is blank, so the overall score currently averages Full and SWE only.
#9 MiniMax M3 Direct Plus 74.63 84.32 #8 67.77 #10 71.80 #6 mean(Full, SWE, Hard Intelligence) $0.0041 Overall 74.63 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full / Agentic at 84.32. Main limiter: SWE MVP at 67.77. Hard Intelligence contributes to the ranking as a separate measured lane.
#10 Kimi K2.7 Code 71.01 87.95 #5 58.61 #12 66.46 #9 mean(Full, SWE, Hard Intelligence) $0.732 Overall 71.01 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full / Agentic at 87.95. Main limiter: SWE MVP at 58.61. Hard Intelligence contributes to the ranking as a separate measured lane.
#11 Step 3.7 Flash 68.06 89.79 #3 80.39 #7 33.99 #11 mean(Full, SWE, Hard Intelligence) $0.494 Overall 68.06 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full rank #3. Main limiter: Hard Intelligence at 33.99. Hard Intelligence contributes to the ranking as a separate measured lane.
#12 NVIDIA Nemotron 3 Ultra 67.58 79.54 #11 58.63 #11 64.56 #10 mean(Full, SWE, Hard Intelligence) $0.489 Overall 67.58 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full / Agentic at 79.54. Main limiter: SWE MVP at 58.63. Hard Intelligence contributes to the ranking as a separate measured lane.
Interpretation

Why the leader leads

LeaderGPT‑5.5
Overall88.22
Full85.76
SWE85.51
Hard IQ93.40

The top rank belongs to the entrant with the strongest cross-lane balance under the current formula, not simply the best isolated lane score.

Lane policy

How Hard Intelligence is handled

Scopeactive inquiry + adaptation + repair
Formula roleincluded when measured
Blank cellsnot yet measured
Interpretationseparate from Full and SWE

When a Hard Intelligence score is published, it becomes the third major lane in the overall mean. Otherwise the row remains ranked by the measured lanes it has.

Tie-break reading

How to compare close rows

Overallfirst glance
Lane ranksdiagnosis
Costruntime context
Reliabilityoperational risk

Close overall scores should be read through the lane breakdown. A model can be strong for building software while weaker at active inquiry, or the reverse.