11 with Hard Intelligence data
Unified ranking · lane-aware explanation
Why the ranking looks like this.
The public ranking is not a single vibe score. It orders measured entrants by a transparent overall formula while keeping Full / Agentic, SWE MVP, Hard Intelligence, cost, and reliability visible.
Overall 88.22
#1 to #12
Full + SWE + published Hard Intelligence when measured
Static HTML plus public JSON
Overall ladder
Every ranked entrant ordered by public overall score.
Overall is a lane mean, not a hidden replacement for source measurements.
Tradeoff scatter maps
Each point is one tested model at the intersection of two public telemetry axes.
Use these maps to read quality versus cost, speed, and token volume. Upper-left is usually the best region for score tradeoffs; lower-left is best for token/cost efficiency.
Lane contrast
Top eight entrants with Full, SWE, and Hard Intelligence shown side by side.
Hard Intelligence is shown as its own lane so cross-lane strengths and weaknesses stay visible.
Measured cost context
Cost is shown because deployment economics matter, but it does not secretly rewrite capability scores.
Very expensive rows are not punished twice; cost is visible telemetry and part of the public interpretation.
Lane balance pressure
Largest gap between each entrant’s strongest and weakest measured major lane.
Lower pressure means a more even profile; higher pressure explains why one strong lane may not lift the overall rank by itself.
Breadth wins the top spot
GPT‑5.5 leads because its measured lanes stay high together: overall 88.22, Full 85.76, SWE 85.51, and Hard Intelligence 93.40.
Full / Agentic alone does not decide
DeepSeek V4 Flash owns Full rank #1 at 95.80, but the overall formula still checks SWE and Hard Intelligence before ordering the table.
SWE is a separate capability signal
Qwen3.7 Max owns SWE rank #1 at 88.99. That lane rewards practical implementation and review behavior rather than only general prompt competence.
Hard Intelligence reshapes the table
GPT‑5.5 owns Hard Intelligence rank #1 at 93.40. That lane tests active inquiry, adaptation, repair, and authority integrity separately from Full and SWE.
The clearest drag is visible
Step 3.7 Flash has a Full/SWE average near 85.09, but Hard Intelligence is 33.99, so the blended overall lands at 68.06.
Table with reasons, not just numbers.
Each row states the score formula, lane ranks, cost context, and the main reason the entrant lands at its current position.
| Rank | Model | Overall | Full | SWE | Hard IQ | Formula | Cost | Why here |
|---|---|---|---|---|---|---|---|---|
| #1 | GPT‑5.5 | 88.22 | 85.76 #6 | 85.51 #5 | 93.40 #1 | mean(Full, SWE, Hard Intelligence) | $4.071 | Overall 88.22 uses mean(Full, SWE, Hard Intelligence). Strength signal: Hard Intelligence rank #1. Main limiter: SWE MVP at 85.51. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #2 | DeepSeek V4 Flash | 86.76 | 95.80 #1 | 86.95 #3 | 77.53 #5 | mean(Full, SWE, Hard Intelligence) | $0.225 | Overall 86.76 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full rank #1, SWE rank #3. Main limiter: Hard Intelligence at 77.53. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #3 | Claude Opus 4.8 | 84.45 | 83.32 #9 | 88.67 #2 | 81.37 #3 | mean(Full, SWE, Hard Intelligence) | $6.115 | Overall 84.45 uses mean(Full, SWE, Hard Intelligence). Strength signal: SWE rank #2, Hard Intelligence rank #3. Main limiter: Hard Intelligence at 81.37. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #4 | Gemini 3.5 Flash | 83.38 | 88.79 #4 | 73.59 #9 | 87.75 #2 | mean(Full, SWE, Hard Intelligence) | $1.646 | Overall 83.38 uses mean(Full, SWE, Hard Intelligence). Strength signal: Hard Intelligence rank #2. Main limiter: SWE MVP at 73.59. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #5 | DeepSeek V4 Pro | 83.36 | 92.04 #2 | 79.68 #8 | 78.38 #4 | mean(Full, SWE, Hard Intelligence) | $0.335 | Overall 83.36 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full rank #2. Main limiter: Hard Intelligence at 78.38. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #6 | Qwen3.7 Max | 80.71 | 83.24 #10 | 88.99 #1 | 69.90 #7 | mean(Full, SWE, Hard Intelligence) | $0.906 | Overall 80.71 uses mean(Full, SWE, Hard Intelligence). Strength signal: SWE rank #1. Main limiter: Hard Intelligence at 69.90. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #7 | MiniMax M3 | 79.66 | 85.33 #7 | 86.88 #4 | 66.77 #8 | mean(Full, SWE, Hard Intelligence) | $0.182 | Overall 79.66 uses mean(Full, SWE, Hard Intelligence). Strength signal: SWE MVP at 86.88. Main limiter: Hard Intelligence at 66.77. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #8 | Claude Fable 5 | 79.22 | 77.02 #12 | 81.42 #6 | mean(Full, SWE) | $11.815 | Overall 79.22 uses mean(Full, SWE). Strength signal: SWE MVP at 81.42. Main limiter: Full / Agentic at 77.02. Hard Intelligence is blank, so the overall score currently averages Full and SWE only. | |
| #9 | MiniMax M3 Direct Plus | 74.63 | 84.32 #8 | 67.77 #10 | 71.80 #6 | mean(Full, SWE, Hard Intelligence) | $0.0041 | Overall 74.63 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full / Agentic at 84.32. Main limiter: SWE MVP at 67.77. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #10 | Kimi K2.7 Code | 71.01 | 87.95 #5 | 58.61 #12 | 66.46 #9 | mean(Full, SWE, Hard Intelligence) | $0.732 | Overall 71.01 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full / Agentic at 87.95. Main limiter: SWE MVP at 58.61. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #11 | Step 3.7 Flash | 68.06 | 89.79 #3 | 80.39 #7 | 33.99 #11 | mean(Full, SWE, Hard Intelligence) | $0.494 | Overall 68.06 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full rank #3. Main limiter: Hard Intelligence at 33.99. Hard Intelligence contributes to the ranking as a separate measured lane. |
| #12 | NVIDIA Nemotron 3 Ultra | 67.58 | 79.54 #11 | 58.63 #11 | 64.56 #10 | mean(Full, SWE, Hard Intelligence) | $0.489 | Overall 67.58 uses mean(Full, SWE, Hard Intelligence). Strength signal: Full / Agentic at 79.54. Main limiter: SWE MVP at 58.63. Hard Intelligence contributes to the ranking as a separate measured lane. |
Why the leader leads
The top rank belongs to the entrant with the strongest cross-lane balance under the current formula, not simply the best isolated lane score.
How Hard Intelligence is handled
When a Hard Intelligence score is published, it becomes the third major lane in the overall mean. Otherwise the row remains ranked by the measured lanes it has.
How to compare close rows
Close overall scores should be read through the lane breakdown. A model can be strong for building software while weaker at active inquiry, or the reverse.