Rank #13
Model result · rank #13
Gemma4‑12B‑Coder Fable5/Composer2.5 Q4_K_M
Local modelLocal GGUF · llama.cpp Vulkan · Q4_K_M. Public result card with the model’s overall score, lane measurements, runtime/cost telemetry, and ranking formula.
Full rank #11
SWE rank #13
Hard rank #11
100.0% reliability
All-around publication view
The overall score averages the measured major lanes while keeping each source measurement visible.
Full / Agentic benchmark
This lane captures instruction following, structured behavior, tool discipline, and general agentic reliability.
Software engineering MVP
This lane is closer to implementation usefulness: source handling, architecture cleanliness, and deliverable quality.
Hard Intelligence diagnostic
Hard Intelligence measures active inquiry, online adaptation, evidence-driven self-repair, and authority/salience integrity.
Runtime economics
Cost, time, and token basis are normalized telemetry. They explain tradeoffs; they do not overwrite the capability score yet.
Why this result lands here.
The model is stronger in the Full/Agentic lane than in the SWE lane; the overall score is therefore shown with both component lanes visible. Hard Intelligence score is 48.96 and contributes to the overall score alongside Full/Agentic and SWE. Local model row: benchmarked on local hardware with no API metering. Strong local Full/Agentic and focused tool-loop behavior; SWE review/audit and Hard Intelligence inquiry/adaptation are weak in this max-thinking run.