← Back to ranking

Model result · rank #13

Gemma4‑12B‑Coder Fable5/Composer2.5 Q4_K_M

Local model

Local GGUF · llama.cpp Vulkan · Q4_K_M. Public result card with the model’s overall score, lane measurements, runtime/cost telemetry, and ranking formula.

Overall score61.19

Rank #13

Full / Agentic80.60

Full rank #11

SWE MVP54.01

SWE rank #13

Hard Intelligence48.96

Hard rank #11

Measured cost$0

100.0% reliability

Overall

All-around publication view

Score61.19
Formulamean(Full, SWE, Hard Intelligence)
BasisLocal GGUF · llama.cpp Vulkan · Q4_K_M

The overall score averages the measured major lanes while keeping each source measurement visible.

Lane 01

Full / Agentic benchmark

Final80.60
Capability87.07
Agentic79.75
Pass rate88.4%
Prompts43

This lane captures instruction following, structured behavior, tool discipline, and general agentic reliability.

Lane 02

Software engineering MVP

SWE score54.01
Focused final54.01
Capability32.70
Daily driver60.76
Prompts8

This lane is closer to implementation usefulness: source handling, architecture cleanliness, and deliverable quality.

Lane 03

Hard Intelligence diagnostic

Hard score48.96
Active inquiry17.50
Online adaptation6.25
Self-repair82.08
Authority integrity90.00

Hard Intelligence measures active inquiry, online adaptation, evidence-driven self-repair, and authority/salience integrity.

Telemetry

Runtime economics

Total cost$0
Cost / scored item$0
Seconds / timed item21.52s
Runtime coverage100.0%
Recorded tokens / item8.3k
Token coverage100.0%

Cost, time, and token basis are normalized telemetry. They explain tradeoffs; they do not overwrite the capability score yet.

Interpretation

Why this result lands here.

The model is stronger in the Full/Agentic lane than in the SWE lane; the overall score is therefore shown with both component lanes visible. Hard Intelligence score is 48.96 and contributes to the overall score alongside Full/Agentic and SWE. Local model row: benchmarked on local hardware with no API metering. Strong local Full/Agentic and focused tool-loop behavior; SWE review/audit and Hard Intelligence inquiry/adaptation are weak in this max-thinking run.