Rank #15
Model result · rank #15
Ornith‑1.0‑35B Q4_K_M
Local modelLocal GGUF · llama.cpp Vulkan · Q4_K_M · 35B MoE. Public result card with the model’s overall score, lane measurements, runtime/cost telemetry, and ranking formula.
Full rank #14
SWE rank #16
Hard rank #12
100.0% reliability
All-around publication view
The overall score averages the measured major lanes while keeping each source measurement visible.
Full / Agentic benchmark
This lane captures instruction following, structured behavior, tool discipline, and general agentic reliability.
Software engineering MVP
This lane is closer to implementation usefulness: source handling, architecture cleanliness, and deliverable quality.
Hard Intelligence diagnostic
Hard Intelligence measures active inquiry, online adaptation, evidence-driven self-repair, and authority/salience integrity.
Runtime economics
Cost, time, and token basis are normalized telemetry. They explain tradeoffs; they do not overwrite the capability score yet.
Why this result lands here.
The model is stronger in the Full/Agentic lane than in the SWE lane; the overall score is therefore shown with both component lanes visible. Hard Intelligence score is 49.58 and contributes to the overall score alongside Full/Agentic and SWE. Local model row: benchmarked on local hardware with no API metering. Strong local Full/Agentic baseline for a 35B MoE GGUF, but SWE implementation/review and Hard Intelligence inquiry/adaptation were weak in this run. The local entrant shares the unified public tournament table with API-backed entrants while exposing local runtime and cost metadata.