Model result · rank #4

GLM‑5.2

OpenRouter · z-ai/glm-5.2 · maximum reasoning · Full + SWE + Hard measured. Public result card with the model’s overall score, lane measurements, runtime/cost telemetry, and ranking formula.

Download public JSON Compare all models

Overall score83.68

Rank #4

Full / Agentic86.87

Full rank #6

SWE MVP89.58

SWE rank #1

Hard Intelligence74.58

Hard rank #6

Measured cost$2.768

100.0% reliability

Overall

All-around publication view

Score83.68

Formulamean(Full, SWE, Hard Intelligence)

BasisOpenRouter · z-ai/glm-5.2 · maximum reasoning · Full + SWE + Hard measured

The overall score averages the measured major lanes while keeping each source measurement visible.

Lane 01

Full / Agentic benchmark

Final86.87

Capability99.37

Agentic96.24

Pass rate100.0%

Prompts43

This lane captures instruction following, structured behavior, tool discipline, and general agentic reliability.

Lane 02

Software engineering MVP

SWE score89.58

Focused final72.41

Capability89.58

Daily driver65.89

Prompts24

This lane is closer to implementation usefulness: source handling, architecture cleanliness, and deliverable quality.

Lane 03

Hard Intelligence diagnostic

Hard score74.58

Active inquiry17.50

Online adaptation98.13

Self-repair82.71

Authority integrity100.00

Hard Intelligence measures active inquiry, online adaptation, evidence-driven self-repair, and authority/salience integrity.

Telemetry

Runtime economics

Total cost$2.768

Cost / scored item$0.037

Seconds / timed item91.19s

Runtime coverage100.0%

Recorded tokens / item22.4k

Token coverage100.0%

Cost, time, and token basis are normalized telemetry. They explain tradeoffs; they do not overwrite the capability score yet.

Interpretation

Why this result lands here.

The result is best read as a balanced benchmark entry: one overall score plus the lane measurements that produced it. Hard Intelligence score is 74.58 and contributes to the overall score alongside Full/Agentic and SWE. GLM‑5.2 used OpenRouter maximum reasoning across the public benchmark suites. SWE score reflects all prompt outcomes from the public software-engineering suite.