Independent model evaluation · public evidence · tactical replays

AI model benchmarks with rules, receipts, and Arena replays.

Resyst Labs Benchmarks is a public evidence surface for comparing AI systems across agentic reasoning, software engineering, reliability, runtime economics, and Resyst Arena duels.

Explore the ranking Enter Resyst Arena

Current leader GPT‑5.5 Overall 88.22 · OpenRouter full/SWE · OpenAI-compatible Hard Intelligence

20 ranked entrants

5 Arena matches

Jul 09, 2026 data refresh

Agentic discipline

Structured outputs, tool-use boundaries, instruction following, grounded reasoning, and hallucination resistance.

Software execution

Practical implementation quality, final-answer usefulness, source handling, and architecture cleanliness.

Hard Intelligence

Active inquiry, online adaptation, self-repair, and authority integrity under a public hard-reasoning diagnostic lane.

Resyst Arena

Turn-based spatial duels where legal action discipline and tactical continuity are measured separately from runtime telemetry.

Unified ranking

One table, visible tradeoffs.

Local and API-backed systems share a single tournament view. Provider, runtime, cost, latency, reliability, and lane basis remain visible so comparisons stay honest.

Open ranking explainer Download ranking JSON

GPT‑5.5

OpenRouter full/SWE · OpenAI-compatible Hard Intelligence

Open result → 88.2

GPT‑5.6 Terra

OpenRouter · extra-high reasoning · Full + SWE + Hard measured

Open result → 87.7

DeepSeek V4 Flash

DeepSeek direct API · refreshed Hard Intelligence telemetry

Open result → 86.1

Rank	Model	Basis	Overall	Full	SWE	Hard IQ	Inquiry	Adapt	Repair	Authority	Cost	Reliability	Details
#1	GPT‑5.5	OpenRouter full/SWE · OpenAI-compatible Hard Intelligence	88.22	85.76	85.51	93.40	93.08	98.13	82.40	100.00	$4.071	100.0%	Result
#2	GPT‑5.6 Terra	OpenRouter · extra-high reasoning · Full + SWE + Hard measured	87.70	88.85	86.08	88.18	99.75	97.50	85.48	70.00	$1.867	100.0%	Result
#3	DeepSeek V4 Flash	DeepSeek direct API · refreshed Hard Intelligence telemetry	86.08	95.80	86.95	75.48	80.74	81.70	73.28	66.21	$0.290	100.0%	Result
#4	GPT‑5.6 Sol	OpenRouter · extra-high reasoning · Full + SWE + Hard measured	84.57	86.27	76.12	91.31	99.75	80.00	85.48	100.00	$3.858	100.0%	Result
#5	Claude Opus 4.8	OpenRouter · extra-high reasoning · Hard Intelligence	84.45	83.32	88.67	81.37	93.08	80.00	82.40	70.00	$6.115	100.0%	Result
#6	Claude Fable 5	OpenRouter · extra-high reasoning · Full + SWE + Hard measured	83.74	77.02	81.42	92.78	99.75	98.13	85.73	87.50	$12.285	100.0%	Result
#7	GLM‑5.2	OpenRouter · z-ai/glm-5.2 · maximum reasoning · Full + SWE + Hard measured	83.68	86.87	89.58	74.58	17.50	98.13	82.71	100.00	$2.768	100.0%	Result
#8	Gemini 3.5 Flash	OpenRouter · extra-high reasoning · Hard Intelligence	83.38	88.79	73.59	87.75	81.42	97.50	82.08	90.00	$1.646	100.0%	Result
#9	DeepSeek V4 Pro	DeepSeek direct API · maximum-reasoning Hard IQ	83.36	92.04	79.68	78.38	81.42	80.00	82.08	70.00	$0.335	100.0%	Result
#10	GPT‑5.6 Luna	OpenRouter · extra-high reasoning · Full + SWE + Hard measured	82.13	89.90	75.84	80.66	99.75	80.00	85.40	57.50	$1.000	100.0%	Result
#11	Claude Sonnet 5	OpenRouter · extra-high reasoning · Full + SWE + Hard measured	82.08	81.64	78.08	86.53	98.50	99.06	86.04	62.50	$2.812	100.0%	Result
#12	Qwen3.7 Max	OpenRouter · extra-high reasoning · Hard Intelligence	80.71	83.24	88.99	69.90	17.50	80.00	82.08	100.00	$0.906	100.0%	Result
#13	MiniMax M3	OpenRouter · extra-high reasoning	79.66	85.33	86.88	66.77	17.50	97.50	82.08	70.00	$0.182	100.0%	Result
#14	MiniMax M3 Direct Plus	MiniMax Plus · direct API · SWE extra-high reasoning	74.63	84.32	67.77	71.80	17.50	98.50	71.19	100.00	$0.0041	100.0%	Result
#15	Kimi K2.7 Code	OpenRouter · Kimi K2.7 Code · extra-high Hard IQ	71.01	87.95	58.61	66.46	17.50	80.00	78.33	90.00	$0.732	100.0%	Result
#16	Step 3.7 Flash	OpenRouter · stepfun/step-3.7-flash · extra-high reasoning · Full + SWE + Hard measured	68.06	89.79	80.39	33.99	17.50	80.00	29.46	9.00	$0.494	100.0%	Result
#17	NVIDIA Nemotron 3 Ultra	OpenRouter · nvidia/nemotron-3-ultra-550b-a55b · extra-high reasoning	67.58	79.54	58.63	64.56	93.08	8.75	56.40	100.00	$0.489	100.0%	Result
#18	Gemma4‑12B‑Coder Fable5/Composer2.5 Q4_K_M Local model	Local GGUF · llama.cpp Vulkan · Q4_K_M	61.19	80.60	54.01	48.96	17.50	6.25	82.08	90.00	$0	100.0%	Result
#19	Ornith‑1.0‑35B Q4_K_M Local model	Local GGUF · llama.cpp Vulkan · Q4_K_M · 35B MoE	55.57	77.96	39.16	49.58	18.75	11.25	78.33	90.00	$0	100.0%	Result
#20	Qwythos‑9B Claude Mythos Q8_0 Local model	Local GGUF · llama.cpp Vulkan · Q8_0 · 256K allocation verified	55.13	74.98	46.51	43.91	81.42	6.88	78.33	9.00	$0	100.0%	Result

Resyst Arena

A tactical testbed, not a latency race.

Resyst Arena evaluates spatial strategy in deterministic turn-based games. Each public match summary links to a replay with board states, legal actions, events, and tactical telemetry.

Side-swapped series before ranking-grade claims.
Legal action rate and invalid actions are first-class evidence.
Replay data stays linked to tactical summaries.

Open the Arena replay room

Encounter 1 · 2 replays · latest 20260612 · extra-high reasoning text actions

DeepSeek V4 Flash vs Kimi K2.7 Code

Kimi K2.7 Code leads 2–0. Replays stay grouped under this model-vs-model encounter, including side-swapped rounds.

Round 1 Kimi K2.7 Code adjudication · seed 361605864 Round 2 Kimi K2.7 Code adjudication · seed 361605864

Replays2

Turns120

Seeds1

Open encounter →

Encounter 2 · 2 replays · latest 20260531 · 0.3

DeepSeek V4 Flash vs Step 3.7 Flash

DeepSeek V4 Flash leads 2–0. Replays stay grouped under this model-vs-model encounter, including side-swapped rounds.

Replay 1 DeepSeek V4 Flash core destroyed · seed 1188878758 Replay 2 DeepSeek V4 Flash adjudication · seed 42

Replays2

Turns163

Seeds2

Open encounter →

Encounter 3 · 1 replay · latest 20260530 · 0.3

DeepSeek V4 Flash vs Gemini 3 Flash Preview

Gemini 3 Flash Preview won the replay. Replays stay grouped under this model-vs-model encounter, including side-swapped rounds.

Replay Gemini 3 Flash Preview adjudication · seed 761168107

Replays1

Turns200

Seeds1

Open encounter →

Methodology

Scores are claims with receipts.

Separate lanes

Agentic, software-engineering, and Hard Intelligence diagnostics are preserved as distinct measurements before any publication formula combines them.

Runtime honesty

The same model can appear through different providers or runtimes. The table exposes basis metadata instead of hiding infrastructure differences.

Evidence thresholds

Single runs are evidence records. Stronger claims require repeated series, side swaps, seed variation, and comparable scoring settings.

Evidence contract

Public by design. Auditable by default.

This site publishes benchmark summaries as versioned data files. The presentation layer is intentionally separate from the scoring harness so rankings can evolve without rewriting the public record.

Ranking data Arena data