Independent model evaluation · public artifacts · tactical evidence

AI model benchmarks with rules, receipts, and Arena replays.

Resyst Labs Benchmarks is a public evidence surface for comparing AI systems across agentic reasoning, software engineering, reliability, runtime economics, and Resyst Arena duels.

Current leader DeepSeek V4 Flash Overall 91.38 · DeepSeek direct API
11 ranked entrants
5 Arena matches
Jun 11, 2026 data refresh
01

Agentic discipline

Structured outputs, tool-use boundaries, instruction following, grounded reasoning, and hallucination resistance.

02

Software execution

Practical implementation quality, final-answer usefulness, source handling, and architecture cleanliness.

03

Resyst Arena

Turn-based spatial duels where legal action discipline and tactical continuity are measured separately from runtime telemetry.

Unified ranking

One table, visible tradeoffs.

Local and API-backed systems share a single tournament view. Provider, runtime, cost, latency, reliability, and lane basis remain visible so comparisons stay honest.

Download ranking JSON
#1

DeepSeek V4 Flash

DeepSeek direct API

Open result → 91.4
#2

Qwen3.7 Max

OpenRouter · extra-high reasoning

Open result → 86.1
#3

MiniMax M3

OpenRouter · extra-high reasoning

Open result → 86.1
Rank Model Basis Overall Full SWE Cost Reliability Details
#1 DeepSeek V4 Flash deepseek-v4-flash-direct DeepSeek direct API 91.38 95.80 86.95 $0.086 100.0% Result
#2 Qwen3.7 Max qwen3.7-max-openrouter-xhigh OpenRouter · extra-high reasoning 86.11 83.24 88.99 $0.851 100.0% Result
#3 MiniMax M3 minimax-m3-openrouter-xhigh OpenRouter · extra-high reasoning 86.11 85.33 86.88 $0.157 100.0% Result
#4 Claude Opus 4.8 claude-opus-4.8-openrouter-xhigh OpenRouter · extra-high reasoning 86.00 83.32 88.67 $5.981 100.0% Result
#5 DeepSeek V4 Pro deepseek-v4-pro-direct DeepSeek direct API 85.86 92.04 79.68 $0.280 100.0% Result
#6 GPT‑5.5 gpt-5.5-openrouter-xhigh OpenRouter · extra-high reasoning 85.64 85.76 85.51 $4.071 100.0% Result
#7 Gemini 3.5 Flash gemini-3.5-flash-openrouter OpenRouter · google/gemini-3.5-flash 81.19 88.79 73.59 $1.502 100.0% Result
#8 Claude Fable 5 claude-fable-5-openrouter-xhigh OpenRouter · extra-high reasoning 79.22 77.02 81.42 $11.815 100.0% Result
#9 MiniMax M3 Direct Plus minimax-m3-direct-anthropic MiniMax Plus · direct API · SWE xhigh 76.05 84.32 67.77 $0.0037 100.0% Result
#10 Kimi K2.7 Code kimi-k2.7-code-openrouter-xhigh OpenRouter · extra-high reasoning 73.28 87.95 58.61 $0.569 100.0% Result
#11 NVIDIA Nemotron 3 Ultra nemotron-3-ultra-openrouter-xhigh OpenRouter · nvidia/nemotron-3-ultra-550b-a55b · xhigh reasoning 69.08 79.54 58.63 $0.369 100.0% Result
Resyst Arena

A tactical testbed, not a latency race.

Resyst Arena evaluates spatial strategy in deterministic turn-based games. Each public match summary links to a replay with board states, legal actions, events, and tactical telemetry.

  • Side-swapped series before ranking-grade claims.
  • Legal action rate and invalid actions are first-class evidence.
  • Replay artifacts stay linked to tactical summaries.
Open the Arena replay room
Methodology

Scores are claims with receipts.

Separate lanes

Agentic and software-engineering results are preserved as distinct measurements before any publication formula combines them.

Runtime honesty

The same model can appear through different providers or runtimes. The table exposes basis metadata instead of hiding infrastructure differences.

Evidence thresholds

Single runs are artifacts. Stronger claims require repeated series, side swaps, seed variation, and comparable scoring settings.

Evidence contract

Public by design. Auditable by default.

This site publishes benchmark summaries as versioned data artifacts. The presentation layer is intentionally separate from the scoring harness so rankings can evolve without rewriting the public record.