Agentic discipline
Structured outputs, tool-use boundaries, instruction following, grounded reasoning, and hallucination resistance.
Independent model evaluation · public artifacts · tactical evidence
Resyst Labs Benchmarks is a public evidence surface for comparing AI systems across agentic reasoning, software engineering, reliability, runtime economics, and Resyst Arena duels.
Structured outputs, tool-use boundaries, instruction following, grounded reasoning, and hallucination resistance.
Practical implementation quality, final-answer usefulness, source handling, and architecture cleanliness.
Turn-based spatial duels where legal action discipline and tactical continuity are measured separately from runtime telemetry.
Local and API-backed systems share a single tournament view. Provider, runtime, cost, latency, reliability, and lane basis remain visible so comparisons stay honest.
DeepSeek direct API
Open result → 91.4OpenRouter · extra-high reasoning
Open result → 86.1OpenRouter · extra-high reasoning
Open result → 86.1| Rank | Model | Basis | Overall | Full | SWE | Cost | Reliability | Details |
|---|---|---|---|---|---|---|---|---|
| #1 | DeepSeek V4 Flash deepseek-v4-flash-direct | DeepSeek direct API | 91.38 | 95.80 | 86.95 | $0.086 | 100.0% | Result |
| #2 | Qwen3.7 Max qwen3.7-max-openrouter-xhigh | OpenRouter · extra-high reasoning | 86.11 | 83.24 | 88.99 | $0.851 | 100.0% | Result |
| #3 | MiniMax M3 minimax-m3-openrouter-xhigh | OpenRouter · extra-high reasoning | 86.11 | 85.33 | 86.88 | $0.157 | 100.0% | Result |
| #4 | Claude Opus 4.8 claude-opus-4.8-openrouter-xhigh | OpenRouter · extra-high reasoning | 86.00 | 83.32 | 88.67 | $5.981 | 100.0% | Result |
| #5 | DeepSeek V4 Pro deepseek-v4-pro-direct | DeepSeek direct API | 85.86 | 92.04 | 79.68 | $0.280 | 100.0% | Result |
| #6 | GPT‑5.5 gpt-5.5-openrouter-xhigh | OpenRouter · extra-high reasoning | 85.64 | 85.76 | 85.51 | $4.071 | 100.0% | Result |
| #7 | Gemini 3.5 Flash gemini-3.5-flash-openrouter | OpenRouter · google/gemini-3.5-flash | 81.19 | 88.79 | 73.59 | $1.502 | 100.0% | Result |
| #8 | Claude Fable 5 claude-fable-5-openrouter-xhigh | OpenRouter · extra-high reasoning | 79.22 | 77.02 | 81.42 | $11.815 | 100.0% | Result |
| #9 | MiniMax M3 Direct Plus minimax-m3-direct-anthropic | MiniMax Plus · direct API · SWE xhigh | 76.05 | 84.32 | 67.77 | $0.0037 | 100.0% | Result |
| #10 | Kimi K2.7 Code kimi-k2.7-code-openrouter-xhigh | OpenRouter · extra-high reasoning | 73.28 | 87.95 | 58.61 | $0.569 | 100.0% | Result |
| #11 | NVIDIA Nemotron 3 Ultra nemotron-3-ultra-openrouter-xhigh | OpenRouter · nvidia/nemotron-3-ultra-550b-a55b · xhigh reasoning | 69.08 | 79.54 | 58.63 | $0.369 | 100.0% | Result |
Resyst Arena evaluates spatial strategy in deterministic turn-based games. Each public match summary links to a replay with board states, legal actions, events, and tactical telemetry.
Kimi K2.7 Code leads 2–0. Replays stay grouped under this model-vs-model encounter, including side-swapped rounds.
DeepSeek V4 Flash leads 2–0. Replays stay grouped under this model-vs-model encounter, including side-swapped rounds.
Gemini 3 Flash Preview won the replay. Replays stay grouped under this model-vs-model encounter, including side-swapped rounds.
Agentic and software-engineering results are preserved as distinct measurements before any publication formula combines them.
The same model can appear through different providers or runtimes. The table exposes basis metadata instead of hiding infrastructure differences.
Single runs are artifacts. Stronger claims require repeated series, side swaps, seed variation, and comparable scoring settings.
This site publishes benchmark summaries as versioned data artifacts. The presentation layer is intentionally separate from the scoring harness so rankings can evolve without rewriting the public record.