Model benchmarks,
for people who work.
The ORCFLO Indexevaluates large language models the way business professionals actually use them — across 40 real-world tasks spanning analysis, writing, extraction, summarization, and behavioral reliability. Each model is scored by a panel of four independent judges.
Published Benchmarks.
32 model evaluations from the latest cohort. Each report is an analyst-style note covering quality, cost, and speed against the full field.
GPT 5
GPT 5 holds the #1 overall quality rank in a 32-model field with a score of 95.04, ahead of Gemini 3 Pro Preview (93.17) and stablemate GPT 5.5 (93.02). It leads or near-leads across nearly every category measured, including a clean #1 in Refusal Calibration. The trade-off is sev…
Read the full evaluationClaude Opus 4.7
Claude Opus 4.6
Claude Opus 4.5
Claude Sonnet 4.6
Claude Sonnet 4.5
Claude Sonnet 4
Claude Haiku 4.5
GPT 5.5
GPT 5.4
GPT 5.2
GPT 5.1
GPT 5 Mini
GPT 5 Nano
o3
o4-mini
o3-mini
GPT 4.1
GPT 4.1 Mini
GPT 4.1 Nano
GPT 4o
GPT 4o Mini
Gemini 3 Pro (Preview)
Gemini 3 Flash (Preview)
Gemini 2.5 Pro
Gemini 2.5 Flash
Gemini 2.5 Flash-Lite
Gemini 2.0 Flash
Gemini 2.0 Flash-Lite
Mistral Large 3 (2512)
Mistral Small 3 (24B)
Codestral (2508)
How we test.
Three dimensions, no composite. A four-model judge panel scoring against case-specific rubrics. Built for evidence-based model selection — not leaderboard chasing.
Three dimensions, separately ranked
Quality, cost, and speed each get their own rank — no composite scores. Composites hide the tradeoffs that drive real model-selection decisions.
Independent four-model panel
Every response is scored by Gemini 2.5 Pro, Claude Opus 4.7, GPT 5.5, and Mistral Large. Judges score independently; the final quality is the average.
Case-specific rubrics
Every test case ships with a rubric written specifically for what that case is measuring. Authored before any model is run — no scoring drift.