ORCFLO Index · updated continuously · 32 models tested

Model benchmarks,
for people who work.

The ORCFLO Index evaluates large language models the way business professionals actually use them — across 40 real-world tasks spanning analysis, writing, extraction, summarization, and behavioral reliability. Each model is scored by a panel of four independent judges.

Browse benchmarks How we test

Models scored

Anthropic, OpenAI, Google, Mistral

Real-world tasks

Across 8 categories of business work

Judge models

Independent panel, scores averaged

Latest refresh

May 10

32 reports in the current cohort

Published Benchmarks.

32 model evaluations from the latest cohort. Each report is an analyst-style note covering quality, cost, and speed against the full field.

32 notes

Top of the field · May 10, 2026

OpenAI·Tested May 10, 2026

GPT 5

GPT 5 holds the #1 overall quality rank in a 32-model field with a score of 95.04, ahead of Gemini 3 Pro Preview (93.17) and stablemate GPT 5.5 (93.02). It leads or near-leads across nearly every category measured, including a clean #1 in Refusal Calibration. The trade-off is sev…

Published Benchmarks.

GPT 5

Claude Opus 4.7

Claude Opus 4.6

Claude Opus 4.5

Claude Sonnet 4.6

Claude Sonnet 4.5

Claude Sonnet 4

Claude Haiku 4.5

GPT 5.5

GPT 5.4

GPT 5.2

GPT 5.1

GPT 5 Mini

GPT 5 Nano

o3

o4-mini

o3-mini

GPT 4.1

GPT 4.1 Mini

GPT 4.1 Nano

GPT 4o

GPT 4o Mini

Gemini 3 Pro (Preview)

Gemini 3 Flash (Preview)

Gemini 2.5 Pro

Gemini 2.5 Flash

Gemini 2.5 Flash-Lite

Gemini 2.0 Flash

Gemini 2.0 Flash-Lite

Mistral Large 3 (2512)

Mistral Small 3 (24B)

Codestral (2508)

How we test.

Three dimensions, separately ranked

Independent four-model panel

Case-specific rubrics

Build with the right models.