ORCFLO Index · updated continuously · 32 models tested

Model benchmarks,
for people who work.

The ORCFLO Indexevaluates large language models the way business professionals actually use them — across 40 real-world tasks spanning analysis, writing, extraction, summarization, and behavioral reliability. Each model is scored by a panel of four independent judges.

Models scored
32
Anthropic, OpenAI, Google, Mistral
Real-world tasks
40
Across 8 categories of business work
Judge models
4
Independent panel, scores averaged
Latest refresh
May 10
32 reports in the current cohort

Published Benchmarks.

32 model evaluations from the latest cohort. Each report is an analyst-style note covering quality, cost, and speed against the full field.

32 notes
Top of the field · May 10, 2026
OpenAI·Tested May 10, 2026

GPT 5

GPT 5 holds the #1 overall quality rank in a 32-model field with a score of 95.04, ahead of Gemini 3 Pro Preview (93.17) and stablemate GPT 5.5 (93.02). It leads or near-leads across nearly every category measured, including a clean #1 in Refusal Calibration. The trade-off is sev

Read the full evaluation
Score · vs. 32 models
AnthropicMay 10, 2026

Claude Opus 4.7

Quality
88.6
#12 of 32
Cost
3.9×
#30 of 32
Speed
1.4×
#25 of 32
AnthropicMay 10, 2026

Claude Opus 4.6

Quality
92.8
#4 of 32
Cost
3.3×
#29 of 32
Speed
1.8×
#26 of 32
AnthropicMay 10, 2026

Claude Opus 4.5

Quality
90.7
#9 of 32
Cost
2.3×
#28 of 32
Speed
1.0×
#15 of 32
AnthropicMay 10, 2026

Claude Sonnet 4.6

Quality
88.2
#14 of 32
Cost
1.8×
#24 of 32
Speed
1.3×
#24 of 32
AnthropicMay 10, 2026

Claude Sonnet 4.5

Quality
84.6
#21 of 32
Cost
1.3×
#20 of 32
Speed
1.0×
#17 of 32
AnthropicMay 10, 2026

Claude Sonnet 4

Quality
85.5
#18 of 32
Cost
1.2×
#17 of 32
Speed
0.8×
#14 of 32
AnthropicMay 10, 2026

Claude Haiku 4.5

Quality
84.7
#20 of 32
Cost
0.5×
#13 of 32
Speed
0.5×
#11 of 32
OpenAIMay 10, 2026

GPT 5.5

Quality
93.0
#3 of 32
Cost
6.3×
#32 of 32
Speed
2.2×
#29 of 32
OpenAIMay 10, 2026

GPT 5.4

Quality
90.8
#8 of 32
Cost
2.2×
#25 of 32
Speed
1.3×
#23 of 32
OpenAIMay 10, 2026

GPT 5.2

Quality
91.9
#7 of 32
Cost
1.6×
#23 of 32
Speed
1.1×
#19 of 32
OpenAIMay 10, 2026

GPT 5.1

Quality
92.7
#5 of 32
Cost
1.5×
#22 of 32
Speed
0.7×
#13 of 32
OpenAIMay 10, 2026

GPT 5 Mini

Quality
89.6
#10 of 32
Cost
0.6×
#14 of 32
Speed
2.8×
#31 of 32
OpenAIMay 10, 2026

GPT 5 Nano

Quality
85.9
#16 of 32
Cost
0.2×
#10 of 32
Speed
2.5×
#30 of 32
OpenAIMay 10, 2026

o3

Quality
86.2
#15 of 32
Cost
2.3×
#26 of 32
Speed
1.3×
#22 of 32
OpenAIMay 10, 2026

o4-mini

Quality
85.9
#17 of 32
Cost
1.4×
#21 of 32
Speed
0.7×
#12 of 32
OpenAIMay 10, 2026

o3-mini

Quality
77.4
#28 of 32
Cost
2.3×
#27 of 32
Speed
1.2×
#20 of 32
OpenAIMay 10, 2026

GPT 4.1

Quality
85.0
#19 of 32
Cost
0.8×
#15 of 32
Speed
0.5×
#10 of 32
OpenAIMay 10, 2026

GPT 4.1 Mini

Quality
80.4
#25 of 32
Cost
0.2×
#9 of 32
Speed
0.5×
#9 of 32
OpenAIMay 10, 2026

GPT 4.1 Nano

Quality
71.3
#31 of 32
Cost
0.0×
#2 of 32
Speed
0.4×
#3 of 32
OpenAIMay 10, 2026

GPT 4o

Quality
75.4
#29 of 32
Cost
0.8×
#16 of 32
Speed
0.4×
#6 of 32
OpenAIMay 10, 2026

GPT 4o Mini

Quality
72.8
#30 of 32
Cost
0.0×
#6 of 32
Speed
0.5×
#8 of 32
GoogleMay 10, 2026

Gemini 3 Pro (Preview)

Quality
93.2
#2 of 32
Cost
1.2×
#18 of 32
Speed
2.1×
#28 of 32
GoogleMay 10, 2026

Gemini 3 Flash (Preview)

Quality
88.7
#11 of 32
Cost
0.3×
#12 of 32
Speed
1.0×
#16 of 32
GoogleMay 10, 2026

Gemini 2.5 Pro

Quality
92.0
#6 of 32
Cost
1.2×
#19 of 32
Speed
1.9×
#27 of 32
GoogleMay 10, 2026

Gemini 2.5 Flash

Quality
88.4
#13 of 32
Cost
0.1×
#7 of 32
Speed
1.0×
#18 of 32
GoogleMay 10, 2026

Gemini 2.5 Flash-Lite

Quality
80.9
#24 of 32
Cost
0.0×
#4 of 32
Speed
0.4×
#5 of 32
GoogleMay 10, 2026

Gemini 2.0 Flash

Quality
83.1
#23 of 32
Cost
0.0×
#5 of 32
Speed
0.4×
#4 of 32
GoogleMay 10, 2026

Gemini 2.0 Flash-Lite

Quality
78.0
#27 of 32
Cost
0.0×
#3 of 32
Speed
0.4×
#2 of 32
MistralMay 10, 2026

Mistral Large 3 (2512)

Quality
83.2
#22 of 32
Cost
0.2×
#11 of 32
Speed
1.3×
#21 of 32
MistralMay 10, 2026

Mistral Small 3 (24B)

Quality
79.9
#26 of 32
Cost
0.0×
#1 of 32
Speed
0.4×
#7 of 32
MistralMay 10, 2026

Codestral (2508)

Quality
71.1
#32 of 32
Cost
0.1×
#8 of 32
Speed
0.2×
#1 of 32

How we test.

Three dimensions, no composite. A four-model judge panel scoring against case-specific rubrics. Built for evidence-based model selection — not leaderboard chasing.

Stop guessing

Build with the right models.

The ORCFLO Index is free to read. To filter by your own tasks, swap models in your workflows, and run A/Bs — sign up for ORCFLO.

500 free creditsNo credit card requiredCancel anytime
ORCFLO IndexLive
Models scored32
Real-world tasks40
Judge panel4 models
Latest refreshMay 10