How the ORCFLO Index
works.
The ORCFLO Index is a public benchmark that ranks large language models on the dimensions business decision-makers actually weigh: quality, cost, and speed. We publish results in periodic cohorts so that organizations evaluating models for production use have an independent, repeatable reference.
Most public benchmarks reduce model performance to a single number. We deliberately don't, because composites hide the tradeoffs that drive real decisions.
A model that ranks #1 on quality but #32 on cost is not interchangeable with a model that ranks #5 on quality and #5 on cost. The first might be right for high-stakes strategic analysis; the second is the better fit for high-volume customer support.
A composite would average those positions into a number that obscures the choice.
We report each dimension on its own independent rank. Readers compose their own weighting based on their use case.
Quality. Cost. Speed.
Three ranks, no composite.
Each model is measured on three independent axes. You see them side by side and decide what matters for your workflow.
DIMENSION 01 · QUALITY
Average across 40 test cases, on a 0–100 scale.
Each case score is itself the mean of independent ratings from a four-model judge panel. With 40 cases × 4 judges, a model's quality score is grounded in 160 scoring events per cohort. Higher is better.
Scale
0–100
Per case, per category, overall
Per cohort
160
Scoring events per model
0–100
QUALITY SCALE · HIGHER IS BETTER
1.0×
COST MULTIPLIER VS. COHORT MEDIAN
DIMENSION 02 · COST
Average cost per test case, at each provider's published API pricing.
In published reports we display cost as a multiplier vs. the cohort median — for example, “4.0× median”. The raw per-case figure is preserved as a secondary line for buyers running specific cost calculations. Why the multiplier framing? Raw cost figures have no anchor for a reader scanning the report; “4.0× median” communicates the tradeoff at a glance. Lower is better.
Anchor
Median = 1.0×
e.g. "4.0× median"
Detail
Raw $
Preserved as a secondary line
DIMENSION 03 · SPEED
Wall-clock time per case, in seconds, from request to response.
Displayed the same way as cost — as a multiplier vs. the cohort median — for the same reason: an absolute number like “22.8 seconds per case” needs an anchor. Lower is better.
Anchor
Median = 1.0×
Multiplier vs. cohort
Detail
Raw sec
Preserved as a secondary line
1.0×
SPEED MULTIPLIER VS. COHORT MEDIAN
40 cases, 8 categories,
one cohort.
Each cohort uses a fixed set of 40 cases spanning 8 categories. The case set is published with the cohort and does not change within a cohort.
Cases that discriminate
Every case has an explicit specification of what behavior it probes — recorded in the case bank as a “what it exposes” field. A case that produces near-identical scores across all 32 models in the field is not doing its job; we replace such cases between cohorts.
Real-world business prompts
Strategic analysis questions, document extraction tasks, summarization assignments, business writing prompts. No academic-style multiple-choice questions, contrived puzzles, or trick questions. The goal is to expose differences that matter for production use.
Abilities
4 categories
What models produce when given a well-formed prompt.
Analysis
Strategic judgment, reasoning, disqualifying-factor detection.
Extraction
Field accuracy, null handling, format compliance, zero fabrication.
Summarization
Compression quality, key-point retention, length compliance.
Writing
Tone, structure, persuasion, audience adaptation.
Behaviors
3 categories
How the models act under pressure.
Hallucination Resistance
Fabrication detection, factual grounding, source fidelity.
Instruction Following
Constraint adherence, format compliance, multi-part directives.
Refusal Calibration
Appropriate refusal vs. over-refusal on legitimate requests.
Stability
1 category
Repeatability across identical inputs.
Output Consistency
Run-to-run reproducibility, format stability.
Why 40 cases
5 cases per category × 8 categories = 40 cases per cohort. This balances enough cases per category to surface meaningful per-category differences, and a total small enough that all models in the cohort can be tested under consistent conditions within a single coordinated run.
Each category is ranked independently across the cohort. A model may lead in one category and lag in another; we report all eight separately so readers can match category strength to their specific use case.
Four judges,
four providers.
Every model is scored by a panel of four independent judge models. Using a panel rather than a single judge produces a more stable reference: any single judge has its own scoring tendencies, and averaging across four produces a more reproducible rank.
Gemini 2.5 Pro
Anthropic
Claude Opus 4.7
OpenAI
GPT 5.5
Mistral
Mistral Large
WHAT JUDGES EVALUATE
Case-specific rubrics, not generic helpfulness scores.
Each test case carries its own scoring rubric. Judges are not asked “how good is this response generally?” — they are asked specific questions tied to what the case was designed to probe.
Each case has a published “what it exposes” specification that defines the scoring criteria. Judges apply that case-specific rubric when producing their numeric score.
WHY FOUR
Broad representation, tractable to coordinate.
Four gives us broad representation of frontier models from the vendors in our portfolio while keeping the scoring panel small enough to be tractable for each cohort.
The judge panel composition is published with every cohort.
For each (case, contestant) pair.
The same three steps run for every (case, contestant model) combination, every cohort.
STEP 01
01Inputs delivered
The judge receives the case prompt, the contestant model's response, and the case-specific rubric.
STEP 02
02Score + rationale
The judge produces a numeric quality score on a 0–100 scale and a written rationale citing the rubric criteria.
STEP 03
03Independent
Each judge scores independently — judges do not see other judges' scores or rationales.
AGGREGATION
Simple unweighted means, all the way up.
Per case
mean of 4 judge scores
A contestant's score on a single case is the simple unweighted mean of the four judges' scores.
Per category
mean of 5 cases
A contestant's score for a category is the mean across the five cases in that category.
Overall
mean of 40 cases
A contestant's overall quality score is the mean across all 40 cases.
There are no weighting schemes. No judge counts more than another. No case counts more than another.
Quartile tiers,
per dimension.
For each dimension, we group the cohort into quartiles. Cutoffs are recomputed per cohort from the cohort's actual distribution, not from fixed thresholds.
Top 25%
Leader
Quality
- Cost
- Budget
- Speed
- Fast
25–50%
Strong
Quality
- Cost
- Standard
- Speed
- Quick
50–75%
Contender
Quality
- Cost
- Premium
- Speed
- Moderate
Bottom 25%
Trailing
Quality
- Cost
- Highest
- Speed
- Slow
Every claim,
traceable.
We publish in cohorts — periodic, dated snapshots of the full field. Each cohort tests a complete set of models in a single coordinated run, uses the same judge panel and case set throughout, and is archived as a permanent record.
In scope for this cohort
- Text-only, English-only tasks
- Single-turn prompts (one prompt → one response, no follow-up)
- Real-world business scenarios
Not tested in this cohort
- Code generation
- Multi-turn conversation
- Multimodal tasks (image, audio, video)
- Agentic workflows (tool use, multi-step planning, autonomous action)
These omissions reflect the philosophy of the ORCFLO Index that is rooted in business-centric activities, organized into discrete tasks that are combined to create sophisticated AI workflows.
What every report carries
Every published report carries the information needed to verify and reconstruct its claims. The case set, judge panel, and scoring approach are fixed within a cohort.
- Test date
- When the cohort was run.
- Cohort identifier
- A unique slug (e.g. gpt-5-2026-05-10) shown prominently in each report header.
- Judge panel
- The four judges named explicitly in the methodology footer.
- Tier cutoffs
- The exact threshold values defining each tier band, published in the Tiering section.
- Case specifications
- Each case has a name and a published description of what it probes; the subscriber report shows these alongside the model's per-case scores.
A reader who wants to verify the rankings could in principle reconstruct them: the same case prompts scored by the same judges produce the same scores.
Limitations
worth knowing.
We'd rather state the boundaries of the benchmark up front than have readers discover them later.
Point-in-time pricing
Cost and speed measurements reflect public API pricing and latency at the cohort test date. Provider pricing and infrastructure change frequently; treat absolute dollar and second figures as a snapshot. Relative rankings are the more durable signal.
Production-grade conditions
Models are tested under standard API conditions with no model-specific prompt optimization, fine-tuning, or scoring tricks. A model that performs better in this benchmark with custom prompt engineering may or may not perform proportionally better in production.
Single-language, single-turn
This cohort tests English-language single-turn prompts. Performance on multilingual tasks, conversational threads, or agentic workflows may differ substantially and is not measured here.
Inclusion is not endorsement
Some tested models are not currently offered by ORCFLO. Each report indicates availability status clearly; the cohort represents a test field, not a recommended shortlist.
Cohorts are dated artifacts
Once a cohort is published, its reports do not change. New evaluations are published as new cohorts, preserving the historical reference.
Audience & intended use
The ORCFLO Index is built for business decision-makers. It isn't peer-reviewed research and isn't intended to be cited as one.
Methodology evolution
The ORCFLO Index methodology is documented here for the current release. As we learn from each cohort, we refine the methodology for the next — adding categories, tuning cases that don't discriminate well, expanding the judge panel composition, broadening scope to areas not currently tested.
Material methodology changes are published with each cohort and noted in the cohort's methodology section. The methodology page on this site always reflects the most recently published cohort.
Methodology questions or want to flag specific concerns? info@orcflo.com