MethodologyCohort May 10, 2026 · 32 models

How the ORCFLO Index
works.

The ORCFLO Index is a public benchmark that ranks large language models on the dimensions business decision-makers actually weigh: quality, cost, and speed. We publish results in periodic cohorts so that organizations evaluating models for production use have an independent, repeatable reference.

01Why no composite 02Three dimensions 03Test design 04The judge panel 05How judges score 06Tier system 07Reproducibility 08Limitations

01Why no composite score

Most public benchmarks reduce model performance to a single number. We deliberately don't, because composites hide the tradeoffs that drive real decisions.

A model that ranks #1 on quality but #32 on cost is not interchangeable with a model that ranks #5 on quality and #5 on cost. The first might be right for high-stakes strategic analysis; the second is the better fit for high-volume customer support.

A composite would average those positions into a number that obscures the choice.

We report each dimension on its own independent rank. Readers compose their own weighting based on their use case.

02The three dimensions

Quality. Cost. Speed.
Three ranks, no composite.

Each model is measured on three independent axes. You see them side by side and decide what matters for your workflow.

DIMENSION 01 · QUALITY

Average across 40 test cases, on a 0–100 scale.

Each case score is itself the mean of independent ratings from a four-model judge panel. With 40 cases × 4 judges, a model's quality score is grounded in 160 scoring events per cohort. Higher is better.

Scale

0–100

Per case, per category, overall

Per cohort

160

Scoring events per model

0–100

QUALITY SCALE · HIGHER IS BETTER

1.0×

COST MULTIPLIER VS. COHORT MEDIAN

DIMENSION 02 · COST

Average cost per test case, at each provider's published API pricing.

In published reports we display cost as a multiplier vs. the cohort median — for example, “4.0× median”. The raw per-case figure is preserved as a secondary line for buyers running specific cost calculations. Why the multiplier framing? Raw cost figures have no anchor for a reader scanning the report; “4.0× median” communicates the tradeoff at a glance. Lower is better.

Anchor

Median = 1.0×

e.g. "4.0× median"

Detail

Raw $

Preserved as a secondary line

DIMENSION 03 · SPEED

Wall-clock time per case, in seconds, from request to response.

Displayed the same way as cost — as a multiplier vs. the cohort median — for the same reason: an absolute number like “22.8 seconds per case” needs an anchor. Lower is better.

Anchor

Median = 1.0×

Multiplier vs. cohort

Detail

Raw sec

Preserved as a secondary line

1.0×

SPEED MULTIPLIER VS. COHORT MEDIAN

03Test design

40 cases, 8 categories,
one cohort.

Each cohort uses a fixed set of 40 cases spanning 8 categories. The case set is published with the cohort and does not change within a cohort.

Cases that discriminate

Every case has an explicit specification of what behavior it probes — recorded in the case bank as a “what it exposes” field. A case that produces near-identical scores across all 32 models in the field is not doing its job; we replace such cases between cohorts.

Real-world business prompts

Strategic analysis questions, document extraction tasks, summarization assignments, business writing prompts. No academic-style multiple-choice questions, contrived puzzles, or trick questions. The goal is to expose differences that matter for production use.

Abilities

4 categories

What models produce when given a well-formed prompt.

Analysis

Strategic judgment, reasoning, disqualifying-factor detection.

Extraction

Field accuracy, null handling, format compliance, zero fabrication.

Summarization

Compression quality, key-point retention, length compliance.

Writing

Tone, structure, persuasion, audience adaptation.

Behaviors

3 categories

How the models act under pressure.

Hallucination Resistance

Fabrication detection, factual grounding, source fidelity.

Instruction Following

Constraint adherence, format compliance, multi-part directives.

Refusal Calibration

Appropriate refusal vs. over-refusal on legitimate requests.

Stability

1 category

Repeatability across identical inputs.

Output Consistency

Run-to-run reproducibility, format stability.

Why 40 cases

5 cases per category × 8 categories = 40 cases per cohort. This balances enough cases per category to surface meaningful per-category differences, and a total small enough that all models in the cohort can be tested under consistent conditions within a single coordinated run.

5×8=40

Each category is ranked independently across the cohort. A model may lead in one category and lag in another; we report all eight separately so readers can match category strength to their specific use case.

04The judge panel

Four judges,
four providers.

Every model is scored by a panel of four independent judge models. Using a panel rather than a single judge produces a more stable reference: any single judge has its own scoring tendencies, and averaging across four produces a more reproducible rank.

Google

Gemini 2.5 Pro

Anthropic

Claude Opus 4.7

OpenAI

GPT 5.5

Mistral

Mistral Large

WHAT JUDGES EVALUATE

Case-specific rubrics, not generic helpfulness scores.

Each test case carries its own scoring rubric. Judges are not asked “how good is this response generally?” — they are asked specific questions tied to what the case was designed to probe.

Each case has a published “what it exposes” specification that defines the scoring criteria. Judges apply that case-specific rubric when producing their numeric score.

WHY FOUR

Broad representation, tractable to coordinate.

Four gives us broad representation of frontier models from the vendors in our portfolio while keeping the scoring panel small enough to be tractable for each cohort.

The judge panel composition is published with every cohort.

Judges that are also contestants Some models in the contestant field are also judges in the panel (for example, GPT 5.5 is judged by a panel that includes GPT 5.5). This is a structural property of evaluating LLMs with LLM judges. Each judge scores every contestant — including itself, where applicable — under identical conditions. We make no claims about whether or how this affects any individual model's score; the relevant facts are disclosed and readers can interpret them.

05How judges score

For each (case, contestant) pair.

The same three steps run for every (case, contestant model) combination, every cohort.

STEP 01
01
Inputs delivered
The judge receives the case prompt, the contestant model's response, and the case-specific rubric.
STEP 02
02
Score + rationale
The judge produces a numeric quality score on a 0–100 scale and a written rationale citing the rubric criteria.
STEP 03
03
Independent
Each judge scores independently — judges do not see other judges' scores or rationales.

AGGREGATION

Simple unweighted means, all the way up.

Per case

mean of 4 judge scores

A contestant's score on a single case is the simple unweighted mean of the four judges' scores.

Per category

mean of 5 cases

A contestant's score for a category is the mean across the five cases in that category.

Overall

mean of 40 cases

A contestant's overall quality score is the mean across all 40 cases.

There are no weighting schemes. No judge counts more than another. No case counts more than another.

06Tier system

Quartile tiers,
per dimension.

For each dimension, we group the cohort into quartiles. Cutoffs are recomputed per cohort from the cohort's actual distribution, not from fixed thresholds.

Top 25%

Leader

Quality

Cost: Budget
Speed: Fast

25–50%

Strong

Quality

Cost: Standard
Speed: Quick

50–75%

Contender

Quality

Cost: Premium
Speed: Moderate

Bottom 25%

Trailing

Quality

Cost: Highest
Speed: Slow

Green = good, red = watch out In published reports, each tier carries a color: green when the tier is favorable for the buyer, red when it isn't. The same logic applies across all three dimensions — green is always “good for you,” red is always “watch out.” The exact threshold values are published inside each report so the cutoff is auditable.

07Cohorts & reproducibility

Every claim,
traceable.

We publish in cohorts — periodic, dated snapshots of the full field. Each cohort tests a complete set of models in a single coordinated run, uses the same judge panel and case set throughout, and is archived as a permanent record.

In scope for this cohort

Text-only, English-only tasks
Single-turn prompts (one prompt → one response, no follow-up)
Real-world business scenarios

Not tested in this cohort

Code generation
Multi-turn conversation
Multimodal tasks (image, audio, video)
Agentic workflows (tool use, multi-step planning, autonomous action)

These omissions reflect the philosophy of the ORCFLO Index that is rooted in business-centric activities, organized into discrete tasks that are combined to create sophisticated AI workflows.

What every report carries

Every published report carries the information needed to verify and reconstruct its claims. The case set, judge panel, and scoring approach are fixed within a cohort.

Test date: When the cohort was run.
Cohort identifier: A unique slug (e.g. gpt-5-2026-05-10) shown prominently in each report header.
Judge panel: The four judges named explicitly in the methodology footer.
Tier cutoffs: The exact threshold values defining each tier band, published in the Tiering section.
Case specifications: Each case has a name and a published description of what it probes; the subscriber report shows these alongside the model's per-case scores.

A reader who wants to verify the rankings could in principle reconstruct them: the same case prompts scored by the same judges produce the same scores.

08Limitations

Limitations
worth knowing.

We'd rather state the boundaries of the benchmark up front than have readers discover them later.

Point-in-time pricing

Cost and speed measurements reflect public API pricing and latency at the cohort test date. Provider pricing and infrastructure change frequently; treat absolute dollar and second figures as a snapshot. Relative rankings are the more durable signal.

Production-grade conditions

Models are tested under standard API conditions with no model-specific prompt optimization, fine-tuning, or scoring tricks. A model that performs better in this benchmark with custom prompt engineering may or may not perform proportionally better in production.

Single-language, single-turn

This cohort tests English-language single-turn prompts. Performance on multilingual tasks, conversational threads, or agentic workflows may differ substantially and is not measured here.

Inclusion is not endorsement

Some tested models are not currently offered by ORCFLO. Each report indicates availability status clearly; the cohort represents a test field, not a recommended shortlist.

Cohorts are dated artifacts

Once a cohort is published, its reports do not change. New evaluations are published as new cohorts, preserving the historical reference.

Audience & intended use

The ORCFLO Index is built for business decision-makers. It isn't peer-reviewed research and isn't intended to be cited as one.

Methodology evolution

The ORCFLO Index methodology is documented here for the current release. As we learn from each cohort, we refine the methodology for the next — adding categories, tuning cases that don't discriminate well, expanding the judge panel composition, broadening scope to areas not currently tested.

Material methodology changes are published with each cohort and noted in the cohort's methodology section. The methodology page on this site always reflects the most recently published cohort.

Methodology questions or want to flag specific concerns? info@orcflo.com

How the ORCFLO Indexworks.

Quality. Cost. Speed.Three ranks, no composite.

Average across 40 test cases, on a 0–100 scale.

Average cost per test case, at each provider's published API pricing.

Wall-clock time per case, in seconds, from request to response.

40 cases, 8 categories,one cohort.

Cases that discriminate

Real-world business prompts

Abilities

Behaviors

Stability

Why 40 cases

Four judges,four providers.

Gemini 2.5 Pro

Claude Opus 4.7

GPT 5.5

Mistral Large

Case-specific rubrics, not generic helpfulness scores.

Broad representation, tractable to coordinate.

For each (case, contestant) pair.

Inputs delivered

Score + rationale

Independent

Simple unweighted means, all the way up.

Quartile tiers,per dimension.

Every claim,traceable.

In scope for this cohort

Not tested in this cohort

What every report carries

Limitationsworth knowing.

Point-in-time pricing

Production-grade conditions

Single-language, single-turn

Inclusion is not endorsement

Cohorts are dated artifacts

Audience & intended use

Methodology evolution

Build with the right models.

How the ORCFLO Index
works.

Quality. Cost. Speed.
Three ranks, no composite.

40 cases, 8 categories,
one cohort.

Four judges,
four providers.

Quartile tiers,
per dimension.

Every claim,
traceable.

Limitations
worth knowing.