Scores are deterministic. AI writes editorial prose only. The public site reads a published snapshot through the DAL, never from a live benchmark scrape. When the methodology changes, the update is versioned and applied only through a republished snapshot, so buyers always read the same rules that produced the current roster.
The two scores that determine the public leaderboard.
Quality and Value are the only official PickAI scores. They rank the same published roster with different rules, so buyers can compare raw capability separately from buying power.
Quality Score
Shared benchmark pool
HLE + coding + factual + context + speed
40% HLE - The universal floor and primary quality anchor.
20% Coding - Only scored when the benchmark exists across the eligible pool; missing optional signals are reweighted.
20% Factual grounding - Only scored when the benchmark exists across the eligible pool; missing optional signals are reweighted.
Official vs raw views
Quality and Value are scores. Intelligence and Coding are benchmark lenses.
The product has two official scores and two raw benchmark views. Intelligence exposes Humanity's Last Exam directly. Coding exposes SWE-bench Verified directly. Both keep companion benchmarks visible without turning them into extra PickAI scores.
Official scores
Quality ranks capability across the shared benchmark pool. Value ranks buying power through HLE plus published API token cost.
Raw benchmark views
Intelligence and Coding show direct benchmark evidence beside the same roster. They surface supporting signals, but they do not rewrite the official score logic.
What stays separate
Display-only signals stay visible without becoming hidden score inputs.
Subscription pricing can appear on Quality pages without changing the Quality score.
LiveCodeBench and Aider Polyglot can appear beside SWE-bench Verified without changing Coding order.
Score coverage
What the official scores include, and what the raw benchmark views show.
Y means the signal is part of the ranking rule. ? means the signal is tracked and shown when available, but it does not yet determine the public order. X means the signal stays out of the scoring formula.
Official PickAI scores
These cards define the published order.
Quality and Value are formula-driven. The score order is public, deterministic, and separate from the benchmark-only tabs.
Quality Score
Price is shown for buyer context, but it does not affect the Quality score. Optional signals are scored only when they are broadly available across the eligible published roster.
HLE
Required floor and primary capability anchor.
Coding / math
Scored only when the benchmark is broadly available.
Supporting buyer signal
Ease of use stays outside the rank.
45 app access + 30 no setup + 15 free tier + 10 official surfaces
45 points Consumer web app exists - A hosted browser-accessible product a normal buyer can open and use today.
30 points No technical setup required - The normal use path does not require an API key, terminal, or developer console.
15 points Free tier or trial available - A buyer can test the product before paying.
10 points Multiple official surfaces - The current snapshot lists more than one verified first-party surface or tool.
Ease of use is a buyer-accessibility signal for non-technical users. It helps explain friction to first use, but it does not change the published leaderboard order.
Why the separation matters
Trust & governance
Why you can trust what you are reading.,
This page is public and buyer-facing, but the safeguards behind it are still explicit. Publication rules, update cadence, and operational controls all exist to stop unverified benchmark noise from leaking into the public ranking.
How publication works
Editorial prose is allowed. Score edits are not.
Scores are read-only after snapshot generation.
The public site reads a published Supabase snapshot instead of fetching live benchmark pages.
Methodology changes ship through a versioned update and a republished snapshot, not a hidden runtime override.
Evidence & benchmark sources
Accepted reputable citations only, grouped by the evidence they support.
Public ranking rows are published only when the underlying benchmark and pricing inputs can be tied to accepted reputable sources that any buyer can fact-check. Unsupported sources are omitted rather than estimated.
Reasoning & novel problem solving
Primary benchmark evidence for raw reasoning views and supporting problem-solving context.
Primary public leaderboard source for MathArena Expected Performance.
FAQ
Questions buyers ask before they trust a ranking.
What is the difference between Quality and Value?
Quality ranks the published roster by capability signals such as HLE, coding, grounding, context, and speed. Value uses a separate deterministic formula and combines HLE with published API token cost instead of blending into Quality.
Why can the Intelligence view rank models differently from Quality?
The Intelligence view is the raw Humanity's Last Exam lens over the same published roster. It surfaces HLE evidence directly, while Quality remains the broader deterministic ranking.
Why does the Coding view show more than one benchmark?
SWE-bench Verified sets the rank, but LiveCodeBench and Aider Polyglot are kept beside it so buyers can see whether a model is genuinely strong across fresh tasks and multi-file work, not just one harness.
10% Context - Included only when the context benchmark is broadly available; otherwise it stays display-only.
10% Speed - Observed performance stays separate from price and editorial prose; it is not a hidden editorial input.
Price is not part of the Quality formula. Any signal that is not broadly available stays display-only and is removed from the score calculation.
Value Score
HLE and token cost
0.40 HLE + 0.60 API token cost
40% HLE - A model must clear the HLE floor to appear on Value.
60% API token cost - The buying-power signal comes from published API pricing, not consumer subscription pricing.
Free or zero-priced API models can still qualify if they clear the HLE floor. Subscription pricing stays visible, but Value leans on published API token cost instead.
Ease of use remains a separate buyer-accessibility signal and does not rank the public leaderboard.
AI-written prose can explain the roster, but it never edits numeric values.
Factual grounding
Scored only when the benchmark is broadly available.
Context window
Scored only when the benchmark is broadly available.
Speed
Observed performance is scored separately from price.
Subscription price
Displayed for buyer context only.
Token costs
Displayed for buyer context only.
Value Score
Value ranks the same published roster by a cost-heavy blend of HLE and published API token cost. Missing pricing weakens the cost signal, but it does not automatically remove a model from Value.
HLE
Required floor and quality anchor.
API token cost
Primary buying-power ranking signal.
Subscription price
Shown as context, not scored separately.
Missing pricing
Tracked as weaker evidence, not a hard blocker.
Raw benchmark views
These cards explain the benchmark-only tabs.
Intelligence and Coding remain direct benchmark views over the same roster. Companion metrics stay visible, but they do not become hidden score inputs.
Intelligence View
This remains a raw Humanity's Last Exam view over the same roster. Supporting benchmarks stay visible, but they do not create a third PickAI score.
HLE
Ranking signal.
MathArena
Displayed as supporting evidence when available.
ARC-AGI-2
Displayed as supporting evidence when available.
GPQA
Displayed as supporting evidence when available.
Context window
Not part of the raw intelligence view.
Speed
Not part of the raw intelligence view.
Price
Not part of the raw intelligence view.
Coding View
This remains a raw SWE-bench Verified view over the same roster. LiveCodeBench and Aider Polyglot stay visible as supporting coding evidence, but they do not create a fourth PickAI score.
SWE-bench Verified
Ranking signal.
LiveCodeBench
Displayed as supporting evidence when available.
Aider Polyglot
Displayed as supporting evidence when verified.
HLE
Buyers can see more context without losing the ranking logic.
Capability stays capability
Quality and Value do the ranking work. Supporting context does not leak into the official order.
Buyer context stays visible
Pricing, app access, benchmark companions, and editorial notes stay available where they help a purchase decision.
Raw views stay raw
Intelligence and Coding expose direct benchmark evidence rather than a blended third or fourth PickAI score.
Trust stays explicit
The page tells you which signals rank, which signals inform, and which signals are intentionally excluded.
Refresh remains manual and admin-gated; there is no public or scheduled refresh trigger.
AI-written editorial copy refreshes after publication without touching numeric score values.
The operational diagnostics that explain candidate-pool blockers and publication withholding live in the protected status console, not on the public methodology page.
The rules we do not bend
01
Scores are immutable
Quality Score, Value Score, and benchmark sub-scores are read-only numbers. AI can write editorial prose, but it never changes the score fields.
02
Snapshot is the source of truth
The public site reads from the snapshot through the DAL. It does not fetch benchmark sources or call AI at request time, and unsupported benchmark packs are withheld rather than rendered.
03
Quality and value stay independent
There is no combined score. Quality and Value remain the only official PickAI scores, while the Intelligence and Coding tabs stay raw benchmark views over the same roster.
04
Up to 10 models
Public pages and endpoints are capped at 10 models. The published roster tracks the latest stable up to 10 models in the current snapshot, with no overflow pagination or hidden extra rows.
05
AI prose is labeled
Any AI-written verdict, pros/cons list, or best-for block carries a visible AI-generated label and data-ai-generated="true".
When updates happen & what we refuse
Update cadence
Admin-reviewed refresh: the public snapshot updates only when the protected refresh action is run after source review.
Narrow refresh scope: only newly discovered supported models and existing models with missing benchmark facts are targeted.
Operator diagnostics stay in the protected status console rather than the public methodology page.
Hard exclusions
We do not let AI change numeric scores.
We do not rank more than 10 models on public pages.
We do not expose public refresh triggers.
We do not add affiliate links to leaderboard rows or this page.
AI-written editorial copy is labeled wherever it appears
First-party leaderboard maintained by the benchmark creators. Report the scores as Aider publishes them, using the aider harness and the open-source benchmark dataset on GitHub.
Official model-card benchmark tables hosted on Hugging Face and cited when they provide current public benchmark details.
PickAI Conversation Value remains our disclosed buying-power benchmark at published API rates, while Ease of use is the separate accessibility signal for non-technical buyers.
Benchmarks under review
These inputs are tracked publicly before they are allowed into the score formula.
These coding benchmarks test different facets of software capability. SWE-bench Verified ranks the Coding view, LiveCodeBench stays visible as a companion signal, and Aider Polyglot is shown when verified. None of them are blended into Quality or Value.
ARC-AGI-2
ARC-AGI-2 tests novel visual pattern reasoning without prior exposure. It will be considered for inclusion only when comparable verified scores exist across all up to 10 models and a methodology version update is published.
ARC-AGI-3
ARC-AGI-3 is a watchlist benchmark for interactive agent reasoning. It will not enter the formula until stable comparative data exists across the published roster.
Where do I compare price, app access, and buyer notes in more detail?
Use the models index and the model detail pages when you want pricing, app access, buyer-facing guidance, and the supporting benchmark context next to the ranking.