[d] panel · [s] shuffle · [f] fullscreen · [r] reverse · [m] mode

Behavioral
fidelity
leaderboard

This benchmark shows which models actually hold up against published studies of human choice.

Request methodology↗︎

Evidence before adoption

Why this benchmark exists

Most benchmarks reward style, speed, or benchmark gaming. We care about whether a system can reproduce the structure of human choice.

That means comparing simulated outcomes to published studies and scoring the same decision models researchers used in the original work.

The point is not to crown a winner for a week. The point is to show what these systems can actually support.

If you cannot measure behavioral fidelity, you cannot responsibly deploy a behavioral system.

417Replicated study runs
5Research domains
93%Accuracy vs real human decisions
OpenPublished methodology
SignalValueInterpretation
Replicated study runs417Published and reconstructed experiments currently represented in the benchmark dataset
Domains covered5Public Health, Consumer Research, Economics, Agricultural Sciences, and Business Administration
Top weighted Spearman correlation0.55Best current cross-domain rank-order alignment in the dataset
Top weighted parameter sign agreement75.4%Best current match on coefficient direction across benchmarked studies

Transparent measurement

Benchmark methodology

The leaderboard is built by replicating published studies of human choice.

We reconstruct the original survey, match the respondent profile, estimate the same models, and compare the results back to human data.

Every score is attached to a documented protocol.

Live rankings

ModelScoreCoverageMatchStudies
sonnet56.7840.0782.6319
gpt455.6246.0278.2119
sonnet54.6345.7970.9523
gemini flash54.6323.8481.5919
gpt453.8735.2773.222
haiku53.6239.9276.6019
gpt3-instruct52.7941.5075.9718
gpt452.6734.9278.749
gemini51.0933.4676.839
sonnet50.8228.5477.559
Similarity
Parameter proximity
Coverage
Confidence overlap
Match
Effect direction
Rank Corr.
Preference ordering
1 / 4

Domain breakdowns

Each table preserves the original domain-level ranking surface from the legacy leaderboard.

Public Health

8 models ranked

#public-health
ModelScoreCoverageMatchStudies
sonnet54.6345.7970.9523
haiku50.3044.9365.8323
gpt450.2343.1267.5923
gemini flash49.4825.3768.9322
o148.4925.6768.7220
gemini39.7140.4755.4223
gpt3-instruct38.9441.1252.2722
gpt o337.5121.0555.2122
Similarity
Parameter proximity
Coverage
Confidence overlap

Consumer Research

8 models ranked

#consumer-research
ModelScoreCoverageMatchStudies
sonnet56.7840.0782.6319
gpt455.6246.0278.2119
gemini flash54.6323.8481.5919
haiku53.6239.9276.6019
gpt3-instruct52.7941.5075.9718
o150.6525.3375.2019
gemini50.0137.1274.8119
gpt o333.1117.9156.5919
Similarity
Parameter proximity
Coverage
Confidence overlap

Economics

8 models ranked

#economics
ModelScoreCoverageMatchStudies
gpt452.6734.9278.749
gemini51.0933.4676.839
sonnet50.8228.5477.559
gemini flash47.4723.8177.879
gpt3-instruct45.0321.5875.799
haiku43.9231.1773.579
o138.4018.6765.199
gpt o333.194.7763.359
Similarity
Parameter proximity
Coverage
Confidence overlap

Agricultural Sciences

4 models ranked

#agricultural-sciences
ModelScoreCoverageMatchStudies
gpt443.1824.3053.482
sonnet38.5229.0847.402
haiku32.5628.3043.322
gemini29.3228.9844.802
Similarity
Parameter proximity
Coverage
Confidence overlap

Business Administration

4 models ranked

#business-administration
ModelScoreCoverageMatchStudies
gpt453.8735.2773.222
sonnet44.6415.6368.582
haiku44.0018.5162.302
gemini43.1229.1660.102
Similarity
Parameter proximity
Coverage
Confidence overlap

// Scoring methodology

How we score

The method follows the original studies closely enough to preserve the research question, not just the surface language.

01 ::

Original study reconstruction

We begin by reconstructing the original survey using the same set of attributes reported in the source study.

02 ::

Demographic profile matching

We recreate the demographic profile of the original respondents as closely as possible from the published paper.

03 ::

Matched respondent simulation

The model is prompted to answer as if it were a participant with the corresponding demographic characteristics.

04 ::

Human model parity

We estimate the same statistical model used in the original research on the AI-generated responses.

05 ::

Comparison to human data

Estimated parameters are compared back to the original human data to assess how well the model replicates real decision-making.

06 ::

Transparent publication

Results are summarized with enough methodological detail for research, legal, and procurement stakeholders to understand what is being claimed.

Frequently asked questions

// Leaderboard details

Questions about the leaderboard? Contact us at hello@subconscious.ai

01 ::

Is this the full interactive leaderboard?

This page presents the benchmark method and the highest-signal summary. We can share deeper methodological detail directly with qualified partners.

02 ::

Can we audit the methodology?

Yes. The benchmark is intentionally designed to be explainable to research, legal, procurement, and executive stakeholders.

03 ::

Why stated preference studies?

They are a rigorous way to compare how humans make trade-offs across defined attributes, which makes them useful for testing simulated decision-makers.

04 ::

What does a good score mean?

A good score means the model is preserving important structural properties of human choice in that domain. It does not mean the model is universally reliable without context.