Behavioral
fidelity
leaderboard
This benchmark shows which models actually hold up against published studies of human choice.
Request methodology↗︎Evidence before adoption
Why this benchmark exists
Most benchmarks reward style, speed, or benchmark gaming. We care about whether a system can reproduce the structure of human choice.
That means comparing simulated outcomes to published studies and scoring the same decision models researchers used in the original work.
The point is not to crown a winner for a week. The point is to show what these systems can actually support.
If you cannot measure behavioral fidelity, you cannot responsibly deploy a behavioral system.
| Signal | Value | Interpretation |
|---|---|---|
| Replicated study runs | 417 | Published and reconstructed experiments currently represented in the benchmark dataset |
| Domains covered | 5 | Public Health, Consumer Research, Economics, Agricultural Sciences, and Business Administration |
| Top weighted Spearman correlation | 0.55 | Best current cross-domain rank-order alignment in the dataset |
| Top weighted parameter sign agreement | 75.4% | Best current match on coefficient direction across benchmarked studies |
Transparent measurement
Benchmark methodology
The leaderboard is built by replicating published studies of human choice.
We reconstruct the original survey, match the respondent profile, estimate the same models, and compare the results back to human data.
Every score is attached to a documented protocol.
Live rankings
| Model | Score | Coverage | Match | Studies |
|---|---|---|---|---|
| sonnet | 56.78 | 40.07 | 82.63 | 19 |
| gpt4 | 55.62 | 46.02 | 78.21 | 19 |
| sonnet | 54.63 | 45.79 | 70.95 | 23 |
| gemini flash | 54.63 | 23.84 | 81.59 | 19 |
| gpt4 | 53.87 | 35.27 | 73.22 | 2 |
| haiku | 53.62 | 39.92 | 76.60 | 19 |
| gpt3-instruct | 52.79 | 41.50 | 75.97 | 18 |
| gpt4 | 52.67 | 34.92 | 78.74 | 9 |
| gemini | 51.09 | 33.46 | 76.83 | 9 |
| sonnet | 50.82 | 28.54 | 77.55 | 9 |
Domain breakdowns
Each table preserves the original domain-level ranking surface from the legacy leaderboard.
Public Health
8 models ranked
| Model | Score | Coverage | Match | Studies |
|---|---|---|---|---|
| sonnet | 54.63 | 45.79 | 70.95 | 23 |
| haiku | 50.30 | 44.93 | 65.83 | 23 |
| gpt4 | 50.23 | 43.12 | 67.59 | 23 |
| gemini flash | 49.48 | 25.37 | 68.93 | 22 |
| o1 | 48.49 | 25.67 | 68.72 | 20 |
| gemini | 39.71 | 40.47 | 55.42 | 23 |
| gpt3-instruct | 38.94 | 41.12 | 52.27 | 22 |
| gpt o3 | 37.51 | 21.05 | 55.21 | 22 |
Consumer Research
8 models ranked
| Model | Score | Coverage | Match | Studies |
|---|---|---|---|---|
| sonnet | 56.78 | 40.07 | 82.63 | 19 |
| gpt4 | 55.62 | 46.02 | 78.21 | 19 |
| gemini flash | 54.63 | 23.84 | 81.59 | 19 |
| haiku | 53.62 | 39.92 | 76.60 | 19 |
| gpt3-instruct | 52.79 | 41.50 | 75.97 | 18 |
| o1 | 50.65 | 25.33 | 75.20 | 19 |
| gemini | 50.01 | 37.12 | 74.81 | 19 |
| gpt o3 | 33.11 | 17.91 | 56.59 | 19 |
Economics
8 models ranked
| Model | Score | Coverage | Match | Studies |
|---|---|---|---|---|
| gpt4 | 52.67 | 34.92 | 78.74 | 9 |
| gemini | 51.09 | 33.46 | 76.83 | 9 |
| sonnet | 50.82 | 28.54 | 77.55 | 9 |
| gemini flash | 47.47 | 23.81 | 77.87 | 9 |
| gpt3-instruct | 45.03 | 21.58 | 75.79 | 9 |
| haiku | 43.92 | 31.17 | 73.57 | 9 |
| o1 | 38.40 | 18.67 | 65.19 | 9 |
| gpt o3 | 33.19 | 4.77 | 63.35 | 9 |
Agricultural Sciences
4 models ranked
| Model | Score | Coverage | Match | Studies |
|---|---|---|---|---|
| gpt4 | 43.18 | 24.30 | 53.48 | 2 |
| sonnet | 38.52 | 29.08 | 47.40 | 2 |
| haiku | 32.56 | 28.30 | 43.32 | 2 |
| gemini | 29.32 | 28.98 | 44.80 | 2 |
Business Administration
4 models ranked
| Model | Score | Coverage | Match | Studies |
|---|---|---|---|---|
| gpt4 | 53.87 | 35.27 | 73.22 | 2 |
| sonnet | 44.64 | 15.63 | 68.58 | 2 |
| haiku | 44.00 | 18.51 | 62.30 | 2 |
| gemini | 43.12 | 29.16 | 60.10 | 2 |
// Scoring methodology
How we score
The method follows the original studies closely enough to preserve the research question, not just the surface language.
Original study reconstruction
We begin by reconstructing the original survey using the same set of attributes reported in the source study.
Demographic profile matching
We recreate the demographic profile of the original respondents as closely as possible from the published paper.
Matched respondent simulation
The model is prompted to answer as if it were a participant with the corresponding demographic characteristics.
Human model parity
We estimate the same statistical model used in the original research on the AI-generated responses.
Comparison to human data
Estimated parameters are compared back to the original human data to assess how well the model replicates real decision-making.
Transparent publication
Results are summarized with enough methodological detail for research, legal, and procurement stakeholders to understand what is being claimed.
Frequently asked questions
// Leaderboard details
Questions about the leaderboard? Contact us at hello@subconscious.ai
Is this the full interactive leaderboard?
This page presents the benchmark method and the highest-signal summary. We can share deeper methodological detail directly with qualified partners.
Can we audit the methodology?
Yes. The benchmark is intentionally designed to be explainable to research, legal, procurement, and executive stakeholders.
Why stated preference studies?
They are a rigorous way to compare how humans make trade-offs across defined attributes, which makes them useful for testing simulated decision-makers.
What does a good score mean?
A good score means the model is preserving important structural properties of human choice in that domain. It does not mean the model is universally reliable without context.