Behavioral
fidelity
leaderboard

This benchmark shows which models actually hold up against published studies of human choice.

Evidence before adoption

Why this benchmark exists

Most benchmarks reward style, speed, or benchmark gaming. We care about whether a system can reproduce the structure of human choice.

That means comparing simulated outcomes to published studies and scoring the same decision models researchers used in the original work.

The point is not to crown a winner for a week. The point is to show what these systems can actually support.

If you cannot measure behavioral fidelity, you cannot responsibly deploy a behavioral system.

417Replicated study runs

5Research domains

93%Accuracy vs real human decisions

OpenPublished methodology

Signal	Value	Interpretation
Replicated study runs	417	Published and reconstructed experiments currently represented in the benchmark dataset
Domains covered	5	Public Health, Consumer Research, Economics, Agricultural Sciences, and Business Administration
Top weighted Spearman correlation	0.55	Best current cross-domain rank-order alignment in the dataset
Top weighted parameter sign agreement	75.4%	Best current match on coefficient direction across benchmarked studies

Transparent measurement

Benchmark methodology

The leaderboard is built by replicating published studies of human choice.

We reconstruct the original survey, match the respondent profile, estimate the same models, and compare the results back to human data.

Read the research↗︎

Every score is attached to a documented protocol.

Live rankings

Model	Score	Coverage	Match	Studies
sonnet	56.78	40.07	82.63	19
gpt4	55.62	46.02	78.21	19
sonnet	54.63	45.79	70.95	23
gemini flash	54.63	23.84	81.59	19
gpt4	53.87	35.27	73.22	2
haiku	53.62	39.92	76.60	19
gpt3-instruct	52.79	41.50	75.97	18
gpt4	52.67	34.92	78.74	9
gemini	51.09	33.46	76.83	9
sonnet	50.82	28.54	77.55	9

Similarity

Parameter proximity

Coverage

Confidence overlap

Match

Effect direction

Rank Corr.

Preference ordering

1 / 4

Domain breakdowns

Each table preserves the original domain-level ranking surface from the legacy leaderboard.

Public Health

8 models ranked

#public-health

Model	Score	Coverage	Match	Studies
sonnet	54.63	45.79	70.95	23
haiku	50.30	44.93	65.83	23
gpt4	50.23	43.12	67.59	23
gemini flash	49.48	25.37	68.93	22
o1	48.49	25.67	68.72	20
gemini	39.71	40.47	55.42	23
gpt3-instruct	38.94	41.12	52.27	22
gpt o3	37.51	21.05	55.21	22

Similarity

Parameter proximity

Coverage

Confidence overlap

Consumer Research

8 models ranked

#consumer-research

Model	Score	Coverage	Match	Studies
sonnet	56.78	40.07	82.63	19
gpt4	55.62	46.02	78.21	19
gemini flash	54.63	23.84	81.59	19
haiku	53.62	39.92	76.60	19
gpt3-instruct	52.79	41.50	75.97	18
o1	50.65	25.33	75.20	19
gemini	50.01	37.12	74.81	19
gpt o3	33.11	17.91	56.59	19

Similarity

Parameter proximity

Coverage

Confidence overlap

Economics

8 models ranked

#economics

Model	Score	Coverage	Match	Studies
gpt4	52.67	34.92	78.74	9
gemini	51.09	33.46	76.83	9
sonnet	50.82	28.54	77.55	9
gemini flash	47.47	23.81	77.87	9
gpt3-instruct	45.03	21.58	75.79	9
haiku	43.92	31.17	73.57	9
o1	38.40	18.67	65.19	9
gpt o3	33.19	4.77	63.35	9

Similarity

Parameter proximity

Coverage

Confidence overlap

Agricultural Sciences

4 models ranked

#agricultural-sciences

Model	Score	Coverage	Match	Studies
gpt4	43.18	24.30	53.48	2
sonnet	38.52	29.08	47.40	2
haiku	32.56	28.30	43.32	2
gemini	29.32	28.98	44.80	2

Similarity

Parameter proximity

Coverage

Confidence overlap

Business Administration

4 models ranked

#business-administration

Model	Score	Coverage	Match	Studies
gpt4	53.87	35.27	73.22	2
sonnet	44.64	15.63	68.58	2
haiku	44.00	18.51	62.30	2
gemini	43.12	29.16	60.10	2

Similarity

Parameter proximity

Coverage

Confidence overlap

// Scoring methodology

How we score

The method follows the original studies closely enough to preserve the research question, not just the surface language.

01 ::

Original study reconstruction

We begin by reconstructing the original survey using the same set of attributes reported in the source study.

02 ::

Demographic profile matching

We recreate the demographic profile of the original respondents as closely as possible from the published paper.

03 ::

Matched respondent simulation

The model is prompted to answer as if it were a participant with the corresponding demographic characteristics.

04 ::

Human model parity

We estimate the same statistical model used in the original research on the AI-generated responses.

05 ::

Comparison to human data

Estimated parameters are compared back to the original human data to assess how well the model replicates real decision-making.

06 ::

Transparent publication

Results are summarized with enough methodological detail for research, legal, and procurement stakeholders to understand what is being claimed.

Frequently asked questions

// Leaderboard details

Questions about the leaderboard? Contact us at hello@subconscious.ai

01 ::

Is this the full interactive leaderboard?

This page presents the benchmark method and the highest-signal summary. We can share deeper methodological detail directly with qualified partners.

02 ::

Can we audit the methodology?

Yes. The benchmark is intentionally designed to be explainable to research, legal, procurement, and executive stakeholders.

03 ::

Why stated preference studies?

They are a rigorous way to compare how humans make trade-offs across defined attributes, which makes them useful for testing simulated decision-makers.

04 ::

What does a good score mean?

A good score means the model is preserving important structural properties of human choice in that domain. It does not mean the model is universally reliable without context.

Behavioralfidelityleaderboard

Why this benchmark exists

Benchmark methodology

Domain breakdowns

Public Health

Consumer Research

Economics

Agricultural Sciences

Business Administration

How we score

Original study reconstruction

Demographic profile matching

Matched respondent simulation

Human model parity

Comparison to human data

Transparent publication

Frequently asked questions

Is this the full interactive leaderboard?

Can we audit the methodology?

Why stated preference studies?

What does a good score mean?

Behavioral
fidelity
leaderboard