LLM Model Performance Leaderboard

Comparative analysis of various LLM models across different research domains, measuring their performance on key statistical metrics.

Filter by Research Domain

Scroll horizontally to see all columns β†’
LLM Model
Domain
Manhattan Similarity (%)↓
Coverage Probability (%)
Directional Effect Matching (%)
Spearman Correlation (%)
gpt4Agricultural Sciences51.45πŸ₯‡24.3053.48πŸ₯‡43.50πŸ₯‡
gemini flashPublic Health50.51πŸ₯‡25.3768.93πŸ₯ˆ53.09πŸ₯ˆ
sonnetBusiness Administration50.35πŸ₯‡15.6368.58πŸ₯ˆ44.00πŸ₯‰
o1Economics50.19πŸ₯‡18.6765.1919.56
gemini flashConsumer Research49.45πŸ₯‡23.8481.59πŸ₯ˆ63.63πŸ₯‡
sonnetPublic Health49.23πŸ₯ˆ45.79πŸ₯‡70.95πŸ₯‡52.57πŸ₯‰
haikuBusiness Administration49.20πŸ₯ˆ18.51πŸ₯‰62.30πŸ₯‰46.00πŸ₯ˆ
geminiBusiness Administration48.70πŸ₯‰29.16πŸ₯ˆ60.1034.50
sonnetAgricultural Sciences48.60πŸ₯ˆ29.08πŸ₯‡47.40πŸ₯ˆ29.00πŸ₯ˆ
gpt3-instructConsumer Research48.43πŸ₯ˆ41.50πŸ₯ˆ75.9745.28
πŸ₯‡Top performer
πŸ₯ˆSecond best
πŸ₯‰Third best
Showing 1 - 10 of 32 entries

About the Metrics

Manhattan Similarity (%)

We recreate the conjoint survey from a published study and gather responses from an AI model. Next, we estimate a discrete choice model similar to the one reported in the study. By computing choice probabilities for each option using both the estimated and reported parameters, we measure their similarity using Manhattan distance. This metric quantifies how closely the AI replicates human decision patterns on a probability scale.

Coverage Probability (%)

We recreate the conjoint survey from a published study and collect responses from an AI model. Then, we estimate a discrete choice model similar to the one in the study. To assess similarity, we check whether the study's reported parameters fall within a confidence interval constructed based on estimated parameters. This metric evaluates Human–LLM Equivalence while accounting for sampling differences.

Directional Effect Matching (%)

We recreate the conjoint survey from a published study and collect responses from an AI model. After estimating a discrete choice model similar to the study's, we compare the signs (positive or negative) of the estimated parameters with those reported in the study. This metric measures Human–LLM Equivalence in terms of preference alignment.

Spearman Correlation (%)

We recreate the conjoint survey from a published study and collect responses from an AI model. After estimating a discrete choice model similar to the one in the study, we calculate the Spearman correlation between the estimated and reported parameters. This metric quantifies Human–LLM Equivalence based on preference ordering.

Domain-Specific Leaderboards

Detailed performance breakdowns for each research domain.

Public Health

Scroll horizontally to see all columns β†’
LLM Model
Manhattan Similarity (%)↓
Coverage Probability (%)
Directional Effect Matching (%)
Spearman Correlation (%)
gemini flash50.51πŸ₯‡25.3768.93πŸ₯ˆ53.09πŸ₯ˆ
sonnet49.23πŸ₯ˆ45.79πŸ₯‡70.95πŸ₯‡52.57πŸ₯‰
haiku47.97πŸ₯‰44.93πŸ₯ˆ65.8342.48
o146.2225.6768.72πŸ₯‰53.35πŸ₯‡
gpt3-instruct44.5041.1252.2717.86
gpt443.1743.12πŸ₯‰67.5947.04
gpt o341.2221.0555.2132.55
gemini40.1940.4755.4222.78

Consumer Research

Scroll horizontally to see all columns β†’
LLM Model
Manhattan Similarity (%)↓
Coverage Probability (%)
Directional Effect Matching (%)
Spearman Correlation (%)
gemini flash49.45πŸ₯‡23.8481.59πŸ₯ˆ63.63πŸ₯‡
gpt3-instruct48.43πŸ₯ˆ41.50πŸ₯ˆ75.9745.28
haiku48.18πŸ₯‰39.9276.6049.79
sonnet46.7140.07πŸ₯‰82.63πŸ₯‡57.74πŸ₯ˆ
o146.0325.3375.2056.05πŸ₯‰
gpt444.6246.02πŸ₯‡78.21πŸ₯‰53.63
gemini42.9237.1274.8145.21
gpt o335.4817.9156.5922.47

Economics

Scroll horizontally to see all columns β†’
LLM Model
Manhattan Similarity (%)↓
Coverage Probability (%)
Directional Effect Matching (%)
Spearman Correlation (%)
o150.19πŸ₯‡18.6765.1919.56
gpt3-instruct47.97πŸ₯ˆ21.5875.7934.78
gemini47.29πŸ₯‰33.46πŸ₯ˆ76.8346.78πŸ₯‰
sonnet46.5428.5477.55πŸ₯‰50.67πŸ₯ˆ
gpt445.4634.92πŸ₯‡78.74πŸ₯‡51.56πŸ₯‡
gemini flash44.3123.8177.87πŸ₯ˆ43.89
haiku39.1731.17πŸ₯‰73.5731.78
gpt o337.204.7763.3527.44

Agricultural Sciences

Scroll horizontally to see all columns β†’
LLM Model
Manhattan Similarity (%)↓
Coverage Probability (%)
Directional Effect Matching (%)
Spearman Correlation (%)
gpt451.45πŸ₯‡24.3053.48πŸ₯‡43.50πŸ₯‡
sonnet48.60πŸ₯ˆ29.08πŸ₯‡47.40πŸ₯ˆ29.00πŸ₯ˆ
haiku46.10πŸ₯‰28.30πŸ₯‰43.3212.50πŸ₯‰
gemini43.5028.98πŸ₯ˆ44.80πŸ₯‰0.00

Business Administration

Scroll horizontally to see all columns β†’
LLM Model
Manhattan Similarity (%)↓
Coverage Probability (%)
Directional Effect Matching (%)
Spearman Correlation (%)
sonnet50.35πŸ₯‡15.6368.58πŸ₯ˆ44.00πŸ₯‰
haiku49.20πŸ₯ˆ18.51πŸ₯‰62.30πŸ₯‰46.00πŸ₯ˆ
gemini48.70πŸ₯‰29.16πŸ₯ˆ60.1034.50
gpt448.0035.27πŸ₯‡73.22πŸ₯‡59.00πŸ₯‡

Have questions about the data?

Contact our research team to learn more about our methodology and findings.
Discord Logo

Join our awesome community

Share results, seek support and stay updated with new releases