Killed by LLM

A memorial to the benchmarks that defined—and were defeated by—AI progress

2024

ARC-AGI(2019 - 2024)

Reasoning

Killed by

Saturation

Killed 1 month ago, Abstract reasoning challenge consisting of visual pattern completion tasks. Each task presents a sequence of abstract visual patterns and requires selecting the correct completion. Created by François Chollet as part of a broader investigation into measuring intelligence. It was 5 years and 1 months old.

Defeated by: O3

Original Score

Human Baseline: ~80%

Final Score

O3: 87.5%

MATH(2021 - 2024)

Mathematics

Killed by

Saturation

Killed 4 months ago, A dataset of 12K challenging competition mathematics problems from AMC, AIME, and other math competitions. Problems range from pre-algebra to olympiad-level and require complex multi-step reasoning. Each problem has a detailed solution that tests mathematical reasoning capabilities. It was 3 years and 6 months old.

Defeated by: O1

Original Score

Average CS PhD: ~40%

Final Score

O1: 94.8%

BIG-Bench-Hard(2022 - 2024)

Multi-task

Killed by

Saturation

Killed 7 months ago, A curated suite of 23 challenging tasks from BIG-Bench where language models initially performed below average human level. Selected to measure progress on particularly difficult capabilities. It was 1 year and 8 months old.

Defeated by: Sonnet 3.5

Original Score

Average Human: 67.7%

Final Score

Sonnet 3.5: 93.1%

HumanEval(2021 - 2024)

Coding

Killed by

Saturation

Killed 8 months ago, A collection of 164 Python programming problems designed to test language models' coding abilities. Each problem includes a function signature, docstring, and unit tests. Models must generate complete, correct function implementations that pass all test cases. It was 2 years and 10 months old.

Defeated by: GPT-4o

Original Score

Unspecified

Final Score

GPT-4o: 90.2%

IFEval(2023 - 2024)

Instruction Following

Killed by

Saturation

Killed 10 months ago, A comprehensive evaluation suite testing instruction following capabilities across coding, math, roleplay, and other tasks. Measures ability to handle complex multi-step instructions and constraints. It was 4 months old.

Defeated by: LLama 3.3 70B

Original Score

Unspecified

Final Score

LLama 3.3 70B: 92.1%

2023

GSM8K(2021 - 2023)

Mathematics

Killed by

Saturation

GSM8K is often considered contaminated because of its inclusion in several instruction following datasets

Killed 1 year ago, A collection of 8.5K grade school math word problems requiring step-by-step solutions. Problems test both numerical computation and natural language understanding through multi-step mathematical reasoning. It was 2 years and 1 months old.

Defeated by: GPT-4

Original Score

Unspecified

Final Score

GPT-4: 92.0%

Turing Test(1950 - 2023)

Conversation

Killed by

Saturation

While the Turing Test remains philosophically significant, modern LLMs can consistently pass it, making it no longer effective at measuring the frontier of AI capabilities.

Killed 1 year ago, The original AI benchmark proposed by Alan Turing in 1950. In this 'imitation game', a computer must convince human judges it is human through natural conversation. The test sparked decades of debate about machine intelligence and consciousness. It was 73 years and 5 months old.

Defeated by: GPT-4

Original Score

Interrogator >50%

Final Score

Interrogator 46%

ARC (AI2)(2018 - 2023)

Reasoning

Killed by

Saturation

Killed 1 year ago, AI2 Reasoning Challenge (ARC) - A collection of grade-school level multiple-choice reasoning tasks testing logical deduction, spatial reasoning, and temporal reasoning. Each task requires applying abstract reasoning skills to solve multi-step problems. It was 5 years old.

Defeated by: GPT-4

Original Score

Unspecifed

Final Score

GPT-4: 96.3%

HellaSwag(2019 - 2023)

Common Sense

Killed by

Saturation

Killed 1 year ago, A challenging dataset of multiple-choice questions about everyday scenarios. Uses adversarial filtering to test models' ability to understand and reason about real-world situations and their likely outcomes. It was 3 years and 10 months old.

Defeated by: GPT-4

Original Score

Human: 95.6%

Final Score

GPT-4: 95.3%

MMLU(2020 - 2023)

Knowledge

Killed by

Saturation

Killed 1 year ago, A comprehensive benchmark covering 57 subjects including mathematics, history, law, computer science, and more. Questions are drawn from real-world sources like professional exams to test both breadth and depth of knowledge across diverse academic domains. It was 2 years and 6 months old.

Defeated by: GPT-4

Original Score

95th pct Human: 87.0%

Final Score

GPT-4: 87.3%

WinoGrande(2019 - 2023)

Common Sense

Killed by

Saturation

Killed 1 year ago, An enhanced version of WSC with 44K problems testing common-sense reasoning through pronoun resolution. Uses adversarial filtering to ensure problems require real-world understanding. It was 3 years and 8 months old.

Defeated by: GPT-4

Original Score

Human: 94%

Final Score

GPT-4: 87.5%

2022

BIG-Bench(2021 - 2022)

Multi-task

Killed by

Saturation

BIG-Bench further faces contamination challenges as: (a) Its canary string has been reproduced by many major models (b) Its contamination has been highlighted in many papers i.e. GPT-4 technical report

Killed 2 years ago, A collaborative collection of 204 tasks spanning linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, and software development. Tests diverse capabilities of language models. It was 10 months old.

Defeated by: Palm 540B

Original Score

Human: 49.8%

Final Score

Palm 540B: 61.4%

2019

SuperGLUE(2019 - 2019)

Language

Killed by

Saturation

Killed 5 years ago, A collection of more challenging language understanding tasks including word sense disambiguation, causal reasoning, and reading comprehension. Designed as a more difficult successor to GLUE. It was 5 months old.

Defeated by: T5

Original Score

Human: 89.8%

Final Score

T5: 89.3%

WSC(2012 - 2019)

Common Sense

Killed by

Saturation

Killed 5 years ago, A collection of carefully crafted sentence pairs with ambiguous pronoun references that resolve differently based on small changes. Designed to test genuine language understanding over statistical patterns. It was 7 years and 3 months old.

Defeated by: ROBERTA (w SFT)

Original Score

Human: 96.5%

Final Score

ROBERTA (w SFT): 90.1%

GLUE(2018 - 2019)

Language

Killed by

Saturation

Killed 5 years ago, A collection of nine tasks for evaluating natural language understanding, including single-sentence tasks, similarity and paraphrase tasks, and inference tasks. The primary NLU benchmark before SuperGLUE. It was 1 year and 1 months old.

Defeated by: XLNet

Original Score

Human: 87.1%

Final Score

XLNet: 88.4%

TriviaQA(2017 - 2019)

Knowledge

Killed by

Saturation

Killed 5 years ago, A large-scale dataset of 650K question-answer-evidence triples authored by trivia enthusiasts. Requires cross-sentence reasoning and synthesis of information from multiple sources. It was 2 years and 1 months old.

Defeated by: SpanBERT

Original Score

Human: 79.7%

Final Score

SpanBERT: 83.6%

SQuAD v2.0(2018 - 2019)

Language

Killed by

Saturation

Killed 5 years ago, An extension of SQuAD that adds unanswerable questions. Models must both answer questions when possible and determine when no answer is supported by the passage. It was 11 months old.

Defeated by: BERT

Original Score

Human: 89.5%

Final Score

BERT: 89.5%

SQuAD(2016 - 2019)

Language

Killed by

Saturation

Killed 5 years ago, A reading comprehension dataset of 100,000+ questions posed by crowdworkers on Wikipedia articles. Answers must be text segments from the corresponding reading passage. It was 2 years and 10 months old.

Defeated by: BERT

Original Score

Human: 91.2%

Final Score

BERT: 93.2%

2018

SWAG(2018 - 2018)

Common Sense

Killed by

Saturation

Killed 6 years ago, A dataset of 113K multiple choice questions about grounded situations. Given a partial description of a situation, models must predict what happens next from 4 choices using common sense reasoning. It was 5 months old.

Defeated by: BERT

Original Score

Human: 88%

Final Score

BERT: 86%

Inspiration

This website is meant to be a bit of fun, and to help us take a look back and remember the massive amount of progress that has been made — much of which I didn't believe I'd see within my lifetime.

It has also been heavily inspired by Cody Ogden's Killed by Google

Understanding "Saturation"

For KilledByLLM "Saturation" means a benchmark can no longer measure the frontier. While these benchmarks are still increadibly useful, valuable tools — they are no longer able to meaningfully contribute to the question of "Can AI do X?"

Data Collection Challenges

This project represents a best-effort attempt to document benchmarks-of-note that have been envloped by LLMs. Proper attribution, timing and scores have been difficult to determine definitively, hence there may be some errors.

To illustrate this, let's examine "Qwen-2.5-72B-instruct" on MATH:

From Qwen's technical report - 83.1
From Stanford's HELM - 79.0
From Huggingface's Open LLM Leaderboard - 38.7

These scores significantly deviatiate from eachother!

Hence we take scores in the following approach:

Please raise an issue or PR if you identify any discrepancies!

From the author's paper/ technical report/ model card
From succeeding benckmark papers e.g. SQuAD 2.0 discusses SQuAD 1.1 performance
From third party sources e.g. Stanford's HELM

Found an error or have additional data? Contribute on GitHub

P.S. The em dashes on this page were lovingly handwritten by humans.