Lighteval Tasks Explorer

Browse tasks by language, tags and search the task descriptions.

⭐⭐⭐ Recommended benchmarks are marked with a star icon.

Languages

Checkbox Group

Benchmark type

Tip: use the filters and search together. Results update live.

math reasoning
english
The American Invitational Mathematics Examination (AIME) is a prestigious,
invite-only mathematics competition for high-school students who perform in the
top 5% of the AMC 12 mathematics exam. It involves 15 questions of increasing
difficulty, with the answer to every question being a single integer from 0 to
999. The median score is historically between 4 and 6 questions correct (out of
the 15 possible). Two versions of the test are given every year (thirty
questions total).
multiple-choice
english
ARC-AGI tasks are a series of three to five input and output tasks followed by a
final task with only the input listed. Each task tests the utilization of a
specific learned skill based on a minimal number of cognitive priors.
In their native form, tasks are a JSON lists of integers. These JSON can also be
represented visually as a grid of colors using an ARC-AGI task viewer. You can
view an example of a task here.
A successful submission is a pixel-perfect description (color and position) of
the final task's output.
100% of tasks in the ARC-AGI-2 dataset were solved by a minimim of 2 people in
less than or equal to 2 attempts (many were solved more). ARC-AGI-2 is more
difficult for AI.
⭐ Humanity'S Last Exam cais/hle
qa reasoning general-knowledge
english
Humanity's Last Exam (HLE) is a global collaborative effort, with questions from
nearly 1,000 subject expert contributors affiliated with over 500 institutions
across 50 countries - comprised mostly of professors, researchers, and graduate
degree holders.
⭐ Ifeval google/IFEval
instruction-following
english
Very specific task where there are no precise outputs but instead we test if the
format obeys rules.
⭐ Live Code Bench lighteval/code_generation_lite
code-generation
english
LiveCodeBench collects problems from periodic contests on LeetCode, AtCoder, and
Codeforces platforms and uses them for constructing a holistic benchmark for
evaluating Code LLMs across variety of code-related scenarios continuously over
time.
math reasoning
english
This dataset contains a subset of 500 problems from the MATH benchmark that
OpenAI created in their Let's Verify Step by Step paper.
⭐ Mix Eval MixEval/MixEval
general-knowledge reasoning qa
english
Ground-truth-based dynamic benchmark derived from off-the-shelf benchmark
mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96
correlation with Chatbot Arena) while running locally and quickly (6% the time
and cost of running MMLU), with its queries being stably and effortlessly
updated every month to avoid contamination.
⭐ Mmlu Pro TIGER-Lab/MMLU-Pro
general-knowledge knowledge multiple-choice
english
MMLU-Pro dataset is a more robust and challenging massive multi-task
understanding dataset tailored to more rigorously benchmark large language
models' capabilities. This dataset contains 12K complex questions across various
disciplines.
⭐ Musr TAUR-Lab/MuSR
long-context multiple-choice reasoning
english
MuSR is a benchmark for evaluating multistep reasoning in natural language
narratives. Built using a neurosymbolic synthetic-to-natural generation process,
it features complex, realistic tasks—such as long-form murder mysteries.
⭐ Simpleqa lighteval/SimpleQA
factuality general-knowledge qa
english
A factuality benchmark called SimpleQA that measures the ability for language
models to answer short, fact-seeking questions.
knowledge multilingual multiple-choice
arabic
Acva multilingual benchmark.
math multilingual reasoning
amharic ewe french hausa igbo kinyarwanda lingala luganda oromo shona sotho swahili twi wolof xhosa yoruba zulu
African MGSM: MGSM for African Languages
knowledge multilingual multiple-choice
amharic ewe french hausa igbo kinyarwanda lingala luganda oromo shona sotho swahili twi wolof xhosa yoruba zulu
African MMLU: African Massive Multitask Language Understanding
classification multilingual nli
amharic ewe french hausa igbo kinyarwanda lingala luganda oromo shona sotho swahili twi wolof xhosa yoruba zulu
African XNLI: African XNLI
biology chemistry geography history knowledge language multiple-choice physics reasoning
english chinese
AGIEval is a human-centric benchmark specifically designed to evaluate the
general abilities of foundation models in tasks pertinent to human cognition and
problem-solving. This benchmark is derived from 20 official, public, and
high-standard admission and qualification exams intended for general human
test-takers, such as general college admission tests (e.g., Chinese College
Entrance Exam (Gaokao) and American SAT), law school admission tests, math
competitions, lawyer qualification tests, and national civil service exams.
nli reasoning
english
The Adversarial Natural Language Inference (ANLI) is a new large-scale NLI
benchmark dataset, The dataset is collected via an iterative, adversarial
human-and-model-in-the-loop procedure. ANLI is much more difficult than its
predecessors including SNLI and MNLI. It contains three rounds. Each round has
train/dev/test splits.
Arabic Mmlu MBZUAI/ArabicMMLU
knowledge multilingual multiple-choice
arabic
Arabic Mmlu multilingual benchmark.
multiple-choice
english
7,787 genuine grade-school level, multiple-choice science questions, assembled
to encourage research in advanced question-answering. The dataset is partitioned
into a Challenge Set and an Easy Set, where the former contains only questions
answered incorrectly by both a retrieval-based algorithm and a word
co-occurrence algorithm
multilingual multiple-choice qa reasoning
arabic
ARCD: Arabic Reading Comprehension Dataset.
math reasoning
english
A small battery of 10 tests that involve asking language models a simple
arithmetic problem in natural language.
math reasoning
english
ASDiv is a dataset for arithmetic reasoning that contains 2,000+ questions
covering addition, subtraction, multiplication, and division.
qa reasoning
english
The bAbI benchmark for measuring understanding and reasoning, evaluates reading
comprehension via question answering.
bias multiple-choice qa
english
The Bias Benchmark for Question Answering (BBQ) for measuring social bias in
question answering in ambiguous and unambigous context .
multilingual multiple-choice reading-comprehension
arabic armenian bengali cyrillic devanagari ethiopic georgian greek gujarati gurmukhi chinese (simplified) chinese (traditional) hangul hebrew japanese khmer kannada lao latin malayalam myanmar odia sinhala tamil telugu thai tibetan
Belebele: A large-scale reading comprehension dataset covering 122 languages.
reasoning
english
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
166 tasks from bigbench benchmark.
language-modeling
english
BLiMP is a challenge set for evaluating what language models (LMs) know
about major grammatical phenomena in English. BLiMP consists of 67
sub-datasets, each containing 1000 minimal pairs isolating specific
contrasts in syntax, morphology, or semantics. The data is automatically
generated according to expert-crafted grammars.
bias generation
english
The Bias in Open-Ended Language Generation Dataset (BOLD) for measuring biases
and toxicity in open-ended language generation.
multilingual multiple-choice reasoning
chinese
C3: A Chinese Challenge Corpus for Cross-lingual and Cross-modal Tasks Reading
comprehension task part of clue.
knowledge multilingual multiple-choice
chinese
Cmmlu multilingual benchmark.
classification multilingual nli
chinese
Native Chinese NLI dataset based on MNLI approach (Machine Translated)
Cmrc2018 clue/clue
multilingual qa
chinese
CMRC 2018: A span-extraction machine reading comprehension dataset for Chinese.
Commonsenseqa tau/commonsense_qa
commonsense multiple-choice qa
english
CommonsenseQA is a new multiple-choice question answering dataset that requires
different types of commonsense knowledge to predict the correct answers . It
contains 12,102 questions with one correct answer and four distractor answers.
The dataset is provided in two major training/validation/testing set splits:
"Random split" which is the main evaluation split, and "Question token split",
see paper for details.
multilingual multiple-choice reasoning
assamese bengali gujarati hindi kannada malayalam marathi nepali oriya punjabi sanskrit sindhi tamil telugu urdu
IndicCOPA: COPA for Indic Languages Paper: https://arxiv.org/pdf/2212.05409
IndicCOPA extends COPA to 15 Indic languages, providing a valuable resource for
evaluating common sense reasoning in these languages.
dialog qa
english
CoQA is a large-scale dataset for building Conversational Question Answering
systems. The goal of the CoQA challenge is to measure the ability of machines to
understand a text passage and answer a series of interconnected questions that
appear in a conversation.
dialog medical
english
The COVID-19 Dialogue dataset is a collection of 500+ dialogues between
doctors and patients during the COVID-19 pandemic.
math qa reasoning
english
The DROP dataset is a new question-answering dataset designed to evaluate the
ability of language models to answer complex questions that require reasoning
over multiple sentences.
Emotion Classification dair-ai/emotion
emotion classification multiple-choice
english
This task performs emotion classification classifying text into one of six
emotion categories: sadness, joy, love, anger, fear, surprise.
knowledge multilingual multiple-choice
portuguese
ENEM (Exame Nacional do Ensino Médio) is a standardized Brazilian national
secondary education examination. The exam is used both as a university admission
test and as a high school evaluation test.
classification ethics justice morality utilitarianism virtue
english
The Ethics benchmark for evaluating the ability of language models to reason about
ethical issues.
knowledge multilingual multiple-choice
albanian arabic bulgarian croatian french german hungarian italian lithuanian macedonian polish portuguese serbian spanish turkish vietnamese
Exams multilingual benchmark.
Flores200 facebook/flores
multilingual translation
arabic armenian bengali cyrillic devanagari ethiopic georgian greek gujarati gurmukhi chinese (simplified) chinese (traditional) hangul hebrew japanese khmer kannada lao latin malayalam myanmar odia sinhala tamil telugu thai tibetan
Flores200 multilingual benchmark.
knowledge multilingual multiple-choice
amharic arabic bengali chinese czech dutch english french german hebrew hindi indonesian italian japanese korean malay norwegian polish portuguese romanian russian serbian spanish swahili swedish tamil telugu thai turkish ukrainian urdu vietnamese yoruba zulu
Translated MMLU using both professional and non-professional translators.
Contains tags for cultural sensitivity.
classification language-understanding
english
The General Language Understanding Evaluation (GLUE) benchmark is a collection
of resources for training, evaluating, and analyzing natural language
understanding systems.
biology chemistry graduate-level multiple-choice physics qa reasoning science
english
GPQA is a dataset of 448 expert-written multiple-choice questions in biology,
physics, and chemistry, designed to test graduate-level reasoning. The questions
are extremely difficult—PhD-level experts score about 65%, skilled non-experts
34% (even with web access), and GPT-4 around 39%. GPQA aims to support research
on scalable oversight, helping humans evaluate and trust AI systems that may
exceed human expertise.
math reasoning
english
GSM-Plus is an adversarial extension of GSM8K that tests the robustness of LLMs'
mathematical reasoning by introducing varied perturbations to grade-school math
problems.
math reasoning
english
GSM8K is a dataset of 8,000+ high-quality, single-step arithmetic word problems.
health medical multiple-choice qa
english spanish
HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to
access a specialized position in the Spanish healthcare system, and are
challenging even for highly specialized humans. They are designed by the
Ministerio de Sanidad, Consumo y Bienestar Social, who also provides direct
access to the exams of the last 5 years.
Hellaswag Rowan/hellaswag
multiple-choice narrative reasoning
english
HellaSwag is a commonsense inference benchmark designed to challenge language
models with adversarially filtered multiple-choice questions.
multilingual multiple-choice reasoning
thai
Hellaswag Thai This is a Thai adaptation of the Hellaswag task. Similar to the
Turkish version, there's no specific paper, but it has been found to be
effective for evaluating Thai language models on commonsense reasoning tasks.
multilingual multiple-choice reasoning
turkish
Hellaswag Turkish This is a Turkish adaptation of the Hellaswag task. While
there's no specific paper for this version, it has been found to work well for
evaluating Turkish language models on commonsense reasoning tasks. We don't
handle them in single task as there is quite a lot of differences
(dataset/subset, dot replacement, etc.) which would make it hard to read
Hindi Boolq ai4bharat/boolq-hi
classification multilingual qa
gujarati hindi malayalam marathi tamil
Hindi Boolq multilingual benchmark.
classification
english
The IMDB benchmark for sentiment analysis in movie review, from:
Learning Word Vectors for Sentiment Analysis
multilingual qa
assamese bengali gujarati hindi kannada malayalam marathi oriya punjabi tamil telugu
IndicQA: A reading comprehension dataset for 11 Indian languages.
language-modeling
english
LAMBADA is a benchmark for testing language models’ ability to understand broad
narrative context. Each passage requires predicting its final word—easy for
humans given the full passage but impossible from just the last sentence.
Success demands long-range discourse comprehension.
classification legal
english
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
classification legal
bulgarian czech danish german greek english spanish estonian finnish french ga croatian hungarian italian lithuanian latvian mt dutch polish portuguese romanian slovak slovenian swedish
LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
qa
english
LogiQA is a machine reading comprehension dataset focused on testing logical
reasoning abilities. It contains 8,678 expert-written multiple-choice questions
covering various types of deductive reasoning. While humans perform strongly,
state-of-the-art models lag far behind, making LogiQA a benchmark for advancing
logical reasoning in NLP systems.
knowledge multilingual multiple-choice
afrikaans chinese english italian javanese portuguese swahili thai vietnamese
M3Exam: Multitask Multilingual Multimodal Evaluation Benchmark It also contains
a multimodal version but we don't support that Paper:
https://arxiv.org/abs/2306.05179
Mathlogicqa Rus ai-forever/MERA
math multilingual qa reasoning
russian
MathLogicQA is a dataset for evaluating mathematical reasoning in language
models. It consists of multiple-choice questions that require logical reasoning
and mathematical problem-solving. This Russian version is part of the MERA
(Multilingual Evaluation of Reasoning Abilities) benchmark.
math qa reasoning
english
large-scale dataset of math word problems. Our dataset is gathered by using a
new representation language to annotate over the AQuA-RAT dataset with
fully-specified operational programs. AQuA-RAT has provided the questions,
options, rationale, and the correct options.
math multilingual reasoning
bengali chinese english french german japanese russian spanish swahili telugu thai
Mgsm multilingual benchmark.
math multilingual reasoning
english spanish french german russian chinese japanese thai swahili bengali telugu
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school
math problems.
The same 250 problems from GSM8K are each translated via human annotators in 10
languages.
knowledge multilingual qa
arabic english french german hindi italian japanese portuguese spanish
Mintaka multilingual benchmark.
multilingual qa
arabic chinese chinese_hong_kong chinese_traditional danish dutch english finnish french german hebrew hungarian italian japanese khmer korean malay norwegian polish portuguese russian spanish swedish thai turkish vietnamese
Mkqa multilingual benchmark.
Mlmm Arc Challenge jon-tow/okapi_arc_challenge
multilingual multiple-choice reasoning
arabic bengali catalan chinese croatian danish dutch french german hindi hungarian indonesian italian kannada malayalam marathi nepali romanian russian serbian slovak spanish tamil telugu ukrainian vietnamese
ARC (AI2 Reasoning Challenge) is a dataset for question answering that requires
reasoning. It consists of multiple-choice science questions from 3rd to 9th
grade exams. The dataset is split into two parts: ARC-Easy and ARC-Challenge.
ARC-Easy contains questions that can be answered correctly by both humans and
simple baseline models. ARC-Challenge contains questions that are difficult for
both humans and current AI systems. Similar to MMLU, ARC tasks uses PMI
normalization by default but only for the challenge set.
multilingual multiple-choice reasoning
arabic armenian basque bengali catalan chinese croatian danish dutch french german gujarati hindi hungarian icelandic indonesian italian kannada malayalam marathi nepali norwegian portuguese romanian russian serbian slovak spanish swedish tamil telugu ukrainian vietnamese
Hellaswag is a commonsense reasoning task that requires models to complete a
given scenario with the most plausible ending. It tests the model's ability to
understand and reason about everyday situations and human behavior.
MLMM-Hellaswag: Multilingual adaptation of Hellaswag
knowledge multilingual multiple-choice
arabic bengali catalan chinese croatian danish dutch french german hindi hungarian indonesian italian kannada malayalam marathi nepali romanian russian serbian slovak spanish tamil telugu ukrainian vietnamese
MLMM MMLU: Another multilingual version of MMLU
Mlmm Truthfulqa jon-tow/okapi_truthfulqa
factuality multilingual qa
arabic armenian basque bengali catalan chinese croatian danish dutch french german gujarati hindi hungarian icelandic indonesian italian kannada malayalam marathi nepali norwegian portuguese romanian russian serbian slovak spanish swedish tamil telugu ukrainian vietnamese
TruthfulQA: Measuring How Models Mimic Human Falsehoods
multilingual qa
arabic chinese german hindi spanish vietnamese
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating
cross-lingual question answering performance. It consists of QA instances in 7
languages: English, Arabic, German, Spanish, Hindi, Vietnamese, and Chinese. The
dataset is derived from the SQuAD v1.1 dataset, with questions and contexts
translated by professional translators.
general-knowledge knowledge multiple-choice
english
MMMLU is a benchmark of general-knowledge and English language understanding.
conversational generation multi-turn
english
MT-Bench is a multi-turn conversational benchmark for evaluating language
models. It consists of 80 high-quality multi-turn questions across 8 common
categories (writing, roleplay, reasoning, math, coding, extraction, STEM,
humanities). Model responses are evaluated by a judge LLM.
qa reading-comprehension
english
NarrativeQA is a reading comprehension benchmark that tests deep understanding
of full narratives—books and movie scripts—rather than shallow text matching. To
answer its questions, models must integrate information across entire stories.
general-knowledge qa
english
This dataset is a collection of question-answer pairs from the Natural Questions
dataset. See Natural Questions for additional information. This dataset can be
used directly with Sentence Transformers to train embedding models.
math reasoning
english
Numeracy is a benchmark for evaluating the ability of language models to reason about mathematics.
knowledge multilingual multiple-choice
portuguese
OAB Exams: A collection of questions from the Brazilian Bar Association exam The
exam is required for anyone who wants to practice law in Brazil
Ocnli clue/clue
classification multilingual nli
chinese
Native Chinese NLI dataset based.
Olympiade Bench Hothan/OlympiadBench
math reasoning language
english chinese
OlympiadBench is a benchmark for evaluating the performance of language models
on olympiad problems.
Openai Mmlu openai/MMMLU
knowledge multilingual multiple-choice
arabic bengali chinese french german hindi indonesian italian japanese korean portuguese spanish swahili yoruba
Openai Mmlu multilingual benchmark.
multilingual multiple-choice reasoning
arabic
OpenBookQA: A Question-Answering Dataset for Open-Book Exams OpenBookQA is a
question-answering dataset modeled after open-book exams for assessing human
understanding of a subject. It consists of multiple-choice questions that
require combining facts from a given open book with broad common knowledge. The
task tests language models' ability to leverage provided information and apply
common sense reasoning.
multilingual multiple-choice reasoning
spanish
Spanish version of OpenBookQA from BSC Language Technology group
Openbook Rus ai-forever/MERA
multilingual multiple-choice reasoning
russian
The Russian version is part of the MERA (Multilingual Enhanced Russian NLP
Architectures) project.
multiple-choice qa
english
OpenBookQA is a question-answering dataset modeled after open-book exams for
assessing human understanding of a subject. It contains multiple-choice
questions that require combining facts from a given open book with broad common
knowledge. The task tests language models' ability to leverage provided
information and apply common sense reasoning.
Oz Serbian Evals DjMel/oz-eval
knowledge multiple-choice
serbian
OZ Eval (sr. Opšte Znanje Evaluacija) dataset was created for the purposes of
evaluating General Knowledge of LLM models in Serbian language. Data consists
of 1k+ high-quality questions and answers which were used as part of entry exams
at the Faculty of Philosophy and Faculty of Organizational Sciences, University
of Belgrade. The exams test the General Knowledge of students and were used in
the enrollment periods from 2003 to 2024.
multilingual
russian
PARus: Plausible Alternatives for Russian PARus is the Russian adaptation of the
COPA task, part of the Russian SuperGLUE benchmark. It evaluates common sense
reasoning and causal inference abilities in Russian language models.
classification multilingual nli
chinese english french german japanese korean spanish
PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification This
dataset contains paraphrase identification pairs in multiple languages. It's
derived from PAWS (Paraphrase Adversaries from Word Scrambling) and We treat
paraphrase as entailment and non-paraphrase as contradiction
commonsense multiple-choice qa
english
PIQA is a benchmark for testing physical commonsense reasoning. It contains
questions requiring this kind of physical commonsense reasoning.
multilingual multiple-choice qa reasoning
arabic
PIQA: Physical Interaction Question Answering PIQA is a benchmark for testing
physical commonsense reasoning. This Arabic version is a translation of the
original PIQA dataset, adapted for Arabic language evaluation. It tests the
ability to reason about physical interactions in everyday situations.
reasoning qa physical-commonsense
english
PROST is a benchmark for testing physical reasoning about objects through space
and time. It includes 18,736 multiple-choice questions covering 10 core physics
concepts, designed to probe models in zero-shot settings. Results show that even
large pretrained models struggle with physical reasoning and are sensitive to
question phrasing, underscoring their limited real-world understanding.
Pubmedqa pubmed_qa
biomedical health medical qa
english
PubMedQA is a dataset for biomedical research question answering.
Qa4Mre qa4mre
biomedical health qa
english
QA4MRE is a machine reading comprehension benchmark from the CLEF 2011-2013
challenges. It evaluates systems' ability to answer questions requiring deep
understanding of short texts, supported by external background knowledge.
Covering tasks like modality, negation, biomedical reading, and entrance exams,
QA4MRE tests reasoning beyond surface-level text matching.
qa scientific
english
QASPER is a dataset for question answering on scientific research papers. It
consists of 5,049 questions over 1,585 Natural Language Processing papers. Each
question is written by an NLP practitioner who read only the title and abstract
of the corresponding paper, and the question seeks information present in the
full text. The questions are then answered by a separate set of NLP
practitioners who also provide supporting evidence to answers.
Race High EleutherAI/race
multiple-choice reading-comprehension
english
RACE is a large-scale reading comprehension dataset with more than 28,000
passages and nearly 100,000 questions. The dataset is collected from English
examinations in China, which are designed for middle school and high school
students. The dataset can be served as the training and test sets for machine
comprehension.
classification reasoning
english
The Real-world annotated few-shot (RAFT) meta-benchmark of 11 real-world text
classification tasks.
classification multilingual nli
russian
Russian Commitment Bank (RCB) is a large-scale NLI dataset with Russian
sentences, collected from the web and crowdsourcing.
physics chemistry biology reasoning multiple-choice qa
english
The SciQ dataset contains 13,679 crowdsourced science exam questions about
Physics, Chemistry and Biology, among others. The questions are in
multiple-choice format with 4 answer options each. For the majority of the
questions, an additional paragraph with supporting evidence for the correct
answer is provided.
knowledge multiple-choice
serbian
The tasks cover a variety of benchmarks, including: standard task like ARC[E][C],
BoolQ, Hellaswag, OpenBookQA,PIQA, Winogrande and a custom OZ Eval.
MMLU is separated by subject and also all in one.
commonsense multiple-choice qa
english
We introduce Social IQa: Social Interaction QA, a new question-answering
benchmark for testing social commonsense intelligence. Contrary to many prior
benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on
reasoning about people's actions and their social implications. For example,
given an action like "Jesse saw a concert" and a question like "Why did Jesse do
this?", humans can easily infer that Jesse wanted "to see their favorite
performer" or "to enjoy the music", and not "to see what's happening inside" or
"to see if it works". The actions in Social IQa span a wide variety of social
situations, and answer candidates contain both human-curated answers and
adversarially-filtered machine-generated candidates. Social IQa contains over
37,000 QA pairs for evaluating models' abilities to reason about the social
implications of everyday events and situations.
reasoning symbolic
english
SLR-Bench is a large-scale benchmark for scalable logical reasoning with
language models, comprising 19,000 prompts organized into 20 curriculum levels.
qa
english
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset,
consisting of questions posed by crowdworkers on a set of Wikipedia articles,
where the answer to every question is a segment of text, or span, from the
corresponding reading passage, or the question might be unanswerable.
SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000
unanswerable questions written adversarially by crowdworkers to look similar to
answerable ones. To do well on SQuAD2.0, systems must not only answer questions
when possible, but also determine when no answer is supported by the paragraph
and abstain from answering.
narrative reasoning
english
A Corpus and Cloze Evaluation for Deeper Understanding of
Commonsense Stories
summarization
english
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural
Networks for Extreme Summarization and: Abstractive Text Summarization using
Sequence-to-sequence RNNs and Beyond
narrative reasoning
english
The dataset consists of 113k multiple choice questions about grounded situations
(73k training, 20k validation, 20k test). Each question is a video caption from
LSMDC or ActivityNet Captions, with four answer choices about what might happen
next in the scene. The correct answer is the (real) video caption for the next
event in the video; the three incorrect answers are adversarially generated and
human verified, so as to fool machines but not humans. SWAG aims to be a
benchmark for evaluating grounded commonsense NLI and for learning
representations.
Swahili Arc
multilingual multiple-choice reasoning
swahili
Swahili Arc multilingual benchmark.
Thai Exams scb10x/thai_exam
knowledge multilingual multiple-choice
thai
Thai Exams multilingual benchmark.
language-modeling
english
The Pile corpus for measuring lanugage model performance across various domains.
generation safety
english
This dataset is for implicit hate speech detection. All instances were generated
using GPT-3 and the methods described in our paper.
multilingual qa
turkish
TQuAD v2: Turkish Question Answering Dataset version 2.
qa
english
TriviaqQA is a reading comprehension dataset containing over 650K
question-answer-evidence triples. TriviaqQA includes 95K question-answer pairs
authored by trivia enthusiasts and independently gathered evidence documents,
six per question on average, that provide high quality distant supervision for
answering the questions.
knowledge multiple-choice
turkic
TUMLU-mini is a benchmark for Turkic language understanding, comprising 1,000
prompts organized into 10 subsets.
Turkish Arc malhajar/arc-tr
multilingual multiple-choice reasoning
turkish
Turkish ARC Comes from the Turkish leaderboard
language-modeling
english
Demographic Dialectal Variation in Social Media: A Case Study of African-American English
multilingual qa
arabic bengali english finnish indonesian japanese korean russian swahili telugu thai
Other QA tasks for RC TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. https://arxiv.org/abs/2003.05002
qa
english
This dataset consists of 6,642 question/answer pairs. The questions are supposed
to be answerable by Freebase, a large knowledge graph. The questions are mostly
centered around a single named entity. The questions are popular ones asked on
the web.
language-modeling
english
The WikiText language modeling dataset is a collection of over 100 million
tokens extracted from the set of verified Good and Featured articles on
Wikipedia. The dataset is available under the Creative Commons
Attribution-ShareAlike License.
commonsense multiple-choice
english
WinoGrande is a new collection of 44k problems, inspired by Winograd Schema
Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the
scale and robustness against the dataset-specific bias. Formulated as a
fill-in-a-blank task with binary options, the goal is to choose the right option
for a given sentence which requires commonsense reasoning.
Worldtree Rus ai-forever/MERA
multilingual
russian
WorldTree is a dataset for multi-hop inference in science question answering. It
provides explanations for elementary science questions by combining facts from a
semi-structured knowledge base. This Russian version is part of the MERA
(Multilingual Evaluation of Reasoning Abilities) benchmark.
multilingual multiple-choice reasoning
arabic chinese dutch english french german hindi italian japanese polish portuguese russian spanish swahili urdu vietnamese
Xcodah multilingual benchmark.
Xcopa
multilingual multiple-choice narrative reasoning
arabic chinese estonian haitian indonesian italian quechua swahili tamil thai turkish vietnamese
COPA (Choice of Plausible Alternatives) tasks involve determining the most
plausible cause or effect for a given premise. These tasks test common sense
reasoning and causal inference abilities. XCOPA: Cross-lingual Choice of
Plausible Alternatives.
commonsense multilingual multiple-choice reasoning
english
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning The Cross-lingual
Choice of Plausible Alternatives dataset is a benchmark to evaluate the ability
of machine learning models to transfer commonsense reasoning across languages.
multilingual multiple-choice qa reasoning
arabic chinese dutch english french german hindi italian japanese polish portuguese russian spanish swahili urdu vietnamese
XCSQA (Cross-lingual Commonsense QA) is part of the XCSR (Cross-lingual
Commonsense Reasoning) benchmark It is a multilingual extension of the
CommonsenseQA dataset, covering 16 languages The task involves answering
multiple-choice questions that require commonsense reasoning Uses PMI
normalization.
classification multilingual nli
arabic bulgarian chinese english french german greek hindi russian spanish swahili thai turkish urdu vietnamese
NLI (Natural Language Inference) tasks involve determining the logical
relationship between two given sentences: a premise and a hypothesis. The goal
is to classify whether the hypothesis is entailed by, contradicts, or is neutral
with respect to the premise. After our inspection we found the neutral label to
be quite ambiguous and decided to exclude it. But you can easily add it by
modifying the adapters The XNLI dataset is a multilingual variant of MultiNLI
classification multilingual nli
assamese bengali gujarati hindi kannada malayalam marathi oriya punjabi tamil telugu
Another variant of XNLI, with emphasis on Indic languages.
Xnli2
classification multilingual nli
arabic assamese bengali bulgarian chinese english french german greek gujarati hindi kannada marathi punjabi russian sanskrit spanish swahili tamil thai turkish urdu vietnamese
Improvement on XNLI with better translation, from our experience models tend to
perform better on XNLI2.0 than XNLI.
multilingual qa
arabic chinese english german greek hindi romanian russian spanish thai turkish vietnamese
Reading Comprehension (RC) tasks evaluate a model's ability to understand and
extract information from text passages. These tasks typically involve answering
questions based on given contexts, spanning multiple languages and formats. Add
RC tasks supporting about 130 unique languages/scripts. SQuAD - like XQuAD:
Cross-lingual Question Answering Dataset, extending SQuAD to 11 languages.
multilingual narrative
arabic basque burmese chinese hindi indonesian russian spanish swahili telugu
Xstory multilingual benchmark.
multilingual narrative reasoning
english russian chinese spanish arabic hindi indonesian telugu swahili basque burmese
XStoryCloze consists of the professionally translated version of the English
StoryCloze dataset (Spring 2016 version) to 10 non-English languages. This
dataset is released by Meta AI.
multilingual multiple-choice reasoning
chinese english french japanese portuguese russian
Xwinograd multilingual benchmark.
commonsense multilingual reasoning
english french japanese portuguese russian chinese
Multilingual winograd schema challenge as used in Crosslingual Generalization through Multitask Finetuning.