Lighteval Tasks Explorer

⭐ Aa Omniscience ArtificialAnalysis/AA-Omniscience-Public

english

benchmark dataset designed to measure a model's ability to both recall factual
information accurately across domains, and correctly abstain when its knowledge
is insufficient. AA-Omniscience is characterized by its penalty for incorrect
guesses,...

Run using lighteval:

aa_omniscience

source | paper

⭐ Aime HuggingFaceH4/aime_2024 yentinglin/aime_2025

math reasoning

english

The American Invitational Mathematics Examination (AIME) is a prestigious,
invite-only mathematics competition for high-school students who perform in the
top 5% of the AMC 12 mathematics exam. It involves 15 questions of increasing
difficulty, with...

Run using lighteval:

aime24 aime24_avg aime24_gpassk

Show 3 more

aime25 aime25_avg aime25_gpassk

source | paper

⭐ Big-Bench Extra Hard jgyasu/bbeh

reasoning

english

BIG-Bench Extra Hard (BBEH) is a successor to BIG-Bench Hard (BBH), created to evaluate large
language models on substantially more difficult general-reasoning tasks. Each BBH task is replaced
with a new task targeting the same underlying reasoning...

Run using lighteval:

bigbench_extra_hard

source | paper

⭐ Gpqa Idavidrein/gpqa

biology chemistry graduate-level multiple-choice physics qa reasoning science

english

GPQA is a dataset of 448 expert-written multiple-choice questions in biology,
physics, and chemistry, designed to test graduate-level reasoning. The questions
are extremely difficult—PhD-level experts score about 65%, skilled non-experts
34% (even...

Run using lighteval:

gpqa

source | paper

⭐ Gsm Plus qintongli/GSM-Plus

math reasoning

english

GSM-Plus is an adversarial extension of GSM8K that tests the robustness of LLMs'
mathematical reasoning by introducing varied perturbations to grade-school math
problems.

Run using lighteval:

gsm_plus

source | paper

⭐ Humanity'S Last Exam cais/hle

qa reasoning general-knowledge

english

Humanity's Last Exam (HLE) is a global collaborative effort, with questions from
nearly 1,000 subject expert contributors affiliated with over 500 institutions
across 50 countries - comprised mostly of professors, researchers, and graduate
degree...

Run using lighteval:

hle

source | paper

⭐ Ifbench allenai/IFBench_test allenai/IFBench_multi-turn

instruction-following

english

Challenging benchmark for precise instruction following.

Run using lighteval:

ifbench_test ifbench_multiturn

source | paper

⭐ Ifeval google/IFEval

instruction-following

english

Very specific task where there are no precise outputs but instead we test if the
format obeys rules.

Run using lighteval:

ifeval

source | paper

⭐ Live Code Bench lighteval/code_generation_lite

code-generation

english

LiveCodeBench collects problems from periodic contests on LeetCode, AtCoder, and
Codeforces platforms and uses them for constructing a holistic benchmark for
evaluating Code LLMs across variety of code-related scenarios continuously over
time.

Run using lighteval:

lcb

source | paper

⭐ Long Horizon Execution arvindh75/Long-Horizon-Execution

long-context state-tracking arithmetic execution

english

Evaluation benchmark for long-context execution capabilities of language models.
Tests a model's ability to maintain state and perform cumulative operations over
long sequences of inputs. Supports both single-turn (all inputs at once) and
multi-turn...

Run using lighteval:

long_horizon_execution

source | paper

⭐ Math 500 HuggingFaceH4/MATH-500

math reasoning

english

This dataset contains a subset of 500 problems from the MATH benchmark that
OpenAI created in their Let's Verify Step by Step paper.

Run using lighteval:

math_500

source | paper

⭐ Mathvista AI4Math/MathVista

math qa reasoning

english

Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this...

Run using lighteval:

mathvista

source | paper

⭐ Mix Eval MixEval/MixEval

general-knowledge reasoning qa

english

Ground-truth-based dynamic benchmark derived from off-the-shelf benchmark
mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96
correlation with Chatbot Arena) while running locally and quickly (6% the time
and cost of...

Run using lighteval:

mixeval_easy mixeval_hard

source | paper

⭐ Mmlu Pro TIGER-Lab/MMLU-Pro

general-knowledge knowledge multiple-choice

english

MMLU-Pro dataset is a more robust and challenging massive multi-task
understanding dataset tailored to more rigorously benchmark large language
models' capabilities. This dataset contains 12K complex questions across various
disciplines.

Run using lighteval:

mmlu_pro

source | paper

⭐ Multichallenge nmayorga7/multichallenge

conversational generation instruction-following

english

MultiChallenge evaluates large language models (LLMs) on their ability to
conduct multi-turn conversations with human users.
The model is given a target question belonging to one or
more axes (categories) and must provide a free-form answer.
The...

Run using lighteval:

multi_challenge

source | paper

⭐ Musr TAUR-Lab/MuSR

long-context multiple-choice reasoning

english

MuSR is a benchmark for evaluating multistep reasoning in natural language
narratives. Built using a neurosymbolic synthetic-to-natural generation process,
it features complex, realistic tasks—such as long-form murder mysteries.

Run using lighteval:

musr

source | paper

⭐ Simpleqa lighteval/SimpleQA

factuality general-knowledge qa

english

A factuality benchmark called SimpleQA that measures the ability for language
models to answer short, fact-seeking questions.

Run using lighteval:

simpleqa

source | paper

Acva OALL/ACVA

knowledge multilingual multiple-choice

arabic

Acva multilingual benchmark.

Run using lighteval:

acva_ara

source

Afri Mgsm masakhane/afrimgsm

math multilingual reasoning

amharic ewe french hausa igbo kinyarwanda lingala luganda oromo shona sotho swahili twi wolof xhosa yoruba zulu

African MGSM: MGSM for African Languages

Run using lighteval:

afri_mgsm_amh afri_mgsm_fra afri_mgsm_swa

Show 1 more

afri_mgsm_yor

source | paper

Afri Xnli masakhane/afrixnli

classification multilingual nli

amharic ewe french hausa igbo kinyarwanda lingala luganda oromo shona sotho swahili twi wolof xhosa yoruba zulu

African XNLI: African XNLI

Run using lighteval:

afri_xnli_amh_mcf afri_xnli_amh_cf afri_xnli_amh_hybrid

Show 9 more

afri_xnli_fra_mcf afri_xnli_fra_cf afri_xnli_fra_hybrid afri_xnli_swa_mcf afri_xnli_swa_cf afri_xnli_swa_hybrid afri_xnli_yor_mcf afri_xnli_yor_cf afri_xnli_yor_hybrid

source | paper

Agieval dmayhem93/agieval-aqua-rat dmayhem93/agieval-gaokao-biology dmayhem93/agieval-gaokao-chemistry dmayhem93/agieval-gaokao-chinese dmayhem93/agieval-gaokao-english dmayhem93/agieval-gaokao-geography +11 more

biology chemistry geography history knowledge language multiple-choice physics reasoning

english chinese

AGIEval is a human-centric benchmark specifically designed to evaluate the
general abilities of foundation models in tasks pertinent to human cognition and
problem-solving. This benchmark is derived from 20 official, public, and
high-standard...

Run using lighteval:

agieval

source | paper

Aimo Progress Prize 1 lighteval/aimo_progress_prize_1

math reasoning

english

-

Run using lighteval:

aimo_progress_prize_1

source

Anli facebook/anli

nli reasoning

english

The Adversarial Natural Language Inference (ANLI) is a new large-scale NLI
benchmark dataset, The dataset is collected via an iterative, adversarial
human-and-model-in-the-loop procedure. ANLI is much more difficult than its
predecessors including...

Run using lighteval:

anli

source | paper

Arabic Arc OALL/AlGhafa-Arabic-LLM-Benchmark-Translated

multilingual multiple-choice reasoning

arabic

Arabic Arc multilingual benchmark.

Run using lighteval:

alghafa_arc_ara_mcf:easy alghafa_arc_ara_cf:easy alghafa_arc_ara_hybrid:easy

source

Arabic Evals MBZUAI/ArabicMMLU MBZUAI/human_translated_arabic_mmlu OALL/Arabic_MMLU OALL/ACVA asas-ai/AraTrust-categorized

knowledge multilingual multiple-choice

arabic

Collection of benchmarks for Arabic language.

Run using lighteval:

arabic_mmlu arabic_mmlu_ht arabic_mmlu_mt

Show 17 more

acva alghafa aratrust madinah_qa arabic_exams race_ar piqa_ar arc_easy_ar arc_challenge_okapi_ar mmlu_okapi_ar openbook_qa_ext_ar boolq_ar copa_ext_ar hellaswag_okapi_ar toxigen_ar sciq_ar alrage_qa

source

Arabic Mmlu MBZUAI/ArabicMMLU

knowledge multilingual multiple-choice

arabic

Arabic Mmlu multilingual benchmark.

Run using lighteval:

mmlu_ara_mcf mmlu_ara_cf mmlu_ara_hybrid

source

Arc allenai/ai2_arc

multiple-choice

english

7,787 genuine grade-school level, multiple-choice science questions, assembled
to encourage research in advanced question-answering. The dataset is partitioned
into a Challenge Set and an Easy Set, where the former contains only questions
answered...

Run using lighteval:

arc

source | paper

Arcagi 2 arc-agi-community/arc-agi-2

multiple-choice

english

ARC-AGI tasks are a series of three to five input and output tasks followed by a
final task with only the input listed. Each task tests the utilization of a
specific learned skill based on a minimal number of cognitive priors.
In their native form,...

Run using lighteval:

arc_agi_2

source | paper

Arcd hsseinmz/arcd

multilingual multiple-choice qa reasoning

arabic

ARCD: Arabic Reading Comprehension Dataset.

Run using lighteval:

arcd_ara

source | paper

Arithmetic EleutherAI/arithmetic

math reasoning

english

A small battery of 10 tests that involve asking language models a simple
arithmetic problem in natural language.

Run using lighteval:

arithmetic

source | paper

Asdiv EleutherAI/asdiv

math reasoning

english

ASDiv is a dataset for arithmetic reasoning that contains 2,000+ questions
covering addition, subtraction, multiplication, and division.

Run using lighteval:

asdiv

source | paper

Babi Qa facebook/babi_qa

qa reasoning

english

The bAbI benchmark for measuring understanding and reasoning, evaluates reading
comprehension via question answering.

Run using lighteval:

babi_qa

source | paper

Bbq lighteval/bbq_helm

bias multiple-choice qa

english

The Bias Benchmark for Question Answering (BBQ) for measuring social bias in
question answering in ambiguous and unambigous context .

Run using lighteval:

bbq bbq

source | paper

Belebele facebook/belebele

multilingual multiple-choice reading-comprehension

arabic armenian bengali cyrillic devanagari ethiopic georgian greek gujarati gurmukhi chinese (simplified) chinese (traditional) hangul hebrew japanese khmer kannada lao latin malayalam myanmar odia sinhala tamil telugu thai tibetan

Belebele: A large-scale reading comprehension dataset covering 122 languages.

Run using lighteval:

belebele_acm_Arab_mcf belebele_arz_Arab_mcf belebele_ceb_Latn_mcf

Show 273 more

belebele_fin_Latn_mcf belebele_hin_Deva_mcf belebele_ita_Latn_mcf belebele_khm_Khmr_mcf belebele_lvs_Latn_mcf belebele_npi_Deva_mcf belebele_pol_Latn_mcf belebele_slv_Latn_mcf belebele_swe_Latn_mcf belebele_afr_Latn_mcf belebele_asm_Beng_mcf belebele_ces_Latn_mcf belebele_fra_Latn_mcf belebele_hin_Latn_mcf belebele_jav_Latn_mcf belebele_mal_Mlym_mcf belebele_npi_Latn_mcf belebele_por_Latn_mcf belebele_swh_Latn_mcf belebele_tur_Latn_mcf belebele_yor_Latn_mcf belebele_als_Latn_mcf belebele_azj_Latn_mcf belebele_ckb_Arab_mcf belebele_hrv_Latn_mcf belebele_jpn_Jpan_mcf belebele_kir_Cyrl_mcf belebele_mar_Deva_mcf belebele_snd_Arab_mcf belebele_tam_Taml_mcf belebele_ukr_Cyrl_mcf belebele_zho_Hans_mcf belebele_amh_Ethi_mcf belebele_dan_Latn_mcf belebele_hun_Latn_mcf belebele_kor_Hang_mcf belebele_mkd_Cyrl_mcf belebele_ron_Latn_mcf belebele_som_Latn_mcf belebele_tel_Telu_mcf belebele_urd_Arab_mcf belebele_zho_Hant_mcf belebele_apc_Arab_mcf belebele_ben_Beng_mcf belebele_deu_Latn_mcf belebele_hye_Armn_mcf belebele_kan_Knda_mcf belebele_lao_Laoo_mcf belebele_mlt_Latn_mcf belebele_ory_Orya_mcf belebele_rus_Cyrl_mcf belebele_tgk_Cyrl_mcf belebele_urd_Latn_mcf belebele_zsm_Latn_mcf belebele_arb_Arab_mcf belebele_ben_Latn_mcf belebele_ell_Grek_mcf belebele_guj_Gujr_mcf belebele_kat_Geor_mcf belebele_pan_Guru_mcf belebele_spa_Latn_mcf belebele_tgl_Latn_mcf belebele_uzn_Latn_mcf belebele_arb_Latn_mcf belebele_eng_Latn_mcf belebele_kaz_Cyrl_mcf belebele_lit_Latn_mcf belebele_mya_Mymr_mcf belebele_pbt_Arab_mcf belebele_sin_Latn_mcf belebele_srp_Cyrl_mcf belebele_tha_Thai_mcf belebele_vie_Latn_mcf belebele_ars_Arab_mcf belebele_bul_Cyrl_mcf belebele_est_Latn_mcf belebele_ind_Latn_mcf belebele_nld_Latn_mcf belebele_pes_Arab_mcf belebele_sin_Sinh_mcf belebele_war_Latn_mcf belebele_ary_Arab_mcf belebele_cat_Latn_mcf belebele_eus_Latn_mcf belebele_heb_Hebr_mcf belebele_isl_Latn_mcf belebele_nob_Latn_mcf belebele_plt_Latn_mcf belebele_slk_Latn_mcf belebele_acm_Arab_cf belebele_arz_Arab_cf belebele_ceb_Latn_cf belebele_fin_Latn_cf belebele_hin_Deva_cf belebele_ita_Latn_cf belebele_khm_Khmr_cf belebele_lvs_Latn_cf belebele_npi_Deva_cf belebele_pol_Latn_cf belebele_slv_Latn_cf belebele_swe_Latn_cf belebele_afr_Latn_cf belebele_asm_Beng_cf belebele_ces_Latn_cf belebele_fra_Latn_cf belebele_hin_Latn_cf belebele_jav_Latn_cf belebele_mal_Mlym_cf belebele_npi_Latn_cf belebele_por_Latn_cf belebele_swh_Latn_cf belebele_tur_Latn_cf belebele_yor_Latn_cf belebele_als_Latn_cf belebele_azj_Latn_cf belebele_ckb_Arab_cf belebele_hrv_Latn_cf belebele_jpn_Jpan_cf belebele_kir_Cyrl_cf belebele_mar_Deva_cf belebele_snd_Arab_cf belebele_tam_Taml_cf belebele_ukr_Cyrl_cf belebele_zho_Hans_cf belebele_amh_Ethi_cf belebele_dan_Latn_cf belebele_hun_Latn_cf belebele_kor_Hang_cf belebele_mkd_Cyrl_cf belebele_ron_Latn_cf belebele_som_Latn_cf belebele_tel_Telu_cf belebele_urd_Arab_cf belebele_zho_Hant_cf belebele_apc_Arab_cf belebele_ben_Beng_cf belebele_deu_Latn_cf belebele_hye_Armn_cf belebele_kan_Knda_cf belebele_lao_Laoo_cf belebele_mlt_Latn_cf belebele_ory_Orya_cf belebele_rus_Cyrl_cf belebele_tgk_Cyrl_cf belebele_urd_Latn_cf belebele_zsm_Latn_cf belebele_arb_Arab_cf belebele_ben_Latn_cf belebele_ell_Grek_cf belebele_guj_Gujr_cf belebele_kat_Geor_cf belebele_pan_Guru_cf belebele_spa_Latn_cf belebele_tgl_Latn_cf belebele_uzn_Latn_cf belebele_arb_Latn_cf belebele_eng_Latn_cf belebele_kaz_Cyrl_cf belebele_lit_Latn_cf belebele_mya_Mymr_cf belebele_pbt_Arab_cf belebele_sin_Latn_cf belebele_srp_Cyrl_cf belebele_tha_Thai_cf belebele_vie_Latn_cf belebele_ars_Arab_cf belebele_bul_Cyrl_cf belebele_est_Latn_cf belebele_ind_Latn_cf belebele_nld_Latn_cf belebele_pes_Arab_cf belebele_sin_Sinh_cf belebele_war_Latn_cf belebele_ary_Arab_cf belebele_cat_Latn_cf belebele_eus_Latn_cf belebele_heb_Hebr_cf belebele_isl_Latn_cf belebele_nob_Latn_cf belebele_plt_Latn_cf belebele_slk_Latn_cf belebele_acm_Arab_hybrid belebele_arz_Arab_hybrid belebele_ceb_Latn_hybrid belebele_fin_Latn_hybrid belebele_hin_Deva_hybrid belebele_ita_Latn_hybrid belebele_khm_Khmr_hybrid belebele_lvs_Latn_hybrid belebele_npi_Deva_hybrid belebele_pol_Latn_hybrid belebele_slv_Latn_hybrid belebele_swe_Latn_hybrid belebele_afr_Latn_hybrid belebele_asm_Beng_hybrid belebele_ces_Latn_hybrid belebele_fra_Latn_hybrid belebele_hin_Latn_hybrid belebele_jav_Latn_hybrid belebele_mal_Mlym_hybrid belebele_npi_Latn_hybrid belebele_por_Latn_hybrid belebele_swh_Latn_hybrid belebele_tur_Latn_hybrid belebele_yor_Latn_hybrid belebele_als_Latn_hybrid belebele_azj_Latn_hybrid belebele_ckb_Arab_hybrid belebele_hrv_Latn_hybrid belebele_jpn_Jpan_hybrid belebele_kir_Cyrl_hybrid belebele_mar_Deva_hybrid belebele_snd_Arab_hybrid belebele_tam_Taml_hybrid belebele_ukr_Cyrl_hybrid belebele_zho_Hans_hybrid belebele_amh_Ethi_hybrid belebele_dan_Latn_hybrid belebele_hun_Latn_hybrid belebele_kor_Hang_hybrid belebele_mkd_Cyrl_hybrid belebele_ron_Latn_hybrid belebele_som_Latn_hybrid belebele_tel_Telu_hybrid belebele_urd_Arab_hybrid belebele_zho_Hant_hybrid belebele_apc_Arab_hybrid belebele_ben_Beng_hybrid belebele_deu_Latn_hybrid belebele_hye_Armn_hybrid belebele_kan_Knda_hybrid belebele_lao_Laoo_hybrid belebele_mlt_Latn_hybrid belebele_ory_Orya_hybrid belebele_rus_Cyrl_hybrid belebele_tgk_Cyrl_hybrid belebele_urd_Latn_hybrid belebele_zsm_Latn_hybrid belebele_arb_Arab_hybrid belebele_ben_Latn_hybrid belebele_ell_Grek_hybrid belebele_guj_Gujr_hybrid belebele_kat_Geor_hybrid belebele_pan_Guru_hybrid belebele_spa_Latn_hybrid belebele_tgl_Latn_hybrid belebele_uzn_Latn_hybrid belebele_arb_Latn_hybrid belebele_eng_Latn_hybrid belebele_kaz_Cyrl_hybrid belebele_lit_Latn_hybrid belebele_mya_Mymr_hybrid belebele_pbt_Arab_hybrid belebele_sin_Latn_hybrid belebele_srp_Cyrl_hybrid belebele_tha_Thai_hybrid belebele_vie_Latn_hybrid belebele_ars_Arab_hybrid belebele_bul_Cyrl_hybrid belebele_est_Latn_hybrid belebele_ind_Latn_hybrid belebele_nld_Latn_hybrid belebele_pes_Arab_hybrid belebele_sin_Sinh_hybrid belebele_war_Latn_hybrid belebele_ary_Arab_hybrid belebele_cat_Latn_hybrid belebele_eus_Latn_hybrid belebele_heb_Hebr_hybrid belebele_isl_Latn_hybrid belebele_nob_Latn_hybrid belebele_plt_Latn_hybrid belebele_slk_Latn_hybrid

source | paper

Bigbench tasksource/bigbench

reasoning

english

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
166 tasks from bigbench benchmark.

Run using lighteval:

bigbench

source | paper

Bigbench Hard lighteval/bbh

reasoning

-

Run using lighteval:

bigbench_hard

source

Blimp nyu-mll/blimp

language-modeling

english

BLiMP is a challenge set for evaluating what language models (LMs) know
about major grammatical phenomena in English. BLiMP consists of 67
sub-datasets, each containing 1000 minimal pairs isolating specific
contrasts in syntax, morphology, or...

Run using lighteval:

blimp

source | paper

Bold lighteval/bold_helm

bias generation

english

The Bias in Open-Ended Language Generation Dataset (BOLD) for measuring biases
and toxicity in open-ended language generation.

Run using lighteval:

bold bold

source | paper

Boolq lighteval/boolq_helm

qa

english

The BoolQ benchmark for binary (yes/no) question answering.

Run using lighteval:

boolq boolq:contrastset

source | paper

C3 clue/clue

multilingual multiple-choice reasoning

chinese

C3: A Chinese Challenge Corpus for Cross-lingual and Cross-modal Tasks Reading
comprehension task part of clue.

Run using lighteval:

c3_zho_mcf c3_zho_cf c3_zho_hybrid

source | paper

Ceval ceval/ceval-exam

knowledge multilingual multiple-choice

chinese

Ceval multilingual benchmark.

Run using lighteval:

ceval_zho_mcf ceval_zho_cf ceval_zho_hybrid

source

Chegeka ai-forever/MERA

knowledge multilingual qa

russian

Chegeka multilingual benchmark.

Run using lighteval:

chegeka_rus

source

Chinese Squad lighteval/ChineseSquad

multilingual qa

chinese

ChineseSquad is a reading comprehension dataset for Chinese.

Run using lighteval:

chinese_squad_zho

source | paper

Civil Comments lighteval/civil_comments_helm

bias classification

english

The CivilComments benchmark for toxicity detection.

Run using lighteval:

civil_comments

source | paper

Cmath weitianwen/cmath

math multilingual reasoning

chinese

Cmath multilingual benchmark.

Run using lighteval:

cmath_zho

source

Cmmlu haonan-li/cmmlu

knowledge multilingual multiple-choice

chinese

Cmmlu multilingual benchmark.

Run using lighteval:

cmmlu_zho_mcf cmmlu_zho_cf cmmlu_zho_hybrid

source

Cmnli fenffef/cmnli

classification multilingual nli

chinese

Native Chinese NLI dataset based on MNLI approach (Machine Translated)

Run using lighteval:

cmnli_zho_mcf cmnli_zho_cf cmnli_zho_hybrid

source | paper

Cmrc2018 clue/clue

multilingual qa

chinese

CMRC 2018: A span-extraction machine reading comprehension dataset for Chinese.

Run using lighteval:

cmrc2018_zho

source | paper

Commonsenseqa tau/commonsense_qa

commonsense multiple-choice qa

english

CommonsenseQA is a new multiple-choice question answering dataset that requires
different types of commonsense knowledge to predict the correct answers . It
contains 12,102 questions with one correct answer and four distractor answers.

Run using lighteval:

commonsenseqa

source | paper

Copa Indic ai4bharat/IndicCOPA

multilingual multiple-choice reasoning

assamese bengali gujarati hindi kannada malayalam marathi nepali oriya punjabi sanskrit sindhi tamil telugu urdu

IndicCOPA: COPA for Indic Languages Paper: https://arxiv.org/pdf/2212.05409
IndicCOPA extends COPA to 15 Indic languages, providing a valuable resource for
evaluating common sense reasoning in these languages.

Run using lighteval:

indicxcopa_asm_mcf indicxcopa_asm_cf indicxcopa_asm_hybrid

Show 42 more

indicxcopa_ben_mcf indicxcopa_ben_cf indicxcopa_ben_hybrid indicxcopa_guj_mcf indicxcopa_guj_cf indicxcopa_guj_hybrid indicxcopa_hin_mcf indicxcopa_hin_cf indicxcopa_hin_hybrid indicxcopa_kan_mcf indicxcopa_kan_cf indicxcopa_kan_hybrid indicxcopa_mal_mcf indicxcopa_mal_cf indicxcopa_mal_hybrid indicxcopa_mar_mcf indicxcopa_mar_cf indicxcopa_mar_hybrid indicxcopa_nep_mcf indicxcopa_nep_cf indicxcopa_nep_hybrid indicxcopa_ori_mcf indicxcopa_ori_cf indicxcopa_ori_hybrid indicxcopa_pan_mcf indicxcopa_pan_cf indicxcopa_pan_hybrid indicxcopa_san_mcf indicxcopa_san_cf indicxcopa_san_hybrid indicxcopa_snd_mcf indicxcopa_snd_cf indicxcopa_snd_hybrid indicxcopa_tam_mcf indicxcopa_tam_cf indicxcopa_tam_hybrid indicxcopa_tel_mcf indicxcopa_tel_cf indicxcopa_tel_hybrid indicxcopa_urd_mcf indicxcopa_urd_cf indicxcopa_urd_hybrid

source | paper

Coqa stanfordnlp/coqa

dialog qa

english

CoQA is a large-scale dataset for building Conversational Question Answering
systems. The goal of the CoQA challenge is to measure the ability of machines to
understand a text passage and answer a series of interconnected questions that
appear in a...

Run using lighteval:

coqa

source | paper

Covid Dialogue lighteval/covid_dialogue

dialog medical

english

The COVID-19 Dialogue dataset is a collection of 500+ dialogues between
doctors and patients during the COVID-19 pandemic.

Run using lighteval:

covid_dialogue

source | paper

Drop Qa lighteval/drop_harness

math qa reasoning

english

The DROP dataset is a new question-answering dataset designed to evaluate the
ability of language models to answer complex questions that require reasoning
over multiple sentences.

Run using lighteval:

drop

source | paper

Dyck Language lighteval/DyckLanguage

reasoning

english

Scenario testing hierarchical reasoning through the Dyck formal languages.

Run using lighteval:

dyck_language

source | paper

Emotion Classification dair-ai/emotion

emotion classification multiple-choice

english

This task performs emotion classification classifying text into one of six
emotion categories: sadness, joy, love, anger, fear, surprise.

Run using lighteval:

emotion_classification

source

Enem maritaca-ai/enem

knowledge multilingual multiple-choice

portuguese

ENEM (Exame Nacional do Ensino Médio) is a standardized Brazilian national
secondary education examination. The exam is used both as a university admission
test and as a high school evaluation test.

Run using lighteval:

enem_por_mcf enem_por_cf enem_por_hybrid

source | paper

Entity Data Imputation lighteval/Buy lighteval/Restaurant

reasoning

english

Scenario that tests the ability to impute missing entities in a data table.

Run using lighteval:

entity_data_imputation

source | paper

Entitymatching lighteval/EntityMatching

classification reasoning

english

Simple entity matching benchmark.

Run using lighteval:

entity_matching entity_matching=Fodors_Zagats

source | paper

Ethics lighteval/hendrycks_ethics

classification ethics justice morality utilitarianism virtue

english

The Ethics benchmark for evaluating the ability of language models to reason about
ethical issues.

Run using lighteval:

ethics

source | paper

Exams mhardalov/exams

knowledge multilingual multiple-choice

albanian arabic bulgarian croatian french german hungarian italian lithuanian macedonian polish portuguese serbian spanish turkish vietnamese

Exams multilingual benchmark.

Run using lighteval:

exams_ara_mcf exams_ara_cf exams_ara_hybrid

Show 45 more

exams_bul_mcf exams_bul_cf exams_bul_hybrid exams_hrv_mcf exams_hrv_cf exams_hrv_hybrid exams_hun_mcf exams_hun_cf exams_hun_hybrid exams_ita_mcf exams_ita_cf exams_ita_hybrid exams_srp_mcf exams_srp_cf exams_srp_hybrid exams_fra_mcf exams_fra_cf exams_fra_hybrid exams_deu_mcf exams_deu_cf exams_deu_hybrid exams_spa_mcf exams_spa_cf exams_spa_hybrid exams_lit_mcf exams_lit_cf exams_lit_hybrid exams_sqi_mcf exams_sqi_cf exams_sqi_hybrid exams_mkd_mcf exams_mkd_cf exams_mkd_hybrid exams_tur_mcf exams_tur_cf exams_tur_hybrid exams_pol_mcf:professional exams_pol_cf:professional exams_pol_hybrid:professional exams_por_mcf exams_por_cf exams_por_hybrid exams_vie_mcf exams_vie_cf exams_vie_hybrid

source

Faquad eraldoluis/faquad

multilingual qa

portuguese

FaQuAD: A Portuguese Reading Comprehension Dataset

Run using lighteval:

faquad_por

source | paper

Filipino Evals filbench/filbench-eval

knowledge multilingual multiple-choice

filipino

Collection of benchmarks for Filipino language.

Run using lighteval:

balita_tgl_mcf balita_tgl_hybrid belebele_fil_mcf

Show 49 more

belebele_ceb_mcf belebele_fil_cf belebele_ceb_cf belebele_fil_hybrid belebele_ceb_hybrid cebuaner_ceb_mcf cebuaner_ceb_hybrid readability_ceb_mcf readability_ceb_hybrid dengue_filipino_fil firecs_fil_mcf firecs_fil_hybrid global_mmlu_all_tgl_mcf global_mmlu_ca_tgl_mcf global_mmlu_cs_tgl_mcf global_mmlu_unk_tgl_mcf global_mmlu_all_tgl_cf global_mmlu_ca_tgl_cf global_mmlu_cs_tgl_cf global_mmlu_unk_tgl_cf global_mmlu_all_tgl_hybrid global_mmlu_ca_tgl_hybrid global_mmlu_cs_tgl_hybrid global_mmlu_unk_tgl_hybrid include_tgl_mcf include_tgl_hybrid kalahi_tgl_hybrid kalahi_tgl_mcf newsphnli_fil_mcf newsphnli_fil_cf newsphnli_fil_hybrid ntrex128_fil sib200_tgl_mcf sib200_tgl_hybrid sib200_ceb_mcf sib200_ceb_hybrid stingraybench_semantic_appropriateness_tgl_mcf stingraybench_semantic_appropriateness_tgl_hybrid stingraybench_correctness_tgl_mcf stingraybench_correctness_tgl_hybrid tatoeba_ceb tatoeba_tgl tico19_tgl tlunifiedner_tgl_mcf tlunifiedner_tgl_hybrid universalner_ceb_mcf universalner_ceb_hybrid universalner_tgl_mcf universalner_tgl_hybrid

source | paper

Flores200 facebook/flores

multilingual translation

arabic armenian bengali cyrillic devanagari ethiopic georgian greek gujarati gurmukhi chinese (simplified) chinese (traditional) hangul hebrew japanese khmer kannada lao latin malayalam myanmar odia sinhala tamil telugu thai tibetan

Flores200 multilingual benchmark.

Run using lighteval:

flores200

source

Fquad V2 manu/fquad2_test

multilingual qa

french

FQuAD v2: French Question Answering Dataset version 2.

Run using lighteval:

fquadv2_fra

source | paper

French Boolq manu/french_boolq

classification multilingual qa

french

French Boolq multilingual benchmark.

Run using lighteval:

community_boolq_fra

source

French Evals fr-gouv-coordination-ia/IFEval-fr fr-gouv-coordination-ia/gpqa-fr fr-gouv-coordination-ia/bac-fr

knowledge multiple-choice qa

french

Collection of benchmarks for the french language.

Run using lighteval:

ifeval-fr gpqa-fr bac-fr

source | paper

French Triviqa manu/french-trivia

multilingual qa

french

French Triviqa multilingual benchmark.

Run using lighteval:

community_triviaqa_fra

source

German Rag Evals deutsche-telekom/Ger-RAG-eval

knowledge reasoning multiple-choice

german

Collection of benchmarks for the German language.

Run using lighteval:

german_rag_eval

source | paper

Germanquad deepset/germanquad

multilingual qa

german

GermanQuAD: High-quality German QA dataset with 13,722 questions.

Run using lighteval:

germanquad_deu

source | paper

Global Mmlu CohereForAI/Global-MMLU

knowledge multilingual multiple-choice

amharic arabic bengali chinese czech dutch english french german hebrew hindi indonesian italian japanese korean malay norwegian polish portuguese romanian russian serbian spanish swahili swedish tamil telugu thai turkish ukrainian urdu vietnamese yoruba zulu

Translated MMLU using both professional and non-professional translators.
Contains tags for cultural sensitivity.

Run using lighteval:

global_mmlu_all_amh_mcf global_mmlu_all_ara_mcf global_mmlu_all_ben_mcf

Show 31 more

global_mmlu_all_zho_mcf global_mmlu_all_ces_mcf global_mmlu_all_deu_mcf global_mmlu_all_eng_mcf global_mmlu_all_spa_mcf global_mmlu_all_fra_mcf global_mmlu_all_heb_mcf global_mmlu_all_hin_mcf global_mmlu_all_ind_mcf global_mmlu_all_ita_mcf global_mmlu_all_jpn_mcf global_mmlu_all_kor_mcf global_mmlu_all_msa_mcf global_mmlu_all_nld_mcf global_mmlu_all_nor_mcf global_mmlu_all_pol_mcf global_mmlu_all_por_mcf global_mmlu_all_ron_mcf global_mmlu_all_rus_mcf global_mmlu_all_srp_mcf global_mmlu_all_swe_mcf global_mmlu_all_swa_mcf global_mmlu_all_tam_mcf global_mmlu_all_tel_mcf global_mmlu_all_tha_mcf global_mmlu_all_tur_mcf global_mmlu_all_ukr_mcf global_mmlu_all_urd_mcf global_mmlu_all_vie_mcf global_mmlu_all_yor_mcf global_mmlu_all_zul_mcf

source | paper

Glue nyu-mll/glue aps/super_glue

classification language-understanding

english

The General Language Understanding Evaluation (GLUE) benchmark is a collection
of resources for training, evaluating, and analyzing natural language
understanding systems.

Run using lighteval:

glue super_glue

source

Gsm8K openai/gsm8k

math reasoning

english

GSM8K is a dataset of 8,000+ high-quality, single-step arithmetic word problems.

Run using lighteval:

gsm8k

source | paper

Headqa lighteval/headqa_harness

health medical multiple-choice qa

english spanish

HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to
access a specialized position in the Spanish healthcare system, and are
challenging even for highly specialized humans. They are designed by the
Ministerio de Sanidad,...

Run using lighteval:

headqa

source | paper

Hellaswag Rowan/hellaswag

multiple-choice narrative reasoning

english

HellaSwag is a commonsense inference benchmark designed to challenge language
models with adversarially filtered multiple-choice questions.

Run using lighteval:

hellaswag

source | paper

Hellaswag Hin ai4bharat/hellaswag-hi

multilingual multiple-choice reasoning

hindi

Hellaswag Hin multilingual benchmark.

Run using lighteval:

community_hellaswag_hin_mcf community_hellaswag_hin_cf community_hellaswag_hin_hybrid

source

Hellaswag Tel LightFury9/hellaswag-telugu

multilingual multiple-choice reasoning

telugu

Hellaswag Tel multilingual benchmark.

Run using lighteval:

community_hellaswag_tel_mcf community_hellaswag_tel_cf community_hellaswag_tel_hybrid

source

Hellaswag Tha lighteval/hellaswag_thai

multilingual multiple-choice reasoning

thai

Hellaswag Thai This is a Thai adaptation of the Hellaswag task. Similar to the
Turkish version, there's no specific paper, but it has been found to be
effective for evaluating Thai language models on commonsense reasoning tasks.

Run using lighteval:

community_hellaswag_tha_mcf community_hellaswag_tha_cf community_hellaswag_tha_hybrid

source

Hellaswag Tur malhajar/hellaswag_tr-v0.2

multilingual multiple-choice reasoning

turkish

Hellaswag Turkish This is a Turkish adaptation of the Hellaswag task. While
there's no specific paper for this version, it has been found to work well for
evaluating Turkish language models on commonsense reasoning tasks. We don't
handle them in...

Run using lighteval:

community_hellaswag_tur_mcf community_hellaswag_tur_cf community_hellaswag_tur_hybrid

source

Hindi Arc ai4bharat/ai2_arc-hi

multilingual multiple-choice reasoning

hindi

Hindi Arc multilingual benchmark.

Run using lighteval:

community_arc_hin_mcf community_arc_hin_cf community_arc_hin_hybrid

source

Hindi Boolq ai4bharat/boolq-hi

classification multilingual qa

gujarati hindi malayalam marathi tamil

Hindi Boolq multilingual benchmark.

Run using lighteval:

community_boolq_hin community_boolq_guj community_boolq_mal

Show 2 more

community_boolq_mar community_boolq_tam

source

Imdb lighteval/IMDB_helm

classification

english

-

Run using lighteval:

imdb imdb:contrastset

source | paper

Indicqa ai4bharat/IndicQA

multilingual qa

assamese bengali gujarati hindi kannada malayalam marathi oriya punjabi tamil telugu

IndicQA: A reading comprehension dataset for 11 Indian languages.

Run using lighteval:

indicqa_asm indicqa_ben indicqa_guj

Show 8 more

indicqa_hin indicqa_kan indicqa_mal indicqa_mar indicqa_ori indicqa_pan indicqa_tam indicqa_tel

source | paper

Jeopardy openaccess-ai-collective/jeopardy

knowledge qa

english

Jeopardy is a dataset of questions and answers from the Jeopardy game show.

Run using lighteval:

jeopardy

source

Kenswquad lighteval/KenSwQuAD

multilingual qa

swahili

KenSwQuAD: A question answering dataset for Kenyan Swahili.

Run using lighteval:

kenswquad_swa

source | paper

Kyrgyz Evals TTimur/kyrgyzMMLU TTimur/kyrgyzRC TTimur/hellaswag_kg TTimur/winogrande_kg TTimur/truthfulqa_kg TTimur/gsm8k_kg +1 more

knowledge reading-comprehension common-sense reasoning math truthfulness

kyrgyz

Comprehensive evaluation suite for Kyrgyz language understanding, including MMLU,
Reading Comprehension, HellaSwag, Winogrande, TruthfulQA, GSM8K, and BoolQ tasks.

Run using lighteval:

kyrgyz_evals

source | paper

Lambada cimec/lambada

language-modeling

english

LAMBADA is a benchmark for testing language models’ ability to understand broad
narrative context. Each passage requires predicting its final word—easy for
humans given the full passage but impossible from just the last sentence.
Success demands...

Run using lighteval:

lambada

source | paper

Legal Summarization lighteval/legal_summarization

legal summarization

english

LegalSummarization is a dataset for legal summarization.

Run using lighteval:

legal_summarization

source | paper

Legalsupport lighteval/LegalSupport

legal

english

Measures fine-grained legal reasoning through reverse entailment.

Run using lighteval:

legalsupport

source

Lexglue lighteval/lexglue

classification legal

english

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Run using lighteval:

lexglue

source | paper

Lextreme lighteval/lextreme

classification legal

bulgarian czech danish german greek english spanish estonian finnish french ga croatian hungarian italian lithuanian latvian mt dutch polish portuguese romanian slovak slovenian swedish

LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain

Run using lighteval:

lextreme

source | paper

Logiqa lighteval/logiqa_harness

qa

english

LogiQA is a machine reading comprehension dataset focused on testing logical
reasoning abilities. It contains 8,678 expert-written multiple-choice questions
covering various types of deductive reasoning. While humans perform...

Run using lighteval:

logiqa

source | paper

Lsat Qa lighteval/lsat_qa

legal qa

english

Questions from law school admission tests.

Run using lighteval:

lsat_qa lsat_qa

source

M3Exams chiayewken/m3exam

knowledge multilingual multiple-choice

afrikaans chinese english italian javanese portuguese swahili thai vietnamese

M3Exam: Multitask Multilingual Multimodal Evaluation Benchmark It also contains

Run using lighteval:

m3exams_afr_mcf m3exams_afr_cf m3exams_afr_hybrid

Show 24 more

m3exams_zho_mcf m3exams_zho_cf m3exams_zho_hybrid m3exams_eng_mcf m3exams_eng_cf m3exams_eng_hybrid m3exams_ita_mcf m3exams_ita_cf m3exams_ita_hybrid m3exams_jav_mcf m3exams_jav_cf m3exams_jav_hybrid m3exams_por_mcf m3exams_por_cf m3exams_por_hybrid m3exams_swa_mcf m3exams_swa_cf m3exams_swa_hybrid m3exams_tha_mcf m3exams_tha_cf m3exams_tha_hybrid m3exams_vie_mcf m3exams_vie_cf m3exams_vie_hybrid

source | paper

Math DigitalLearningGmbH/MATH-lighteval

math reasoning

english

-

Run using lighteval:

math

source | paper

Mathlogicqa Rus ai-forever/MERA

math multilingual qa reasoning

russian

MathLogicQA is a dataset for evaluating mathematical reasoning in language
models. It consists of multiple-choice questions that require logical reasoning
and mathematical problem-solving. This Russian version is part of the MERA
(Multilingual...

Run using lighteval:

mathlogic_qa_rus_cf mathlogic_qa_rus_mcf mathlogic_qa_rus_hybrid

source | paper

Mathqa allenai/math_qa

math qa reasoning

english

large-scale dataset of math word problems. Our dataset is gathered by using a
new representation language to annotate over the AQuA-RAT dataset with
fully-specified operational programs. AQuA-RAT has provided the questions,
options, rationale, and...

Run using lighteval:

mathqa

source | paper

Med lighteval/med_mcqa lighteval/med_paragraph_simplification bigbio/med_qa

health medical

english

A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Run using lighteval:

med_mcqa med_paragraph_simplification med_qa

source | paper

Med Dialog lighteval/med_dialog

dialog health medical

english

A collection of medical dialogue datasets.

Run using lighteval:

med_dialog

source

Meta Mmlu meta-llama/Meta-Llama-3.1-8B-Instruct-evals

knowledge multilingual multiple-choice

french german hindi italian portuguese spanish thai

Meta MMLU: A multilingual version of MMLU (using google translation)

Run using lighteval:

meta_mmlu_deu_mcf meta_mmlu_deu_cf meta_mmlu_deu_hybrid

Show 18 more

meta_mmlu_spa_mcf meta_mmlu_spa_cf meta_mmlu_spa_hybrid meta_mmlu_fra_mcf meta_mmlu_fra_cf meta_mmlu_fra_hybrid meta_mmlu_hin_mcf meta_mmlu_hin_cf meta_mmlu_hin_hybrid meta_mmlu_ita_mcf meta_mmlu_ita_cf meta_mmlu_ita_hybrid meta_mmlu_por_mcf meta_mmlu_por_cf meta_mmlu_por_hybrid meta_mmlu_tha_mcf meta_mmlu_tha_cf meta_mmlu_tha_hybrid

source | paper

Mgsm juletxara/mgsm

math multilingual reasoning

english spanish french german russian chinese japanese thai swahili bengali telugu

Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school
math problems.
The same 250 problems from GSM8K are each translated via human annotators in 10
languages.

Run using lighteval:

mgsm

source | paper

Mgsm juletxara/mgsm

math multilingual reasoning

bengali chinese english french german japanese russian spanish swahili telugu thai

Mgsm multilingual benchmark.

Run using lighteval:

mgsm_eng mgsm_spa mgsm_fra

Show 8 more

mgsm_deu mgsm_rus mgsm_zho mgsm_jpn mgsm_tha mgsm_swa mgsm_ben mgsm_tel

source

Mintaka AmazonScience/mintaka

knowledge multilingual qa

arabic english french german hindi italian japanese portuguese spanish

Mintaka multilingual benchmark.

Run using lighteval:

mintaka_ara mintaka_deu mintaka_eng

Show 6 more

mintaka_spa mintaka_fra mintaka_hin mintaka_ita mintaka_jpn mintaka_por

source

Mkqa apple/mkqa

multilingual qa

arabic chinese chinese_hong_kong chinese_traditional danish dutch english finnish french german hebrew hungarian italian japanese khmer korean malay norwegian polish portuguese russian spanish swedish thai turkish vietnamese

Mkqa multilingual benchmark.

Run using lighteval:

mkqa_ara mkqa_dan mkqa_deu

Show 21 more

mkqa_eng mkqa_spa mkqa_fin mkqa_fra mkqa_heb mkqa_hun mkqa_ita mkqa_jpn mkqa_kor mkqa_khm mkqa_msa mkqa_nld mkqa_nor mkqa_pol mkqa_por mkqa_rus mkqa_swe mkqa_tha mkqa_tur mkqa_vie mkqa_zho

source

Mlmm Arc Challenge jon-tow/okapi_arc_challenge

multilingual multiple-choice reasoning

arabic bengali catalan chinese croatian danish dutch french german hindi hungarian indonesian italian kannada malayalam marathi nepali romanian russian serbian slovak spanish tamil telugu ukrainian vietnamese

ARC (AI2 Reasoning Challenge) is a dataset for question answering that requires
reasoning. It consists of multiple-choice science questions from 3rd to 9th
grade exams. The dataset is split into two parts: ARC-Easy and ARC-Challenge.
ARC-Easy...

Run using lighteval:

mlmm_arc_rus_mcf:challenge mlmm_arc_rus_cf:challenge mlmm_arc_rus_hybrid:challenge

Show 75 more

mlmm_arc_deu_mcf:challenge mlmm_arc_deu_cf:challenge mlmm_arc_deu_hybrid:challenge mlmm_arc_zho_mcf:challenge mlmm_arc_zho_cf:challenge mlmm_arc_zho_hybrid:challenge mlmm_arc_fra_mcf:challenge mlmm_arc_fra_cf:challenge mlmm_arc_fra_hybrid:challenge mlmm_arc_spa_mcf:challenge mlmm_arc_spa_cf:challenge mlmm_arc_spa_hybrid:challenge mlmm_arc_ita_mcf:challenge mlmm_arc_ita_cf:challenge mlmm_arc_ita_hybrid:challenge mlmm_arc_nld_mcf:challenge mlmm_arc_nld_cf:challenge mlmm_arc_nld_hybrid:challenge mlmm_arc_vie_mcf:challenge mlmm_arc_vie_cf:challenge mlmm_arc_vie_hybrid:challenge mlmm_arc_ind_mcf:challenge mlmm_arc_ind_cf:challenge mlmm_arc_ind_hybrid:challenge mlmm_arc_ara_mcf:challenge mlmm_arc_ara_cf:challenge mlmm_arc_ara_hybrid:challenge mlmm_arc_hun_mcf:challenge mlmm_arc_hun_cf:challenge mlmm_arc_hun_hybrid:challenge mlmm_arc_ron_mcf:challenge mlmm_arc_ron_cf:challenge mlmm_arc_ron_hybrid:challenge mlmm_arc_dan_mcf:challenge mlmm_arc_dan_cf:challenge mlmm_arc_dan_hybrid:challenge mlmm_arc_slk_mcf:challenge mlmm_arc_slk_cf:challenge mlmm_arc_slk_hybrid:challenge mlmm_arc_ukr_mcf:challenge mlmm_arc_ukr_cf:challenge mlmm_arc_ukr_hybrid:challenge mlmm_arc_cat_mcf:challenge mlmm_arc_cat_cf:challenge mlmm_arc_cat_hybrid:challenge mlmm_arc_srp_mcf:challenge mlmm_arc_srp_cf:challenge mlmm_arc_srp_hybrid:challenge mlmm_arc_hrv_mcf:challenge mlmm_arc_hrv_cf:challenge mlmm_arc_hrv_hybrid:challenge mlmm_arc_hin_mcf:challenge mlmm_arc_hin_cf:challenge mlmm_arc_hin_hybrid:challenge mlmm_arc_ben_mcf:challenge mlmm_arc_ben_cf:challenge mlmm_arc_ben_hybrid:challenge mlmm_arc_tam_mcf:challenge mlmm_arc_tam_cf:challenge mlmm_arc_tam_hybrid:challenge mlmm_arc_nep_mcf:challenge mlmm_arc_nep_cf:challenge mlmm_arc_nep_hybrid:challenge mlmm_arc_mal_mcf:challenge mlmm_arc_mal_cf:challenge mlmm_arc_mal_hybrid:challenge mlmm_arc_mar_mcf:challenge mlmm_arc_mar_cf:challenge mlmm_arc_mar_hybrid:challenge mlmm_arc_tel_mcf:challenge mlmm_arc_tel_cf:challenge mlmm_arc_tel_hybrid:challenge mlmm_arc_kan_mcf:challenge mlmm_arc_kan_cf:challenge mlmm_arc_kan_hybrid:challenge

source | paper

Mlmm Hellaswag jon-tow/okapi_hellaswag

multilingual multiple-choice reasoning

arabic armenian basque bengali catalan chinese croatian danish dutch french german gujarati hindi hungarian icelandic indonesian italian kannada malayalam marathi nepali norwegian portuguese romanian russian serbian slovak spanish swedish tamil telugu ukrainian vietnamese

Hellaswag is a commonsense reasoning task that requires models to complete a
given scenario with the most plausible ending. It tests the model's ability to
understand and reason about everyday situations and human behavior.
MLMM-Hellaswag:...

Run using lighteval:

mlmm_hellaswag_ara_mcf mlmm_hellaswag_ara_cf mlmm_hellaswag_ara_hybrid

Show 96 more

mlmm_hellaswag_ben_mcf mlmm_hellaswag_ben_cf mlmm_hellaswag_ben_hybrid mlmm_hellaswag_cat_mcf mlmm_hellaswag_cat_cf mlmm_hellaswag_cat_hybrid mlmm_hellaswag_dan_mcf mlmm_hellaswag_dan_cf mlmm_hellaswag_dan_hybrid mlmm_hellaswag_deu_mcf mlmm_hellaswag_deu_cf mlmm_hellaswag_deu_hybrid mlmm_hellaswag_spa_mcf mlmm_hellaswag_spa_cf mlmm_hellaswag_spa_hybrid mlmm_hellaswag_eus_mcf mlmm_hellaswag_eus_cf mlmm_hellaswag_eus_hybrid mlmm_hellaswag_fra_mcf mlmm_hellaswag_fra_cf mlmm_hellaswag_fra_hybrid mlmm_hellaswag_guj_mcf mlmm_hellaswag_guj_cf mlmm_hellaswag_guj_hybrid mlmm_hellaswag_hin_mcf mlmm_hellaswag_hin_cf mlmm_hellaswag_hin_hybrid mlmm_hellaswag_hrv_mcf mlmm_hellaswag_hrv_cf mlmm_hellaswag_hrv_hybrid mlmm_hellaswag_hun_mcf mlmm_hellaswag_hun_cf mlmm_hellaswag_hun_hybrid mlmm_hellaswag_hye_mcf mlmm_hellaswag_hye_cf mlmm_hellaswag_hye_hybrid mlmm_hellaswag_ind_mcf mlmm_hellaswag_ind_cf mlmm_hellaswag_ind_hybrid mlmm_hellaswag_isl_mcf mlmm_hellaswag_isl_cf mlmm_hellaswag_isl_hybrid mlmm_hellaswag_ita_mcf mlmm_hellaswag_ita_cf mlmm_hellaswag_ita_hybrid mlmm_hellaswag_kan_mcf mlmm_hellaswag_kan_cf mlmm_hellaswag_kan_hybrid mlmm_hellaswag_mal_mcf mlmm_hellaswag_mal_cf mlmm_hellaswag_mal_hybrid mlmm_hellaswag_mar_mcf mlmm_hellaswag_mar_cf mlmm_hellaswag_mar_hybrid mlmm_hellaswag_nor_mcf mlmm_hellaswag_nor_cf mlmm_hellaswag_nor_hybrid mlmm_hellaswag_nep_mcf mlmm_hellaswag_nep_cf mlmm_hellaswag_nep_hybrid mlmm_hellaswag_nld_mcf mlmm_hellaswag_nld_cf mlmm_hellaswag_nld_hybrid mlmm_hellaswag_por_mcf mlmm_hellaswag_por_cf mlmm_hellaswag_por_hybrid mlmm_hellaswag_ron_mcf mlmm_hellaswag_ron_cf mlmm_hellaswag_ron_hybrid mlmm_hellaswag_rus_mcf mlmm_hellaswag_rus_cf mlmm_hellaswag_rus_hybrid mlmm_hellaswag_slk_mcf mlmm_hellaswag_slk_cf mlmm_hellaswag_slk_hybrid mlmm_hellaswag_srp_mcf mlmm_hellaswag_srp_cf mlmm_hellaswag_srp_hybrid mlmm_hellaswag_swe_mcf mlmm_hellaswag_swe_cf mlmm_hellaswag_swe_hybrid mlmm_hellaswag_tam_mcf mlmm_hellaswag_tam_cf mlmm_hellaswag_tam_hybrid mlmm_hellaswag_tel_mcf mlmm_hellaswag_tel_cf mlmm_hellaswag_tel_hybrid mlmm_hellaswag_ukr_mcf mlmm_hellaswag_ukr_cf mlmm_hellaswag_ukr_hybrid mlmm_hellaswag_vie_mcf mlmm_hellaswag_vie_cf mlmm_hellaswag_vie_hybrid mlmm_hellaswag_zho_mcf mlmm_hellaswag_zho_cf mlmm_hellaswag_zho_hybrid

source | paper

Mlmm Mmlu jon-tow/okapi_mmlu

knowledge multilingual multiple-choice

arabic bengali catalan chinese croatian danish dutch french german hindi hungarian indonesian italian kannada malayalam marathi nepali romanian russian serbian slovak spanish tamil telugu ukrainian vietnamese

MLMM MMLU: Another multilingual version of MMLU

Run using lighteval:

mlmm_mmlu_rus_mcf mlmm_mmlu_rus_cf mlmm_mmlu_rus_hybrid

Show 75 more

mlmm_mmlu_deu_mcf mlmm_mmlu_deu_cf mlmm_mmlu_deu_hybrid mlmm_mmlu_zho_mcf mlmm_mmlu_zho_cf mlmm_mmlu_zho_hybrid mlmm_mmlu_fra_mcf mlmm_mmlu_fra_cf mlmm_mmlu_fra_hybrid mlmm_mmlu_spa_mcf mlmm_mmlu_spa_cf mlmm_mmlu_spa_hybrid mlmm_mmlu_ita_mcf mlmm_mmlu_ita_cf mlmm_mmlu_ita_hybrid mlmm_mmlu_nld_mcf mlmm_mmlu_nld_cf mlmm_mmlu_nld_hybrid mlmm_mmlu_vie_mcf mlmm_mmlu_vie_cf mlmm_mmlu_vie_hybrid mlmm_mmlu_ind_mcf mlmm_mmlu_ind_cf mlmm_mmlu_ind_hybrid mlmm_mmlu_ara_mcf mlmm_mmlu_ara_cf mlmm_mmlu_ara_hybrid mlmm_mmlu_hun_mcf mlmm_mmlu_hun_cf mlmm_mmlu_hun_hybrid mlmm_mmlu_ron_mcf mlmm_mmlu_ron_cf mlmm_mmlu_ron_hybrid mlmm_mmlu_dan_mcf mlmm_mmlu_dan_cf mlmm_mmlu_dan_hybrid mlmm_mmlu_slk_mcf mlmm_mmlu_slk_cf mlmm_mmlu_slk_hybrid mlmm_mmlu_ukr_mcf mlmm_mmlu_ukr_cf mlmm_mmlu_ukr_hybrid mlmm_mmlu_cat_mcf mlmm_mmlu_cat_cf mlmm_mmlu_cat_hybrid mlmm_mmlu_srp_mcf mlmm_mmlu_srp_cf mlmm_mmlu_srp_hybrid mlmm_mmlu_hrv_mcf mlmm_mmlu_hrv_cf mlmm_mmlu_hrv_hybrid mlmm_mmlu_hin_mcf mlmm_mmlu_hin_cf mlmm_mmlu_hin_hybrid mlmm_mmlu_ben_mcf mlmm_mmlu_ben_cf mlmm_mmlu_ben_hybrid mlmm_mmlu_tam_mcf mlmm_mmlu_tam_cf mlmm_mmlu_tam_hybrid mlmm_mmlu_nep_mcf mlmm_mmlu_nep_cf mlmm_mmlu_nep_hybrid mlmm_mmlu_mal_mcf mlmm_mmlu_mal_cf mlmm_mmlu_mal_hybrid mlmm_mmlu_mar_mcf mlmm_mmlu_mar_cf mlmm_mmlu_mar_hybrid mlmm_mmlu_tel_mcf mlmm_mmlu_tel_cf mlmm_mmlu_tel_hybrid mlmm_mmlu_kan_mcf mlmm_mmlu_kan_cf mlmm_mmlu_kan_hybrid

source | paper

Mlmm Truthfulqa jon-tow/okapi_truthfulqa

factuality multilingual qa

arabic armenian basque bengali catalan chinese croatian danish dutch french german gujarati hindi hungarian icelandic indonesian italian kannada malayalam marathi nepali norwegian portuguese romanian russian serbian slovak spanish swedish tamil telugu ukrainian vietnamese

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Run using lighteval:

mlmm_truthfulqa_ara_mcf mlmm_truthfulqa_ara_cf mlmm_truthfulqa_ara_hybrid

Show 96 more

mlmm_truthfulqa_ben_mcf mlmm_truthfulqa_ben_cf mlmm_truthfulqa_ben_hybrid mlmm_truthfulqa_cat_mcf mlmm_truthfulqa_cat_cf mlmm_truthfulqa_cat_hybrid mlmm_truthfulqa_dan_mcf mlmm_truthfulqa_dan_cf mlmm_truthfulqa_dan_hybrid mlmm_truthfulqa_deu_mcf mlmm_truthfulqa_deu_cf mlmm_truthfulqa_deu_hybrid mlmm_truthfulqa_spa_mcf mlmm_truthfulqa_spa_cf mlmm_truthfulqa_spa_hybrid mlmm_truthfulqa_eus_mcf mlmm_truthfulqa_eus_cf mlmm_truthfulqa_eus_hybrid mlmm_truthfulqa_fra_mcf mlmm_truthfulqa_fra_cf mlmm_truthfulqa_fra_hybrid mlmm_truthfulqa_guj_mcf mlmm_truthfulqa_guj_cf mlmm_truthfulqa_guj_hybrid mlmm_truthfulqa_hin_mcf mlmm_truthfulqa_hin_cf mlmm_truthfulqa_hin_hybrid mlmm_truthfulqa_hrv_mcf mlmm_truthfulqa_hrv_cf mlmm_truthfulqa_hrv_hybrid mlmm_truthfulqa_hun_mcf mlmm_truthfulqa_hun_cf mlmm_truthfulqa_hun_hybrid mlmm_truthfulqa_hye_mcf mlmm_truthfulqa_hye_cf mlmm_truthfulqa_hye_hybrid mlmm_truthfulqa_ind_mcf mlmm_truthfulqa_ind_cf mlmm_truthfulqa_ind_hybrid mlmm_truthfulqa_isl_mcf mlmm_truthfulqa_isl_cf mlmm_truthfulqa_isl_hybrid mlmm_truthfulqa_ita_mcf mlmm_truthfulqa_ita_cf mlmm_truthfulqa_ita_hybrid mlmm_truthfulqa_kan_mcf mlmm_truthfulqa_kan_cf mlmm_truthfulqa_kan_hybrid mlmm_truthfulqa_mal_mcf mlmm_truthfulqa_mal_cf mlmm_truthfulqa_mal_hybrid mlmm_truthfulqa_mar_mcf mlmm_truthfulqa_mar_cf mlmm_truthfulqa_mar_hybrid mlmm_truthfulqa_nor_mcf mlmm_truthfulqa_nor_cf mlmm_truthfulqa_nor_hybrid mlmm_truthfulqa_nep_mcf mlmm_truthfulqa_nep_cf mlmm_truthfulqa_nep_hybrid mlmm_truthfulqa_nld_mcf mlmm_truthfulqa_nld_cf mlmm_truthfulqa_nld_hybrid mlmm_truthfulqa_por_mcf mlmm_truthfulqa_por_cf mlmm_truthfulqa_por_hybrid mlmm_truthfulqa_ron_mcf mlmm_truthfulqa_ron_cf mlmm_truthfulqa_ron_hybrid mlmm_truthfulqa_rus_mcf mlmm_truthfulqa_rus_cf mlmm_truthfulqa_rus_hybrid mlmm_truthfulqa_slk_mcf mlmm_truthfulqa_slk_cf mlmm_truthfulqa_slk_hybrid mlmm_truthfulqa_srp_mcf mlmm_truthfulqa_srp_cf mlmm_truthfulqa_srp_hybrid mlmm_truthfulqa_swe_mcf mlmm_truthfulqa_swe_cf mlmm_truthfulqa_swe_hybrid mlmm_truthfulqa_tam_mcf mlmm_truthfulqa_tam_cf mlmm_truthfulqa_tam_hybrid mlmm_truthfulqa_tel_mcf mlmm_truthfulqa_tel_cf mlmm_truthfulqa_tel_hybrid mlmm_truthfulqa_ukr_mcf mlmm_truthfulqa_ukr_cf mlmm_truthfulqa_ukr_hybrid mlmm_truthfulqa_vie_mcf mlmm_truthfulqa_vie_cf mlmm_truthfulqa_vie_hybrid mlmm_truthfulqa_zho_mcf mlmm_truthfulqa_zho_cf mlmm_truthfulqa_zho_hybrid

source | paper

Mlqa facebook/mlqa

multilingual qa

arabic chinese german hindi spanish vietnamese

MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating
cross-lingual question answering performance. It consists of QA instances in 7
languages: English, Arabic, German, Spanish, Hindi, Vietnamese, and Chinese. The
dataset is...

Run using lighteval:

mlqa_ara mlqa_deu mlqa_spa

Show 3 more

mlqa_zho mlqa_hin mlqa_vie

source | paper

Mmlu lighteval/mmlu

general-knowledge knowledge multiple-choice

english

MMMLU is a benchmark of general-knowledge and English language understanding.

Run using lighteval:

mmlu

source | paper

Mmlu Redux edinburgh-dawg/mmlu-redux-2.0

general-knowledge knowledge multiple-choice

english

MMLU-Redux is a subset of 5,700 manually re-annotated questions across 57 MMLU subjects.

Run using lighteval:

mmlu_redux_2

source | paper

Mmmu Pro MMMU/MMMU_pro

general-knowledge knowledge multimodal multiple-choice

english

-

Run using lighteval:

mmmu_pro

source | paper

Mt Bench lighteval/mt-bench

conversational generation multi-turn

english

MT-Bench is a multi-turn conversational benchmark for evaluating language
models. It consists of 80 high-quality multi-turn questions across 8 common
categories (writing, roleplay, reasoning, math, coding, extraction, STEM,
humanities). Model...

Run using lighteval:

mt_bench

source | paper

Narrativeqa lighteval/narrative_qa_helm

qa reading-comprehension

english

NarrativeQA is a reading comprehension benchmark that tests deep understanding
of full narratives—books and movie scripts—rather than shallow text matching. To
answer its questions, models must integrate information across entire stories.

Run using lighteval:

narrativeqa

source | paper

Natural Questions lighteval/small_natural_questions

general-knowledge qa

english

This dataset is a collection of question-answer pairs from the Natural Questions
dataset. See Natural Questions for additional information. This dataset can be
used directly with Sentence Transformers to train embedding models.

Run using lighteval:

natural_questions

source | paper

Numeracy lighteval/numeracy

math reasoning

english

Numeracy is a benchmark for evaluating the ability of language models to reason about mathematics.

Run using lighteval:

numeracy

source

Oab Exams eduagarcia/oab_exams

knowledge multilingual multiple-choice

portuguese

OAB Exams: A collection of questions from the Brazilian Bar Association exam The
exam is required for anyone who wants to practice law in Brazil

Run using lighteval:

oab_exams_por_mcf oab_exams_por_cf oab_exams_por_hybrid

source | paper

Ocnli clue/clue

classification multilingual nli

chinese

Native Chinese NLI dataset based.

Run using lighteval:

ocnli_zho_mcf ocnli_zho_cf ocnli_zho_hybrid

source | paper

Olympiade Bench Hothan/OlympiadBench

math reasoning language

english chinese

OlympiadBench is a benchmark for evaluating the performance of language models
on olympiad problems.

Run using lighteval:

olympiad_bench

source | paper

Openai Mmlu openai/MMMLU

knowledge multilingual multiple-choice

arabic bengali chinese french german hindi indonesian italian japanese korean portuguese spanish swahili yoruba

Openai Mmlu multilingual benchmark.

Run using lighteval:

openai_mmlu_ara_mcf openai_mmlu_ara_cf openai_mmlu_ara_hybrid

Show 39 more

openai_mmlu_ben_mcf openai_mmlu_ben_cf openai_mmlu_ben_hybrid openai_mmlu_deu_mcf openai_mmlu_deu_cf openai_mmlu_deu_hybrid openai_mmlu_spa_mcf openai_mmlu_spa_cf openai_mmlu_spa_hybrid openai_mmlu_fra_mcf openai_mmlu_fra_cf openai_mmlu_fra_hybrid openai_mmlu_hin_mcf openai_mmlu_hin_cf openai_mmlu_hin_hybrid openai_mmlu_ind_mcf openai_mmlu_ind_cf openai_mmlu_ind_hybrid openai_mmlu_ita_mcf openai_mmlu_ita_cf openai_mmlu_ita_hybrid openai_mmlu_jpn_mcf openai_mmlu_jpn_cf openai_mmlu_jpn_hybrid openai_mmlu_kor_mcf openai_mmlu_kor_cf openai_mmlu_kor_hybrid openai_mmlu_por_mcf openai_mmlu_por_cf openai_mmlu_por_hybrid openai_mmlu_swa_mcf openai_mmlu_swa_cf openai_mmlu_swa_hybrid openai_mmlu_yor_mcf openai_mmlu_yor_cf openai_mmlu_yor_hybrid openai_mmlu_zho_mcf openai_mmlu_zho_cf openai_mmlu_zho_hybrid

source

Openbook Ara OALL/AlGhafa-Arabic-LLM-Benchmark-Translated

multilingual multiple-choice reasoning

arabic

OpenBookQA: A Question-Answering Dataset for Open-Book Exams OpenBookQA is a
question-answering dataset modeled after open-book exams for assessing human
understanding of a subject. It consists of multiple-choice questions that
require combining...

Run using lighteval:

alghafa_openbookqa_ara_mcf alghafa_openbookqa_ara_cf alghafa_openbookqa_ara_hybrid

source | paper

Openbook Es BSC-LT/openbookqa-es

multilingual multiple-choice reasoning

spanish

Spanish version of OpenBookQA from BSC Language Technology group

Run using lighteval:

openbookqa_spa_mcf openbookqa_spa_cf openbookqa_spa_hybrid

source | paper

Openbook Rus ai-forever/MERA

multilingual multiple-choice reasoning

russian

The Russian version is part of the MERA (Multilingual Enhanced Russian NLP
Architectures) project.

Run using lighteval:

mera_openbookqa_rus_mcf mera_openbookqa_rus_cf mera_openbookqa_rus_hybrid

source | paper

Openbookqa allenai/openbookqa

multiple-choice qa

english

OpenBookQA is a question-answering dataset modeled after open-book exams for
assessing human understanding of a subject. It contains multiple-choice
questions that require combining facts from a given open book with broad common
knowledge. The task...

Run using lighteval:

openbookqa

source | paper

Oz Serbian Evals DjMel/oz-eval

knowledge multiple-choice

serbian

OZ Eval (sr. Opšte Znanje Evaluacija) dataset was created for the purposes of
evaluating General Knowledge of LLM models in Serbian language. Data consists
of 1k+ high-quality questions and answers which were used as part of entry exams
at the...

Run using lighteval:

serbian_evals:oz_task

source

Parus ai-forever/MERA

multilingual

russian

PARus: Plausible Alternatives for Russian PARus is the Russian adaptation of the
COPA task, part of the Russian SuperGLUE benchmark. It evaluates common sense
reasoning and causal inference abilities in Russian language models.

Run using lighteval:

parus_rus_mcf parus_rus_cf parus_rus_hybrid

source | paper

Paws X google-research-datasets/paws-x

classification multilingual nli

chinese english french german japanese korean spanish

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification This
dataset contains paraphrase identification pairs in multiple languages. It's
derived from PAWS (Paraphrase Adversaries from Word Scrambling) and We treat
paraphrase as...

Run using lighteval:

pawsx_deu_mcf pawsx_deu_cf pawsx_deu_hybrid

Show 18 more

pawsx_eng_mcf pawsx_eng_cf pawsx_eng_hybrid pawsx_spa_mcf pawsx_spa_cf pawsx_spa_hybrid pawsx_fra_mcf pawsx_fra_cf pawsx_fra_hybrid pawsx_jpn_mcf pawsx_jpn_cf pawsx_jpn_hybrid pawsx_kor_mcf pawsx_kor_cf pawsx_kor_hybrid pawsx_zho_mcf pawsx_zho_cf pawsx_zho_hybrid

source | paper

Piqa ybisk/piqa

commonsense multiple-choice qa

english

PIQA is a benchmark for testing physical commonsense reasoning. It contains
questions requiring this kind of physical commonsense reasoning.

Run using lighteval:

piqa

source | paper

Piqa Ar OALL/AlGhafa-Arabic-LLM-Benchmark-Translated

multilingual multiple-choice qa reasoning

arabic

PIQA: Physical Interaction Question Answering PIQA is a benchmark for testing
physical commonsense reasoning. This Arabic version is a translation of the
original PIQA dataset, adapted for Arabic language evaluation. It tests the
ability to reason...

Run using lighteval:

alghafa_piqa_ara_mcf alghafa_piqa_ara_cf alghafa_piqa_ara_hybrid

source | paper

Prost lighteval/prost

reasoning qa physical-commonsense

english

PROST is a benchmark for testing physical reasoning about objects through space
and time. It includes 18,736 multiple-choice questions covering 10 core physics
concepts, designed to probe models in zero-shot settings. Results show that even
large...

Run using lighteval:

prost

source | paper

Pubmedqa pubmed_qa

biomedical health medical qa

english

PubMedQA is a dataset for biomedical research question answering.

Run using lighteval:

pubmedqa

source | paper

Qa4Mre qa4mre

biomedical health qa

english

QA4MRE is a machine reading comprehension benchmark from the CLEF 2011-2013
challenges. It evaluates systems' ability to answer questions requiring deep
understanding of short texts, supported by external background knowledge.
Covering tasks like...

Run using lighteval:

qa4mre

source | paper

Qasper allenai/qasper

qa scientific

english

QASPER is a dataset for question answering on scientific research papers. It
consists of 5,049 questions over 1,585 Natural Language Processing papers. Each
question is written by an NLP practitioner who read only the title and abstract
of the...

Run using lighteval:

qasper

source | paper

Quac lighteval/quac_helm

dialog qa

english

The QuAC benchmark for question answering in the context of dialogues.

Run using lighteval:

quac

source | paper

Race High EleutherAI/race

multiple-choice reading-comprehension

english

RACE is a large-scale reading comprehension dataset with more than 28,000
passages and nearly 100,000 questions. The dataset is collected from English
examinations in China, which are designed for middle school and high school
students. The dataset...

Run using lighteval:

race:high

source | paper

Raft ought/raft

classification reasoning

english

The Real-world annotated few-shot (RAFT) meta-benchmark of 11 real-world text
classification tasks.

Run using lighteval:

raft

source | paper

Rcb ai-forever/MERA

classification multilingual nli

russian

Russian Commitment Bank (RCB) is a large-scale NLI dataset with Russian
sentences, collected from the web and crowdsourcing.

Run using lighteval:

rcb_rus_mcf rcb_rus_cf rcb_rus_hybrid

source | paper

Real Toxicity Prompts allenai/real-toxicity-prompts

generation safety

english

The RealToxicityPrompts dataset for measuring toxicity in prompted model generations

Run using lighteval:

real_toxicity_prompts

source | paper

Sacrebleu lighteval/sacrebleu_manual wmt14 wmt16

translation

english german french japanese korean chinese arabic

tasks from sacrebleu

Run using lighteval:

wmt14:de-en wmt16:en-cs wmt19

Show 1 more

wmt20

source | paper

Sber Squad kuznetsoffandrey/sberquad

multilingual qa

russian

SberQuAD: A large-scale Russian reading comprehension dataset.

Run using lighteval:

sber_squad_rus

source | paper

Sciq allenai/sciq

physics chemistry biology reasoning multiple-choice qa

english

The SciQ dataset contains 13,679 crowdsourced science exam questions about
Physics, Chemistry and Biology, among others. The questions are in
multiple-choice format with 4 answer options each. For the majority of the
questions, an additional...

Run using lighteval:

sciq

source | paper

Serbian Evals datatab/serbian-llm-benchmark

knowledge multiple-choice

serbian

The tasks cover a variety of benchmarks, including: standard task like ARC[E][C],
BoolQ, Hellaswag, OpenBookQA,PIQA, Winogrande and a custom OZ Eval.
MMLU is separated by subject and also all in one.

Run using lighteval:

serbian_evals

source

Siqa allenai/social_i_qa

commonsense multiple-choice qa

english

We introduce Social IQa: Social Interaction QA, a new question-answering
benchmark for testing social commonsense intelligence. Contrary to many prior
benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on
reasoning about...

Run using lighteval:

siqa

source

Slr-Bench AIML-TUDA/SLR-Bench

reasoning symbolic

english

SLR-Bench is a large-scale benchmark for scalable logical reasoning with
language models, comprising 19,000 prompts organized into 20 curriculum levels.

Run using lighteval:

slr_bench_all slr_bench_basic slr_bench_easy

Show 2 more

slr_bench_medium slr_bench_hard

source | paper

Soqal OALL/AlGhafa-Arabic-LLM-Benchmark-Native

multilingual qa

arabic

SOQAL: A large-scale Arabic reading comprehension dataset.

Run using lighteval:

soqal_ara_mcf soqal_ara_cf soqal_ara_hybrid

source | paper

Squad Es ccasimiro/squad_es

multilingual qa

spanish

SQuAD-es: Spanish translation of the Stanford Question Answering Dataset

Run using lighteval:

squad_spa

source | paper

Squad It crux82/squad_it

multilingual qa

italian

SQuAD-it: Italian translation of the SQuAD dataset.

Run using lighteval:

squad_ita

source | paper

Squad V2 rajpurkar/squad_v2

qa

english

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset,
consisting of questions posed by crowdworkers on a set of Wikipedia articles,
where the answer to every question is a segment of text, or span, from the
corresponding...

Run using lighteval:

squad_v2

source | paper

Storycloze MoE-UNC/story_cloze

narrative reasoning

english

A Corpus and Cloze Evaluation for Deeper Understanding of
Commonsense Stories

Run using lighteval:

storycloze

source | paper

Summarization lighteval/summarization

summarization

english

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural
Networks for Extreme Summarization and: Abstractive Text Summarization using
Sequence-to-sequence RNNs and Beyond

Run using lighteval:

summarization

source | paper

Swag allenai/swag

narrative reasoning

english

The dataset consists of 113k multiple choice questions about grounded situations
(73k training, 20k validation, 20k test). Each question is a video caption from
LSMDC or ActivityNet Captions, with four answer choices about what might happen
next in...

Run using lighteval:

swag

source | paper

Swahili Arc

multilingual multiple-choice reasoning

swahili

Swahili Arc multilingual benchmark.

Run using lighteval:

community_arc_swa_mcf community_arc_swa_cf community_arc_swa_hybrid

source

Synthetic Reasoning lighteval/synthetic_reasoning lighteval/synthetic_reasoning_natural

reasoning

english

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning

Run using lighteval:

synthetic_reasoning

source | paper

Thai Exams scb10x/thai_exam

knowledge multilingual multiple-choice

thai

Thai Exams multilingual benchmark.

Run using lighteval:

thai_exams_tha_mcf thai_exams_tha_cf thai_exams_tha_hybrid

source

Thaiqa lighteval/thaiqa_squad_fixed

multilingual qa

thai

ThaiQA: A question answering dataset for the Thai language.

Run using lighteval:

thaiqa_tha

source

The Pile lighteval/pile_helm

language-modeling

english

The Pile corpus for measuring lanugage model performance across various domains.

Run using lighteval:

the_pile

source | paper

Tiny Benchmarks tinyBenchmarks/tinyWinogrande tinyBenchmarks/tinyAI2_arc tinyBenchmarks/tinyHellaswag tinyBenchmarks/tinyMMLU tinyBenchmarks/tinyTruthfulQA tinyBenchmarks/tinyGSM8k

general-knowledge reasoning qa

english

TinyBenchmarks is a benchmark for evaluating the performance of language models
on tiny benchmarks.

Run using lighteval:

tiny

source | paper

Toxigen skg/toxigen-data

generation safety

english

This dataset is for implicit hate speech detection. All instances were generated
using GPT-3 and the methods described in our paper.

Run using lighteval:

toxigen

source | paper

Tquad V2 erdometo/tquad2

multilingual qa

turkish

TQuAD v2: Turkish Question Answering Dataset version 2.

Run using lighteval:

tquadv2_tur

source

Triviaqa mandarjoshi/trivia_qa

qa

english

TriviaqQA is a reading comprehension dataset containing over 650K
question-answer-evidence triples. TriviaqQA includes 95K question-answer pairs
authored by trivia enthusiasts and independently gathered evidence documents,
six per question on...

Run using lighteval:

triviaqa

source | paper

Truthfulqa EleutherAI/truthful_qa_mc

factuality qa

english

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Run using lighteval:

truthfulqa

source | paper

Turkic Evals jafarisbarov/TUMLU-mini

knowledge multiple-choice

turkic

TUMLU-mini is a benchmark for Turkic language understanding, comprising 1,000
prompts organized into 10 subsets.

Run using lighteval:

tumlu

source | paper

Turkish Arc malhajar/arc-tr

multilingual multiple-choice reasoning

turkish

Turkish ARC Comes from the Turkish leaderboard

Run using lighteval:

community_arc_tur_mcf community_arc_tur_cf community_arc_tur_hybrid

source

Turkish Mmlu AYueksel/TurkishMMLU

knowledge multilingual multiple-choice

turkish

Turkish Mmlu multilingual benchmark.

Run using lighteval:

mmlu_tur_mcf mmlu_tur_cf mmlu_tur_hybrid

source

Twitteraae lighteval/twitterAAE

language-modeling

english

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

Run using lighteval:

twitterAAE

source | paper

Tydiqa google-research-datasets/tydiqa

multilingual qa

arabic bengali english finnish indonesian japanese korean russian swahili telugu thai

Other QA tasks for RC TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. https://arxiv.org/abs/2003.05002

Run using lighteval:

tydiqa_eng tydiqa_ara tydiqa_ben

Show 8 more

tydiqa_fin tydiqa_ind tydiqa_jpn tydiqa_kor tydiqa_swa tydiqa_rus tydiqa_tel tydiqa_tha

source | paper

Unscramble lighteval/GPT3_unscramble

language-modeling reasoning

english

Benchmark where we ask the model to unscramble a word, either anagram or
random insertion.

Run using lighteval:

unscramble

source | paper

Webqs stanfordnlp/web_questions

qa

english

This dataset consists of 6,642 question/answer pairs. The questions are supposed
to be answerable by Freebase, a large knowledge graph. The questions are mostly
centered around a single named entity. The questions are popular ones asked on
the web.

Run using lighteval:

webqs

source | paper

Wikifact lighteval/wikifact

factuality knowledge

english

Extensively test factual knowledge.

Run using lighteval:

wikifact

source | paper

Wikitext EleutherAI/wikitext_document_level

language-modeling

english

The WikiText language modeling dataset is a collection of over 100 million
tokens extracted from the set of verified Good and Featured articles on
Wikipedia. The dataset is available under the Creative Commons
Attribution-ShareAlike License.

Run using lighteval:

wikitext:103:document_level

source | paper

Winogrande allenai/winogrande

commonsense multiple-choice

english

WinoGrande is a new collection of 44k problems, inspired by Winograd Schema
Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the
scale and robustness against the dataset-specific bias. Formulated as a
fill-in-a-blank task...

Run using lighteval:

winogrande

source | paper

Worldtree Rus ai-forever/MERA

multilingual

russian

WorldTree is a dataset for multi-hop inference in science question answering. It
provides explanations for elementary science questions by combining facts from a
semi-structured knowledge base. This Russian version is part of the MERA
(Multilingual...

Run using lighteval:

mera_worldtree_rus_mcf mera_worldtree_rus_cf mera_worldtree_rus_hybrid

source | paper

Xcodah INK-USC/xcsr

multilingual multiple-choice reasoning

arabic chinese dutch english french german hindi italian japanese polish portuguese russian spanish swahili urdu vietnamese

Xcodah multilingual benchmark.

Run using lighteval:

xcodah_ara_mcf xcodah_ara_cf xcodah_ara_hybrid

Show 45 more

xcodah_deu_mcf xcodah_deu_cf xcodah_deu_hybrid xcodah_eng_mcf xcodah_eng_cf xcodah_eng_hybrid xcodah_spa_mcf xcodah_spa_cf xcodah_spa_hybrid xcodah_fra_mcf xcodah_fra_cf xcodah_fra_hybrid xcodah_hin_mcf xcodah_hin_cf xcodah_hin_hybrid xcodah_ita_mcf xcodah_ita_cf xcodah_ita_hybrid xcodah_jpn_mcf xcodah_jpn_cf xcodah_jpn_hybrid xcodah_nld_mcf xcodah_nld_cf xcodah_nld_hybrid xcodah_pol_mcf xcodah_pol_cf xcodah_pol_hybrid xcodah_por_mcf xcodah_por_cf xcodah_por_hybrid xcodah_rus_mcf xcodah_rus_cf xcodah_rus_hybrid xcodah_swa_mcf xcodah_swa_cf xcodah_swa_hybrid xcodah_urd_mcf xcodah_urd_cf xcodah_urd_hybrid xcodah_vie_mcf xcodah_vie_cf xcodah_vie_hybrid xcodah_zho_mcf xcodah_zho_cf xcodah_zho_hybrid

source

Xcopa cambridgeltl/xcopa

commonsense multilingual multiple-choice reasoning

english

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning The Cross-lingual
Choice of Plausible Alternatives dataset is a benchmark to evaluate the ability
of machine learning models to transfer commonsense reasoning across languages.

Run using lighteval:

xcopa

source | paper

Xcopa

multilingual multiple-choice narrative reasoning

arabic chinese estonian haitian indonesian italian quechua swahili tamil thai turkish vietnamese

COPA (Choice of Plausible Alternatives) tasks involve determining the most
plausible cause or effect for a given premise. These tasks test common sense
reasoning and causal inference abilities. XCOPA: Cross-lingual Choice of
Plausible Alternatives.

Run using lighteval:

xcopa_ara_mcf xcopa_ara_cf xcopa_ara_hybrid

Show 33 more

xcopa_est_mcf xcopa_est_cf xcopa_est_hybrid xcopa_ind_mcf xcopa_ind_cf xcopa_ind_hybrid xcopa_ita_mcf xcopa_ita_cf xcopa_ita_hybrid xcopa_swa_mcf xcopa_swa_cf xcopa_swa_hybrid xcopa_tam_mcf xcopa_tam_cf xcopa_tam_hybrid xcopa_tha_mcf xcopa_tha_cf xcopa_tha_hybrid xcopa_tur_mcf xcopa_tur_cf xcopa_tur_hybrid xcopa_vie_mcf xcopa_vie_cf xcopa_vie_hybrid xcopa_zho_mcf xcopa_zho_cf xcopa_zho_hybrid xcopa_hti_mcf xcopa_hti_cf xcopa_hti_hybrid xcopa_que_mcf xcopa_que_cf xcopa_que_hybrid

source | paper

Xcsqa INK-USC/xcsr

multilingual multiple-choice qa reasoning

arabic chinese dutch english french german hindi italian japanese polish portuguese russian spanish swahili urdu vietnamese

XCSQA (Cross-lingual Commonsense QA) is part of the XCSR (Cross-lingual
Commonsense Reasoning) benchmark It is a multilingual extension of the
CommonsenseQA dataset, covering 16 languages The task involves answering
multiple-choice questions that...

Run using lighteval:

xcsqa_ara_mcf xcsqa_ara_cf xcsqa_ara_hybrid

Show 45 more

xcsqa_deu_mcf xcsqa_deu_cf xcsqa_deu_hybrid xcsqa_eng_mcf xcsqa_eng_cf xcsqa_eng_hybrid xcsqa_spa_mcf xcsqa_spa_cf xcsqa_spa_hybrid xcsqa_fra_mcf xcsqa_fra_cf xcsqa_fra_hybrid xcsqa_hin_mcf xcsqa_hin_cf xcsqa_hin_hybrid xcsqa_ita_mcf xcsqa_ita_cf xcsqa_ita_hybrid xcsqa_jpn_mcf xcsqa_jpn_cf xcsqa_jpn_hybrid xcsqa_nld_mcf xcsqa_nld_cf xcsqa_nld_hybrid xcsqa_pol_mcf xcsqa_pol_cf xcsqa_pol_hybrid xcsqa_por_mcf xcsqa_por_cf xcsqa_por_hybrid xcsqa_rus_mcf xcsqa_rus_cf xcsqa_rus_hybrid xcsqa_swa_mcf xcsqa_swa_cf xcsqa_swa_hybrid xcsqa_urd_mcf xcsqa_urd_cf xcsqa_urd_hybrid xcsqa_vie_mcf xcsqa_vie_cf xcsqa_vie_hybrid xcsqa_zho_mcf xcsqa_zho_cf xcsqa_zho_hybrid

source | paper

Xnli facebook/xnli

classification multilingual nli

arabic bulgarian chinese english french german greek hindi russian spanish swahili thai turkish urdu vietnamese

NLI (Natural Language Inference) tasks involve determining the logical
relationship between two given sentences: a premise and a hypothesis. The goal
is to classify whether the hypothesis is entailed by, contradicts, or is neutral
with respect to...

Run using lighteval:

xnli_ara_mcf xnli_ara_cf xnli_ara_hybrid

Show 48 more

xnli_eng_mcf xnli_eng_cf xnli_eng_hybrid xnli_fra_mcf xnli_fra_cf xnli_fra_hybrid xnli_spa_mcf xnli_spa_cf xnli_spa_hybrid xnli_bul_mcf xnli_bul_cf xnli_bul_hybrid xnli_deu_mcf xnli_deu_cf xnli_deu_hybrid xnli_ell_mcf xnli_ell_cf xnli_ell_hybrid xnli_eng_mcf xnli_eng_cf xnli_eng_hybrid xnli_fra_mcf xnli_fra_cf xnli_fra_hybrid xnli_hin_mcf xnli_hin_cf xnli_hin_hybrid xnli_rus_mcf xnli_rus_cf xnli_rus_hybrid xnli_swa_mcf xnli_swa_cf xnli_swa_hybrid xnli_tha_mcf xnli_tha_cf xnli_tha_hybrid xnli_tur_mcf xnli_tur_cf xnli_tur_hybrid xnli_urd_mcf xnli_urd_cf xnli_urd_hybrid xnli_vie_mcf xnli_vie_cf xnli_vie_hybrid xnli_zho_mcf xnli_zho_cf xnli_zho_hybrid

source | paper

Xnli Indic Divyanshu/indicxnli

classification multilingual nli

assamese bengali gujarati hindi kannada malayalam marathi oriya punjabi tamil telugu

Another variant of XNLI, with emphasis on Indic languages.

Run using lighteval:

indicnxnli_asm_mcf indicnxnli_asm_cf indicnxnli_asm_hybrid

Show 30 more

indicnxnli_ben_mcf indicnxnli_ben_cf indicnxnli_ben_hybrid indicnxnli_guj_mcf indicnxnli_guj_cf indicnxnli_guj_hybrid indicnxnli_hin_mcf indicnxnli_hin_cf indicnxnli_hin_hybrid indicnxnli_kan_mcf indicnxnli_kan_cf indicnxnli_kan_hybrid indicnxnli_mal_mcf indicnxnli_mal_cf indicnxnli_mal_hybrid indicnxnli_mar_mcf indicnxnli_mar_cf indicnxnli_mar_hybrid indicnxnli_ori_mcf indicnxnli_ori_cf indicnxnli_ori_hybrid indicnxnli_pan_mcf indicnxnli_pan_cf indicnxnli_pan_hybrid indicnxnli_tam_mcf indicnxnli_tam_cf indicnxnli_tam_hybrid indicnxnli_tel_mcf indicnxnli_tel_cf indicnxnli_tel_hybrid

source | paper

Xnli2

classification multilingual nli

arabic assamese bengali bulgarian chinese english french german greek gujarati hindi kannada marathi punjabi russian sanskrit spanish swahili tamil thai turkish urdu vietnamese

Improvement on XNLI with better translation, from our experience models tend to
perform better on XNLI2.0 than XNLI.

Run using lighteval:

xnli2.0_eng_mcf xnli2.0_eng_cf xnli2.0_eng_hybrid

Show 69 more

xnli2.0_fra_mcf xnli2.0_fra_cf xnli2.0_fra_hybrid xnli2.0_pan_mcf xnli2.0_pan_cf xnli2.0_pan_hybrid xnli2.0_guj_mcf xnli2.0_guj_cf xnli2.0_guj_hybrid xnli2.0_kan_mcf xnli2.0_kan_cf xnli2.0_kan_hybrid xnli2.0_asm_mcf xnli2.0_asm_cf xnli2.0_asm_hybrid xnli2.0_ben_mcf xnli2.0_ben_cf xnli2.0_ben_hybrid xnli2.0_mar_mcf xnli2.0_mar_cf xnli2.0_mar_hybrid xnli2.0_san_mcf xnli2.0_san_cf xnli2.0_san_hybrid xnli2.0_tam_mcf xnli2.0_tam_cf xnli2.0_tam_hybrid xnli2.0_deu_mcf xnli2.0_deu_cf xnli2.0_deu_hybrid xnli2.0_eng_mcf xnli2.0_eng_cf xnli2.0_eng_hybrid xnli2.0_urd_mcf xnli2.0_urd_cf xnli2.0_urd_hybrid xnli2.0_vie_mcf xnli2.0_vie_cf xnli2.0_vie_hybrid xnli2.0_tur_mcf xnli2.0_tur_cf xnli2.0_tur_hybrid xnli2.0_tha_mcf xnli2.0_tha_cf xnli2.0_tha_hybrid xnli2.0_swa_mcf xnli2.0_swa_cf xnli2.0_swa_hybrid xnli2.0_spa_mcf xnli2.0_spa_cf xnli2.0_spa_hybrid xnli2.0_rus_mcf xnli2.0_rus_cf xnli2.0_rus_hybrid xnli2.0_hin_mcf xnli2.0_hin_cf xnli2.0_hin_hybrid xnli2.0_ell_mcf xnli2.0_ell_cf xnli2.0_ell_hybrid xnli2.0_zho_mcf xnli2.0_zho_cf xnli2.0_zho_hybrid xnli2.0_bul_mcf xnli2.0_bul_cf xnli2.0_bul_hybrid xnli2.0_ara_mcf xnli2.0_ara_cf xnli2.0_ara_hybrid

source | paper

Xquad google/xquad

multilingual qa

arabic chinese english german greek hindi romanian russian spanish thai turkish vietnamese

Reading Comprehension (RC) tasks evaluate a model's ability to understand and
extract information from text passages. These tasks typically involve answering
questions based on given contexts, spanning multiple languages and formats. Add

Run using lighteval:

xquad_ara xquad_deu xquad_ell

Show 9 more

xquad_eng xquad_spa xquad_hin xquad_ron xquad_rus xquad_tha xquad_tur xquad_vie xquad_zho

source | paper

Xstory juletxara/xstory_cloze

multilingual narrative

arabic basque burmese chinese hindi indonesian russian spanish swahili telugu

Xstory multilingual benchmark.

Run using lighteval:

xstory_cloze_rus_mcf xstory_cloze_rus_cf xstory_cloze_rus_hybrid

Show 27 more

xstory_cloze_zho_mcf xstory_cloze_zho_cf xstory_cloze_zho_hybrid xstory_cloze_spa_mcf xstory_cloze_spa_cf xstory_cloze_spa_hybrid xstory_cloze_ara_mcf xstory_cloze_ara_cf xstory_cloze_ara_hybrid xstory_cloze_hin_mcf xstory_cloze_hin_cf xstory_cloze_hin_hybrid xstory_cloze_ind_mcf xstory_cloze_ind_cf xstory_cloze_ind_hybrid xstory_cloze_tel_mcf xstory_cloze_tel_cf xstory_cloze_tel_hybrid xstory_cloze_swa_mcf xstory_cloze_swa_cf xstory_cloze_swa_hybrid xstory_cloze_eus_mcf xstory_cloze_eus_cf xstory_cloze_eus_hybrid xstory_cloze_mya_mcf xstory_cloze_mya_cf xstory_cloze_mya_hybrid

source

Xstory Cloze juletxara/xstory_cloze

multilingual narrative reasoning

english russian chinese spanish arabic hindi indonesian telugu swahili basque burmese

XStoryCloze consists of the professionally translated version of the English
StoryCloze dataset (Spring 2016 version) to 10 non-English languages. This
dataset is released by Meta AI.

Run using lighteval:

xstory_cloze

source

Xwinograd Muennighoff/xwinograd

commonsense multilingual reasoning

english french japanese portuguese russian chinese

Multilingual winograd schema challenge as used in Crosslingual Generalization through Multitask Finetuning.

Run using lighteval:

xwinograd

source | paper

Xwinograd Muennighoff/xwinograd

multilingual multiple-choice reasoning

chinese english french japanese portuguese russian

Xwinograd multilingual benchmark.

Run using lighteval:

xwinograd_eng_mcf xwinograd_eng_cf xwinograd_eng_hybrid

Show 15 more

xwinograd_fra_mcf xwinograd_fra_cf xwinograd_fra_hybrid xwinograd_jpn_mcf xwinograd_jpn_cf xwinograd_jpn_hybrid xwinograd_por_mcf xwinograd_por_cf xwinograd_por_hybrid xwinograd_rus_mcf xwinograd_rus_cf xwinograd_rus_hybrid xwinograd_zho_mcf xwinograd_zho_cf xwinograd_zho_hybrid

source