Lighteval Tasks Explorer

Browse tasks by language, tags and search the task descriptions.

177 tasks

⭐⭐⭐ Recommended benchmarks are marked with a star icon.

Languages

Checkbox Group

Benchmark type

Tip: use the filters and search together. Results update live.

math reasoning
english
The American Invitational Mathematics Examination (AIME) is a prestigious,
invite-only mathematics competition for high-school students who perform in the
top 5% of the AMC 12 mathematics exam. It involves 15 questions of increasing
difficulty, with...
Run using lighteval:
aime24 aime24_gpassk aime25
Show 1 more
aime25_gpassk
multiple-choice
english
ARC-AGI tasks are a series of three to five input and output tasks followed by a
final task with only the input listed. Each task tests the utilization of a
specific learned skill based on a minimal number of cognitive priors.
In their native form,...
Run using lighteval:
arc_agi_2
biology chemistry graduate-level multiple-choice physics qa reasoning science
english
GPQA is a dataset of 448 expert-written multiple-choice questions in biology,
physics, and chemistry, designed to test graduate-level reasoning. The questions
are extremely difficult—PhD-level experts score about 65%, skilled non-experts
34% (even...
Run using lighteval:
gpqa
⭐ Humanity'S Last Exam cais/hle
qa reasoning general-knowledge
english
Humanity's Last Exam (HLE) is a global collaborative effort, with questions from
nearly 1,000 subject expert contributors affiliated with over 500 institutions
across 50 countries - comprised mostly of professors, researchers, and graduate
degree...
Run using lighteval:
hle
⭐ Ifeval google/IFEval
instruction-following
english
Very specific task where there are no precise outputs but instead we test if the
format obeys rules.
Run using lighteval:
ifeval
⭐ Live Code Bench lighteval/code_generation_lite
code-generation
english
LiveCodeBench collects problems from periodic contests on LeetCode, AtCoder, and
Codeforces platforms and uses them for constructing a holistic benchmark for
evaluating Code LLMs across variety of code-related scenarios continuously over
time.
Run using lighteval:
lcb
math reasoning
english
This dataset contains a subset of 500 problems from the MATH benchmark that
OpenAI created in their Let's Verify Step by Step paper.
Run using lighteval:
math_500
⭐ Mix Eval MixEval/MixEval
general-knowledge reasoning qa
english
Ground-truth-based dynamic benchmark derived from off-the-shelf benchmark
mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96
correlation with Chatbot Arena) while running locally and quickly (6% the time
and cost of...
Run using lighteval:
mixeval_easy mixeval_hard
⭐ Mmlu Pro TIGER-Lab/MMLU-Pro
general-knowledge knowledge multiple-choice
english
MMLU-Pro dataset is a more robust and challenging massive multi-task
understanding dataset tailored to more rigorously benchmark large language
models' capabilities. This dataset contains 12K complex questions across various
disciplines.
Run using lighteval:
mmlu_pro
⭐ Musr TAUR-Lab/MuSR
long-context multiple-choice reasoning
english
MuSR is a benchmark for evaluating multistep reasoning in natural language
narratives. Built using a neurosymbolic synthetic-to-natural generation process,
it features complex, realistic tasks—such as long-form murder mysteries.
Run using lighteval:
musr
⭐ Simpleqa lighteval/SimpleQA
factuality general-knowledge qa
english
A factuality benchmark called SimpleQA that measures the ability for language
models to answer short, fact-seeking questions.
Run using lighteval:
simpleqa
knowledge multilingual multiple-choice
arabic
Acva multilingual benchmark.
Run using lighteval:
acva_ara
math multilingual reasoning
amharic ewe french hausa igbo kinyarwanda lingala luganda oromo shona sotho swahili twi wolof xhosa yoruba zulu
African MGSM: MGSM for African Languages
Run using lighteval:
afri_mgsm_amh afri_mgsm_fra afri_mgsm_swa
Show 1 more
afri_mgsm_yor
classification multilingual nli
amharic ewe french hausa igbo kinyarwanda lingala luganda oromo shona sotho swahili twi wolof xhosa yoruba zulu
African XNLI: African XNLI
Run using lighteval:
afri_xnli_amh_mcf afri_xnli_amh_cf afri_xnli_amh_hybrid
Show 9 more
afri_xnli_fra_mcf afri_xnli_fra_cf afri_xnli_fra_hybrid afri_xnli_swa_mcf afri_xnli_swa_cf afri_xnli_swa_hybrid afri_xnli_yor_mcf afri_xnli_yor_cf afri_xnli_yor_hybrid
biology chemistry geography history knowledge language multiple-choice physics reasoning
english chinese
AGIEval is a human-centric benchmark specifically designed to evaluate the
general abilities of foundation models in tasks pertinent to human cognition and
problem-solving. This benchmark is derived from 20 official, public, and
high-standard...
Run using lighteval:
agieval
nli reasoning
english
The Adversarial Natural Language Inference (ANLI) is a new large-scale NLI
benchmark dataset, The dataset is collected via an iterative, adversarial
human-and-model-in-the-loop procedure. ANLI is much more difficult than its
predecessors including...
Run using lighteval:
anli
Arabic Mmlu MBZUAI/ArabicMMLU
knowledge multilingual multiple-choice
arabic
Arabic Mmlu multilingual benchmark.
Run using lighteval:
mmlu_ara_mcf mmlu_ara_cf mmlu_ara_hybrid
multiple-choice
english
7,787 genuine grade-school level, multiple-choice science questions, assembled
to encourage research in advanced question-answering. The dataset is partitioned
into a Challenge Set and an Easy Set, where the former contains only questions
answered...
Run using lighteval:
arc
multilingual multiple-choice qa reasoning
arabic
ARCD: Arabic Reading Comprehension Dataset.
Run using lighteval:
arcd_ara
math reasoning
english
A small battery of 10 tests that involve asking language models a simple
arithmetic problem in natural language.
Run using lighteval:
arithmetic
math reasoning
english
ASDiv is a dataset for arithmetic reasoning that contains 2,000+ questions
covering addition, subtraction, multiplication, and division.
Run using lighteval:
asdiv
qa reasoning
english
The bAbI benchmark for measuring understanding and reasoning, evaluates reading
comprehension via question answering.
Run using lighteval:
babi_qa
bias multiple-choice qa
english
The Bias Benchmark for Question Answering (BBQ) for measuring social bias in
question answering in ambiguous and unambigous context .
Run using lighteval:
bbq bbq
multilingual multiple-choice reading-comprehension
arabic armenian bengali cyrillic devanagari ethiopic georgian greek gujarati gurmukhi chinese (simplified) chinese (traditional) hangul hebrew japanese khmer kannada lao latin malayalam myanmar odia sinhala tamil telugu thai tibetan
Belebele: A large-scale reading comprehension dataset covering 122 languages.
Run using lighteval:
belebele_acm_Arab_mcf belebele_arz_Arab_mcf belebele_ceb_Latn_mcf
Show 273 more
belebele_fin_Latn_mcf belebele_hin_Deva_mcf belebele_ita_Latn_mcf belebele_khm_Khmr_mcf belebele_lvs_Latn_mcf belebele_npi_Deva_mcf belebele_pol_Latn_mcf belebele_slv_Latn_mcf belebele_swe_Latn_mcf belebele_afr_Latn_mcf belebele_asm_Beng_mcf belebele_ces_Latn_mcf belebele_fra_Latn_mcf belebele_hin_Latn_mcf belebele_jav_Latn_mcf belebele_mal_Mlym_mcf belebele_npi_Latn_mcf belebele_por_Latn_mcf belebele_swh_Latn_mcf belebele_tur_Latn_mcf belebele_yor_Latn_mcf belebele_als_Latn_mcf belebele_azj_Latn_mcf belebele_ckb_Arab_mcf belebele_hrv_Latn_mcf belebele_jpn_Jpan_mcf belebele_kir_Cyrl_mcf belebele_mar_Deva_mcf belebele_snd_Arab_mcf belebele_tam_Taml_mcf belebele_ukr_Cyrl_mcf belebele_zho_Hans_mcf belebele_amh_Ethi_mcf belebele_dan_Latn_mcf belebele_hun_Latn_mcf belebele_kor_Hang_mcf belebele_mkd_Cyrl_mcf belebele_ron_Latn_mcf belebele_som_Latn_mcf belebele_tel_Telu_mcf belebele_urd_Arab_mcf belebele_zho_Hant_mcf belebele_apc_Arab_mcf belebele_ben_Beng_mcf belebele_deu_Latn_mcf belebele_hye_Armn_mcf belebele_kan_Knda_mcf belebele_lao_Laoo_mcf belebele_mlt_Latn_mcf belebele_ory_Orya_mcf belebele_rus_Cyrl_mcf belebele_tgk_Cyrl_mcf belebele_urd_Latn_mcf belebele_zsm_Latn_mcf belebele_arb_Arab_mcf belebele_ben_Latn_mcf belebele_ell_Grek_mcf belebele_guj_Gujr_mcf belebele_kat_Geor_mcf belebele_pan_Guru_mcf belebele_spa_Latn_mcf belebele_tgl_Latn_mcf belebele_uzn_Latn_mcf belebele_arb_Latn_mcf belebele_eng_Latn_mcf belebele_kaz_Cyrl_mcf belebele_lit_Latn_mcf belebele_mya_Mymr_mcf belebele_pbt_Arab_mcf belebele_sin_Latn_mcf belebele_srp_Cyrl_mcf belebele_tha_Thai_mcf belebele_vie_Latn_mcf belebele_ars_Arab_mcf belebele_bul_Cyrl_mcf belebele_est_Latn_mcf belebele_ind_Latn_mcf belebele_nld_Latn_mcf belebele_pes_Arab_mcf belebele_sin_Sinh_mcf belebele_war_Latn_mcf belebele_ary_Arab_mcf belebele_cat_Latn_mcf belebele_eus_Latn_mcf belebele_heb_Hebr_mcf belebele_isl_Latn_mcf belebele_nob_Latn_mcf belebele_plt_Latn_mcf belebele_slk_Latn_mcf belebele_acm_Arab_cf belebele_arz_Arab_cf belebele_ceb_Latn_cf belebele_fin_Latn_cf belebele_hin_Deva_cf belebele_ita_Latn_cf belebele_khm_Khmr_cf belebele_lvs_Latn_cf belebele_npi_Deva_cf belebele_pol_Latn_cf belebele_slv_Latn_cf belebele_swe_Latn_cf belebele_afr_Latn_cf belebele_asm_Beng_cf belebele_ces_Latn_cf belebele_fra_Latn_cf belebele_hin_Latn_cf belebele_jav_Latn_cf belebele_mal_Mlym_cf belebele_npi_Latn_cf belebele_por_Latn_cf belebele_swh_Latn_cf belebele_tur_Latn_cf belebele_yor_Latn_cf belebele_als_Latn_cf belebele_azj_Latn_cf belebele_ckb_Arab_cf belebele_hrv_Latn_cf belebele_jpn_Jpan_cf belebele_kir_Cyrl_cf belebele_mar_Deva_cf belebele_snd_Arab_cf belebele_tam_Taml_cf belebele_ukr_Cyrl_cf belebele_zho_Hans_cf belebele_amh_Ethi_cf belebele_dan_Latn_cf belebele_hun_Latn_cf belebele_kor_Hang_cf belebele_mkd_Cyrl_cf belebele_ron_Latn_cf belebele_som_Latn_cf belebele_tel_Telu_cf belebele_urd_Arab_cf belebele_zho_Hant_cf belebele_apc_Arab_cf belebele_ben_Beng_cf belebele_deu_Latn_cf belebele_hye_Armn_cf belebele_kan_Knda_cf belebele_lao_Laoo_cf belebele_mlt_Latn_cf belebele_ory_Orya_cf belebele_rus_Cyrl_cf belebele_tgk_Cyrl_cf belebele_urd_Latn_cf belebele_zsm_Latn_cf belebele_arb_Arab_cf belebele_ben_Latn_cf belebele_ell_Grek_cf belebele_guj_Gujr_cf belebele_kat_Geor_cf belebele_pan_Guru_cf belebele_spa_Latn_cf belebele_tgl_Latn_cf belebele_uzn_Latn_cf belebele_arb_Latn_cf belebele_eng_Latn_cf belebele_kaz_Cyrl_cf belebele_lit_Latn_cf belebele_mya_Mymr_cf belebele_pbt_Arab_cf belebele_sin_Latn_cf belebele_srp_Cyrl_cf belebele_tha_Thai_cf belebele_vie_Latn_cf belebele_ars_Arab_cf belebele_bul_Cyrl_cf belebele_est_Latn_cf belebele_ind_Latn_cf belebele_nld_Latn_cf belebele_pes_Arab_cf belebele_sin_Sinh_cf belebele_war_Latn_cf belebele_ary_Arab_cf belebele_cat_Latn_cf belebele_eus_Latn_cf belebele_heb_Hebr_cf belebele_isl_Latn_cf belebele_nob_Latn_cf belebele_plt_Latn_cf belebele_slk_Latn_cf belebele_acm_Arab_hybrid belebele_arz_Arab_hybrid belebele_ceb_Latn_hybrid belebele_fin_Latn_hybrid belebele_hin_Deva_hybrid belebele_ita_Latn_hybrid belebele_khm_Khmr_hybrid belebele_lvs_Latn_hybrid belebele_npi_Deva_hybrid belebele_pol_Latn_hybrid belebele_slv_Latn_hybrid belebele_swe_Latn_hybrid belebele_afr_Latn_hybrid belebele_asm_Beng_hybrid belebele_ces_Latn_hybrid belebele_fra_Latn_hybrid belebele_hin_Latn_hybrid belebele_jav_Latn_hybrid belebele_mal_Mlym_hybrid belebele_npi_Latn_hybrid belebele_por_Latn_hybrid belebele_swh_Latn_hybrid belebele_tur_Latn_hybrid belebele_yor_Latn_hybrid belebele_als_Latn_hybrid belebele_azj_Latn_hybrid belebele_ckb_Arab_hybrid belebele_hrv_Latn_hybrid belebele_jpn_Jpan_hybrid belebele_kir_Cyrl_hybrid belebele_mar_Deva_hybrid belebele_snd_Arab_hybrid belebele_tam_Taml_hybrid belebele_ukr_Cyrl_hybrid belebele_zho_Hans_hybrid belebele_amh_Ethi_hybrid belebele_dan_Latn_hybrid belebele_hun_Latn_hybrid belebele_kor_Hang_hybrid belebele_mkd_Cyrl_hybrid belebele_ron_Latn_hybrid belebele_som_Latn_hybrid belebele_tel_Telu_hybrid belebele_urd_Arab_hybrid belebele_zho_Hant_hybrid belebele_apc_Arab_hybrid belebele_ben_Beng_hybrid belebele_deu_Latn_hybrid belebele_hye_Armn_hybrid belebele_kan_Knda_hybrid belebele_lao_Laoo_hybrid belebele_mlt_Latn_hybrid belebele_ory_Orya_hybrid belebele_rus_Cyrl_hybrid belebele_tgk_Cyrl_hybrid belebele_urd_Latn_hybrid belebele_zsm_Latn_hybrid belebele_arb_Arab_hybrid belebele_ben_Latn_hybrid belebele_ell_Grek_hybrid belebele_guj_Gujr_hybrid belebele_kat_Geor_hybrid belebele_pan_Guru_hybrid belebele_spa_Latn_hybrid belebele_tgl_Latn_hybrid belebele_uzn_Latn_hybrid belebele_arb_Latn_hybrid belebele_eng_Latn_hybrid belebele_kaz_Cyrl_hybrid belebele_lit_Latn_hybrid belebele_mya_Mymr_hybrid belebele_pbt_Arab_hybrid belebele_sin_Latn_hybrid belebele_srp_Cyrl_hybrid belebele_tha_Thai_hybrid belebele_vie_Latn_hybrid belebele_ars_Arab_hybrid belebele_bul_Cyrl_hybrid belebele_est_Latn_hybrid belebele_ind_Latn_hybrid belebele_nld_Latn_hybrid belebele_pes_Arab_hybrid belebele_sin_Sinh_hybrid belebele_war_Latn_hybrid belebele_ary_Arab_hybrid belebele_cat_Latn_hybrid belebele_eus_Latn_hybrid belebele_heb_Hebr_hybrid belebele_isl_Latn_hybrid belebele_nob_Latn_hybrid belebele_plt_Latn_hybrid belebele_slk_Latn_hybrid
reasoning
english
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
166 tasks from bigbench benchmark.
Run using lighteval:
bigbench
language-modeling
english
BLiMP is a challenge set for evaluating what language models (LMs) know
about major grammatical phenomena in English. BLiMP consists of 67
sub-datasets, each containing 1000 minimal pairs isolating specific
contrasts in syntax, morphology, or...
Run using lighteval:
blimp
bias generation
english
The Bias in Open-Ended Language Generation Dataset (BOLD) for measuring biases
and toxicity in open-ended language generation.
Run using lighteval:
bold bold
qa
english
The BoolQ benchmark for binary (yes/no) question answering.
Run using lighteval:
boolq boolq:contrastset
multilingual multiple-choice reasoning
chinese
C3: A Chinese Challenge Corpus for Cross-lingual and Cross-modal Tasks Reading
comprehension task part of clue.
Run using lighteval:
c3_zho_mcf c3_zho_cf c3_zho_hybrid
knowledge multilingual multiple-choice
chinese
Ceval multilingual benchmark.
Run using lighteval:
ceval_zho_mcf ceval_zho_cf ceval_zho_hybrid
knowledge multilingual qa
russian
Chegeka multilingual benchmark.
Run using lighteval:
chegeka_rus
multilingual qa
chinese
ChineseSquad is a reading comprehension dataset for Chinese.
Run using lighteval:
chinese_squad_zho
math multilingual reasoning
chinese
Cmath multilingual benchmark.
Run using lighteval:
cmath_zho
knowledge multilingual multiple-choice
chinese
Cmmlu multilingual benchmark.
Run using lighteval:
cmmlu_zho_mcf cmmlu_zho_cf cmmlu_zho_hybrid
classification multilingual nli
chinese
Native Chinese NLI dataset based on MNLI approach (Machine Translated)
Run using lighteval:
cmnli_zho_mcf cmnli_zho_cf cmnli_zho_hybrid
Cmrc2018 clue/clue
multilingual qa
chinese
CMRC 2018: A span-extraction machine reading comprehension dataset for Chinese.
Run using lighteval:
cmrc2018_zho
Commonsenseqa tau/commonsense_qa
commonsense multiple-choice qa
english
CommonsenseQA is a new multiple-choice question answering dataset that requires
different types of commonsense knowledge to predict the correct answers . It
contains 12,102 questions with one correct answer and four distractor answers.
Run using lighteval:
commonsenseqa
multilingual multiple-choice reasoning
assamese bengali gujarati hindi kannada malayalam marathi nepali oriya punjabi sanskrit sindhi tamil telugu urdu
IndicCOPA: COPA for Indic Languages Paper: https://arxiv.org/pdf/2212.05409
IndicCOPA extends COPA to 15 Indic languages, providing a valuable resource for
evaluating common sense reasoning in these languages.
Run using lighteval:
indicxcopa_asm_mcf indicxcopa_asm_cf indicxcopa_asm_hybrid
Show 42 more
indicxcopa_ben_mcf indicxcopa_ben_cf indicxcopa_ben_hybrid indicxcopa_guj_mcf indicxcopa_guj_cf indicxcopa_guj_hybrid indicxcopa_hin_mcf indicxcopa_hin_cf indicxcopa_hin_hybrid indicxcopa_kan_mcf indicxcopa_kan_cf indicxcopa_kan_hybrid indicxcopa_mal_mcf indicxcopa_mal_cf indicxcopa_mal_hybrid indicxcopa_mar_mcf indicxcopa_mar_cf indicxcopa_mar_hybrid indicxcopa_nep_mcf indicxcopa_nep_cf indicxcopa_nep_hybrid indicxcopa_ori_mcf indicxcopa_ori_cf indicxcopa_ori_hybrid indicxcopa_pan_mcf indicxcopa_pan_cf indicxcopa_pan_hybrid indicxcopa_san_mcf indicxcopa_san_cf indicxcopa_san_hybrid indicxcopa_snd_mcf indicxcopa_snd_cf indicxcopa_snd_hybrid indicxcopa_tam_mcf indicxcopa_tam_cf indicxcopa_tam_hybrid indicxcopa_tel_mcf indicxcopa_tel_cf indicxcopa_tel_hybrid indicxcopa_urd_mcf indicxcopa_urd_cf indicxcopa_urd_hybrid
dialog qa
english
CoQA is a large-scale dataset for building Conversational Question Answering
systems. The goal of the CoQA challenge is to measure the ability of machines to
understand a text passage and answer a series of interconnected questions that
appear in a...
Run using lighteval:
coqa
dialog medical
english
The COVID-19 Dialogue dataset is a collection of 500+ dialogues between
doctors and patients during the COVID-19 pandemic.
Run using lighteval:
covid_dialogue
math qa reasoning
english
The DROP dataset is a new question-answering dataset designed to evaluate the
ability of language models to answer complex questions that require reasoning
over multiple sentences.
Run using lighteval:
drop
reasoning
english
Scenario testing hierarchical reasoning through the Dyck formal languages.
Run using lighteval:
dyck_language
Emotion Classification dair-ai/emotion
emotion classification multiple-choice
english
This task performs emotion classification classifying text into one of six
emotion categories: sadness, joy, love, anger, fear, surprise.
Run using lighteval:
emotion_classification
knowledge multilingual multiple-choice
portuguese
ENEM (Exame Nacional do Ensino Médio) is a standardized Brazilian national
secondary education examination. The exam is used both as a university admission
test and as a high school evaluation test.
Run using lighteval:
enem_por_mcf enem_por_cf enem_por_hybrid
classification reasoning
english
Simple entity matching benchmark.
Run using lighteval:
entity_matching entity_matching=Fodors_Zagats
classification ethics justice morality utilitarianism virtue
english
The Ethics benchmark for evaluating the ability of language models to reason about
ethical issues.
Run using lighteval:
ethics
knowledge multilingual multiple-choice
albanian arabic bulgarian croatian french german hungarian italian lithuanian macedonian polish portuguese serbian spanish turkish vietnamese
Exams multilingual benchmark.
Run using lighteval:
exams_ara_mcf exams_ara_cf exams_ara_hybrid
Show 45 more
exams_bul_mcf exams_bul_cf exams_bul_hybrid exams_hrv_mcf exams_hrv_cf exams_hrv_hybrid exams_hun_mcf exams_hun_cf exams_hun_hybrid exams_ita_mcf exams_ita_cf exams_ita_hybrid exams_srp_mcf exams_srp_cf exams_srp_hybrid exams_fra_mcf exams_fra_cf exams_fra_hybrid exams_deu_mcf exams_deu_cf exams_deu_hybrid exams_spa_mcf exams_spa_cf exams_spa_hybrid exams_lit_mcf exams_lit_cf exams_lit_hybrid exams_sqi_mcf exams_sqi_cf exams_sqi_hybrid exams_mkd_mcf exams_mkd_cf exams_mkd_hybrid exams_tur_mcf exams_tur_cf exams_tur_hybrid exams_pol_mcf:professional exams_pol_cf:professional exams_pol_hybrid:professional exams_por_mcf exams_por_cf exams_por_hybrid exams_vie_mcf exams_vie_cf exams_vie_hybrid
multilingual qa
portuguese
FaQuAD: A Portuguese Reading Comprehension Dataset
Run using lighteval:
faquad_por
Filipino Evals filbench/filbench-eval
knowledge multilingual multiple-choice
filipino
Collection of benchmarks for Filipino language.
Run using lighteval:
balita_tgl_mcf balita_tgl_hybrid belebele_fil_mcf
Show 49 more
belebele_ceb_mcf belebele_fil_cf belebele_ceb_cf belebele_fil_hybrid belebele_ceb_hybrid cebuaner_ceb_mcf cebuaner_ceb_hybrid readability_ceb_mcf readability_ceb_hybrid dengue_filipino_fil firecs_fil_mcf firecs_fil_hybrid global_mmlu_all_tgl_mcf global_mmlu_ca_tgl_mcf global_mmlu_cs_tgl_mcf global_mmlu_unk_tgl_mcf global_mmlu_all_tgl_cf global_mmlu_ca_tgl_cf global_mmlu_cs_tgl_cf global_mmlu_unk_tgl_cf global_mmlu_all_tgl_hybrid global_mmlu_ca_tgl_hybrid global_mmlu_cs_tgl_hybrid global_mmlu_unk_tgl_hybrid include_tgl_mcf include_tgl_hybrid kalahi_tgl_hybrid kalahi_tgl_mcf newsphnli_fil_mcf newsphnli_fil_cf newsphnli_fil_hybrid ntrex128_fil sib200_tgl_mcf sib200_tgl_hybrid sib200_ceb_mcf sib200_ceb_hybrid stingraybench_semantic_appropriateness_tgl_mcf stingraybench_semantic_appropriateness_tgl_hybrid stingraybench_correctness_tgl_mcf stingraybench_correctness_tgl_hybrid tatoeba_ceb tatoeba_tgl tico19_tgl tlunifiedner_tgl_mcf tlunifiedner_tgl_hybrid universalner_ceb_mcf universalner_ceb_hybrid universalner_tgl_mcf universalner_tgl_hybrid
Flores200 facebook/flores
multilingual translation
arabic armenian bengali cyrillic devanagari ethiopic georgian greek gujarati gurmukhi chinese (simplified) chinese (traditional) hangul hebrew japanese khmer kannada lao latin malayalam myanmar odia sinhala tamil telugu thai tibetan
Flores200 multilingual benchmark.
Run using lighteval:
flores200
multilingual qa
french
FQuAD v2: French Question Answering Dataset version 2.
Run using lighteval:
fquadv2_fra
French Boolq manu/french_boolq
classification multilingual qa
french
French Boolq multilingual benchmark.
Run using lighteval:
community_boolq_fra
French Triviqa manu/french-trivia
multilingual qa
french
French Triviqa multilingual benchmark.
Run using lighteval:
community_triviaqa_fra
multilingual qa
german
GermanQuAD: High-quality German QA dataset with 13,722 questions.
Run using lighteval:
germanquad_deu
knowledge multilingual multiple-choice
amharic arabic bengali chinese czech dutch english french german hebrew hindi indonesian italian japanese korean malay norwegian polish portuguese romanian russian serbian spanish swahili swedish tamil telugu thai turkish ukrainian urdu vietnamese yoruba zulu
Translated MMLU using both professional and non-professional translators.
Contains tags for cultural sensitivity.
Run using lighteval:
global_mmlu_all_amh_mcf global_mmlu_all_ara_mcf global_mmlu_all_ben_mcf
Show 31 more
global_mmlu_all_zho_mcf global_mmlu_all_ces_mcf global_mmlu_all_deu_mcf global_mmlu_all_eng_mcf global_mmlu_all_spa_mcf global_mmlu_all_fra_mcf global_mmlu_all_heb_mcf global_mmlu_all_hin_mcf global_mmlu_all_ind_mcf global_mmlu_all_ita_mcf global_mmlu_all_jpn_mcf global_mmlu_all_kor_mcf global_mmlu_all_msa_mcf global_mmlu_all_nld_mcf global_mmlu_all_nor_mcf global_mmlu_all_pol_mcf global_mmlu_all_por_mcf global_mmlu_all_ron_mcf global_mmlu_all_rus_mcf global_mmlu_all_srp_mcf global_mmlu_all_swe_mcf global_mmlu_all_swa_mcf global_mmlu_all_tam_mcf global_mmlu_all_tel_mcf global_mmlu_all_tha_mcf global_mmlu_all_tur_mcf global_mmlu_all_ukr_mcf global_mmlu_all_urd_mcf global_mmlu_all_vie_mcf global_mmlu_all_yor_mcf global_mmlu_all_zul_mcf
classification language-understanding
english
The General Language Understanding Evaluation (GLUE) benchmark is a collection
of resources for training, evaluating, and analyzing natural language
understanding systems.
Run using lighteval:
glue super_glue
math reasoning
english
GSM-Plus is an adversarial extension of GSM8K that tests the robustness of LLMs'
mathematical reasoning by introducing varied perturbations to grade-school math
problems.
Run using lighteval:
gsm_plus
math reasoning
english
GSM8K is a dataset of 8,000+ high-quality, single-step arithmetic word problems.
Run using lighteval:
gsm8k
health medical multiple-choice qa
english spanish
HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to
access a specialized position in the Spanish healthcare system, and are
challenging even for highly specialized humans. They are designed by the
Ministerio de Sanidad,...
Run using lighteval:
headqa
Hellaswag Rowan/hellaswag
multiple-choice narrative reasoning
english
HellaSwag is a commonsense inference benchmark designed to challenge language
models with adversarially filtered multiple-choice questions.
Run using lighteval:
hellaswag
multilingual multiple-choice reasoning
hindi
Hellaswag Hin multilingual benchmark.
Run using lighteval:
community_hellaswag_hin_mcf community_hellaswag_hin_cf community_hellaswag_hin_hybrid
multilingual multiple-choice reasoning
telugu
Hellaswag Tel multilingual benchmark.
Run using lighteval:
community_hellaswag_tel_mcf community_hellaswag_tel_cf community_hellaswag_tel_hybrid
multilingual multiple-choice reasoning
thai
Hellaswag Thai This is a Thai adaptation of the Hellaswag task. Similar to the
Turkish version, there's no specific paper, but it has been found to be
effective for evaluating Thai language models on commonsense reasoning tasks.
Run using lighteval:
community_hellaswag_tha_mcf community_hellaswag_tha_cf community_hellaswag_tha_hybrid
multilingual multiple-choice reasoning
turkish
Hellaswag Turkish This is a Turkish adaptation of the Hellaswag task. While
there's no specific paper for this version, it has been found to work well for
evaluating Turkish language models on commonsense reasoning tasks. We don't
handle them in...
Run using lighteval:
community_hellaswag_tur_mcf community_hellaswag_tur_cf community_hellaswag_tur_hybrid
multilingual multiple-choice reasoning
hindi
Hindi Arc multilingual benchmark.
Run using lighteval:
community_arc_hin_mcf community_arc_hin_cf community_arc_hin_hybrid
Hindi Boolq ai4bharat/boolq-hi
classification multilingual qa
gujarati hindi malayalam marathi tamil
Hindi Boolq multilingual benchmark.
Run using lighteval:
community_boolq_hin community_boolq_guj community_boolq_mal
Show 2 more
community_boolq_mar community_boolq_tam
multilingual qa
assamese bengali gujarati hindi kannada malayalam marathi oriya punjabi tamil telugu
IndicQA: A reading comprehension dataset for 11 Indian languages.
Run using lighteval:
indicqa_asm indicqa_ben indicqa_guj
Show 8 more
indicqa_hin indicqa_kan indicqa_mal indicqa_mar indicqa_ori indicqa_pan indicqa_tam indicqa_tel
multilingual qa
swahili
KenSwQuAD: A question answering dataset for Kenyan Swahili.
Run using lighteval:
kenswquad_swa
language-modeling
english
LAMBADA is a benchmark for testing language models’ ability to understand broad
narrative context. Each passage requires predicting its final word—easy for
humans given the full passage but impossible from just the last sentence.
Success demands...
Run using lighteval:
lambada
legal
english
Measures fine-grained legal reasoning through reverse entailment.
Run using lighteval:
legalsupport
classification legal
english
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
Run using lighteval:
lexglue
classification legal
bulgarian czech danish german greek english spanish estonian finnish french ga croatian hungarian italian lithuanian latvian mt dutch polish portuguese romanian slovak slovenian swedish
LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
Run using lighteval:
lextreme
qa
english
LogiQA is a machine reading comprehension dataset focused on testing logical
reasoning abilities. It contains 8,678 expert-written multiple-choice questions
covering various types of deductive reasoning. While humans perform...
Run using lighteval:
logiqa
legal qa
english
Questions from law school admission tests.
Run using lighteval:
lsat_qa lsat_qa
knowledge multilingual multiple-choice
afrikaans chinese english italian javanese portuguese swahili thai vietnamese
M3Exam: Multitask Multilingual Multimodal Evaluation Benchmark It also contains
Run using lighteval:
m3exams_afr_mcf m3exams_afr_cf m3exams_afr_hybrid
Show 24 more
m3exams_zho_mcf m3exams_zho_cf m3exams_zho_hybrid m3exams_eng_mcf m3exams_eng_cf m3exams_eng_hybrid m3exams_ita_mcf m3exams_ita_cf m3exams_ita_hybrid m3exams_jav_mcf m3exams_jav_cf m3exams_jav_hybrid m3exams_por_mcf m3exams_por_cf m3exams_por_hybrid m3exams_swa_mcf m3exams_swa_cf m3exams_swa_hybrid m3exams_tha_mcf m3exams_tha_cf m3exams_tha_hybrid m3exams_vie_mcf m3exams_vie_cf m3exams_vie_hybrid
Mathlogicqa Rus ai-forever/MERA
math multilingual qa reasoning
russian
MathLogicQA is a dataset for evaluating mathematical reasoning in language
models. It consists of multiple-choice questions that require logical reasoning
and mathematical problem-solving. This Russian version is part of the MERA
(Multilingual...
Run using lighteval:
mathlogic_qa_rus_cf mathlogic_qa_rus_mcf mathlogic_qa_rus_hybrid
math qa reasoning
english
large-scale dataset of math word problems. Our dataset is gathered by using a
new representation language to annotate over the AQuA-RAT dataset with
fully-specified operational programs. AQuA-RAT has provided the questions,
options, rationale, and...
Run using lighteval:
mathqa
dialog health medical
english
A collection of medical dialogue datasets.
Run using lighteval:
med_dialog
knowledge multilingual multiple-choice
french german hindi italian portuguese spanish thai
Meta MMLU: A multilingual version of MMLU (using google translation)
Run using lighteval:
meta_mmlu_deu_mcf meta_mmlu_deu_cf meta_mmlu_deu_hybrid
Show 18 more
meta_mmlu_spa_mcf meta_mmlu_spa_cf meta_mmlu_spa_hybrid meta_mmlu_fra_mcf meta_mmlu_fra_cf meta_mmlu_fra_hybrid meta_mmlu_hin_mcf meta_mmlu_hin_cf meta_mmlu_hin_hybrid meta_mmlu_ita_mcf meta_mmlu_ita_cf meta_mmlu_ita_hybrid meta_mmlu_por_mcf meta_mmlu_por_cf meta_mmlu_por_hybrid meta_mmlu_tha_mcf meta_mmlu_tha_cf meta_mmlu_tha_hybrid
math multilingual reasoning
english spanish french german russian chinese japanese thai swahili bengali telugu
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school
math problems.
The same 250 problems from GSM8K are each translated via human annotators in 10
languages.
Run using lighteval:
mgsm
math multilingual reasoning
bengali chinese english french german japanese russian spanish swahili telugu thai
Mgsm multilingual benchmark.
Run using lighteval:
mgsm_eng mgsm_spa mgsm_fra
Show 8 more
mgsm_deu mgsm_rus mgsm_zho mgsm_jpn mgsm_tha mgsm_swa mgsm_ben mgsm_tel
knowledge multilingual qa
arabic english french german hindi italian japanese portuguese spanish
Mintaka multilingual benchmark.
Run using lighteval:
mintaka_ara mintaka_deu mintaka_eng
Show 6 more
mintaka_spa mintaka_fra mintaka_hin mintaka_ita mintaka_jpn mintaka_por
multilingual qa
arabic chinese chinese_hong_kong chinese_traditional danish dutch english finnish french german hebrew hungarian italian japanese khmer korean malay norwegian polish portuguese russian spanish swedish thai turkish vietnamese
Mkqa multilingual benchmark.
Run using lighteval:
mkqa_ara mkqa_dan mkqa_deu
Show 21 more
mkqa_eng mkqa_spa mkqa_fin mkqa_fra mkqa_heb mkqa_hun mkqa_ita mkqa_jpn mkqa_kor mkqa_khm mkqa_msa mkqa_nld mkqa_nor mkqa_pol mkqa_por mkqa_rus mkqa_swe mkqa_tha mkqa_tur mkqa_vie mkqa_zho
Mlmm Arc Challenge jon-tow/okapi_arc_challenge
multilingual multiple-choice reasoning
arabic bengali catalan chinese croatian danish dutch french german hindi hungarian indonesian italian kannada malayalam marathi nepali romanian russian serbian slovak spanish tamil telugu ukrainian vietnamese
ARC (AI2 Reasoning Challenge) is a dataset for question answering that requires
reasoning. It consists of multiple-choice science questions from 3rd to 9th
grade exams. The dataset is split into two parts: ARC-Easy and ARC-Challenge.
ARC-Easy...
Run using lighteval:
mlmm_arc_rus_mcf:challenge mlmm_arc_rus_cf:challenge mlmm_arc_rus_hybrid:challenge
Show 75 more
mlmm_arc_deu_mcf:challenge mlmm_arc_deu_cf:challenge mlmm_arc_deu_hybrid:challenge mlmm_arc_zho_mcf:challenge mlmm_arc_zho_cf:challenge mlmm_arc_zho_hybrid:challenge mlmm_arc_fra_mcf:challenge mlmm_arc_fra_cf:challenge mlmm_arc_fra_hybrid:challenge mlmm_arc_spa_mcf:challenge mlmm_arc_spa_cf:challenge mlmm_arc_spa_hybrid:challenge mlmm_arc_ita_mcf:challenge mlmm_arc_ita_cf:challenge mlmm_arc_ita_hybrid:challenge mlmm_arc_nld_mcf:challenge mlmm_arc_nld_cf:challenge mlmm_arc_nld_hybrid:challenge mlmm_arc_vie_mcf:challenge mlmm_arc_vie_cf:challenge mlmm_arc_vie_hybrid:challenge mlmm_arc_ind_mcf:challenge mlmm_arc_ind_cf:challenge mlmm_arc_ind_hybrid:challenge mlmm_arc_ara_mcf:challenge mlmm_arc_ara_cf:challenge mlmm_arc_ara_hybrid:challenge mlmm_arc_hun_mcf:challenge mlmm_arc_hun_cf:challenge mlmm_arc_hun_hybrid:challenge mlmm_arc_ron_mcf:challenge mlmm_arc_ron_cf:challenge mlmm_arc_ron_hybrid:challenge mlmm_arc_dan_mcf:challenge mlmm_arc_dan_cf:challenge mlmm_arc_dan_hybrid:challenge mlmm_arc_slk_mcf:challenge mlmm_arc_slk_cf:challenge mlmm_arc_slk_hybrid:challenge mlmm_arc_ukr_mcf:challenge mlmm_arc_ukr_cf:challenge mlmm_arc_ukr_hybrid:challenge mlmm_arc_cat_mcf:challenge mlmm_arc_cat_cf:challenge mlmm_arc_cat_hybrid:challenge mlmm_arc_srp_mcf:challenge mlmm_arc_srp_cf:challenge mlmm_arc_srp_hybrid:challenge mlmm_arc_hrv_mcf:challenge mlmm_arc_hrv_cf:challenge mlmm_arc_hrv_hybrid:challenge mlmm_arc_hin_mcf:challenge mlmm_arc_hin_cf:challenge mlmm_arc_hin_hybrid:challenge mlmm_arc_ben_mcf:challenge mlmm_arc_ben_cf:challenge mlmm_arc_ben_hybrid:challenge mlmm_arc_tam_mcf:challenge mlmm_arc_tam_cf:challenge mlmm_arc_tam_hybrid:challenge mlmm_arc_nep_mcf:challenge mlmm_arc_nep_cf:challenge mlmm_arc_nep_hybrid:challenge mlmm_arc_mal_mcf:challenge mlmm_arc_mal_cf:challenge mlmm_arc_mal_hybrid:challenge mlmm_arc_mar_mcf:challenge mlmm_arc_mar_cf:challenge mlmm_arc_mar_hybrid:challenge mlmm_arc_tel_mcf:challenge mlmm_arc_tel_cf:challenge mlmm_arc_tel_hybrid:challenge mlmm_arc_kan_mcf:challenge mlmm_arc_kan_cf:challenge mlmm_arc_kan_hybrid:challenge
multilingual multiple-choice reasoning
arabic armenian basque bengali catalan chinese croatian danish dutch french german gujarati hindi hungarian icelandic indonesian italian kannada malayalam marathi nepali norwegian portuguese romanian russian serbian slovak spanish swedish tamil telugu ukrainian vietnamese
Hellaswag is a commonsense reasoning task that requires models to complete a
given scenario with the most plausible ending. It tests the model's ability to
understand and reason about everyday situations and human behavior.
MLMM-Hellaswag:...
Run using lighteval:
mlmm_hellaswag_ara_mcf mlmm_hellaswag_ara_cf mlmm_hellaswag_ara_hybrid
Show 96 more
mlmm_hellaswag_ben_mcf mlmm_hellaswag_ben_cf mlmm_hellaswag_ben_hybrid mlmm_hellaswag_cat_mcf mlmm_hellaswag_cat_cf mlmm_hellaswag_cat_hybrid mlmm_hellaswag_dan_mcf mlmm_hellaswag_dan_cf mlmm_hellaswag_dan_hybrid mlmm_hellaswag_deu_mcf mlmm_hellaswag_deu_cf mlmm_hellaswag_deu_hybrid mlmm_hellaswag_spa_mcf mlmm_hellaswag_spa_cf mlmm_hellaswag_spa_hybrid mlmm_hellaswag_eus_mcf mlmm_hellaswag_eus_cf mlmm_hellaswag_eus_hybrid mlmm_hellaswag_fra_mcf mlmm_hellaswag_fra_cf mlmm_hellaswag_fra_hybrid mlmm_hellaswag_guj_mcf mlmm_hellaswag_guj_cf mlmm_hellaswag_guj_hybrid mlmm_hellaswag_hin_mcf mlmm_hellaswag_hin_cf mlmm_hellaswag_hin_hybrid mlmm_hellaswag_hrv_mcf mlmm_hellaswag_hrv_cf mlmm_hellaswag_hrv_hybrid mlmm_hellaswag_hun_mcf mlmm_hellaswag_hun_cf mlmm_hellaswag_hun_hybrid mlmm_hellaswag_hye_mcf mlmm_hellaswag_hye_cf mlmm_hellaswag_hye_hybrid mlmm_hellaswag_ind_mcf mlmm_hellaswag_ind_cf mlmm_hellaswag_ind_hybrid mlmm_hellaswag_isl_mcf mlmm_hellaswag_isl_cf mlmm_hellaswag_isl_hybrid mlmm_hellaswag_ita_mcf mlmm_hellaswag_ita_cf mlmm_hellaswag_ita_hybrid mlmm_hellaswag_kan_mcf mlmm_hellaswag_kan_cf mlmm_hellaswag_kan_hybrid mlmm_hellaswag_mal_mcf mlmm_hellaswag_mal_cf mlmm_hellaswag_mal_hybrid mlmm_hellaswag_mar_mcf mlmm_hellaswag_mar_cf mlmm_hellaswag_mar_hybrid mlmm_hellaswag_nor_mcf mlmm_hellaswag_nor_cf mlmm_hellaswag_nor_hybrid mlmm_hellaswag_nep_mcf mlmm_hellaswag_nep_cf mlmm_hellaswag_nep_hybrid mlmm_hellaswag_nld_mcf mlmm_hellaswag_nld_cf mlmm_hellaswag_nld_hybrid mlmm_hellaswag_por_mcf mlmm_hellaswag_por_cf mlmm_hellaswag_por_hybrid mlmm_hellaswag_ron_mcf mlmm_hellaswag_ron_cf mlmm_hellaswag_ron_hybrid mlmm_hellaswag_rus_mcf mlmm_hellaswag_rus_cf mlmm_hellaswag_rus_hybrid mlmm_hellaswag_slk_mcf mlmm_hellaswag_slk_cf mlmm_hellaswag_slk_hybrid mlmm_hellaswag_srp_mcf mlmm_hellaswag_srp_cf mlmm_hellaswag_srp_hybrid mlmm_hellaswag_swe_mcf mlmm_hellaswag_swe_cf mlmm_hellaswag_swe_hybrid mlmm_hellaswag_tam_mcf mlmm_hellaswag_tam_cf mlmm_hellaswag_tam_hybrid mlmm_hellaswag_tel_mcf mlmm_hellaswag_tel_cf mlmm_hellaswag_tel_hybrid mlmm_hellaswag_ukr_mcf mlmm_hellaswag_ukr_cf mlmm_hellaswag_ukr_hybrid mlmm_hellaswag_vie_mcf mlmm_hellaswag_vie_cf mlmm_hellaswag_vie_hybrid mlmm_hellaswag_zho_mcf mlmm_hellaswag_zho_cf mlmm_hellaswag_zho_hybrid
knowledge multilingual multiple-choice
arabic bengali catalan chinese croatian danish dutch french german hindi hungarian indonesian italian kannada malayalam marathi nepali romanian russian serbian slovak spanish tamil telugu ukrainian vietnamese
MLMM MMLU: Another multilingual version of MMLU
Run using lighteval:
mlmm_mmlu_rus_mcf mlmm_mmlu_rus_cf mlmm_mmlu_rus_hybrid
Show 75 more
mlmm_mmlu_deu_mcf mlmm_mmlu_deu_cf mlmm_mmlu_deu_hybrid mlmm_mmlu_zho_mcf mlmm_mmlu_zho_cf mlmm_mmlu_zho_hybrid mlmm_mmlu_fra_mcf mlmm_mmlu_fra_cf mlmm_mmlu_fra_hybrid mlmm_mmlu_spa_mcf mlmm_mmlu_spa_cf mlmm_mmlu_spa_hybrid mlmm_mmlu_ita_mcf mlmm_mmlu_ita_cf mlmm_mmlu_ita_hybrid mlmm_mmlu_nld_mcf mlmm_mmlu_nld_cf mlmm_mmlu_nld_hybrid mlmm_mmlu_vie_mcf mlmm_mmlu_vie_cf mlmm_mmlu_vie_hybrid mlmm_mmlu_ind_mcf mlmm_mmlu_ind_cf mlmm_mmlu_ind_hybrid mlmm_mmlu_ara_mcf mlmm_mmlu_ara_cf mlmm_mmlu_ara_hybrid mlmm_mmlu_hun_mcf mlmm_mmlu_hun_cf mlmm_mmlu_hun_hybrid mlmm_mmlu_ron_mcf mlmm_mmlu_ron_cf mlmm_mmlu_ron_hybrid mlmm_mmlu_dan_mcf mlmm_mmlu_dan_cf mlmm_mmlu_dan_hybrid mlmm_mmlu_slk_mcf mlmm_mmlu_slk_cf mlmm_mmlu_slk_hybrid mlmm_mmlu_ukr_mcf mlmm_mmlu_ukr_cf mlmm_mmlu_ukr_hybrid mlmm_mmlu_cat_mcf mlmm_mmlu_cat_cf mlmm_mmlu_cat_hybrid mlmm_mmlu_srp_mcf mlmm_mmlu_srp_cf mlmm_mmlu_srp_hybrid mlmm_mmlu_hrv_mcf mlmm_mmlu_hrv_cf mlmm_mmlu_hrv_hybrid mlmm_mmlu_hin_mcf mlmm_mmlu_hin_cf mlmm_mmlu_hin_hybrid mlmm_mmlu_ben_mcf mlmm_mmlu_ben_cf mlmm_mmlu_ben_hybrid mlmm_mmlu_tam_mcf mlmm_mmlu_tam_cf mlmm_mmlu_tam_hybrid mlmm_mmlu_nep_mcf mlmm_mmlu_nep_cf mlmm_mmlu_nep_hybrid mlmm_mmlu_mal_mcf mlmm_mmlu_mal_cf mlmm_mmlu_mal_hybrid mlmm_mmlu_mar_mcf mlmm_mmlu_mar_cf mlmm_mmlu_mar_hybrid mlmm_mmlu_tel_mcf mlmm_mmlu_tel_cf mlmm_mmlu_tel_hybrid mlmm_mmlu_kan_mcf mlmm_mmlu_kan_cf mlmm_mmlu_kan_hybrid
Mlmm Truthfulqa jon-tow/okapi_truthfulqa
factuality multilingual qa
arabic armenian basque bengali catalan chinese croatian danish dutch french german gujarati hindi hungarian icelandic indonesian italian kannada malayalam marathi nepali norwegian portuguese romanian russian serbian slovak spanish swedish tamil telugu ukrainian vietnamese
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Run using lighteval:
mlmm_truthfulqa_ara_mcf mlmm_truthfulqa_ara_cf mlmm_truthfulqa_ara_hybrid
Show 96 more
mlmm_truthfulqa_ben_mcf mlmm_truthfulqa_ben_cf mlmm_truthfulqa_ben_hybrid mlmm_truthfulqa_cat_mcf mlmm_truthfulqa_cat_cf mlmm_truthfulqa_cat_hybrid mlmm_truthfulqa_dan_mcf mlmm_truthfulqa_dan_cf mlmm_truthfulqa_dan_hybrid mlmm_truthfulqa_deu_mcf mlmm_truthfulqa_deu_cf mlmm_truthfulqa_deu_hybrid mlmm_truthfulqa_spa_mcf mlmm_truthfulqa_spa_cf mlmm_truthfulqa_spa_hybrid mlmm_truthfulqa_eus_mcf mlmm_truthfulqa_eus_cf mlmm_truthfulqa_eus_hybrid mlmm_truthfulqa_fra_mcf mlmm_truthfulqa_fra_cf mlmm_truthfulqa_fra_hybrid mlmm_truthfulqa_guj_mcf mlmm_truthfulqa_guj_cf mlmm_truthfulqa_guj_hybrid mlmm_truthfulqa_hin_mcf mlmm_truthfulqa_hin_cf mlmm_truthfulqa_hin_hybrid mlmm_truthfulqa_hrv_mcf mlmm_truthfulqa_hrv_cf mlmm_truthfulqa_hrv_hybrid mlmm_truthfulqa_hun_mcf mlmm_truthfulqa_hun_cf mlmm_truthfulqa_hun_hybrid mlmm_truthfulqa_hye_mcf mlmm_truthfulqa_hye_cf mlmm_truthfulqa_hye_hybrid mlmm_truthfulqa_ind_mcf mlmm_truthfulqa_ind_cf mlmm_truthfulqa_ind_hybrid mlmm_truthfulqa_isl_mcf mlmm_truthfulqa_isl_cf mlmm_truthfulqa_isl_hybrid mlmm_truthfulqa_ita_mcf mlmm_truthfulqa_ita_cf mlmm_truthfulqa_ita_hybrid mlmm_truthfulqa_kan_mcf mlmm_truthfulqa_kan_cf mlmm_truthfulqa_kan_hybrid mlmm_truthfulqa_mal_mcf mlmm_truthfulqa_mal_cf mlmm_truthfulqa_mal_hybrid mlmm_truthfulqa_mar_mcf mlmm_truthfulqa_mar_cf mlmm_truthfulqa_mar_hybrid mlmm_truthfulqa_nor_mcf mlmm_truthfulqa_nor_cf mlmm_truthfulqa_nor_hybrid mlmm_truthfulqa_nep_mcf mlmm_truthfulqa_nep_cf mlmm_truthfulqa_nep_hybrid mlmm_truthfulqa_nld_mcf mlmm_truthfulqa_nld_cf mlmm_truthfulqa_nld_hybrid mlmm_truthfulqa_por_mcf mlmm_truthfulqa_por_cf mlmm_truthfulqa_por_hybrid mlmm_truthfulqa_ron_mcf mlmm_truthfulqa_ron_cf mlmm_truthfulqa_ron_hybrid mlmm_truthfulqa_rus_mcf mlmm_truthfulqa_rus_cf mlmm_truthfulqa_rus_hybrid mlmm_truthfulqa_slk_mcf mlmm_truthfulqa_slk_cf mlmm_truthfulqa_slk_hybrid mlmm_truthfulqa_srp_mcf mlmm_truthfulqa_srp_cf mlmm_truthfulqa_srp_hybrid mlmm_truthfulqa_swe_mcf mlmm_truthfulqa_swe_cf mlmm_truthfulqa_swe_hybrid mlmm_truthfulqa_tam_mcf mlmm_truthfulqa_tam_cf mlmm_truthfulqa_tam_hybrid mlmm_truthfulqa_tel_mcf mlmm_truthfulqa_tel_cf mlmm_truthfulqa_tel_hybrid mlmm_truthfulqa_ukr_mcf mlmm_truthfulqa_ukr_cf mlmm_truthfulqa_ukr_hybrid mlmm_truthfulqa_vie_mcf mlmm_truthfulqa_vie_cf mlmm_truthfulqa_vie_hybrid mlmm_truthfulqa_zho_mcf mlmm_truthfulqa_zho_cf mlmm_truthfulqa_zho_hybrid
multilingual qa
arabic chinese german hindi spanish vietnamese
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating
cross-lingual question answering performance. It consists of QA instances in 7
languages: English, Arabic, German, Spanish, Hindi, Vietnamese, and Chinese. The
dataset is...
Run using lighteval:
mlqa_ara mlqa_deu mlqa_spa
Show 3 more
mlqa_zho mlqa_hin mlqa_vie
general-knowledge knowledge multiple-choice
english
MMMLU is a benchmark of general-knowledge and English language understanding.
Run using lighteval:
mmlu
general-knowledge knowledge multiple-choice
english
MMLU-Redux is a subset of 5,700 manually re-annotated questions across 57 MMLU subjects.
Run using lighteval:
mmlu_redux_2
Mmmu Pro MMMU/MMMU_pro
general-knowledge knowledge multimodal multiple-choice
english
-
Run using lighteval:
mmmu_pro
conversational generation multi-turn
english
MT-Bench is a multi-turn conversational benchmark for evaluating language
models. It consists of 80 high-quality multi-turn questions across 8 common
categories (writing, roleplay, reasoning, math, coding, extraction, STEM,
humanities). Model...
Run using lighteval:
mt_bench
qa reading-comprehension
english
NarrativeQA is a reading comprehension benchmark that tests deep understanding
of full narratives—books and movie scripts—rather than shallow text matching. To
answer its questions, models must integrate information across entire stories.
Run using lighteval:
narrativeqa
general-knowledge qa
english
This dataset is a collection of question-answer pairs from the Natural Questions
dataset. See Natural Questions for additional information. This dataset can be
used directly with Sentence Transformers to train embedding models.
Run using lighteval:
natural_questions
math reasoning
english
Numeracy is a benchmark for evaluating the ability of language models to reason about mathematics.
Run using lighteval:
numeracy
knowledge multilingual multiple-choice
portuguese
OAB Exams: A collection of questions from the Brazilian Bar Association exam The
exam is required for anyone who wants to practice law in Brazil
Run using lighteval:
oab_exams_por_mcf oab_exams_por_cf oab_exams_por_hybrid
Ocnli clue/clue
classification multilingual nli
chinese
Native Chinese NLI dataset based.
Run using lighteval:
ocnli_zho_mcf ocnli_zho_cf ocnli_zho_hybrid
Olympiade Bench Hothan/OlympiadBench
math reasoning language
english chinese
OlympiadBench is a benchmark for evaluating the performance of language models
on olympiad problems.
Run using lighteval:
olympiad_bench
Openai Mmlu openai/MMMLU
knowledge multilingual multiple-choice
arabic bengali chinese french german hindi indonesian italian japanese korean portuguese spanish swahili yoruba
Openai Mmlu multilingual benchmark.
Run using lighteval:
openai_mmlu_ara_mcf openai_mmlu_ara_cf openai_mmlu_ara_hybrid
Show 39 more
openai_mmlu_ben_mcf openai_mmlu_ben_cf openai_mmlu_ben_hybrid openai_mmlu_deu_mcf openai_mmlu_deu_cf openai_mmlu_deu_hybrid openai_mmlu_spa_mcf openai_mmlu_spa_cf openai_mmlu_spa_hybrid openai_mmlu_fra_mcf openai_mmlu_fra_cf openai_mmlu_fra_hybrid openai_mmlu_hin_mcf openai_mmlu_hin_cf openai_mmlu_hin_hybrid openai_mmlu_ind_mcf openai_mmlu_ind_cf openai_mmlu_ind_hybrid openai_mmlu_ita_mcf openai_mmlu_ita_cf openai_mmlu_ita_hybrid openai_mmlu_jpn_mcf openai_mmlu_jpn_cf openai_mmlu_jpn_hybrid openai_mmlu_kor_mcf openai_mmlu_kor_cf openai_mmlu_kor_hybrid openai_mmlu_por_mcf openai_mmlu_por_cf openai_mmlu_por_hybrid openai_mmlu_swa_mcf openai_mmlu_swa_cf openai_mmlu_swa_hybrid openai_mmlu_yor_mcf openai_mmlu_yor_cf openai_mmlu_yor_hybrid openai_mmlu_zho_mcf openai_mmlu_zho_cf openai_mmlu_zho_hybrid
multilingual multiple-choice reasoning
arabic
OpenBookQA: A Question-Answering Dataset for Open-Book Exams OpenBookQA is a
question-answering dataset modeled after open-book exams for assessing human
understanding of a subject. It consists of multiple-choice questions that
require combining...
Run using lighteval:
alghafa_openbookqa_ara_mcf alghafa_openbookqa_ara_cf alghafa_openbookqa_ara_hybrid
multilingual multiple-choice reasoning
spanish
Spanish version of OpenBookQA from BSC Language Technology group
Run using lighteval:
openbookqa_spa_mcf openbookqa_spa_cf openbookqa_spa_hybrid
Openbook Rus ai-forever/MERA
multilingual multiple-choice reasoning
russian
The Russian version is part of the MERA (Multilingual Enhanced Russian NLP
Architectures) project.
Run using lighteval:
mera_openbookqa_rus_mcf mera_openbookqa_rus_cf mera_openbookqa_rus_hybrid
multiple-choice qa
english
OpenBookQA is a question-answering dataset modeled after open-book exams for
assessing human understanding of a subject. It contains multiple-choice
questions that require combining facts from a given open book with broad common
knowledge. The task...
Run using lighteval:
openbookqa
Oz Serbian Evals DjMel/oz-eval
knowledge multiple-choice
serbian
OZ Eval (sr. Opšte Znanje Evaluacija) dataset was created for the purposes of
evaluating General Knowledge of LLM models in Serbian language. Data consists
of 1k+ high-quality questions and answers which were used as part of entry exams
at the...
Run using lighteval:
serbian_evals:oz_task
multilingual
russian
PARus: Plausible Alternatives for Russian PARus is the Russian adaptation of the
COPA task, part of the Russian SuperGLUE benchmark. It evaluates common sense
reasoning and causal inference abilities in Russian language models.
Run using lighteval:
parus_rus_mcf parus_rus_cf parus_rus_hybrid
classification multilingual nli
chinese english french german japanese korean spanish
PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification This
dataset contains paraphrase identification pairs in multiple languages. It's
derived from PAWS (Paraphrase Adversaries from Word Scrambling) and We treat
paraphrase as...
Run using lighteval:
pawsx_deu_mcf pawsx_deu_cf pawsx_deu_hybrid
Show 18 more
pawsx_eng_mcf pawsx_eng_cf pawsx_eng_hybrid pawsx_spa_mcf pawsx_spa_cf pawsx_spa_hybrid pawsx_fra_mcf pawsx_fra_cf pawsx_fra_hybrid pawsx_jpn_mcf pawsx_jpn_cf pawsx_jpn_hybrid pawsx_kor_mcf pawsx_kor_cf pawsx_kor_hybrid pawsx_zho_mcf pawsx_zho_cf pawsx_zho_hybrid
commonsense multiple-choice qa
english
PIQA is a benchmark for testing physical commonsense reasoning. It contains
questions requiring this kind of physical commonsense reasoning.
Run using lighteval:
piqa
multilingual multiple-choice qa reasoning
arabic
PIQA: Physical Interaction Question Answering PIQA is a benchmark for testing
physical commonsense reasoning. This Arabic version is a translation of the
original PIQA dataset, adapted for Arabic language evaluation. It tests the
ability to reason...
Run using lighteval:
alghafa_piqa_ara_mcf alghafa_piqa_ara_cf alghafa_piqa_ara_hybrid
reasoning qa physical-commonsense
english
PROST is a benchmark for testing physical reasoning about objects through space
and time. It includes 18,736 multiple-choice questions covering 10 core physics
concepts, designed to probe models in zero-shot settings. Results show that even
large...
Run using lighteval:
prost
Pubmedqa pubmed_qa
biomedical health medical qa
english
PubMedQA is a dataset for biomedical research question answering.
Run using lighteval:
pubmedqa
Qa4Mre qa4mre
biomedical health qa
english
QA4MRE is a machine reading comprehension benchmark from the CLEF 2011-2013
challenges. It evaluates systems' ability to answer questions requiring deep
understanding of short texts, supported by external background knowledge.
Covering tasks like...
Run using lighteval:
qa4mre
qa scientific
english
QASPER is a dataset for question answering on scientific research papers. It
consists of 5,049 questions over 1,585 Natural Language Processing papers. Each
question is written by an NLP practitioner who read only the title and abstract
of the...
Run using lighteval:
qasper
dialog qa
english
The QuAC benchmark for question answering in the context of dialogues.
Run using lighteval:
quac
Race High EleutherAI/race
multiple-choice reading-comprehension
english
RACE is a large-scale reading comprehension dataset with more than 28,000
passages and nearly 100,000 questions. The dataset is collected from English
examinations in China, which are designed for middle school and high school
students. The dataset...
Run using lighteval:
race:high
classification reasoning
english
The Real-world annotated few-shot (RAFT) meta-benchmark of 11 real-world text
classification tasks.
Run using lighteval:
raft
classification multilingual nli
russian
Russian Commitment Bank (RCB) is a large-scale NLI dataset with Russian
sentences, collected from the web and crowdsourcing.
Run using lighteval:
rcb_rus_mcf rcb_rus_cf rcb_rus_hybrid
Real Toxicity Prompts allenai/real-toxicity-prompts
generation safety
english
The RealToxicityPrompts dataset for measuring toxicity in prompted model generations
Run using lighteval:
real_toxicity_prompts
physics chemistry biology reasoning multiple-choice qa
english
The SciQ dataset contains 13,679 crowdsourced science exam questions about
Physics, Chemistry and Biology, among others. The questions are in
multiple-choice format with 4 answer options each. For the majority of the
questions, an additional...
Run using lighteval:
sciq
knowledge multiple-choice
serbian
The tasks cover a variety of benchmarks, including: standard task like ARC[E][C],
BoolQ, Hellaswag, OpenBookQA,PIQA, Winogrande and a custom OZ Eval.
MMLU is separated by subject and also all in one.
Run using lighteval:
serbian_evals
commonsense multiple-choice qa
english
We introduce Social IQa: Social Interaction QA, a new question-answering
benchmark for testing social commonsense intelligence. Contrary to many prior
benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on
reasoning about...
Run using lighteval:
siqa
reasoning symbolic
english
SLR-Bench is a large-scale benchmark for scalable logical reasoning with
language models, comprising 19,000 prompts organized into 20 curriculum levels.
Run using lighteval:
slr_bench_all slr_bench_basic slr_bench_easy
Show 2 more
slr_bench_medium slr_bench_hard
multilingual qa
spanish
SQuAD-es: Spanish translation of the Stanford Question Answering Dataset
Run using lighteval:
squad_spa
multilingual qa
italian
SQuAD-it: Italian translation of the SQuAD dataset.
Run using lighteval:
squad_ita
qa
english
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset,
consisting of questions posed by crowdworkers on a set of Wikipedia articles,
where the answer to every question is a segment of text, or span, from the
corresponding...
Run using lighteval:
squad_v2
narrative reasoning
english
A Corpus and Cloze Evaluation for Deeper Understanding of
Commonsense Stories
Run using lighteval:
storycloze
summarization
english
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural
Networks for Extreme Summarization and: Abstractive Text Summarization using
Sequence-to-sequence RNNs and Beyond
Run using lighteval:
summarization
narrative reasoning
english
The dataset consists of 113k multiple choice questions about grounded situations
(73k training, 20k validation, 20k test). Each question is a video caption from
LSMDC or ActivityNet Captions, with four answer choices about what might happen
next in...
Run using lighteval:
swag
Swahili Arc
multilingual multiple-choice reasoning
swahili
Swahili Arc multilingual benchmark.
Run using lighteval:
community_arc_swa_mcf community_arc_swa_cf community_arc_swa_hybrid
Thai Exams scb10x/thai_exam
knowledge multilingual multiple-choice
thai
Thai Exams multilingual benchmark.
Run using lighteval:
thai_exams_tha_mcf thai_exams_tha_cf thai_exams_tha_hybrid
language-modeling
english
The Pile corpus for measuring lanugage model performance across various domains.
Run using lighteval:
the_pile
generation safety
english
This dataset is for implicit hate speech detection. All instances were generated
using GPT-3 and the methods described in our paper.
Run using lighteval:
toxigen
multilingual qa
turkish
TQuAD v2: Turkish Question Answering Dataset version 2.
Run using lighteval:
tquadv2_tur
qa
english
TriviaqQA is a reading comprehension dataset containing over 650K
question-answer-evidence triples. TriviaqQA includes 95K question-answer pairs
authored by trivia enthusiasts and independently gathered evidence documents,
six per question on...
Run using lighteval:
triviaqa
knowledge multiple-choice
turkic
TUMLU-mini is a benchmark for Turkic language understanding, comprising 1,000
prompts organized into 10 subsets.
Run using lighteval:
tumlu
Turkish Arc malhajar/arc-tr
multilingual multiple-choice reasoning
turkish
Turkish ARC Comes from the Turkish leaderboard
Run using lighteval:
community_arc_tur_mcf community_arc_tur_cf community_arc_tur_hybrid
knowledge multilingual multiple-choice
turkish
Turkish Mmlu multilingual benchmark.
Run using lighteval:
mmlu_tur_mcf mmlu_tur_cf mmlu_tur_hybrid
language-modeling
english
Demographic Dialectal Variation in Social Media: A Case Study of African-American English
Run using lighteval:
twitterAAE
multilingual qa
arabic bengali english finnish indonesian japanese korean russian swahili telugu thai
Other QA tasks for RC TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. https://arxiv.org/abs/2003.05002
Run using lighteval:
tydiqa_eng tydiqa_ara tydiqa_ben
Show 8 more
tydiqa_fin tydiqa_ind tydiqa_jpn tydiqa_kor tydiqa_swa tydiqa_rus tydiqa_tel tydiqa_tha
language-modeling reasoning
english
Benchmark where we ask the model to unscramble a word, either anagram or
random insertion.
Run using lighteval:
unscramble
qa
english
This dataset consists of 6,642 question/answer pairs. The questions are supposed
to be answerable by Freebase, a large knowledge graph. The questions are mostly
centered around a single named entity. The questions are popular ones asked on
the web.
Run using lighteval:
webqs
language-modeling
english
The WikiText language modeling dataset is a collection of over 100 million
tokens extracted from the set of verified Good and Featured articles on
Wikipedia. The dataset is available under the Creative Commons
Attribution-ShareAlike License.
Run using lighteval:
wikitext:103:document_level
commonsense multiple-choice
english
WinoGrande is a new collection of 44k problems, inspired by Winograd Schema
Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the
scale and robustness against the dataset-specific bias. Formulated as a
fill-in-a-blank task...
Run using lighteval:
winogrande
Worldtree Rus ai-forever/MERA
multilingual
russian
WorldTree is a dataset for multi-hop inference in science question answering. It
provides explanations for elementary science questions by combining facts from a
semi-structured knowledge base. This Russian version is part of the MERA
(Multilingual...
Run using lighteval:
mera_worldtree_rus_mcf mera_worldtree_rus_cf mera_worldtree_rus_hybrid
multilingual multiple-choice reasoning
arabic chinese dutch english french german hindi italian japanese polish portuguese russian spanish swahili urdu vietnamese
Xcodah multilingual benchmark.
Run using lighteval:
xcodah_ara_mcf xcodah_ara_cf xcodah_ara_hybrid
Show 45 more
xcodah_deu_mcf xcodah_deu_cf xcodah_deu_hybrid xcodah_eng_mcf xcodah_eng_cf xcodah_eng_hybrid xcodah_spa_mcf xcodah_spa_cf xcodah_spa_hybrid xcodah_fra_mcf xcodah_fra_cf xcodah_fra_hybrid xcodah_hin_mcf xcodah_hin_cf xcodah_hin_hybrid xcodah_ita_mcf xcodah_ita_cf xcodah_ita_hybrid xcodah_jpn_mcf xcodah_jpn_cf xcodah_jpn_hybrid xcodah_nld_mcf xcodah_nld_cf xcodah_nld_hybrid xcodah_pol_mcf xcodah_pol_cf xcodah_pol_hybrid xcodah_por_mcf xcodah_por_cf xcodah_por_hybrid xcodah_rus_mcf xcodah_rus_cf xcodah_rus_hybrid xcodah_swa_mcf xcodah_swa_cf xcodah_swa_hybrid xcodah_urd_mcf xcodah_urd_cf xcodah_urd_hybrid xcodah_vie_mcf xcodah_vie_cf xcodah_vie_hybrid xcodah_zho_mcf xcodah_zho_cf xcodah_zho_hybrid
commonsense multilingual multiple-choice reasoning
english
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning The Cross-lingual
Choice of Plausible Alternatives dataset is a benchmark to evaluate the ability
of machine learning models to transfer commonsense reasoning across languages.
Run using lighteval:
xcopa
Xcopa
multilingual multiple-choice narrative reasoning
arabic chinese estonian haitian indonesian italian quechua swahili tamil thai turkish vietnamese
COPA (Choice of Plausible Alternatives) tasks involve determining the most
plausible cause or effect for a given premise. These tasks test common sense
reasoning and causal inference abilities. XCOPA: Cross-lingual Choice of
Plausible Alternatives.
Run using lighteval:
xcopa_ara_mcf xcopa_ara_cf xcopa_ara_hybrid
Show 33 more
xcopa_est_mcf xcopa_est_cf xcopa_est_hybrid xcopa_ind_mcf xcopa_ind_cf xcopa_ind_hybrid xcopa_ita_mcf xcopa_ita_cf xcopa_ita_hybrid xcopa_swa_mcf xcopa_swa_cf xcopa_swa_hybrid xcopa_tam_mcf xcopa_tam_cf xcopa_tam_hybrid xcopa_tha_mcf xcopa_tha_cf xcopa_tha_hybrid xcopa_tur_mcf xcopa_tur_cf xcopa_tur_hybrid xcopa_vie_mcf xcopa_vie_cf xcopa_vie_hybrid xcopa_zho_mcf xcopa_zho_cf xcopa_zho_hybrid xcopa_hti_mcf xcopa_hti_cf xcopa_hti_hybrid xcopa_que_mcf xcopa_que_cf xcopa_que_hybrid
multilingual multiple-choice qa reasoning
arabic chinese dutch english french german hindi italian japanese polish portuguese russian spanish swahili urdu vietnamese
XCSQA (Cross-lingual Commonsense QA) is part of the XCSR (Cross-lingual
Commonsense Reasoning) benchmark It is a multilingual extension of the
CommonsenseQA dataset, covering 16 languages The task involves answering
multiple-choice questions that...
Run using lighteval:
xcsqa_ara_mcf xcsqa_ara_cf xcsqa_ara_hybrid
Show 45 more
xcsqa_deu_mcf xcsqa_deu_cf xcsqa_deu_hybrid xcsqa_eng_mcf xcsqa_eng_cf xcsqa_eng_hybrid xcsqa_spa_mcf xcsqa_spa_cf xcsqa_spa_hybrid xcsqa_fra_mcf xcsqa_fra_cf xcsqa_fra_hybrid xcsqa_hin_mcf xcsqa_hin_cf xcsqa_hin_hybrid xcsqa_ita_mcf xcsqa_ita_cf xcsqa_ita_hybrid xcsqa_jpn_mcf xcsqa_jpn_cf xcsqa_jpn_hybrid xcsqa_nld_mcf xcsqa_nld_cf xcsqa_nld_hybrid xcsqa_pol_mcf xcsqa_pol_cf xcsqa_pol_hybrid xcsqa_por_mcf xcsqa_por_cf xcsqa_por_hybrid xcsqa_rus_mcf xcsqa_rus_cf xcsqa_rus_hybrid xcsqa_swa_mcf xcsqa_swa_cf xcsqa_swa_hybrid xcsqa_urd_mcf xcsqa_urd_cf xcsqa_urd_hybrid xcsqa_vie_mcf xcsqa_vie_cf xcsqa_vie_hybrid xcsqa_zho_mcf xcsqa_zho_cf xcsqa_zho_hybrid
classification multilingual nli
arabic bulgarian chinese english french german greek hindi russian spanish swahili thai turkish urdu vietnamese
NLI (Natural Language Inference) tasks involve determining the logical
relationship between two given sentences: a premise and a hypothesis. The goal
is to classify whether the hypothesis is entailed by, contradicts, or is neutral
with respect to...
Run using lighteval:
xnli_ara_mcf xnli_ara_cf xnli_ara_hybrid
Show 48 more
xnli_eng_mcf xnli_eng_cf xnli_eng_hybrid xnli_fra_mcf xnli_fra_cf xnli_fra_hybrid xnli_spa_mcf xnli_spa_cf xnli_spa_hybrid xnli_bul_mcf xnli_bul_cf xnli_bul_hybrid xnli_deu_mcf xnli_deu_cf xnli_deu_hybrid xnli_ell_mcf xnli_ell_cf xnli_ell_hybrid xnli_eng_mcf xnli_eng_cf xnli_eng_hybrid xnli_fra_mcf xnli_fra_cf xnli_fra_hybrid xnli_hin_mcf xnli_hin_cf xnli_hin_hybrid xnli_rus_mcf xnli_rus_cf xnli_rus_hybrid xnli_swa_mcf xnli_swa_cf xnli_swa_hybrid xnli_tha_mcf xnli_tha_cf xnli_tha_hybrid xnli_tur_mcf xnli_tur_cf xnli_tur_hybrid xnli_urd_mcf xnli_urd_cf xnli_urd_hybrid xnli_vie_mcf xnli_vie_cf xnli_vie_hybrid xnli_zho_mcf xnli_zho_cf xnli_zho_hybrid
classification multilingual nli
assamese bengali gujarati hindi kannada malayalam marathi oriya punjabi tamil telugu
Another variant of XNLI, with emphasis on Indic languages.
Run using lighteval:
indicnxnli_asm_mcf indicnxnli_asm_cf indicnxnli_asm_hybrid
Show 30 more
indicnxnli_ben_mcf indicnxnli_ben_cf indicnxnli_ben_hybrid indicnxnli_guj_mcf indicnxnli_guj_cf indicnxnli_guj_hybrid indicnxnli_hin_mcf indicnxnli_hin_cf indicnxnli_hin_hybrid indicnxnli_kan_mcf indicnxnli_kan_cf indicnxnli_kan_hybrid indicnxnli_mal_mcf indicnxnli_mal_cf indicnxnli_mal_hybrid indicnxnli_mar_mcf indicnxnli_mar_cf indicnxnli_mar_hybrid indicnxnli_ori_mcf indicnxnli_ori_cf indicnxnli_ori_hybrid indicnxnli_pan_mcf indicnxnli_pan_cf indicnxnli_pan_hybrid indicnxnli_tam_mcf indicnxnli_tam_cf indicnxnli_tam_hybrid indicnxnli_tel_mcf indicnxnli_tel_cf indicnxnli_tel_hybrid
Xnli2
classification multilingual nli
arabic assamese bengali bulgarian chinese english french german greek gujarati hindi kannada marathi punjabi russian sanskrit spanish swahili tamil thai turkish urdu vietnamese
Improvement on XNLI with better translation, from our experience models tend to
perform better on XNLI2.0 than XNLI.
Run using lighteval:
xnli2.0_eng_mcf xnli2.0_eng_cf xnli2.0_eng_hybrid
Show 69 more
xnli2.0_fra_mcf xnli2.0_fra_cf xnli2.0_fra_hybrid xnli2.0_pan_mcf xnli2.0_pan_cf xnli2.0_pan_hybrid xnli2.0_guj_mcf xnli2.0_guj_cf xnli2.0_guj_hybrid xnli2.0_kan_mcf xnli2.0_kan_cf xnli2.0_kan_hybrid xnli2.0_asm_mcf xnli2.0_asm_cf xnli2.0_asm_hybrid xnli2.0_ben_mcf xnli2.0_ben_cf xnli2.0_ben_hybrid xnli2.0_mar_mcf xnli2.0_mar_cf xnli2.0_mar_hybrid xnli2.0_san_mcf xnli2.0_san_cf xnli2.0_san_hybrid xnli2.0_tam_mcf xnli2.0_tam_cf xnli2.0_tam_hybrid xnli2.0_deu_mcf xnli2.0_deu_cf xnli2.0_deu_hybrid xnli2.0_eng_mcf xnli2.0_eng_cf xnli2.0_eng_hybrid xnli2.0_urd_mcf xnli2.0_urd_cf xnli2.0_urd_hybrid xnli2.0_vie_mcf xnli2.0_vie_cf xnli2.0_vie_hybrid xnli2.0_tur_mcf xnli2.0_tur_cf xnli2.0_tur_hybrid xnli2.0_tha_mcf xnli2.0_tha_cf xnli2.0_tha_hybrid xnli2.0_swa_mcf xnli2.0_swa_cf xnli2.0_swa_hybrid xnli2.0_spa_mcf xnli2.0_spa_cf xnli2.0_spa_hybrid xnli2.0_rus_mcf xnli2.0_rus_cf xnli2.0_rus_hybrid xnli2.0_hin_mcf xnli2.0_hin_cf xnli2.0_hin_hybrid xnli2.0_ell_mcf xnli2.0_ell_cf xnli2.0_ell_hybrid xnli2.0_zho_mcf xnli2.0_zho_cf xnli2.0_zho_hybrid xnli2.0_bul_mcf xnli2.0_bul_cf xnli2.0_bul_hybrid xnli2.0_ara_mcf xnli2.0_ara_cf xnli2.0_ara_hybrid
multilingual qa
arabic chinese english german greek hindi romanian russian spanish thai turkish vietnamese
Reading Comprehension (RC) tasks evaluate a model's ability to understand and
extract information from text passages. These tasks typically involve answering
questions based on given contexts, spanning multiple languages and formats. Add
Run using lighteval:
xquad_ara xquad_deu xquad_ell
Show 9 more
xquad_eng xquad_spa xquad_hin xquad_ron xquad_rus xquad_tha xquad_tur xquad_vie xquad_zho
multilingual narrative
arabic basque burmese chinese hindi indonesian russian spanish swahili telugu
Xstory multilingual benchmark.
Run using lighteval:
xstory_cloze_rus_mcf xstory_cloze_rus_cf xstory_cloze_rus_hybrid
Show 27 more
xstory_cloze_zho_mcf xstory_cloze_zho_cf xstory_cloze_zho_hybrid xstory_cloze_spa_mcf xstory_cloze_spa_cf xstory_cloze_spa_hybrid xstory_cloze_ara_mcf xstory_cloze_ara_cf xstory_cloze_ara_hybrid xstory_cloze_hin_mcf xstory_cloze_hin_cf xstory_cloze_hin_hybrid xstory_cloze_ind_mcf xstory_cloze_ind_cf xstory_cloze_ind_hybrid xstory_cloze_tel_mcf xstory_cloze_tel_cf xstory_cloze_tel_hybrid xstory_cloze_swa_mcf xstory_cloze_swa_cf xstory_cloze_swa_hybrid xstory_cloze_eus_mcf xstory_cloze_eus_cf xstory_cloze_eus_hybrid xstory_cloze_mya_mcf xstory_cloze_mya_cf xstory_cloze_mya_hybrid
multilingual narrative reasoning
english russian chinese spanish arabic hindi indonesian telugu swahili basque burmese
XStoryCloze consists of the professionally translated version of the English
StoryCloze dataset (Spring 2016 version) to 10 non-English languages. This
dataset is released by Meta AI.
Run using lighteval:
xstory_cloze
commonsense multilingual reasoning
english french japanese portuguese russian chinese
Multilingual winograd schema challenge as used in Crosslingual Generalization through Multitask Finetuning.
Run using lighteval:
xwinograd
multilingual multiple-choice reasoning
chinese english french japanese portuguese russian
Xwinograd multilingual benchmark.
Run using lighteval:
xwinograd_eng_mcf xwinograd_eng_cf xwinograd_eng_hybrid
Show 15 more
xwinograd_fra_mcf xwinograd_fra_cf xwinograd_fra_hybrid xwinograd_jpn_mcf xwinograd_jpn_cf xwinograd_jpn_hybrid xwinograd_por_mcf xwinograd_por_cf xwinograd_por_hybrid xwinograd_rus_mcf xwinograd_rus_cf xwinograd_rus_hybrid xwinograd_zho_mcf xwinograd_zho_cf xwinograd_zho_hybrid