Starting kit

A starter kit with an end-to-end submission flow can be found here: Starting kit

Please join us on Discord for discussions and up-to-date announcements: Discord

Open evaluation datasets

Dataset Dimension Source
CommonsenseQA Knowledge https://www.tau-nlp.sites.tau.ac.il/commonsenseqa
BIG-Bench Hard Reasoning https://github.com/suzgunmirac/BIG-Bench-Hard
GSM8K Math https://github.com/openai/grade-school-math
LongBench Long-Context https://github.com/THUDM/LongBench
HumanEval Programming https://github.com/openai/human-eval
TruthfulQA Knowledge https://github.com/sylinrl/TruthfulQA
CHID Language https://github.com/chujiezheng/ChID-Dataset