Starting kit
A starter kit with an end-to-end submission flow can be found here: Starting kit
Please join us on Discord for discussions and up-to-date announcements: Discord
Open evaluation datasets
Dataset | Dimension | Source |
CommonsenseQA | Knowledge | https://www.tau-nlp.sites.tau.ac.il/commonsenseqa |
BIG-Bench Hard | Reasoning | https://github.com/suzgunmirac/BIG-Bench-Hard |
GSM8K | Math | https://github.com/openai/grade-school-math |
LongBench | Long-Context | https://github.com/THUDM/LongBench |
HumanEval | Programming | https://github.com/openai/human-eval |
TruthfulQA | Knowledge | https://github.com/sylinrl/TruthfulQA |
CHID | Language | https://github.com/chujiezheng/ChID-Dataset |