Tracks

1. Compression Challenge: teams are tasked with developing their own compression methods to compress three pre-trained LLMs individually: Phi-2, Llama-3-8B, and Qwen-7B. Each model submitted must be capable of running on a smartphone device with 12 GB DRAM. Each model will be evaluated on a subset of the OpenCompass benchmark, which comprehensively assesses LLMs across multiple fundamental dimensions. For each task, the final submission score will be determined by calculating the average score of the three models involved.

2. Training from Scratch Challenge: teams are challenged to train language models from scratch without utilizing any pre-trained LLMs. There are no constraints on model architectures, training procedures, or duration, as long as the final models can run on a smartphone device with 12GB memory. Participants are free to design their architectures or utilize existing LLM architectures for this task. However, there is a restriction on the training data: only the C4 and Alpaca datasets are permitted for training and fine-tuning.

Please note that quantization methods are not allowed since 8-bit or 4-bit quantization for LLMs is a well-established technique. Participants must submit models in FP16 or FP32 format.

Submission

To participate in the challenge, you need to submit your entry by filling out a form. The form requires you to provide a link to your GitHub repository where your code is hosted. Additionally, you should add our repository edge-llms-challenge as a collaborator on your repository. This will allow us to access and review your code.

Here are the steps to complete the pre-submission and the final submission:

  1. Create a GitHub repository for your project.
  2. Add your code to the repository, including the source code for pre-training or compression, model definition files, configuration files, and a CSV file containing evaluated results that are evaluated locally.
  3. A .txt file containing Google Drive links to download model checkpoints and the compiled model for building Android APP.
  4. Fill out the pre-submission form or the final submission form with the required information, including the link to your GitHub repository.
  5. Add the edge-llm-challenge repository as a collaborator on your repository and meanwhile make your repository as private.
  6. Submit the form to complete your entry.

Please, refer to this link for more details about the submission.

By following these steps, you will have successfully submitted your entry for the challenge. Good luck!

Evaluation

The evaluation process in our competition will include two stages.

The first stage: the submitted models will be evaluated on a diverse subset of the OpenCompass benchmark, including CSQA, BIG-Bench Hard, GSM8K, LongBench, HumanEval, TruthfulQA, CHID, along with a set of secret holdout tasks to avoid overfitting. A self-contained code base will be provided to facilitate participants in easily evaluating their models on seven diverse tasks via the OpenCompass benchmark.

The top 15 teams with the highest scores will be displayed on the leaderboard. Participants are required to submit their source codes, evaluation log files, and models to the organizers (Check Submission section for details).

Participating teams are encouraged to submit their models for a preliminary review on the 25th of August to identify potential bugs and format issues, ensuring that the final submissions are in the correct format. Each team can make three submissions to each track after the deadline of the preliminary review. The best-performing model will be used to rank the team on the leaderboard and selected for the final model evaluation.

The second stage: After the competition is closed on October 25th, 2024, we will contact the top 3 teams with the highest scores in both tracks, requesting that they submit all necessary code and data to reproduce their results. We will then replicate their entire process, to ensure it is fully repeatable and the final model can be run on a smartphone with 12 GB RAM with the same results. If the top-scoring model cannot be reproduced under these imposed conditions or can not fit the smartphone, we will move on to consider the next highest-scoring submission in the category, until a reproducible and high-performing submission is selected.

Our evaluation will leverage an extensive array of rigorously curated datasets across multiple fundamental dimensions: language comprehension, knowledge precision, logical deduction, mathematical problem-solving, programming proficiency, extended text analysis, and intelligent agent engagement. Details of the evaluation tasks we chose are shown below:

Evaluation datasets

Dataset Dimension Source
CommonsenseQA Knowledge https://www.tau-nlp.sites.tau.ac.il/commonsenseqa
BIG-Bench Hard Reasoning https://github.com/suzgunmirac/BIG-Bench-Hard
GSM8K Math https://github.com/openai/grade-school-math
LongBench Long-Context https://github.com/THUDM/LongBench
HumanEval Programming https://github.com/openai/human-eval
TruthfulQA Knowledge https://github.com/sylinrl/TruthfulQA
CHID Language https://github.com/chujiezheng/ChID-Dataset

Metrics

To comprehensively evaluate submissions, we will employ a set of rigorously curated metrics, which include:

Baseline, code, and materials

The “starting kit” will be released to provide a starting point for people who are interested in our challenge before June 25th, 2024. This starting kit will provide detailed clarifications on what a submission looks like exactly, and how it will be evaluated and submitted. This starting kit will include an end-to-end submission flow, exemplified with a simple baseline:

Training datasets

This competition will not evaluate submissions based on the analysis of data. Our competition features two tracks: (1) post-training LLM compression for edge devices; and (2) training edge LLMs from scratch. We will allow participants to use and only use the C4 dataset and (Chinese) Alpaca for both tracks. The C4 dataset is a colossal, cleaned version of Common Crawl’s web crawl corpus, which is mainly intended to pre-train language models and word representations. Language models like MPT-7B and T5 are pre-trained with the C4 dataset. C4 is a large enough dataset that can make the competition interesting and draw conclusive and statistically significant results.