Tracks
1. Compression Challenge: teams are tasked with developing their own compression methods to compress three pre-trained LLMs individually: Phi-2, Llama-3-8B, and Qwen-7B. Each model submitted must be capable of running on a smartphone device with 12 GB DRAM. Each model will be evaluated on a subset of the OpenCompass benchmark, which comprehensively assesses LLMs across multiple fundamental dimensions. For each task, the final submission score will be determined by calculating the average score of the three models involved.
2. Training from Scratch Challenge: teams are challenged to train language models from scratch without utilizing any pre-trained LLMs. There are no constraints on model architectures, training procedures, or duration, as long as the final models can run on a smartphone device with 12GB memory. Participants are free to design their architectures or utilize existing LLM architectures for this task. However, there is a restriction on the training data: only the C4 and Alpaca datasets are permitted for training and fine-tuning.
Please note that quantization methods are not allowed since 8-bit or 4-bit quantization for LLMs is a well-established technique. Participants must submit models in FP16 or FP32 format.
Submission
To participate in the challenge, you need to submit your entry by filling out a form. The form requires you to provide a link to your GitHub repository where your code is hosted. Additionally, you should add our repository edge-llms-challenge as a collaborator on your repository. This will allow us to access and review your code.
Here are the steps to complete the pre-submission and the final submission:
- Create a GitHub repository for your project.
- Add your code to the repository, including the source code for pre-training or compression, model definition files, configuration files, and a CSV file containing evaluated results that are evaluated locally.
- A .txt file containing Google Drive links to download model checkpoints and the compiled model for building Android APP.
- Fill out the pre-submission form or the final submission form with the required information, including the link to your GitHub repository.
- Add the edge-llm-challenge repository as a collaborator on your repository and meanwhile make your repository as private.
- Submit the form to complete your entry.
Please, refer to this link for more details about the submission.
By following these steps, you will have successfully submitted your entry for the challenge. Good luck!
Evaluation
The evaluation process in our competition will include two stages.
The first stage: the submitted models will be evaluated on a diverse subset of the OpenCompass benchmark, including CSQA, BIG-Bench Hard, GSM8K, LongBench, HumanEval, TruthfulQA, CHID, along with a set of secret holdout tasks to avoid overfitting. A self-contained code base will be provided to facilitate participants in easily evaluating their models on seven diverse tasks via the OpenCompass benchmark.
-
Each submission will be ranked individually for each task, with the top 10 submissions with the highest score per task receiving scores from 10 to 1. The score of each task is calculated as the average score of the three models for the first track. For the second track, the score is the score of the submitted model.
-
We will measure the models' throughput on a smartphone platform provided by the sponsor. Submissions will be ranked based on throughput. To emphasize the importance of inference speed, the score for the throughput task will be doubled, ranging from 20 to 2 for the top 10 submissions. To assist participants in improving the speed of their models, we will provide an easy-to-run pipeline to measure throughput on a GPU, where the throughput values are roughly scaled to those on the smartphone platform.
-
We will measure the inference memory usage of all models. Submissions with models that exceed 12GB of memory usage for inference will be disqualified.
-
The final rank of submissions will be determined by the sum of scores across all evaluation tasks including seven diverse tasks as shown in the Table below (100 scores in total), maximum 70 scores, one secret holdout task (maximum 10 scores), and the throughput measurement (maximum 20 scores).
The top 15 teams with the highest scores will be displayed on the leaderboard. Participants are required to submit their source codes, evaluation log files, and models to the organizers (Check Submission section for details).
Participating teams are encouraged to submit their models for a preliminary review on the 25th of August to identify potential bugs and format issues, ensuring that the final submissions are in the correct format. Each team can make three submissions to each track after the deadline of the preliminary review. The best-performing model will be used to rank the team on the leaderboard and selected for the final model evaluation.
The second stage: After the competition is closed on October 25th, 2024, we will contact the top 3 teams with the highest scores in both tracks, requesting that they submit all necessary code and data to reproduce their results. We will then replicate their entire process, to ensure it is fully repeatable and the final model can be run on a smartphone with 12 GB RAM with the same results. If the top-scoring model cannot be reproduced under these imposed conditions or can not fit the smartphone, we will move on to consider the next highest-scoring submission in the category, until a reproducible and high-performing submission is selected.
Our evaluation will leverage an extensive array of rigorously curated datasets across multiple fundamental dimensions: language comprehension, knowledge precision, logical deduction, mathematical problem-solving, programming proficiency, extended text analysis, and intelligent agent engagement. Details of the evaluation tasks we chose are shown below:
Evaluation datasets
Dataset | Dimension | Source |
CommonsenseQA | Knowledge | https://www.tau-nlp.sites.tau.ac.il/commonsenseqa |
BIG-Bench Hard | Reasoning | https://github.com/suzgunmirac/BIG-Bench-Hard |
GSM8K | Math | https://github.com/openai/grade-school-math |
LongBench | Long-Context | https://github.com/THUDM/LongBench |
HumanEval | Programming | https://github.com/openai/human-eval |
TruthfulQA | Knowledge | https://github.com/sylinrl/TruthfulQA |
CHID | Language | https://github.com/chujiezheng/ChID-Dataset |
Metrics
To comprehensively evaluate submissions, we will employ a set of rigorously curated metrics, which include:
-
Performance Score: One of our primary scoring metrics is the performance score on the selected evaluation task. Each submission will be ranked individually for each task based on the performance score, with the top 10 submissions per task receiving scores from 10 to 1. The final score of the submission is calculated as the sum across all evaluation tasks.
-
Memory Requirement: The memory footprint during inference is a crucial metric for real-life edge devices. To qualify, the peak memory usage of all models must be less than 12GB.
-
Throughput: The throughput of an LLM typically refers to the rate at which the model can process input data and generate output tokens. It is often measured in terms of tokens per second. Throughput is a critical metric for evaluating the efficiency and performance of LLMs, especially in real-time or near-real-time applications where the speed of processing is crucial. Achieving high throughput implies that the model can handle large volumes of data quickly, making it suitable for tasks requiring rapid language processing, such as live chatbots, real-time translation, or speech recognition systems. This value will be measured on a smartphone with 12GB DRAM.
-
Parameter count: Submissions should also include the model size, expressed as the parameter count, to represent the model’s dimensions. This metric is used for information only, not for ranking.
Baseline, code, and materials
The “starting kit” will be released to provide a starting point for people who are interested in our challenge before June 25th, 2024. This starting kit will provide detailed clarifications on what a submission looks like exactly, and how it will be evaluated and submitted. This starting kit will include an end-to-end submission flow, exemplified with a simple baseline:
- Loading a large language model choosing from Phi-2, Llama3-8B-Instruct, or Qwen-7B-Instruct.
- Evaluating the pruned model on the OpenCompass benchmark. We will provide an easy-to-run pipeline for participants to evaluate their model on the subset of the OpenCompass benchmark. We will also provide a tool for measuring the throughput and GPU memory usage of a LLM.
- Compile a model into a binary model library that can be run on android devices. We will provide a tool that can help participants easily deploy their LLMs on the smartphone platform.
Training datasets
This competition will not evaluate submissions based on the analysis of data. Our competition features two tracks: (1) post-training LLM compression for edge devices; and (2) training edge LLMs from scratch. We will allow participants to use and only use the C4 dataset and (Chinese) Alpaca for both tracks. The C4 dataset is a colossal, cleaned version of Common Crawl’s web crawl corpus, which is mainly intended to pre-train language models and word representations. Language models like MPT-7B and T5 are pre-trained with the C4 dataset. C4 is a large enough dataset that can make the competition interesting and draw conclusive and statistically significant results.
-
The C4 dataset can be accessed through: C4 dataset.
-
The Chinese version of Alpaca can be accessed through: Alpaca dataset.