|
--- |
|
license: openrail++ |
|
language: |
|
- en |
|
tags: |
|
- text-to-code |
|
- multilingual-code-generation |
|
--- |
|
|
|
<!-- <p align="center" width="70%"> |
|
<img src="assets/Logo.jpg" alt="HKUST CodeUp" style="width: 50%; min-width: 250px; display: block; margin: auto;"> |
|
</p> --> |
|
![HKUST CodeUp](assets/Logo.jpg) |
|
# CodeUp: A Multilingual Code Generation Llama2 Model with Parameter-Efficient Instruction-Tuning on a Single RTX 3090 |
|
|
|
## Description |
|
In recent years, large language models (LLMs) have shown exceptional capabilities in a wide range of applications due to their fantastic emergence ability. To align with human preference, instruction-tuning and reinforcement learning from human feedback (RLHF) are proposed for Chat-based LLMs (e.g., ChatGPT, GPT-4). However, these LLMs (except for Codex) primarily focus on the general domain and are not specifically designed for the code domain. Although Codex provides an alternative choice, it is a closed-source model developed by OpenAI. Hence, it is imperative to develop open-source instruction-following LLMs for the code domain. |
|
However, the large-scale number of LLMs' parameters ($\ge$7B) and training datasets require a vast amount of computational resources, which significantly impedes the development of training and inference on consumer hardware. |
|
|
|
To handle these challenges, in this project, we adopt the latest powerful foundation model `Llama 2` and construct high-quality instruction-following data for code generation tasks, and propose an instruction-following multilingual code generation Llama2 model. Meanwhile, to make it fit an academic budget and consumer hardware (e.g., a single RTX 3090) based on `Alpaca-LoRA`, we equip `CodeUp` with the advanced parameter-efficient fine-tuning (PEFT) methods (e.g., [LoRA](https://arxiv.org/abs/2106.09685)) which enable efficient adaptation of pre-trained language models (PLMs, also known as foundation model) to various downstream applications without fine-tuning the entire model's parameters. The overall training recipe is as follows. |
|
|
|
![Training Framework](assets/Framework.jpg) |
|
|
|
|
|
## NL2Code Data Release |
|
Recently, it has attracted significant attention to exploiting much larger and more powerful LLMs (e.g., ChatGPT, GPT-4) to self-generate instruction-following data by delicate prompt design. However, many approaches primarily focus on the general domain and lack code-specific domain considerations. To this end, [Code Alpaca](https://github.com/sahil280114/codealpaca) follows the previous Self-Instruct paper [3] and [Stanford Alpaca repo](https://github.com/tatsu-lab/stanford_alpaca) with some code-related modifications to conduct 20K instruction-following data `data/code_alpaca_20k.json` for code generation tasks. This `JSON` file following `alpaca_data.json` format is a list of dictionaries; each dictionary contains the following fields: |
|
|
|
- `instruction`: `str`, describes the task the model should perform. Each of the 20K instructions is unique. |
|
- `input`: `str`, optional context or input for the task. For example, when the instruction is "Amend the following SQL query to select distinct elements", the input is the SQL query. Around 40% of the examples have an input. |
|
- `output`: `str`, the answer to the instruction as generated by `text-davinci-003`. |
|
|
|
### High-quality Data Filter |
|
However, after carefully checking the LLMs-self-generated data, we observe three critical problems that may hinder LLMs' instruction learning due to ambiguous and irrelevant noise. That is |
|
|
|
1. When `instruction` doesn't specify the programming language (PL) of implementation, the `output` appears with diverse options, e.g., Python, C++, and JavaScript. |
|
2. It is ambiguous to identify which programming language `output` is implemented by. |
|
3. Both `instruction` and `output` are irrelevant to the code-specific domain. |
|
|
|
Hence, we filter the ambiguous and irrelevant data by rigorous design to obtain high-quality instruction data. Specifically, to solve 1) we set Python as the default PL of implementation and use [Guesslang](https://guesslang.readthedocs.io/en/latest/) package to detect the PL of a given source code in `output`. If the Python is detected, this prompt is retained. Otherwise, it will be filtered. 2) and 3) In these cases, we delete these prompts. After that, about 5K low-quality instruction data is filtered. To supplement the high-quality instruction data, we further integrate the `data/new_codealpaca.json` data (about 4.5K) under the above filter rules. |
|
|
|
This way, we gain the 19K high-quality instruction data of code generation. The following is the instruction number distribution of each PL with Radar visualization before and after filtering. |
|
|
|
<!-- | Raw Data (20K + 4K)| Filtered Data (19K) | |
|
| -- | -- | |
|
| <center><img src="assets/PL_Raw.png" width="100%"></center> | <center><img src="assets/PL_Clean.png" width="92%"></center> | --> |
|
|
|
![PL Data Filtering)](assets/PL_Filter.jpg) |
|
|
|
|
|
## Training & Inference |
|
Detailed instructions can be found at [https://github.com/juyongjiang/CodeUp](https://github.com/juyongjiang/CodeUp). |
|
|