CodeUp: A Multilingual Code Generation Llama2 Model with Parameter-Efficient Instruction-Tuning on a Single RTX 3090

Description

In recent years, large language models (LLMs) have shown exceptional capabilities in a wide range of applications due to their fantastic emergence ability. To align with human preference, instruction-tuning and reinforcement learning from human feedback (RLHF) are proposed for Chat-based LLMs (e.g., ChatGPT, GPT-4). However, these LLMs (except for Codex) primarily focus on the general domain and are not specifically designed for the code domain. Although Codex provides an alternative choice, it is a closed-source model developed by OpenAI. Hence, it is imperative to develop open-source instruction-following LLMs for the code domain. However, the large-scale number of LLMs' parameters ($\ge$7B) and training datasets require a vast amount of computational resources, which significantly impedes the development of training and inference on consumer hardware.

To handle these challenges, in this project, we adopt the latest powerful foundation model Llama 2 and construct high-quality instruction-following data for code generation tasks, and propose an instruction-following multilingual code generation Llama2 model. Meanwhile, to make it fit an academic budget and consumer hardware (e.g., a single RTX 3090) based on Alpaca-LoRA, we equip CodeUp with the advanced parameter-efficient fine-tuning (PEFT) methods (e.g., LoRA) which enable efficient adaptation of pre-trained language models (PLMs, also known as foundation model) to various downstream applications without fine-tuning the entire model's parameters. The overall training recipe is as follows.

NL2Code Data Release

Recently, it has attracted significant attention to exploiting much larger and more powerful LLMs (e.g., ChatGPT, GPT-4) to self-generate instruction-following data by delicate prompt design. However, many approaches primarily focus on the general domain and lack code-specific domain considerations. To this end, Code Alpaca follows the previous Self-Instruct paper [3] and Stanford Alpaca repo with some code-related modifications to conduct 20K instruction-following data data/code_alpaca_20k.json for code generation tasks. This JSON file following alpaca_data.json format is a list of dictionaries; each dictionary contains the following fields:

instruction: str, describes the task the model should perform. Each of the 20K instructions is unique.
input: str, optional context or input for the task. For example, when the instruction is "Amend the following SQL query to select distinct elements", the input is the SQL query. Around 40% of the examples have an input.
output: str, the answer to the instruction as generated by text-davinci-003.

High-quality Data Filter

However, after carefully checking the LLMs-self-generated data, we observe three critical problems that may hinder LLMs' instruction learning due to ambiguous and irrelevant noise. That is

When instruction doesn't specify the programming language (PL) of implementation, the output appears with diverse options, e.g., Python, C++, and JavaScript.
It is ambiguous to identify which programming language output is implemented by.
Both instruction and output are irrelevant to the code-specific domain.

Hence, we filter the ambiguous and irrelevant data by rigorous design to obtain high-quality instruction data. Specifically, to solve 1) we set Python as the default PL of implementation and use Guesslang package to detect the PL of a given source code in output. If the Python is detected, this prompt is retained. Otherwise, it will be filtered. 2) and 3) In these cases, we delete these prompts. After that, about 5K low-quality instruction data is filtered. To supplement the high-quality instruction data, we further integrate the data/new_codealpaca.json data (about 4.5K) under the above filter rules.

This way, we gain the 19K high-quality instruction data of code generation. The following is the instruction number distribution of each PL with Radar visualization before and after filtering.

Training & Inference

Detailed instructions can be found at https://github.com/juyongjiang/CodeUp.

Citation

If you use the data or code in this repo, please cite the repo.

@misc{codeup,
  author = {Juyong Jiang and Sunghun Kim},
  title = {CodeUp: A Multilingual Code Generation Llama2 Model with Parameter-Efficient Instruction-Tuning},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/juyongjiang/CodeUp}},
}

Naturally, you should also cite the original LLaMA V1 [1] & V2 paper [2], and the Self-Instruct paper [3], and the LoRA paper [4], and the Stanford Alpaca repo, and Alpaca-LoRA repo, and Code Alpaca repo, and PEFT.

juyongjiang
/

CodeUp-Llama-2-13b-chat-hf