|
--- |
|
language: |
|
- multilingual |
|
tags: |
|
- code-generation |
|
- transformers |
|
license: mit |
|
--- |
|
<div align="center"> |
|
<img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" /> |
|
</div> |
|
<hr> |
|
|
|
# Kwaipilot KwaiCoder-DS-V2-Lite-Base |
|
|
|
## 1.Model Details |
|
|
|
**Introduction** |
|
|
|
Kwai-Coder-DS-V2-Lite-Base is built on Deepseek-v2-Lite-Base, which has a total of 16B parameters and 2.4B activated parameters. It supports both English and Chinese and underwent continue pretraining on 800B tokens of high-quality code, math, and Chinese-English text data. The training data consists of 70% code data, 20% math data, and 10% text data (including a large amount of code-related text data). Ultimately, the base model achieved SOTA levels in multiple benchmarks. |
|
|
|
**Performance** |
|
|
|
| Model | Size | Humaneval | Humaneval+ | MBPP | MBPP+ | BigCodeBench(Full) | BigCodeBench(Hard) | MATH| GSM8k| |
|
|----------|----------|-----------|------------|------|-------|----------------------|----------------------|-------|-------| |
|
| Qwen2.5-Coder | 1.5B | 43.9| 36.6| 69.2| 58.6| 34.6| 9.5|30.9|65.8| |
|
| CodeGemma | 2B |31.1| 16.5| 51.1| 43.1| 23.9| 7.4|-|-| |
|
| CodeLlama | 7B | 33.5| 26.2| 55.3| 46.8| 28.7| 5.4|12.1|31.2| |
|
| Qwen2.5-Coder | 7B | 46.3| 37.8| 66.2| 53.1| 38.4| 12.2|**46.6**|**83.9**| |
|
| OpenCoder | 8B |66.5 |63.4| 79.9| **70.4** |40.5 |9.5|-|-| |
|
| Yi-Coder | 9B | 53.7 | 46.3 | 48.4 |40.7 | 42.9 | 14.2 |-|-| |
|
| StarCoder2 | 15B | 46.3| 37.8| 66.2| 53.1 |38.4| 12.2|10.3|23.4| |
|
| DeepSeek-Coder-V2-Lite | 16B | 40.9| 34.1| 71.9| 59.4| 30.6| 8.1|39.0|67.1| |
|
| **KwaiCoder-DS-V2-Lite** | 16B |**75.0**|**68.9**|**81.2**|67.7|**49.4**|**18.2**| 40.48|81.5| |
|
| CodeLlama | 34B | 51.8| 43.9| 69.3| 56.3| 45.3| 16.2|21.2|58.2| |
|
|
|
|
|
|
|
Kwai-Coder-DS-V2-Lite-Base achieved Pass@1 scores of 75.0% and 68.9% on the HumanEval and HumanEval+ test sets, respectively. Compared to Deepseek-v2-Lite-Base of the same parameter scale, this represents an improvement of 83.37% and 102.05%, respectively. Additionally, it surpassed the current best base model (OpenCoder-8B), reaching SOTA (State-of-the-Art) levels. |
|
|
|
On the MBPP and MBPP+ test sets, Kwai-Coder-DS-V2-Lite-Base outperformed the Deepseek-v2-Lite-Base model of the same parameter scale. Additionally, with only 2.4B activated parameters, the Kwai-Coder-DS-V2-Lite-Base model achieved an average improvement of nearly 5 percentage points compared to the 7B parameter-scale Qwen2.5-Coder. |
|
|
|
On the BigCodeBench-Complete full set (Full), Kwai-Coder-DS-V2-Lite-Base achieved a 6% improvement over DeepSeek-Coder-33B, reaching SOTA (State-of-the-Art) levels. On the Hard subset, Kwai-Coder-DS-V2-Lite-Base also significantly outperformed the 70B parameter-scale CodeLlama model and the 7B parameter-scale Qwen2.5-Coder model. |
|
|
|
In terms of mathematical capabilities, with only 2.4B activated parameters, Kwai-Coder-DS-V2-Lite-Base surpassed Deepseek-v2-Lite-Base of the same parameter scale on the MATH and GSM8K test sets (with improvements of 3.79% and 21.46%, respectively) and outperformed the larger parameter-scale CodeLlama-34B (with improvements of 90.95% and 40.03%, respectively). Although it has not yet exceeded Qwen2.5-Coder-7B, Kwai-Coder-DS-V2-Lite-Base has already surpassed the Qwen2.5-Coder-3B model, which has more activated parameters, achieving SOTA (State-of-the-Art) levels for its parameter scale. |
|
|
|
## 2.Usage |
|
|
|
**Code Completion** |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
model_id = "Kwaipilot/KwaiCoder-DS-V2-Lite-Base" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,trust_remote_code=True) |
|
text = "#write a quick sort algorithm" |
|
inputs = tokenizer(text, return_tensors="pt").to(model.device) |
|
outputs = model.generate(**inputs, max_new_tokens=80) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(text):]) |
|
``` |
|
**Code Insertion** |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
model_id = "Kwaipilot/KwaiCoder-DS-V2-Lite-Base" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,trust_remote_code=True) |
|
text = """<|fim▁begin|>def find_longest_substring(s): |
|
seen = {} |
|
max_length = 0 |
|
start = 0 |
|
<|fim▁hole|> |
|
if char in seen and seen[char] >= start: |
|
start = seen[char] + 1 |
|
seen[char] = end |
|
max_length = max(max_length, end - start + 1) |
|
return max_length<|fim▁end|>""" |
|
inputs = tokenizer(text, return_tensors="pt").to(model.device) |
|
outputs = model.generate(**inputs, max_new_tokens=80) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(text):]) |
|
``` |
|
|
|
## 3.License |
|
This code repository is licensed under the MIT License. The use of KwaiCoder-DS-V2-Lite-Base models is subject to the Model License. |
|
|
|
## 4.BibTex |
|
```BibTex |
|
@misc{kwaicoder, |
|
title = {KwaiCoder: Code mathematical abilities comprehensive improvement.}, |
|
author = {Kwaipilot team}, |
|
year = {2024}, |
|
} |
|
``` |