Update README.md

151f1c7 verified about 1 month ago

5.32 kB

	---
	language:
	- multilingual
	tags:
	- code-generation
	- transformers
	license: mit
	---
	<div align="center">
	<img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" />
	</div>
	<hr>

	# Kwaipilot KwaiCoder-DS-V2-Lite-Base

	## 1.Model Details

	Introduction

	Kwai-Coder-DS-V2-Lite-Base is built on Deepseek-v2-Lite-Base, which has a total of 16B parameters and 2.4B activated parameters. It supports both English and Chinese and underwent continue pretraining on 800B tokens of high-quality code, math, and Chinese-English text data. The training data consists of 70% code data, 20% math data, and 10% text data (including a large amount of code-related text data). Ultimately, the base model achieved SOTA levels in multiple benchmarks.

	Performance

	\| Model \| Size \| Humaneval \| Humaneval+ \| MBPP \| MBPP+ \| BigCodeBench（Full） \| BigCodeBench（Hard） \| MATH\| GSM8k\|
	\|----------\|----------\|-----------\|------------\|------\|-------\|----------------------\|----------------------\|-------\|-------\|
	\| Qwen2.5-Coder \| 1.5B \| 43.9\| 36.6\| 69.2\| 58.6\| 34.6\| 9.5\|30.9\|65.8\|
	\| CodeGemma \| 2B \|31.1\| 16.5\| 51.1\| 43.1\| 23.9\| 7.4\|-\|-\|
	\| CodeLlama \| 7B \| 33.5\| 26.2\| 55.3\| 46.8\| 28.7\| 5.4\|12.1\|31.2\|
	\| Qwen2.5-Coder \| 7B \| 46.3\| 37.8\| 66.2\| 53.1\| 38.4\| 12.2\|46.6\|83.9\|
	\| OpenCoder \| 8B \|66.5 \|63.4\| 79.9\| 70.4 \|40.5 \|9.5\|-\|-\|
	\| Yi-Coder \| 9B \| 53.7 \| 46.3 \| 48.4 \|40.7 \| 42.9 \| 14.2 \|-\|-\|
	\| StarCoder2 \| 15B \| 46.3\| 37.8\| 66.2\| 53.1 \|38.4\| 12.2\|10.3\|23.4\|
	\| DeepSeek-Coder-V2-Lite \| 16B \| 40.9\| 34.1\| 71.9\| 59.4\| 30.6\| 8.1\|39.0\|67.1\|
	\| KwaiCoder-DS-V2-Lite \| 16B \|75.0\|68.9\|81.2\|67.7\|49.4\|18.2\| 40.48\|81.5\|
	\| CodeLlama \| 34B \| 51.8\| 43.9\| 69.3\| 56.3\| 45.3\| 16.2\|21.2\|58.2\|



	Kwai-Coder-DS-V2-Lite-Base achieved Pass@1 scores of 75.0% and 68.9% on the HumanEval and HumanEval+ test sets, respectively. Compared to Deepseek-v2-Lite-Base of the same parameter scale, this represents an improvement of 83.37% and 102.05%, respectively. Additionally, it surpassed the current best base model (OpenCoder-8B), reaching SOTA (State-of-the-Art) levels.

	On the MBPP and MBPP+ test sets, Kwai-Coder-DS-V2-Lite-Base outperformed the Deepseek-v2-Lite-Base model of the same parameter scale. Additionally, with only 2.4B activated parameters, the Kwai-Coder-DS-V2-Lite-Base model achieved an average improvement of nearly 5 percentage points compared to the 7B parameter-scale Qwen2.5-Coder.

	On the BigCodeBench-Complete full set (Full), Kwai-Coder-DS-V2-Lite-Base achieved a 6% improvement over DeepSeek-Coder-33B, reaching SOTA (State-of-the-Art) levels. On the Hard subset, Kwai-Coder-DS-V2-Lite-Base also significantly outperformed the 70B parameter-scale CodeLlama model and the 7B parameter-scale Qwen2.5-Coder model.

	In terms of mathematical capabilities, with only 2.4B activated parameters, Kwai-Coder-DS-V2-Lite-Base surpassed Deepseek-v2-Lite-Base of the same parameter scale on the MATH and GSM8K test sets (with improvements of 3.79% and 21.46%, respectively) and outperformed the larger parameter-scale CodeLlama-34B (with improvements of 90.95% and 40.03%, respectively). Although it has not yet exceeded Qwen2.5-Coder-7B, Kwai-Coder-DS-V2-Lite-Base has already surpassed the Qwen2.5-Coder-3B model, which has more activated parameters, achieving SOTA (State-of-the-Art) levels for its parameter scale.

	## 2.Usage

	Code Completion
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	model_id = "Kwaipilot/KwaiCoder-DS-V2-Lite-Base"
	tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,trust_remote_code=True)
	text = "#write a quick sort algorithm"
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=80)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(text):])
	```
	Code Insertion
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	model_id = "Kwaipilot/KwaiCoder-DS-V2-Lite-Base"
	tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,trust_remote_code=True)
	text = """<｜fim▁begin｜>def find_longest_substring(s):
	seen = {}
	max_length = 0
	start = 0
	<｜fim▁hole｜>
	if char in seen and seen[char] >= start:
	start = seen[char] + 1
	seen[char] = end
	max_length = max(max_length, end - start + 1)
	return max_length<｜fim▁end｜>"""
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=80)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(text):])
	```

	## 3.License
	This code repository is licensed under the MIT License. The use of KwaiCoder-DS-V2-Lite-Base models is subject to the Model License.

	## 4.BibTex
	```BibTex
	@misc{kwaicoder,
	title = {KwaiCoder: Code mathematical abilities comprehensive improvement.},
	author = {Kwaipilot team},
	year = {2024},
	}
	```