|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- BAAI/COIG-PC |
|
language: |
|
- zh |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This is an experimental product that can be used to create new LLM bassed on Chinese language. |
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
- **Developed by:** yjf9966 |
|
- **Model type:** LLaMA with enhanced tokenizer-size-49964 |
|
- **Language(s) (NLP):** Chinese |
|
- **License:** Apache-2.0 |
|
- **Finetuned from model:** [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://huggingface.co/BlueWhaleX/bwx-13B-HF |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
You can use the raw model for next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. |
|
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. |
|
It also inherits some of the bias of its dataset model. |
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
from transformers import LlamaForCausalLM, LlamaTokenizer |
|
import torch |
|
|
|
base_model_name = "BlueWhaleX/bwx-13B-hf" |
|
load_type = torch.float16 |
|
device = None |
|
|
|
generation_config = dict( |
|
temperature=0.2, |
|
top_k=40, |
|
top_p=0.9, |
|
do_sample=True, |
|
num_beams=1, |
|
repetition_penalty=1.3, |
|
max_new_tokens=400 |
|
) |
|
|
|
prompt_input = ( |
|
"Below is an instruction that describes a task. " |
|
"Write a response that appropriately completes the request.\n\n" |
|
"### Instruction:\n\n{instruction}\n\n### Response:\n\n" |
|
) |
|
if torch.cuda.is_available(): |
|
device = torch.device(0) |
|
else: |
|
device = torch.device('cpu') |
|
|
|
def generate_prompt(instruction, input=None): |
|
if input: |
|
instruction = instruction + '\n' + input |
|
return prompt_input.format_map({'instruction': instruction}) |
|
|
|
tokenizer = LlamaTokenizer.from_pretrained(base_model_name) |
|
model = LlamaForCausalLM.from_pretrained( |
|
base_model_name, |
|
load_in_8bit=False, |
|
torch_dtype=load_type, |
|
low_cpu_mem_usage=True, |
|
device_map='auto', |
|
) |
|
|
|
model_vocab_size = model.get_input_embeddings().weight.size(0) |
|
tokenzier_vocab_size = len(tokenizer) |
|
if model_vocab_size != tokenzier_vocab_size: |
|
model.resize_token_embeddings(tokenzier_vocab_size) |
|
|
|
raw_input_text = input("Input:") |
|
input_text = generate_prompt(instruction=raw_input_text) |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
generation_output = model.generate( |
|
input_ids=inputs["input_ids"].to(device), |
|
attention_mask=inputs['attention_mask'].to(device), |
|
eos_token_id=tokenizer.eos_token_id, |
|
pad_token_id=tokenizer.pad_token_id, |
|
**generation_config |
|
) |
|
s = generation_output[0] |
|
output = tokenizer.decode(s, skip_special_tokens=True) |
|
response = output.split("### Response:")[1].strip() |
|
print("Response: ", response) |
|
print("\n") |
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
BAAI/COIG-PC |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
#### Preprocessing [optional] |
|
|
|
80% for train dataset and 20% for test dataset |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** fp16 mixed precision, lr=1e-4, lora_rank=8, lora_alpha=32 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
|
|
## Evaluation |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Data Card if possible. --> |
|
20% of the BAAI/COIG-PC dataset. |
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
``` |
|
@software{bwx-13B-HF, |
|
title={An Enchanced Chinese Language Model based on the Chinese-Alpaca}, |
|
url={https://huggingface.co/BlueWhaleX/bwx-13B-HF}, |
|
year={2023} |
|
} |
|
``` |