File size: 2,998 Bytes
404c98c 5a32041 b6dd780 404c98c a0ec501 5f19b5d a0ec501 f303995 a0ec501 c962a35 5a32041 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
license: mit
library_name: peft
base_model: TinyPixel/Llama-2-7B-bf16-sharded
---
# Chinkara 7B
_Chinkara_ is a Large Language Model trained on [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) dataset based on Meta's brand new LLaMa-2 with 7 billion parameters using QLoRa Technique, optimized for small consumer size GPUs.

## Information
For more information about the model please visit [prp-e/chinkara](https://github.com/prp-e/chinkara) on Github.
## Inference Guide
[](https://colab.research.google.com/github/prp-e/chinkara/blob/main/inference-7b.ipynb)
_NOTE: This part is for the time you want to load and infere the model on your local machine. You still need 8GB of VRAM on your GPU. The recommended GPU is at least a 2080!_
### Installing libraries
```
pip install -U bitsandbytes
pip install -U git+https://github.com/huggingface/transformers.git
pip install -U git+https://github.com/huggingface/peft.git
pip install -U git+https://github.com/huggingface/accelerate.git
pip install -U datasets
pip install -U einops
```
### Loading the model
```python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "TinyPixel/Llama-2-7B-bf16-sharded"
adapters_name = 'MaralGPT/chinkara-7b'
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
torch_dtype=torch.bfloat16,
device_map="auto",
max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
),
)
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
### Setting the model up
```python
from peft import LoraConfig, get_peft_model
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
### Prompt and inference
```python
prompt = "What is the answer to life, universe and everything?"
prompt = f"###Human: {prompt} ###Assistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=50, temperature=0.5, repetition_penalty=1.0)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)
```
## Training procedure
The following `bitsandbytes` quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: False
- bnb_4bit_compute_dtype: float16
### Framework versions
- PEFT 0.5.0.dev0 |