File size: 5,096 Bytes
cefd69e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
---
license: apache-2.0
pipeline_tag: text-generation
---
<p align="center" style="font-size:34px;"><b>Buddhi-128K-Chat</b></p>
# Buddhi-128K-Chat (7B) vLLM Inference: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/11_8W8FpKK-856QdRVJLyzbu9g-DMxNfg?usp=sharing)
## Model Description
Buddhi-128k-Chat is a general-purpose first chat model with 128K context length window. It is meticulously fine-tuned on the Mistral 7B Instruct, and optimised to handle an extended context length of up to 128,000 tokens using the innovative YaRN (Yet another Rope Extension) Technique. This enhancement allows Buddhi to maintain a deeper understanding of context in long documents or conversations, making it particularly adept at tasks requiring extensive context retention, such as comprehensive document summarization, detailed narrative generation, and intricate question-answering.
## Dataset Creation
## Architecture
### Hardware requirements:
> For 128k Context Length
> - 80GB VRAM - A100 Preferred
> For 32k Context Length
> - 40GB VRAM - A100 Preferred
### vLLM - For Faster Inference
#### Installation
```
!pip install vllm
!pip install flash_attn # If Flash Attention 2 is supported by your System
```
Please check out [Flash Attention 2](https://github.com/Dao-AILab/flash-attention) Github Repository for more instructions on how to Install it.
**Implementation**:
> Note: The actual hardware requirements to run the model is roughly around 70GB VRAM. For experimentation, we are limiting the context length to 75K instead of 128K. This make it suitable for testing the model in 30-35 GB VRAM
```python
from vllm import LLM, SamplingParams
llm = LLM(
model='aiplanet/buddhi-128k-chat-7b',
trust_remote_code=True,
dtype = 'bfloat16',
gpu_memory_utilization=1,
max_model_len= 75000
)
prompts = [
"""<s> [INST] Please tell me a joke. [/INST] """,
"""<s> [INST] What is Machine Learning? [/INST] """
]
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=1000
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(generated_text)
print("\n\n")
# we have also attached a colab notebook, that contains: 2 more experimentations: Long Essay and Entire Book
```
For Output, do check out the colab notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/11_8W8FpKK-856QdRVJLyzbu9g-DMxNfg?usp=sharing)
### Transformers - Basic Implementation
```python
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_name = "aiplanet/Buddhi-128K-Chat"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="sequential",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
model,
trust_remote_code=True
)
prompt = "<s> [INST] Please tell me a small joke. [/INST] "
tokens = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**tokens,
max_new_tokens=100,
do_sample=True,
top_p=0.95,
temperature=0.8,
)
decoded_output = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]
print(f"Output:\n{decoded_output[len(prompt):]}")
```
Output
```
Output:
Why don't scientists trust atoms?
Because they make up everything.
```
## Prompt Template for Buddi-128-Chat
In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [/INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.
```
"<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"
```
## Get in Touch
You can schedule a 1:1 meeting with our DevRel & Community Team to get started with AI Planet Open Source LLMs and GenAI Stack. Schedule the call here: [https://calendly.com/jaintarun](https://calendly.com/jaintarun)
Stay tuned for more updates and be a part of the coding evolution. Join us on this exciting journey as we make AI accessible to all at AI Planet!
### Framework versions
- Transformers 4.39.2
- Pytorch 2.2.1+cu121
- Datasets 2.18.0
- Accelerate 0.27.2
- flash_attn 2.5.6
### Citation
```
@misc {Chaitanya890, lucifertrj ,
author = { Chaitanya Singhal, Tarun Jain },
title = { Buddhi-128k-Chat by AI Planet},
year = 2024,
url = { https://huggingface.co/aiplanet//Buddhi-128K-Chat },
publisher = { Hugging Face }
}
``` |