File size: 3,251 Bytes
65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 65505ef 0d80da7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
tags:
- llama
- adapter-transformers
- llama-2
datasets:
- timdettmers/openassistant-guanaco
license: apache-2.0
pipeline_tag: text-generation
---
# OpenAssistant Bottleneck QAdapter for Llama-2 7B
QAdapter sequential bottleneck adapter for the Llama-2 7B (`meta-llama/Llama-2-7b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
**This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**
## Usage
First, install `adapters`:
```
pip install -U adapters
```
Now, the model and adapter can be loaded and activated like this:
```python
import adapters
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "meta-llama/Llama-2-7b-hf"
adapter_id = "AdapterHub/llama2-7b-qadapter-seq-openassistant"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
),
torch_dtype=torch.bfloat16,
)
adapters.init(model)
adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
```
### Inference
Inference can be done via standard methods built in to the Transformers library.
We add some helper code to properly prompt the model first:
```python
from transformers import StoppingCriteria
# stop if model starts to generate "### Human:"
class EosListStoppingCriteria(StoppingCriteria):
def __init__(self, eos_sequence = [12968, 29901]):
self.eos_sequence = eos_sequence
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
return self.eos_sequence in last_ids
def prompt_model(model, text: str):
batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
batch = batch.to(model.device)
with torch.cuda.amp.autocast():
output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])
# skip prompt when decoding
decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
return decoded[:-10] if decoded.endswith("### Human:") else decoded
```
Now, to prompt the model:
```python
prompt_model(model, "Please explain NLP in simple terms.")
```
## Architecture & Training
**Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)**.
The adapter uses the sequential bottleneck architecture described in [Houlsby et al. (2019)](https://arxiv.org/pdf/1902.00751.pdf) and available in Adapters as `double_seq_bn`.
The adapter is trained similar to the Guanaco models proposed in the paper:
- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
- Quantization: 4-bit QLoRA
- Batch size: 16, LR: 2e-4, max steps: 1875
- Sequence length: 512
|