--- license: mit datasets: - OpenAssistant/oasst1 language: - en tags: - sft pipeline_tag: text-generation widget: - text: >- <|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|> - text: <|prompter|>What's the Earth total population<|endoftext|><|assistant|> - text: <|prompter|>Write a story about future of AI development<|endoftext|><|assistant|> --- # Load Merged Model (Recommended, identical configuration to a fine-tuned model) ``` import torch from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig repo_id = "jordiclive/falcon-40b-lora-sft-stage2-1.1k" dtype = torch.bfloat16 tokenizer = AutoTokenizer.from_pretrained(repo_id) model = AutoModelForCausalLM.from_pretrained( repo_id, torch_dtype=dtype, trust_remote_code=True, ) ``` ## Model Details - **Developed** as part of the OpenAssistant Project - **Model type:** LoRA (PEFT) - **Language:** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish); - **Finetuned from:** [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-4b) - **Model type:** Causal decoder-only transformer language model - **Weights & Biases:** [Training log1](https://wandb.ai/open-assistant/public-sft/runs/q0q9lce4) [Training log2](https://wandb.ai/open-assistant/public-sft/runs/qqok9ru2?workspace=user-jordanclive) # LoRA Adapter for Falcon 40B trained on oasst-top1 This repo contains a **Falcon 40B** LoRA fine-tuned model and the low-rank adapter fit on datasets part of the OpenAssistant project. This version of the weights was trained with the following hyperparameters: SFT 1 - Epochs: 2 - Batch size: 128 - Max Length: 2048 - Learning rate: 1e-4 - Lora _r_: 64 - Lora Alpha: 16 - Lora target modules: ["dense_4h_to_h", "dense", "query_key_value", "dense_h_to_4h"] SFT2 - Epochs: 10 - Batch size: 128 The model was trained with flash attention and gradient checkpointing and deepspeed stage 3 on 8 x A100 80gb Dataset: SFT1: ``` - oa_leet10k: val_split: 0.05 max_val_set: 250 - cmu_wiki_qa: val_split: 0.05 - joke: val_split: 0.05 - webgpt: val_split: 0.05 max_val_set: 250 - alpaca_gpt4: val_split: 0.025 max_val_set: 250 - gpteacher_roleplay: val_split: 0.05 - wizardlm_70k: val_split: 0.05 max_val_set: 500 - poem_instructions: val_split: 0.025 - tell_a_joke: val_split: 0.05 max_val_set: 250 - gpt4all: val_split: 0.01 max_val_set: 1000 - minimath: val_split: 0.05 - humaneval_mbpp_codegen_qa: val_split: 0.05 - humaneval_mbpp_testgen_qa: val_split: 0.05 - dolly15k: val_split: 0.05 max_val_set: 300 - recipes: val_split: 0.05 - code_alpaca: val_split: 0.05 max_val_set: 250 - vicuna: fraction: 0.5 val_split: 0.025 max_val_set: 250 - oa_wiki_qa_bart_10000row: val_split: 0.05 max_val_set: 250 - grade_school_math_instructions: val_split: 0.05 ``` SFT2 ``` - oasst_export: lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk" # sft-8.0 input_file_path: 2023-05-06_OASST_labels.jsonl.gz val_split: 0.05 top_k: 1 - lima: val_split: 0.05 max_val_set: 50 ``` ## Prompting Two special tokens are used to mark the beginning of user and assistant turns: `<|prompter|>` and `<|assistant|>`. Each turn ends with a `<|endoftext|>` token. Input prompt example: ``` <|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|> ``` The input ends with the `<|assistant|>` token to signal that the model should start generating the assistant reply. # Example Inference code (Prompt Template) ``` model = model.to(device) if dtype == torch.float16: model = model.half() # Choose Generation parameters generation_config = GenerationConfig( temperature=0.1, top_p=0.75, top_k=40, num_beams=4, ) def format_system_prompt(prompt, eos_token=tokenizer.eos_token): return "{}{}{}{}".format("<|prompter|>", prompt, eos_token, "<|assistant|>") def generate(prompt, generation_config=generation_config, max_new_tokens=2048, device=device): prompt = format_system_prompt(prompt,eos_token=tokenizer.eos_token) # OpenAssistant Prompt Format expected input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) with torch.no_grad(): generation_output = model.generate( input_ids=input_ids, generation_config=generation_config, return_dict_in_generate=True, output_scores=True, max_new_tokens=max_new_tokens, eos_token_id=tokenizer.eos_token_id, ) s = generation_output.sequences[0] output = tokenizer.decode(s) print("Text generated:") print(output) return output ``` ## LoRA weights If you want to use the LoRA weights separately, several special token embeddings also need to be added. ``` base_model_id = "tiiuae/falcon-40b" import torch import transformers from huggingface_hub import hf_hub_download from peft import PeftModel def add_embeddings(model, embed_path, tokenizer): old_embeddings = model.get_input_embeddings() old_num_tokens, old_embedding_dim = old_embeddings.weight.size() new_embeddings = torch.nn.Embedding(old_num_tokens, old_embedding_dim) new_embeddings.to(old_embeddings.weight.device, dtype=old_embeddings.weight.dtype) model._init_weights(new_embeddings) embed_weights = torch.load(embed_path, map_location=old_embeddings.weight.device) vocab_size = tokenizer.vocab_size new_embeddings.weight.data[:vocab_size, :] = old_embeddings.weight.data[:vocab_size, :] new_embeddings.weight.data[vocab_size : vocab_size + embed_weights.shape[0], :] = embed_weights.to( new_embeddings.weight.dtype ).to(new_embeddings.weight.device) model.set_input_embeddings(new_embeddings) model.tie_weights() def load_peft_model(model, peft_model_path, tokenizer): embed_weights = hf_hub_download(peft_model_path, "extra_embeddings.pt") model.resize_token_embeddings(tokenizer.vocab_size + torch.load(embed_weights).shape[0]) model.config.eos_token_id = tokenizer.eos_token_id model.config.bos_token_id = tokenizer.bos_token_id model.config.pad_token_id = tokenizer.pad_token_id model = PeftModel.from_pretrained( model, model_id=peft_model_path, torch_dtype=model.dtype, ) model.eos_token_id = tokenizer.eos_token_id add_embeddings(model, embed_weights, tokenizer) return model def load_lora_model(base_model_id, tokenizer, device, dtype): model = transformers.AutoModelForCausalLM.from_pretrained( base_model_id, torch_dtype=dtype, trust_remote_code=True, ) model = load_peft_model(model, repo_id, tokenizer) model = model.to(device) return model model = load_lora_model(base_model_id=base_model_id, tokenizer=tokenizer, device=device, dtype=dtype) ```