--- license: llama3 language: - en library_name: transformers pipeline_tag: text-generation tags: - Text Generation - Transformers - llama - llama-3 - 8B - nvidia - facebook - meta - LLM - fine-tuned - insurance - research - pytorch - instruct - chatqa-1.5 - chatqa - finetune - gpt4 - conversational - text-generation-inference - Inference Endpoints datasets: - InsuranceQA base_model: "nvidia/Llama3-ChatQA-1.5-8B" finetuned: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B" quantized: "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF" --- # Open-Insurance-LLM-Llama3-8B-GGUF This model is a GGUF-quantized version of an insurance domain-specific language model based on Nvidia Llama 3-ChatQA Fine-tuned for insurance-related queries and conversations. ## Model Details - **Model Type:** Quantized Language Model (GGUF format) - **Base Model:** nvidia/Llama3-ChatQA-1.5-8B - **Finetuned Model:** Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B - **Quantized Model:** Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF - **Model Architecture:** Llama - **Quantization:** 8-bit (Q8_0), 5-bit (Q5_K_M), 4-bit (Q4_K_M), 16-bit - **Finetuned Dataset**: InsuranceQA - **Developer:** Raj Maharajwala - **License:** llama3 - **Language:** English ## Setup Instructions ### Environment Setup #### For Windows ```bash python3 -m venv .venv_open_insurance_llm .\.venv_open_insurance_llm\Scripts\activate ``` #### For Mac/Linux ```bash python3 -m venv .venv_open_insurance_llm source .venv_open_insurance_llm/bin/activate ``` ### Installation #### For Mac Users (Metal Support) ```bash export FORCE_CMAKE=1 CMAKE_ARGS="-DGGML_METAL=on" pip install --upgrade --force-reinstall llama-cpp-python==0.3.2 --no-cache-dir ``` #### For Windows Users (CPU Support) ```bash pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu ``` ### Dependencies Then install dependencies (inference_requirements.txt) attached under `Files and Versions`: ```bash pip install -r inference_requirements.txt ``` ## Inference Loop ```python # Attached under `Files and Versions` (inference_open-insurance-llm-gguf.py) import os import time from pathlib import Path from llama_cpp import Llama from rich.console import Console from huggingface_hub import hf_hub_download from dataclasses import dataclass from typing import List, Dict, Any, Tuple @dataclass class ModelConfig: # Optimized parameters for coherent responses and efficient performance on devices like MacBook Air M2 model_name: str = "Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF" model_file: str = "open-insurance-llm-q4_k_m.gguf" # model_file: str = "open-insurance-llm-q8_0.gguf" # 8-bit quantization; higher precision, better quality, increased resource usage # model_file: str = "open-insurance-llm-q5_k_m.gguf" # 5-bit quantization; balance between performance and resource efficiency max_tokens: int = 1000 # Maximum number of tokens to generate in a single output temperature: float = 0.1 # Controls randomness in output; lower values produce more coherent responses (performs scaling distribution) top_k: int = 15 # After temperature scaling, Consider the top 15 most probable tokens during sampling top_p: float = 0.2 # After reducing the set to 15 tokens, Uses nucleus sampling to select tokens with a cumulative probability of 20% repeat_penalty: float = 1.2 # Penalize repeated tokens to reduce redundancy num_beams: int = 4 # Number of beams for beam search; higher values improve quality at the cost of speed n_gpu_layers: int = -2 # Number of layers to offload to GPU; -1 for full GPU utilization, -2 for automatic configuration n_ctx: int = 2048 # Context window size; Llama 3 models support up to 8192 tokens context length n_batch: int = 256 # Number of tokens to process simultaneously; adjust based on available hardware (suggested 512) verbose: bool = False # True for enabling verbose logging for debugging purposes use_mmap: bool = False # Memory-map model to reduce RAM usage; set to True if running on limited memory systems use_mlock: bool = True # Lock model into RAM to prevent swapping; improves performance on systems with sufficient RAM offload_kqv: bool = True # Offload key, query, value matrices to GPU to accelerate inference class InsuranceLLM: def __init__(self, config: ModelConfig): self.config = config self.llm_ctx = None self.console = Console() self.conversation_history: List[Dict[str, str]] = [] self.system_message = ( "This is a chat between a user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. " "The assistant should also indicate when the answer cannot be found in the context. " "You are an expert from the Insurance domain with extensive insurance knowledge and " "professional writer skills, especially about insurance policies. " "Your name is OpenInsuranceLLM, and you were developed by Raj Maharajwala. " "You are willing to help answer the user's query with a detailed explanation. " "In your explanation, leverage your deep insurance expertise, such as relevant insurance policies, " "complex coverage plans, or other pertinent insurance concepts. Use precise insurance terminology while " "still aiming to make the explanation clear and accessible to a general audience." ) def download_model(self) -> str: try: with self.console.status("[bold green]Downloading model..."): model_path = hf_hub_download( self.config.model_name, filename=self.config.model_file, local_dir=os.path.join(os.getcwd(), 'gguf_dir') ) return model_path except Exception as e: self.console.print(f"[red]Error downloading model: {str(e)}[/red]") raise def load_model(self) -> None: try: quantized_path = os.path.join(os.getcwd(), "gguf_dir") directory = Path(quantized_path) try: model_path = str(list(directory.glob(self.config.model_file))[0]) except IndexError: model_path = self.download_model() with self.console.status("[bold green]Loading model..."): self.llm_ctx = Llama( model_path=model_path, n_gpu_layers=self.config.n_gpu_layers, n_ctx=self.config.n_ctx, n_batch=self.config.n_batch, num_beams=self.config.num_beams, verbose=self.config.verbose, use_mlock=self.config.use_mlock, use_mmap=self.config.use_mmap, offload_kqv=self.config.offload_kqv ) except Exception as e: self.console.print(f"[red]Error loading model: {str(e)}[/red]") raise def build_conversation_prompt(self, new_question: str, context: str = "") -> str: prompt = f"System: {self.system_message}\n\n" # Add conversation history for exchange in self.conversation_history: prompt += f"User: {exchange['user']}\n\n" prompt += f"Assistant: {exchange['assistant']}\n\n" # Add the new question if context: prompt += f"User: Context: {context}\nQuestion: {new_question}\n\n" else: prompt += f"User: {new_question}\n\n" prompt += "Assistant:" return prompt def generate_response(self, prompt: str) -> Tuple[str, int, float]: if not self.llm_ctx: raise RuntimeError("Model not loaded. Call load_model() first.") self.console.print("[bold cyan]Assistant: [/bold cyan]", end="") complete_response = "" token_count = 0 start_time = time.time() try: for chunk in self.llm_ctx.create_completion( prompt, max_tokens=self.config.max_tokens, top_k=self.config.top_k, top_p=self.config.top_p, temperature=self.config.temperature, repeat_penalty=self.config.repeat_penalty, stream=True ): text_chunk = chunk["choices"][0]["text"] complete_response += text_chunk token_count += 1 print(text_chunk, end="", flush=True) elapsed_time = time.time() - start_time print() return complete_response, token_count, elapsed_time except Exception as e: self.console.print(f"\n[red]Error generating response: {str(e)}[/red]") return f"I encountered an error while generating a response. Please try again or ask a different question.", 0, 0 def run_chat(self): try: self.load_model() self.console.print("\n[bold green]Welcome to Open-Insurance-LLM![/bold green]") self.console.print("Enter your questions (type '/bye', 'exit', or 'quit' to end the session)\n") self.console.print("Optional: You can provide context by typing 'context:' followed by your context, then 'question:' followed by your question\n") self.console.print("Your conversation history will be maintained for context-aware responses.\n") total_tokens = 0 while True: try: user_input = self.console.input("[bold cyan]User:[/bold cyan] ").strip() if user_input.lower() in ["exit", "/bye", "quit"]: self.console.print(f"\n[dim]Total tokens: {total_tokens}[/dim]") self.console.print("\n[bold green]Thank you for using OpenInsuranceLLM![/bold green]") break # Reset conversation with command if user_input.lower() == "/reset": self.conversation_history = [] self.console.print("[yellow]Conversation history has been reset.[/yellow]") continue context = "" question = user_input if "context:" in user_input.lower() and "question:" in user_input.lower(): parts = user_input.split("question:", 1) context = parts[0].replace("context:", "").strip() question = parts[1].strip() prompt = self.build_conversation_prompt(question, context) response, tokens, elapsed_time = self.generate_response(prompt) # Add to conversation history self.conversation_history.append({ "user": question, "assistant": response }) # Update total tokens total_tokens += tokens # Print metrics tokens_per_sec = tokens / elapsed_time if elapsed_time > 0 else 0 self.console.print( f"[dim]Tokens: {tokens} || " + f"Time: {elapsed_time:.2f}s || " + f"Speed: {tokens_per_sec:.2f} tokens/sec[/dim]" ) print() # Add a blank line after each response except KeyboardInterrupt: self.console.print("\n[yellow]Input interrupted. Type '/bye', 'exit', or 'quit' to quit.[/yellow]") continue except Exception as e: self.console.print(f"\n[red]Error processing input: {str(e)}[/red]") continue except Exception as e: self.console.print(f"\n[red]Fatal error: {str(e)}[/red]") finally: if self.llm_ctx: del self.llm_ctx def main(): try: config = ModelConfig() llm = InsuranceLLM(config) llm.run_chat() except KeyboardInterrupt: print("\nProgram interrupted by user") except Exception as e: print(f"\nApplication error: {str(e)}") if __name__ == "__main__": main() ``` ```bash python3 inference_open-insurance-llm-gguf.py ``` ### Nvidia Llama 3 - ChatQA Paper: Arxiv : [https://arxiv.org/pdf/2401.10225](https://arxiv.org/pdf/2401.10225) ## Use Cases This model is specifically designed for: - Insurance policy understanding and explanation - Claims processing assistance - Coverage analysis - Insurance terminology clarification - Policy comparison and recommendations - Risk assessment queries - Insurance compliance questions ## Limitations - The model's knowledge is limited to its training data cutoff - Should not be used as a replacement for professional insurance advice - May occasionally generate plausible-sounding but incorrect information ## Bias and Ethics This model should be used with awareness that: - It may reflect biases present in insurance industry training data - Output should be verified by insurance professionals for critical decisions - It should not be used as the sole basis for insurance decisions - The model's responses should be treated as informational, not as legal or professional advice ## Citation and Attribution If you use base model or quantized model in your research or applications, please cite: ``` @misc{maharajwala2024openinsurance, author = {Raj Maharajwala}, title = {Open-Insurance-LLM-Llama3-8B-GGUF}, year = {2024}, publisher = {HuggingFace}, linkedin = {https://www.linkedin.com/in/raj6800/}, url = {https://huggingface.co/Raj-Maharajwala/Open-Insurance-LLM-Llama3-8B-GGUF} } ```