Llama-v3-8B-Chat / README.md
korywat's picture
Upload README.md with huggingface_hub
b533c22 verified
|
raw
history blame
5.34 kB
metadata
library_name: pytorch
license: llama3
pipeline_tag: text-generation
tags:
  - llm
  - generative_ai
  - quantized
  - android

Llama-v3-8B-Chat: Optimized for Mobile Deployment

State-of-the-art large language model useful on a variety of language understanding and generation tasks

Llama 3 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to w4a16 (4-bit weights and 16-bit activations) and part of the model is quantized to w8a16 (8-bit weights and 16-bit activations) making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-Quantized's latency.

This is based on the implementation of Llama-v3-8B-Chat found here. More details on model performance accross various devices, can be found here.

Model Details

  • Model Type: Text generation
  • Model Stats:
    • Context length: 4096
    • Number of parameters: 8B
    • Model size: 4.8GB
    • Precision: w4a16 + w8a16 (few layers)
    • Num of key-value heads: 8
    • Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
    • Prompt processor input: 128 tokens + position embeddings + attention mask + KV cache inputs
    • Prompt processor output: 128 output tokens + KV cache outputs
    • Model-2 (Token Generator): Llama-TokenGenerator-Quantized
    • Token generator input: 1 input token + position embeddings + attention mask + KV cache inputs
    • Token generator output: 1 output token + KV cache outputs
    • Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
Model Device Chipset Target Runtime Response Rate (tokens per second) Time To First Token (range, seconds)
Llama-v3-8B-Chat Snapdragon 8 Elite QRD Snapdragon® 8 Elite QNN 66.14 0.028 - 0.92
Llama-v3-8B-Chat Samsung Galaxy S24 Snapdragon® 8 Gen 3 QNN 66.14 0.028 - 0.92
Llama-v3-8B-Chat Samsung Galaxy S23 Ultra Snapdragon® 8 Gen 2 QNN 66.14 0.028 - 0.92
Llama-v3-8B-Chat Snapdragon X Elite CRD Snapdragon® X Elite QNN 66.14 0.028 - 0.92
Llama-v3-8B-Chat QCS8550 (Proxy) QCS8550 Proxy QNN 66.14 0.028 - 0.92

Deploying Llama 3 on-device

Please follow this tutorial to compile QNN binaries and generate bundle assets to run ChatApp on Windows and on Android powered by QNN-Genie.

Sample output prompts generated on-device

  1. --prompt "where is California?"
------- Response Summary --------
Prompt: where is California?
Response: California is a state located on the West Coast of
  1. --prompt "what is 2+3?" --max-output-tokens 30
-------- Response Summary --------
Prompt: what is 2+3?
Response: 2 + 3 = 5
  1. --prompt "what is superposition in Quantum Physics?" --max-output-tokens 30
Prompt: what is superposition in Quantum Physics?
Response: Superposition is a fundamental concept in quantum mechanics, which is a branch of physics that studies the behavior of matter and energy at a very

License

  • The license for the original implementation of Llama-v3-8B-Chat can be found here.
  • The license for the compiled assets for on-device deployment can be found here

References

Community

Usage and Limitations

Model may not be used for or in connection with any of the following applications:

  • Accessing essential private and public services and benefits;
  • Administration of justice and democratic processes;
  • Assessing or recognizing the emotional state of a person;
  • Biometric and biometrics-based systems, including categorization of persons based on sensitive characteristics;
  • Education and vocational training;
  • Employment and workers management;
  • Exploitation of the vulnerabilities of persons resulting in harmful behavior;
  • General purpose social scoring;
  • Law enforcement;
  • Management and operation of critical infrastructure;
  • Migration, asylum and border control management;
  • Predictive policing;
  • Real-time remote biometric identification in public spaces;
  • Recommender systems of social media platforms;
  • Scraping of facial images (from the internet or otherwise); and/or
  • Subliminal manipulation