--- inference: false license: apache-2.0 language: - zh - en pipeline_tag: text-generation quantized_by: yuuko-eth tags: - nlp - chinese - mistral - traditional_chinese - MediaTek-Research/Breeze-7B-FC-v0_1 --- # Breeze-7B-FC-v1_0-GGUF - Original model: [Breeze-7B-FC-v1_0](https://huggingface.co/MediaTek-Research/Breeze-7B-FC-v1_0) A conversion of [Breeze-7B-FC-v1_0](https://huggingface.co/MediaTek-Research/Breeze-7B-FC-v1_0) into diffrent quantisation levels via [llama.cpp](https://github.com/ggerganov/llama.cpp). | Name | Quant method | Bits | Size | Use case | | ---- | ---- | ---- | ---- | ---- | | [Breeze-7B-FC-v1_0-q4_0.gguf](https://huggingface.co/yuuko-eth/Breeze-7B-FC-v1_0-GGUF/blob/main/Breeze-7B-FC-v1_0-q4_0.gguf) | Q4_0 | 4 | 4.3 GB | medium quality | | [Breeze-7B-FC-v1_0-q4_k_m.gguf](https://huggingface.co/yuuko-eth/Breeze-7B-FC-v1_0-GGUF/blob/main/Breeze-7B-FC-v1_0-q4_k_m.gguf) | Q4_K_M | 4 | 4.54 GB| medium, balanced quality - recommended | | [Breeze-7B-FC-v1_0-q5_0.gguf](https://huggingface.co/yuuko-eth/Breeze-7B-FC-v1_0-GGUF/blob/main/Breeze-7B-FC-v1_0-q5_0.gguf) | Q5_0 | 5 | 5.2 GB| large, low quality loss - recommended | | [Breeze-7B-FC-v1_0-q5_1.gguf](https://huggingface.co/yuuko-eth/Breeze-7B-FC-v1_0-GGUF/blob/main/Breeze-7B-FC-v1_0-q5_1.gguf) | Q5_1 | 5 | 5.6 GB| large, very low quality loss - recommended | | [Breeze-7B-FC-v1_0-q5_k_m.gguf](https://huggingface.co/yuuko-eth/Breeze-7B-FC-v1_0-GGUF/blob/main/Breeze-7B-FC-v1_0-q5_k_m.gguf) | Q5_K_M | 5 | 5.32 GB| large, very low quality loss - recommended | | [Breeze-7B-FC-v1_0-q6_k.gguf](https://huggingface.co/yuuko-eth/Breeze-7B-FC-v1_0-GGUF/blob/main/Breeze-7B-FC-v1_0-q6_k.gguf) | Q6_K | 6 | 6.11 GB| very large, extremely low quality loss | | [Breeze-7B-FC-v1_0-q8_0.gguf](https://huggingface.co/yuuko-eth/Breeze-7B-FC-v1_0-GGUF/blob/main/Breeze-7B-FC-v1_0-q8_0.gguf) | Q8_0 | 8 | 8.0 GB| very large, nearly no quality loss | ## Description This repo contains GGUF format model files for [Breeze-7B-FC-v1_0](https://huggingface.co/MediaTek-Research/Breeze-7B-FC-v1_0). ### About GGUF GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: * [llama.cpp](https://github.com/ggerganov/llama.cpp). The source project for GGUF. Offers a CLI and a server option. * [ollama](https://github.com/ollama/ollama). Get up and running with large language models with a GUI. * [mistral.rs](https://github.com/EricLBuehler/mistral.rs). Blazingly fast LLM inference in Rust. * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. * [KoboldCpp](https://github.com/LostRuins/koboldcpp), a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. * [GPT4All](https://gpt4all.io/index.html), a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. * [LM Studio](https://lmstudio.ai/), an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with many interesting and unique features, including a full model library for easy model selection. * [Faraday.dev](https://faraday.dev/), an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. * [candle](https://github.com/huggingface/candle), a Rust ML framework with a focus on performance, including GPU support, and ease of use. * [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. ## How to locally use those models by Python codes 1. Install [ctransformers](https://github.com/marella/ctransformers) Run one of the following commands, according to your system: ``` # Base ctransformers with no GPU acceleration pip install ctransformers # Or with CUDA GPU acceleration pip install ctransformers[cuda] # Or with AMD ROCm GPU acceleration (Linux only) CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers # Or with Metal GPU acceleration for macOS systems only CT_METAL=1 pip install ctransformers --no-binary ctransformers ``` 2. Simple code ```python from ctransformers import AutoModelForCausalLM # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system. llm = AutoModelForCausalLM.from_pretrained( "yuuko-eth/Breeze-7B-FC-v1_0-GGUF", model_file="Breeze-7B-FC-v1_0-q6_k.gguf", model_type="mistral", context_length=8192, gpu_layers=99) from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("MediaTek-Research/Breeze-7B-Instruct-v1_0") gen_kwargs = dict( max_new_tokens=1024, repetition_penalty=1.1, stop=["[INST]"], temperature=0.0, top_p=0.0, top_k=1, ) chat = [ {"role": "system", "content": "You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan."}, {"role": "user", "content": "請介紹五樣台灣小吃"} ] for text in llm(tokenizer.apply_chat_template(chat, tokenize=False), stream=True, **gen_kwargs): print(text, end="", flush=True) # 以下推薦五樣台灣的小吃: # # 1. 蚵仔煎 (Oyster omelette) - 蚵仔煎是一種以蛋、麵皮和蚵仔為主要食材的傳統美食。它通常在油鍋中煎至金黃色,外酥內嫩,並帶有一股獨特的香氣。蚵仔煎是一道非常受歡迎的小吃,經常可以在夜市或小吃店找到。 # 2. 牛肉麵 (Beef noodle soup) - 牛肉麵是台灣的經典美食之一,它以軟嫩的牛肉和濃郁的湯頭聞名。不同地區的牛肉麵可能有不同的口味和配料,但通常都會包含麵條、牛肉、蔬菜和調味料。牛肉麵在全台灣都有不少知名店家,例如林東芳牛肉麵、牛大哥牛肉麵等。 # 3. 鹹酥雞 (Fried chicken) - 鹹酥雞是一種以雞肉為主要食材的快餐。它通常會經過油炸處理,然後搭配多種蔬菜和調味料。鹹酥雞的口味因地區而異,但通常都會有辣、甜、鹹等不同風味。鹹酥雞經常可以在夜市或路邊攤找到,例如鼎王鹹酥雞、鹹酥G去等知名店家。 # 4. 珍珠奶茶 (Bubble tea) - 珍珠奶茶是一種以紅茶為基底的飲品,加入珍珠(Q彈的小湯圓)和鮮奶。它起源於台灣,並迅速成為全球流行的飲料。珍珠奶茶在全台灣都有不少知名品牌,例如茶湯會、五桐號等。 # 5. 臭豆腐 (Stinky tofu) - 臭豆腐是一種以發酵豆腐為原料製作的傳統小吃。它具有強烈的氣味,但味道獨特且深受台灣人喜愛。臭豆腐通常會搭配多種調味料和配料,例如辣椒醬、蒜泥、酸菜等。臭豆腐在全台灣都有不少知名店家,例如阿宗麵線、大勇街臭豆腐等。 ``` **Instruction following** ```python from mtkresearch.llm.prompt import MRPromptV2 sys_prompt = ('You are a helpful AI assistant built by MediaTek Research. ' 'The user you are helping speaks Traditional Chinese and comes from Taiwan.') prompt_engine = MRPromptV2() conversations = [ {"role": "system", "content": sys_prompt}, {"role": "user", "content": "請問什麼是深度學習?"}, ] prompt = prompt_engine.get_prompt(conversations) output_str = _inference(prompt, llm, params) result = prompt_engine.parse_generated_str(output_str) print(result) # {'role': 'assistant', # 'content': '深度學習(Deep Learning)是一種機器學習方法,它模仿人類大腦的神經網路結構來 # 處理複雜的數據和任務。在深度學習中,模型由多層人工神經元組成,每個神經元之間有 # 權重連接,並通過非線性轉換進行計算。這些層與層之間的相互作用使模型能夠學習複雜 # 的函數關係或模式,從而解決各種問題,如圖像識別、自然語言理解、語音辨識等。深度 # 學習通常需要大量的數據和強大的計算能力,因此經常使用圖形處理器(GPU)或特殊的 # 加速器來執行。'} ``` **Function Calling** ```python import json from mtkresearch.llm.prompt import MRPromptV2 functions = [ { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] } }, "required": ["location"] } } ] def fake_get_current_weather(location, unit=None): return {'temperature': 30} mapping = { 'get_current_weather': fake_get_current_weather } prompt_engine = MRPromptV2() # stage 1: query conversations = [ {"role": "user", "content": "請問台北目前溫度是攝氏幾度?"}, ] prompt = prompt_engine.get_prompt(conversations, functions=functions) output_str = _inference(prompt, llm, params) result = prompt_engine.parse_generated_str(output_str) print(result) # {'role': 'assistant', # 'tool_calls': [ # {'id': 'call_U9bYCBRAbF639uUqfwehwSbw', 'type': 'function', # 'function': {'name': 'get_current_weather', 'arguments': '{"location": "台北, 台灣", "unit": "celsius"}'}}]} # stage 2: execute called functions conversations.append(result) tool_call = result['tool_calls'][0] func_name = tool_call['function']['name'] func = mapping[func_name] arguments = json.loads(tool_call['function']['arguments']) called_result = func(**arguments) # stage 3: put executed results conversations.append( { 'role': 'tool', 'tool_call_id': tool_call['id'], 'name': func_name, 'content': json.dumps(called_result) } ) prompt = prompt_engine.get_prompt(conversations, functions=functions) output_str2 = _inference(prompt, llm, params) result2 = prompt_engine.parse_generated_str(output_str2) print(result2) # {'role': 'assistant', 'content': '台北目前的溫度是攝氏30度'} ``` 3. Example function calling via `llama.cpp` server: ![Function calling example](https://ik.imagekit.io/project2062/hadaly/Screenshot%202024-10-01%20at%2018.10.19_lh-dgptFf.png?updatedAt=1727777532528)