kirp
/

Text Generation
GGUF
English
File size: 6,846 Bytes
82b73d6
 
 
 
 
 
 
 
126f512
 
34ff67b
82b73d6
ab281fc
6788f14
0583444
916ec69
 
0583444
 
 
 
 
 
 
 
 
 
 
 
 
0d87751
 
 
8d9b841
 
0d87751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51f5988
0d87751
 
 
 
 
 
ec9549c
0d87751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80bdf26
0d87751
ec9549c
82b73d6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
license: apache-2.0
datasets:
- cerebras/SlimPajama-627B
- bigcode/starcoderdata
- OpenAssistant/oasst_top1_2023-08-25
language:
- en
quantized_by:
- kirp
pipeline_tag: text-generation
---
# 🔥 Good news
You can download the model from [PY007](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.2-GGUF) without any change to llama.cpp.

Here is a [demo](https://huggingface.co/spaces/kirp/tinyllama-chat).

# Pay attention
To use this model, you need to change the rope part of llama.cpp/llama.cpp. (From mode 0 to mode 2 rope)

Change 2568 and 2572 line from
```
struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd_head, n_head_kv, N), n_past, n_embd_head, 0, 0, freq_base, freq_scale);
struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd_head, n_head, N),    n_past, n_embd_head, 0, 0, freq_base, freq_scale);
```
to
```
struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd_head, n_head_kv, N), n_past, n_embd_head, 2, 0, freq_base, freq_scale);
struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd_head, n_head, N),    n_past, n_embd_head, 2, 0, freq_base, freq_scale);
```
# TinyLlama-1.1B Chat v0.2 GGUF

- Model creator: [PY007](https://huggingface.co/PY007)
- Original model: [TinyLlama 1.1B Chat v0.2](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.2)
- Github: [TinyLlama](https://github.com/jzhang38/TinyLlama)

## Description

This repo contains GGUF format model files for [PY007's TinyLlama 1.1B Chat v0.2](PY007/TinyLlama-1.1B-Chat-v0.2)

<!-- README_GGUF.md-about-gguf start -->

### About GGUF

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.

The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenization code, including for the first time full support for special tokens. This should improve performance, especially with models that use new special tokens and implement custom prompt templates.

Here are a list of clients and libraries that are known to support GGUF:
* [llama.cpp](https://github.com/ggerganov/llama.cpp).
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most widely used web UI, with many features and powerful extensions.
* [KoboldCpp](https://github.com/LostRuins/koboldcpp), a fully featured web UI, with full GPU accel across multiple platforms and GPU architectures. Especially good for story telling.
* [LM Studio](https://lmstudio.ai/), an easy-to-use and powerful local GUI with GPU acceleration on both Windows (NVidia and AMD), and macOS.
* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with many interesting and unique features, including a full model library for easy model selection.
* [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server.
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
* [candle](https://github.com/huggingface/candle), a Rust ML framework with a focus on performance, including GPU support, and ease of use.

<!-- README_GGUF.md-about-gguf end -->

<!-- prompt-template start -->

## Prompt template: TinyLlama chat

```
<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n
```

Example:

```
<|im_start|>user
Explain huggingface.<|im_end|>
<|im_start|>assistant
Hugging Face is a platform for building and hosting open-source applications. It provides a simple interface for developers to build, deploy, and host any application on the web. Hugging Face offers a wide range of services, including:

1. API Gateway: This service allows developers to create REST APIs that can be accessed by other Hugging Face services.

2. Functions: This service provides functions that can be used for processing data and making predictions.

3. Transformers: These are a set of algorithms that allow developers to process large amounts of text data and generate new content.

4. Datasets: Hugging Face provides datasets that can be used to train models, evaluate them, and make predictions.

5. CLI: This service provides a command-line interface for developers to build, deploy, and manage their applications.

6. Documentation: This service provides documentation for the different services and features available on Hugging Face's platform.

7. Community: The Hugging Face community is made up of developers, data scientists, and other experts who can provide support and resources for using and building on Hugging Face's platforms.<|im_end|>
```
<!-- prompt-template end -->

<!-- compatibility_gguf start -->

## Compatibility

These quantised GGUF files are compatible with llama.cpp from August 21st 2023 onwards, as of commit [6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9](https://github.com/ggerganov/llama.cpp/commit/6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9)

They are now also compatible with many third party UIs and libraries - please see the list at the top of the README.

## Explanation of quantisation methods
<details>
  <summary>Click to see details</summary>

The new methods available are:
* GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
* GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
* GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
* GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
* GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw

Refer to the Provided Files table below to see what files use which methods, and how.
</details>
<!-- compatibility_gguf end -->

<!-- README_GGUF.md-how-to-run start -->

## Example `llama.cpp` command

For compatibility with older versions of llama.cpp, or for any third-party libraries or clients that haven't yet updated for GGUF, please use GGML files instead.

```
./main -m ./models/ggml-model-q4_k_m.gguf \
        -n 512 --color --temp 0 -e \
        -p "<|im_start|>user\nExplain huggingface.<|im_end|>\n<|im_start|>assistant\n"
```