File size: 11,711 Bytes
987800c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
---
tags:
- mteb
- sentence-transformers
- transformers
- Qwen2
- sentence-similarity
- llama-cpp
license: apache-2.0
---
## This version
This model was converted from the 32-bit original safetensors format to a (lossless in this case) **32-bit GGUF format (`f32`)** from **[`Alibaba-NLP/gte-Qwen2-7B-instruct`](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)** using `llama-quantize` built from [`llama.cpp`](https://github.com/ggerganov/llama.cpp).
Custom conversion script settings:
```json
"gte-Qwen2-7B-instruct": {
"model_name": "gte-Qwen2-7B-instruct",
"hq_quant_type": "f32",
"final_quant_type": "",
"produce_final_quant": false,
"parts_num": 4,
"max_shard_size_gb": 4,
"numexpr_max_thread": 8
}
```
Please refer to the [original model card](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) for more details on the unquantized model, including its metrics, which may be different (typically slightly worse) for this quantized version.
## gte-Qwen2-7B-instruct
**gte-Qwen2-7B-instruct** is the latest model in the gte (General Text Embedding) model family that ranks **No.1** in both English and Chinese evaluations on the Massive Text Embedding Benchmark [MTEB benchmark](https://huggingface.co/spaces/mteb/leaderboard) (as of June 16, 2024).
Recently, the [**Qwen team**](https://huggingface.co/Qwen) released the Qwen2 series models, and we have trained the **gte-Qwen2-7B-instruct** model based on the [Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B) LLM model. Compared to the [gte-Qwen1.5-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct) model, the **gte-Qwen2-7B-instruct** model uses the same training data and training strategies during the finetuning stage, with the only difference being the upgraded base model to Qwen2-7B. Considering the improvements in the Qwen2 series models compared to the Qwen1.5 series, we can also expect consistent performance enhancements in the embedding models.
The model incorporates several key advancements:
- Integration of bidirectional attention mechanisms, enriching its contextual understanding.
- Instruction tuning, applied solely on the query side for streamlined efficiency
- Comprehensive training across a vast, multilingual text corpus spanning diverse domains and scenarios. This training leverages both weakly supervised and supervised data, ensuring the model's applicability across numerous languages and a wide array of downstream tasks.
## Model Information
### Overview
- Model Type: GTE (General Text Embeddings)
- Model Size: 7B
- Embedding Dimension: 3584
- Context Window: 131072
### Supported languages
- North America: English
- Western Europe: German, French, Spanish, Portuguese, Italian, Dutch
- Eastern & Central Europe: Russian, Czech, Polish
- Middle East: Arabic, Persian, Hebrew, Turkish
- Eastern Asia: Chinese, Japanese, Korean
- South-Eastern Asia: Vietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
- Southern Asia: Hindi, Bengali, Urdu
- [[source](https://qwenlm.github.io/blog/qwen2/)]
### Details
```
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = gte-Qwen2-7B-instruct
llama_model_loader: - kv 3: general.finetune str = instruct
llama_model_loader: - kv 4: general.basename str = gte-Qwen2
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.tags arr[str,5] = ["mteb", "sentence-transformers", "tr...
llama_model_loader: - kv 8: qwen2.block_count u32 = 28
llama_model_loader: - kv 9: qwen2.context_length u32 = 131072
llama_model_loader: - kv 10: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 11: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 12: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 13: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 14: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 16: general.file_type u32 = 0
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151646] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151646] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = true
llama_model_loader: - kv 26: tokenizer.chat_template str = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: split.no u16 = 0
llama_model_loader: - kv 29: split.count u16 = 8
llama_model_loader: - kv 30: split.tensors.count i32 = 339
llama_model_loader: - type f32: 339 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.9308 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151646
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 7.61 B
llm_load_print_meta: model size = 28.36 GiB (32.00 BPW)
llm_load_print_meta: general.name = gte-Qwen2-7B-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: CPU_Mapped model buffer size = 3795.37 MiB
llm_load_tensors: CPU_Mapped model buffer size = 3612.20 MiB
llm_load_tensors: CPU_Mapped model buffer size = 3668.20 MiB
llm_load_tensors: CPU_Mapped model buffer size = 3703.16 MiB
llm_load_tensors: CPU_Mapped model buffer size = 3556.17 MiB
llm_load_tensors: CPU_Mapped model buffer size = 3556.19 MiB
llm_load_tensors: CPU_Mapped model buffer size = 3556.18 MiB
llm_load_tensors: CPU_Mapped model buffer size = 3592.38 MiB
........................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_ctx_per_seq = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 7168.00 MiB
llama_new_context_with_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.01 MiB
llama_new_context_with_model: CPU compute buffer size = 7452.01 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 1
```
## Usage
### Sentence Transformers
### Transformers
## Inference
### Using `llama.cpp` to get embeddings in CPU and/or GPU
First [build](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md) or [install](https://github.com/ggerganov/llama.cpp/blob/master/docs/install.md) **`llama-server`** binary from [`llama.cpp`](https://github.com/ggerganov/llama.cpp), preferably with GPU support.
### CLI
### Server
```bash
# using remote HF repo address (with model file(s) to be downloaded and cached locally)
$ llama-server --hf-repo mirekphd/gte-Qwen2-7B-instruct-F32 --hf-file gte-Qwen2-7B-instruct-F32-00001-of-00008.gguf --n-gpu-layers 0 --ctx-size 131072 --embeddings
# using a previously downloaded local model file(s)
$ llama-server --model <path-to-hf-models>/mirekphd/gte-Qwen2-7B-instruct-F32/gte-Qwen2-7B-instruct-F32-00001-of-00008.gguf --n-gpu-layers 0 --ctx-size 131072 --embeddings
```
## Evaluation
### MTEB & C-MTEB
## Cloud API Services
## Citation
If you find our paper or models helpful, please consider cite:
```
@article{li2023towards,
title={Towards general text embeddings with multi-stage contrastive learning},
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
journal={arXiv preprint arXiv:2308.03281},
year={2023}
}
``` |