Conan-embedding-v2 / README.md
Vurkty's picture
First commit
1e9cb1d
---
tags:
- mteb
- sentence-transformers
- transformers
- sentence-similarity
language:
- en
- zh
license: apache-2.0
---
# Conan-Embedding-v2
## What's New?
- **Performance**
Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.
- **Cross-lingual Retrieval between Chinese and English**
Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.
- **Longer Context Support**
Conan-Embedding-v2 now supports a context length of 32,768 tokens.
- **Conan 1.4B Large Model Trained from Scratch**
A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance.
The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.
## Performance
Performance of Conan-Embedding-v2 on MTEB for Chinese and English
![MTEB Result](./src/mteb_res_v2.png)
**English**
| Embedding TaskMertric | Class. Acc. (12) | Clust V-Meas. (11) | PairClass AP (3) | Rerank MAP (4) | Retri nDCG @ 10 (15) | STS Spear. (12) | SummSpear. (1) | Avg.(56) |
|:-----------------------:|:----------------:|:------------------:|:----------------:|:--------------:|:--------------------:|:---------------:|:--------------:|:---------:|
| bge-multilingual-gemma2 | 88.08 | 54.65 | 85.97 | 59.72 | 59.24 | 83.88 | 31.20 | 69.88 |
| e5-mistral-7b-instruct | 79.89 | 51.44 | 88.42 | 49.78 | 57.62 | 84.32 | **36.57** | 67.98 |
| gte-Qwen2-7B-instruct | 86.58 | 56.92 | 85.90 | **61.42** | 59.11 | 83.06 | 31.35 | 69.95 |
| stella-en-1.5B-v5 | 87.63 | 57.69 | 88.07 | 61.21 | 61.01 | 84.51 | 31.49 | 71.19 |
| bge-en-icl | 88.95 | 57.89 | 88.14 | 59.86 | 62.16 | 84.24 | 30.77 | 71.67 |
| NV-Embed-v2 | **90.37** | 58.46 | 88.67 | 60.65 | 62.65 | 84.31 | 30.70 | 72.31 |
| **Conan-embedding-v2** | 90.15 | **60.86** | **93.47** | 60.89 | **66.40** | **85.73** | 28.08 | **74.22** |
**Chinese**
| Embedding TaskMertric | Class.Acc. (9) | ClustV-Meas. (4) | PairClassAP (2) | RerankMAP (4) | RetrinDCG @ 10 (8) | STSSpear. (8) | Avg.(35) |
|:-----------------------:|:--------------:|:----------------:|:---------------:|:-------------:|:------------------:|:-------------:|:---------:|
| e5-mistral-7b-instruct | 72.96 | 52.30 | 72.19 | 61.86 | 61.75 | 48.34 | 59.92 |
| gte-Qwen2-1.5B-instruct | 72.53 | 54.61 | 86.91 | 68.21 | 71.86 | 60.05 | 67.12 |
| bge-multilingual-gemma2 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | 67.64 |
| gte-Qwen2-7B-instruct | 75.77 | 66.06 | 87.48 | 68.92 | 75.71 | 65.20 | 71.62 |
| xiaobu-embedding-v2 | 76.53 | 65.17 | 91.87 | 72.58 | 76.50 | 64.18 | 72.36 |
| Conan-embedding-v1 | **76.77** | 66.33 | 91.66 | 72.76 | 76.67 | 63.67 | 72.50 |
| **Conan-embedding-v2** | 76.47 | **68.84** | **92.44** | **74.41** | **78.31** | **65.48** | **74.24** |
## Model Detail
### Model Structure
**Conan-Embedding-v2 Structure:**
```
SentenceTransformer(
(0): Transformer({
'max_seq_length': 32768,
'do_lower_case': False
}) with Transformer model: ConanEmbedModel,
(1): Pooling({
'word_embedding_dimension': 3584,
'pooling_mode_cls_token': False,
'pooling_mode_mean_tokens': True,
'pooling_mode_max_tokens': False,
'pooling_mode_mean_sqrt_len_tokens': False,
'pooling_mode_weightedmean_tokens': False,
'pooling_mode_lasttoken': False,
'include_prompt': True
}),
(2): Dense({
'in_features': 3584,
'out_features': 3584,
'bias': True,
'activation_function': 'torch.nn.modules.linear.Identity'
})
)
```
**Key Specifications of Conan-1.4B (Transformer):**
- Number of Parameters (Non-Dense-Layer): 1.48B
- Vocabulary Size: 150,000
- Number of Layers: 8
- Hidden Layer Dimension: 3584
- Number of Attention Heads (GOA): 32 for Q and 8 for KV
- Intermediate Dimension of FFN Layer: 8192
- Maximum Context Window: 32,768 Tokens
For more model details, please refer to ```model/modeling_conan.py``` and ```config.json```, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.
### Tokenizer
We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.
## Technical Report
We will soon release our technical report.
## Using Conan-Embedding-v2
Use ```/model/conan_api_client.py``` to access our test API. A sample call is as follows:
```
from modeling_conan import ConanClient
AK = os.getenv("CONAN_AK")
SK = os.getenv("CONAN_SK")
client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")
res = client.embed("Hello!")
print(res)
```
This is a temporary calling solution. Please contact us to obtain an access token.
In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.
---
**About**
Created by the Tencent BAC Group. All rights reserved.