|
---
|
|
tags:
|
|
- mteb
|
|
- sentence-transformers
|
|
- transformers
|
|
- sentence-similarity
|
|
language:
|
|
- en
|
|
- zh
|
|
license: apache-2.0
|
|
---
|
|
|
|
# Conan-Embedding-v2
|
|
|
|
## What's New?
|
|
|
|
- **Performance**
|
|
|
|
Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.
|
|
|
|
- **Cross-lingual Retrieval between Chinese and English**
|
|
|
|
Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.
|
|
|
|
- **Longer Context Support**
|
|
|
|
Conan-Embedding-v2 now supports a context length of 32,768 tokens.
|
|
|
|
- **Conan 1.4B Large Model Trained from Scratch**
|
|
|
|
A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance.
|
|
|
|
The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.
|
|
|
|
|
|
## Performance
|
|
|
|
Performance of Conan-Embedding-v2 on MTEB for Chinese and English
|
|
|
|

|
|
|
|
**English**
|
|
|
|
| Embedding TaskMertric | Class. Acc. (12) | Clust V-Meas. (11) | PairClass AP (3) | Rerank MAP (4) | Retri nDCG @ 10 (15) | STS Spear. (12) | SummSpear. (1) | Avg.(56) |
|
|
|:-----------------------:|:----------------:|:------------------:|:----------------:|:--------------:|:--------------------:|:---------------:|:--------------:|:---------:|
|
|
| bge-multilingual-gemma2 | 88.08 | 54.65 | 85.97 | 59.72 | 59.24 | 83.88 | 31.20 | 69.88 |
|
|
| e5-mistral-7b-instruct | 79.89 | 51.44 | 88.42 | 49.78 | 57.62 | 84.32 | **36.57** | 67.98 |
|
|
| gte-Qwen2-7B-instruct | 86.58 | 56.92 | 85.90 | **61.42** | 59.11 | 83.06 | 31.35 | 69.95 |
|
|
| stella-en-1.5B-v5 | 87.63 | 57.69 | 88.07 | 61.21 | 61.01 | 84.51 | 31.49 | 71.19 |
|
|
| bge-en-icl | 88.95 | 57.89 | 88.14 | 59.86 | 62.16 | 84.24 | 30.77 | 71.67 |
|
|
| NV-Embed-v2 | **90.37** | 58.46 | 88.67 | 60.65 | 62.65 | 84.31 | 30.70 | 72.31 |
|
|
| **Conan-embedding-v2** | 90.15 | **60.86** | **93.47** | 60.89 | **66.40** | **85.73** | 28.08 | **74.22** |
|
|
|
|
**Chinese**
|
|
|
|
| Embedding TaskMertric | Class.Acc. (9) | ClustV-Meas. (4) | PairClassAP (2) | RerankMAP (4) | RetrinDCG @ 10 (8) | STSSpear. (8) | Avg.(35) |
|
|
|:-----------------------:|:--------------:|:----------------:|:---------------:|:-------------:|:------------------:|:-------------:|:---------:|
|
|
| e5-mistral-7b-instruct | 72.96 | 52.30 | 72.19 | 61.86 | 61.75 | 48.34 | 59.92 |
|
|
| gte-Qwen2-1.5B-instruct | 72.53 | 54.61 | 86.91 | 68.21 | 71.86 | 60.05 | 67.12 |
|
|
| bge-multilingual-gemma2 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | 67.64 |
|
|
| gte-Qwen2-7B-instruct | 75.77 | 66.06 | 87.48 | 68.92 | 75.71 | 65.20 | 71.62 |
|
|
| xiaobu-embedding-v2 | 76.53 | 65.17 | 91.87 | 72.58 | 76.50 | 64.18 | 72.36 |
|
|
| Conan-embedding-v1 | **76.77** | 66.33 | 91.66 | 72.76 | 76.67 | 63.67 | 72.50 |
|
|
| **Conan-embedding-v2** | 76.47 | **68.84** | **92.44** | **74.41** | **78.31** | **65.48** | **74.24** |
|
|
|
|
|
|
## Model Detail
|
|
|
|
### Model Structure
|
|
|
|
**Conan-Embedding-v2 Structure:**
|
|
|
|
```
|
|
SentenceTransformer(
|
|
(0): Transformer({
|
|
'max_seq_length': 32768,
|
|
'do_lower_case': False
|
|
}) with Transformer model: ConanEmbedModel,
|
|
(1): Pooling({
|
|
'word_embedding_dimension': 3584,
|
|
'pooling_mode_cls_token': False,
|
|
'pooling_mode_mean_tokens': True,
|
|
'pooling_mode_max_tokens': False,
|
|
'pooling_mode_mean_sqrt_len_tokens': False,
|
|
'pooling_mode_weightedmean_tokens': False,
|
|
'pooling_mode_lasttoken': False,
|
|
'include_prompt': True
|
|
}),
|
|
(2): Dense({
|
|
'in_features': 3584,
|
|
'out_features': 3584,
|
|
'bias': True,
|
|
'activation_function': 'torch.nn.modules.linear.Identity'
|
|
})
|
|
)
|
|
```
|
|
|
|
**Key Specifications of Conan-1.4B (Transformer):**
|
|
|
|
- Number of Parameters (Non-Dense-Layer): 1.48B
|
|
|
|
- Vocabulary Size: 150,000
|
|
|
|
- Number of Layers: 8
|
|
|
|
- Hidden Layer Dimension: 3584
|
|
|
|
- Number of Attention Heads (GOA): 32 for Q and 8 for KV
|
|
|
|
- Intermediate Dimension of FFN Layer: 8192
|
|
|
|
- Maximum Context Window: 32,768 Tokens
|
|
|
|
For more model details, please refer to ```model/modeling_conan.py``` and ```config.json```, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.
|
|
|
|
### Tokenizer
|
|
|
|
We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.
|
|
|
|
## Technical Report
|
|
|
|
We will soon release our technical report.
|
|
|
|
## Using Conan-Embedding-v2
|
|
|
|
Use ```/model/conan_api_client.py``` to access our test API. A sample call is as follows:
|
|
|
|
```
|
|
from modeling_conan import ConanClient
|
|
|
|
AK = os.getenv("CONAN_AK")
|
|
SK = os.getenv("CONAN_SK")
|
|
client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")
|
|
res = client.embed("Hello!")
|
|
print(res)
|
|
|
|
```
|
|
|
|
This is a temporary calling solution. Please contact us to obtain an access token.
|
|
|
|
In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.
|
|
|
|
|
|
---
|
|
|
|
**About**
|
|
|
|
Created by the Tencent BAC Group. All rights reserved. |