Conan-embedding-v2 / README.md

First commit

1e9cb1d 14 days ago

6.29 kB

	---
	tags:
	- mteb
	- sentence-transformers
	- transformers
	- sentence-similarity
	language:
	- en
	- zh
	license: apache-2.0
	---

	# Conan-Embedding-v2

	## What's New?

	- Performance

	Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.

	- Cross-lingual Retrieval between Chinese and English

	Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.

	- Longer Context Support

	Conan-Embedding-v2 now supports a context length of 32,768 tokens.

	- Conan 1.4B Large Model Trained from Scratch

	A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance.

	The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.


	## Performance

	Performance of Conan-Embedding-v2 on MTEB for Chinese and English

	![MTEB Result](./src/mteb_res_v2.png)

	English

	\| Embedding TaskMertric \| Class. Acc. (12) \| Clust V-Meas. (11) \| PairClass AP (3) \| Rerank MAP (4) \| Retri nDCG @ 10 (15) \| STS Spear. (12) \| SummSpear. (1) \| Avg.(56) \|
	\|:-----------------------:\|:----------------:\|:------------------:\|:----------------:\|:--------------:\|:--------------------:\|:---------------:\|:--------------:\|:---------:\|
	\| bge-multilingual-gemma2 \| 88.08 \| 54.65 \| 85.97 \| 59.72 \| 59.24 \| 83.88 \| 31.20 \| 69.88 \|
	\| e5-mistral-7b-instruct \| 79.89 \| 51.44 \| 88.42 \| 49.78 \| 57.62 \| 84.32 \| 36.57 \| 67.98 \|
	\| gte-Qwen2-7B-instruct \| 86.58 \| 56.92 \| 85.90 \| 61.42 \| 59.11 \| 83.06 \| 31.35 \| 69.95 \|
	\| stella-en-1.5B-v5 \| 87.63 \| 57.69 \| 88.07 \| 61.21 \| 61.01 \| 84.51 \| 31.49 \| 71.19 \|
	\| bge-en-icl \| 88.95 \| 57.89 \| 88.14 \| 59.86 \| 62.16 \| 84.24 \| 30.77 \| 71.67 \|
	\| NV-Embed-v2 \| 90.37 \| 58.46 \| 88.67 \| 60.65 \| 62.65 \| 84.31 \| 30.70 \| 72.31 \|
	\| Conan-embedding-v2 \| 90.15 \| 60.86 \| 93.47 \| 60.89 \| 66.40 \| 85.73 \| 28.08 \| 74.22 \|

	Chinese

	\| Embedding TaskMertric \| Class.Acc. (9) \| ClustV-Meas. (4) \| PairClassAP (2) \| RerankMAP (4) \| RetrinDCG @ 10 (8) \| STSSpear. (8) \| Avg.(35) \|
	\|:-----------------------:\|:--------------:\|:----------------:\|:---------------:\|:-------------:\|:------------------:\|:-------------:\|:---------:\|
	\| e5-mistral-7b-instruct \| 72.96 \| 52.30 \| 72.19 \| 61.86 \| 61.75 \| 48.34 \| 59.92 \|
	\| gte-Qwen2-1.5B-instruct \| 72.53 \| 54.61 \| 86.91 \| 68.21 \| 71.86 \| 60.05 \| 67.12 \|
	\| bge-multilingual-gemma2 \| 75.31 \| 59.30 \| 86.67 \| 68.28 \| 73.73 \| 55.19 \| 67.64 \|
	\| gte-Qwen2-7B-instruct \| 75.77 \| 66.06 \| 87.48 \| 68.92 \| 75.71 \| 65.20 \| 71.62 \|
	\| xiaobu-embedding-v2 \| 76.53 \| 65.17 \| 91.87 \| 72.58 \| 76.50 \| 64.18 \| 72.36 \|
	\| Conan-embedding-v1 \| 76.77 \| 66.33 \| 91.66 \| 72.76 \| 76.67 \| 63.67 \| 72.50 \|
	\| Conan-embedding-v2 \| 76.47 \| 68.84 \| 92.44 \| 74.41 \| 78.31 \| 65.48 \| 74.24 \|


	## Model Detail

	### Model Structure

	Conan-Embedding-v2 Structure:

	```
	SentenceTransformer(
	(0): Transformer({
	'max_seq_length': 32768,
	'do_lower_case': False
	}) with Transformer model: ConanEmbedModel,
	(1): Pooling({
	'word_embedding_dimension': 3584,
	'pooling_mode_cls_token': False,
	'pooling_mode_mean_tokens': True,
	'pooling_mode_max_tokens': False,
	'pooling_mode_mean_sqrt_len_tokens': False,
	'pooling_mode_weightedmean_tokens': False,
	'pooling_mode_lasttoken': False,
	'include_prompt': True
	}),
	(2): Dense({
	'in_features': 3584,
	'out_features': 3584,
	'bias': True,
	'activation_function': 'torch.nn.modules.linear.Identity'
	})
	)
	```

	Key Specifications of Conan-1.4B (Transformer):

	- Number of Parameters (Non-Dense-Layer): 1.48B

	- Vocabulary Size: 150,000

	- Number of Layers: 8

	- Hidden Layer Dimension: 3584

	- Number of Attention Heads (GOA): 32 for Q and 8 for KV

	- Intermediate Dimension of FFN Layer: 8192

	- Maximum Context Window: 32,768 Tokens

	For more model details, please refer to ```model/modeling_conan.py``` and ```config.json```, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.

	### Tokenizer

	We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.

	## Technical Report

	We will soon release our technical report.

	## Using Conan-Embedding-v2

	Use ```/model/conan_api_client.py``` to access our test API. A sample call is as follows:

	```
	from modeling_conan import ConanClient

	AK = os.getenv("CONAN_AK")
	SK = os.getenv("CONAN_SK")
	client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")
	res = client.embed("Hello!")
	print(res)

	```

	This is a temporary calling solution. Please contact us to obtain an access token.

	In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.


	---

	About

	Created by the Tencent BAC Group. All rights reserved.