infly
/

inf-wse-v1-base-zh

@@ -6,7 +6,7 @@ tags:
   - transformers
 ---
-## INF Word-level Sparse Embedding (INF-WSE)
 **INF-WSE** is a series of word-level sparse embedding models developed by [INFLY TECH](https://www.infly.cn/en).
 These models are optimized to generate sparse, high-dimensional text embeddings that excel in capturing the most
@@ -29,7 +29,7 @@ relevant information for search and retrieval, particularly in Chinese text.
 ### Transformers
-#### Infer Embeddings
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
@@ -58,31 +58,10 @@ print(scores.tolist())
 #### Convert embeddings to lexical weights
 ```python
-import torch
-from transformers import AutoTokenizer, AutoModel
 from collections import OrderedDict
-queries = ['电脑一体机由什么构成？', '什么是掌上电脑？']
-documents = [
-    '电脑一体机，是由一台显示器、一个电脑键盘和一个鼠标组成的电脑。',
-    '掌上电脑是一种运行在嵌入式操作系统和内嵌式应用软件之上的、小巧、轻便、易带、实用、价廉的手持式计算设备。',
-]
-input_texts = queries + documents
-tokenizer = AutoTokenizer.from_pretrained("infly/inf-wse-v1-base-zh", trust_remote_code=True, use_fast=False)
-model = AutoModel.from_pretrained("infly/inf-wse-v1-base-zh", trust_remote_code=True)
-model.eval()
-max_length = 512
-input_batch = tokenizer(input_texts, padding=True, max_length=max_length, truncation=True, return_tensors="pt")
-with torch.no_grad():
-    embeddings = model(input_batch['input_ids'], input_batch['attention_mask'], return_sparse=False)
 def convert_embeddings_to_weights(embeddings, tokenizer):
     values, indices = torch.sort(embeddings, dim=-1, descending=True)
     token2weight = []
     for i in range(embeddings.size(0)):
         token2weight.append(OrderedDict())
@@ -97,14 +76,14 @@ def convert_embeddings_to_weights(embeddings, tokenizer):
     return token2weight
 token2weight = convert_embeddings_to_weights(embeddings, tokenizer)
-print(token2weight[0])
-# OrderedDict([('一体机', 3.3438382148742676), ('由', 2.493837356567383), ('电脑', 2.0291812419891357), ('构成', 1.986171841621399), ('什么', 1.0218793153762817)])
 ```
 ## Evaluation
 ### C-MTEB Retrieval task
 ([Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB))
 Metric: nDCG@10
@@ -114,3 +93,5 @@ Metric: nDCG@10
 |  [BM25-zh](https://github.com/castorini/pyserini)   |     -      |   25.39   |   13.70   | **86.66** |   13.68   |   11.49   |   15.48   |   6.56    |   29.53   |   25.98   |
 | [bge-m3-sparse](https://huggingface.co/BAAI/bge-m3) |    512     |   29.94   | **24.50** |   76.16   |   22.12   |   17.62   |   27.52   |   9.78    | **37.69** |   24.12   |
 |               **inf-wse-v1-base-zh**                |    512     | **32.83** |   20.51   |   76.40   | **36.77** | **19.97** | **28.61** | **13.32** |   36.81   | **30.25** |

   - transformers
 ---
+## <u>INF</u> <u>W</u>ord-level <u>S</u>parse <u>E</u>mbedding (INF-WSE)
 **INF-WSE** is a series of word-level sparse embedding models developed by [INFLY TECH](https://www.infly.cn/en).
 These models are optimized to generate sparse, high-dimensional text embeddings that excel in capturing the most
 ### Transformers
+#### Infer embeddings
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
 #### Convert embeddings to lexical weights
 ```python
 from collections import OrderedDict
 def convert_embeddings_to_weights(embeddings, tokenizer):
     values, indices = torch.sort(embeddings, dim=-1, descending=True)
     token2weight = []
     for i in range(embeddings.size(0)):
         token2weight.append(OrderedDict())
     return token2weight
 token2weight = convert_embeddings_to_weights(embeddings, tokenizer)
+print(token2weight[1])
+# OrderedDict([('掌上', 3.4572525024414062), ('电脑', 2.6253132820129395), ('是', 2.0787220001220703), ('什么', 1.2899624109268188)])
 ```
 ## Evaluation
 ### C-MTEB Retrieval task
 ([Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB))
 Metric: nDCG@10
 |  [BM25-zh](https://github.com/castorini/pyserini)   |     -      |   25.39   |   13.70   | **86.66** |   13.68   |   11.49   |   15.48   |   6.56    |   29.53   |   25.98   |
 | [bge-m3-sparse](https://huggingface.co/BAAI/bge-m3) |    512     |   29.94   | **24.50** |   76.16   |   22.12   |   17.62   |   27.52   |   9.78    | **37.69** |   24.12   |
 |               **inf-wse-v1-base-zh**                |    512     | **32.83** |   20.51   |   76.40   | **36.77** | **19.97** | **28.61** | **13.32** |   36.81   | **30.25** |
+All results, except for BM25, are measured by building the sparse index via [Qdrant](https://github.com/qdrant/qdrant).