exaggerated commited on
Commit
5765f25
·
1 Parent(s): 81bf11c

Upload 10 files

Browse files
README.md CHANGED
@@ -1,3 +1,124 @@
1
  ---
 
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: sentence-similarity
3
  license: apache-2.0
4
+ tags:
5
+ - text2vec
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ - transformers
9
  ---
10
+ # shibing624/text2vec-base-chinese
11
+ This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-base-chinese.
12
+
13
+ It maps sentences to a 768 dimensional dense vector space and can be used for tasks
14
+ like sentence embeddings, text matching or semantic search.
15
+
16
+
17
+ ## Evaluation
18
+ For an automated evaluation of this model, see the *Evaluation Benchmark*: [text2vec](https://github.com/shibing624/text2vec)
19
+
20
+ - chinese text matching task:
21
+
22
+ | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
23
+ | :---- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
24
+ | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 33.86 | 10283 |
25
+ | paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 41.99 | 2371 |
26
+ | text2vec-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | **48.25** | 2572 |
27
+
28
+
29
+ ## Usage (text2vec)
30
+ Using this model becomes easy when you have [text2vec](https://github.com/shibing624/text2vec) installed:
31
+
32
+ ```
33
+ pip install -U text2vec
34
+ ```
35
+
36
+ Then you can use the model like this:
37
+
38
+ ```python
39
+ from text2vec import SentenceModel
40
+ sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
41
+
42
+ model = SentenceModel('shibing624/text2vec-base-chinese')
43
+ embeddings = model.encode(sentences)
44
+ print(embeddings)
45
+ ```
46
+
47
+ ## Usage (HuggingFace Transformers)
48
+ Without [text2vec](https://github.com/shibing624/text2vec), you can use the model like this:
49
+
50
+ First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
51
+
52
+ Install transformers:
53
+ ```
54
+ pip install transformers
55
+ ```
56
+
57
+ Then load model and predict:
58
+ ```python
59
+ from transformers import BertTokenizer, BertModel
60
+ import torch
61
+
62
+ # Mean Pooling - Take attention mask into account for correct averaging
63
+ def mean_pooling(model_output, attention_mask):
64
+ token_embeddings = model_output[0] # First element of model_output contains all token embeddings
65
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
66
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
67
+
68
+ # Load model from HuggingFace Hub
69
+ tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
70
+ model = BertModel.from_pretrained('shibing624/text2vec-base-chinese')
71
+ sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
72
+ # Tokenize sentences
73
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
74
+
75
+ # Compute token embeddings
76
+ with torch.no_grad():
77
+ model_output = model(**encoded_input)
78
+ # Perform pooling. In this case, mean pooling.
79
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
80
+ print("Sentence embeddings:")
81
+ print(sentence_embeddings)
82
+ ```
83
+
84
+ ## Usage (sentence-transformers)
85
+ [sentence-transformers](https://github.com/UKPLab/sentence-transformers) is a popular library to compute dense vector representations for sentences.
86
+
87
+ Install sentence-transformers:
88
+ ```
89
+ pip install -U sentence-transformers
90
+ ```
91
+
92
+ Then load model and predict:
93
+
94
+ ```python
95
+ from sentence_transformers import SentenceTransformer
96
+
97
+ m = SentenceTransformer("shibing624/text2vec-base-chinese")
98
+ sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
99
+
100
+ sentence_embeddings = m.encode(sentences)
101
+ print("Sentence embeddings:")
102
+ print(sentence_embeddings)
103
+ ```
104
+
105
+
106
+ ## Full Model Architecture
107
+ ```
108
+ CoSENT(
109
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
110
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
111
+ )
112
+ ```
113
+ ## Citing & Authors
114
+ This model was trained by [text2vec](https://github.com/shibing624/text2vec).
115
+
116
+ If you find this model helpful, feel free to cite:
117
+ ```bibtex
118
+ @software{text2vec,
119
+ author = {Xu Ming},
120
+ title = {text2vec: A Tool for Text to Vector},
121
+ year = {2022},
122
+ url = {https://github.com/shibing624/text2vec},
123
+ }
124
+ ```
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "hfl/chinese-macbert-base",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "directionality": "bidi",
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-12,
16
+ "max_position_embeddings": 512,
17
+ "model_type": "bert",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 0,
21
+ "pooler_fc_size": 768,
22
+ "pooler_num_attention_heads": 12,
23
+ "pooler_num_fc_layers": 3,
24
+ "pooler_size_per_head": 128,
25
+ "pooler_type": "first_token_transform",
26
+ "position_embedding_type": "absolute",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.12.3",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 21128
32
+ }
logs.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Epoch:0 Valid| corr: 0.794410
2
+ Epoch:0 Valid| corr: 0.691819
3
+ Epoch:1 Valid| corr: 0.722749
4
+ Epoch:2 Valid| corr: 0.735054
5
+ Epoch:3 Valid| corr: 0.738295
6
+ Epoch:4 Valid| corr: 0.739411
7
+ Test | corr: 0.679971
8
+ Epoch:0 Valid| corr: 0.817416
9
+ Epoch:1 Valid| corr: 0.832376
10
+ Epoch:2 Valid| corr: 0.842308
11
+ Epoch:3 Valid| corr: 0.843520
12
+ Epoch:4 Valid| corr: 0.841837
13
+ Test | corr: 0.793495
14
+ Epoch:0 Valid| corr: 0.814648
15
+ Epoch:1 Valid| corr: 0.831609
16
+ Epoch:2 Valid| corr: 0.841678
17
+ Epoch:3 Valid| corr: 0.842387
18
+ Epoch:4 Valid| corr: 0.841435
19
+ Test | corr: 0.794840
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:54ff3a857e3efa0b8114eb5e7a9e7e2b6230b4ddb083254a751e44772bb99075
3
+ size 409154033
rust_model.ot ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef6c0545c58ffb71777d1880df4fd5b18d54a38f8314e278cad3adb2e10d0f72
3
+ size 409136819
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "name_or_path": "hfl/chinese-macbert-base", "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff