mssongit commited on
Commit
13ff335
·
verified ·
1 Parent(s): ead155c

Upload 12 files

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md CHANGED
@@ -1,24 +1,16 @@
1
  ---
 
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
5
  - feature-extraction
6
  - sentence-similarity
7
  - transformers
8
- datasets:
9
- - kornlu
10
- language:
11
- - ko
12
- license: cc-by-4.0
13
- ---
14
 
15
- # bi-matrix/gmatrix-embedding
16
 
17
- 해당 모델은 [KF-DeBERTa](https://huggingface.co/kakaobank/kf-deberta-base) 모델과 KorSTS, KorNLI 데이터셋을 활용하였으며, sentence-transformers의 공식 문서 내 소개된 [continue-learning](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py) 방법을 통해 아래와 같이 학습되었습니다.
18
- 1. NLI 데이터셋을 통해 nagative sampling 후 MultipleNegativeRankingLoss 활용 및 STS 데이터셋을 통해 CosineSimilarityLoss를 활용하여 Multi-task Learning 학습 10epoch 진행
19
- 2. Learning Rate를 1e-06으로 줄여서 4epoch 추가 Multi-task 학습 진행
20
 
21
- ---
22
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
23
 
24
  <!--- Describe your model here -->
@@ -37,7 +29,7 @@ Then you can use the model like this:
37
  from sentence_transformers import SentenceTransformer
38
  sentences = ["This is an example sentence", "Each sentence is converted"]
39
 
40
- model = SentenceTransformer("bi-matrix/gmatrix-embedding")
41
  embeddings = model.encode(sentences)
42
  print(embeddings)
43
  ```
@@ -63,8 +55,8 @@ def mean_pooling(model_output, attention_mask):
63
  sentences = ['This is an example sentence', 'Each sentence is converted']
64
 
65
  # Load model from HuggingFace Hub
66
- tokenizer = AutoTokenizer.from_pretrained("bi-matrix/gmatrix-embedding")
67
- model = AutoModel.from_pretrained("bi-matrix/gmatrix-embedding")
68
 
69
  # Tokenize sentences
70
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -81,102 +73,69 @@ print(sentence_embeddings)
81
  ```
82
 
83
 
84
- ## Evaluation Results
85
-
86
- <!--- Describe how your model was evaluated -->
87
-
88
- KorSTS 평가 데이터셋으로 평가한 결과입니다.
89
-
90
- - Cosine Pearson: 85.77
91
- - Cosine Spearman: 86.30
92
- - Manhattan Pearson: 84.84
93
- - Manhattan Spearman: 85.33
94
- - Euclidean Pearson: 84.82
95
- - Euclidean Spearman: 85.29
96
- - Dot Pearson: 83.19
97
- - Dot Spearman: 83.19
98
-
99
- <br>
100
-
101
- |model|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman|
102
- |:-------------------------|-----------------:|------------------:|--------------------:|---------------------:|--------------------:|---------------------:|--------------:|---------------:|
103
- |[**gmatrix-embedding**](https://huggingface.co/bi-matrix/gmatrix-embedding)|**85.77**|**86.30**|**84.82**|**85.29**|**84.84**|**85.33**|**83.19**|**83.19**|
104
- |[kf-deberta-multitask](https://huggingface.co/upskyy/kf-deberta-multitask)|85.75|86.25|84.79|85.25|84.80|85.27|82.93|82.86|
105
- |[ko-sroberta-multitask](https://huggingface.co/jhgan/ko-sroberta-multitask)|84.77|85.6|83.71|84.40|83.70|84.38|82.42|82.33|
106
- |[ko-sbert-multitask](https://huggingface.co/jhgan/ko-sbert-multitask)|84.13|84.71|82.42|82.66|82.41|82.69|80.05|79.69|
107
- |[ko-sroberta-base-nli](https://huggingface.co/jhgan/ko-sroberta-nli)|82.83|83.85|82.87|83.29|82.88|83.28|80.34|79.69|
108
- |[ko-sbert-nli](https://huggingface.co/jhgan/ko-sbert-multitask)|82.24|83.16|82.19|82.31|82.18|82.3|79.3|78.78|
109
- |[ko-sroberta-sts](https://huggingface.co/jhgan/ko-sroberta-sts)|81.84|81.82|81.15|81.25|81.14|81.25|79.09|78.54|
110
- |[ko-sbert-sts](https://huggingface.co/jhgan/ko-sbert-sts)|81.55|81.23|79.94|79.79|79.9|79.75|76.02|75.31|
111
-
112
- <br>
113
 
 
114
 
115
  <!--- Describe how your model was evaluated -->
116
 
117
- G-MATRIX Embedding 데이터셋 측정 결과입니다.
118
- 사람 3명이서 0~5점으로 두 문장간의 유사도를 측정하여 점수를 내고 평균을 구하여 각 모델의 임베딩값을 통해
119
-
120
- 코사인 유사도, 유클리디안 거리, 맨하탄 거리, Dot-product를 구하여 피어슨, 스피어만 상관계수를 구한 값입니다.
121
-
122
- - Cosine Pearson: 75.86
123
- - Cosine Spearman: 65.75
124
- - Manhattan Pearson: 72.65
125
- - Manhattan Spearman: 65.20
126
- - Euclidean Pearson: 72.48
127
- - Euclidean Spearman: 65.32
128
- - Dot Pearson: 64.71
129
- - Dot Spearman: 53.90
130
-
131
- <br>
132
 
133
- model|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman|
134
- |:-------------------------|-----------------:|------------------:|--------------------:|---------------------:|--------------------:|---------------------:|--------------:|---------------:|
135
- |[**gmatrix-embedding**](https://huggingface.co/bi-matrix/gmatrix-embedding)|**75.86**|**65.75**|**72.65**|**65.20**|**72.48**|**65.32**|**64.71**|**53.90**|
136
- |[ko-sroberta-multitask](https://huggingface.co/jhgan/ko-sroberta-multitask)|71.78|63.16|70.80|63.47|70.89|63.72|53.57|44.23|
137
- |[bge-m3](https://huggingface.co/BAAI/bge-m3)|64.15|60.65|61.88|60.68|61.88|60.19|64.16|60.71|
138
 
139
- <br>
140
-
141
-
142
-
143
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6350f6750b94548566da3279/CcK0QL3oQAz7sJOCtH6PB.png)
144
-
145
- <br>
146
 
147
- ## G-MATRIX Embedding 레이블링 판단 기준 (KLUE-RoBERTa의 STS 데이터 생성 참고)
148
- 1. 두 문장의 유사한 정도를 보고 0~5점으로 판단
149
- 2. 맞춤법, 띄어쓰기, 온점이나 쉼표 차이는 판단 대상이 아님
150
- 3. 문장의 의도, 표현이 담고 있는 의미를 비교
151
- 4. 두 문장에 공통적으로 사용된 단어의 유무를 찾는 것이 아닌, 문장의 의미가 유사한지를 비교
152
- 5. 0은 의미적 유사성이 없는 경우이고, 5는 의미적으로 동등함을 뜻함
153
 
 
 
 
 
154
 
 
155
 
156
- ## Training
157
- The model was trained with the parameters:
 
 
158
 
159
  **DataLoader**:
160
 
161
- `torch.utils.data.dataloader.DataLoader` of length 329 with parameters:
162
  ```
163
- {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
164
  ```
165
 
166
  **Loss**:
167
 
168
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  ## Full Model Architecture
172
  ```
173
  SentenceTransformer(
174
- (0): Transformer({'max_seq_length': 128, 'do_lower_case': True}) with Transformer model: DeBERTaV2Model
175
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
176
  )
177
  ```
178
 
179
  ## Citing & Authors
180
 
181
- <!--- Describe where people can find more information -->
182
- [MINSANG SONG] at [BI-Matrix](https://www.bimatrix.co.kr/)
 
1
  ---
2
+ library_name: sentence-transformers
3
  pipeline_tag: sentence-similarity
4
  tags:
5
  - sentence-transformers
6
  - feature-extraction
7
  - sentence-similarity
8
  - transformers
 
 
 
 
 
 
9
 
10
+ ---
11
 
12
+ # {MODEL_NAME}
 
 
13
 
 
14
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
15
 
16
  <!--- Describe your model here -->
 
29
  from sentence_transformers import SentenceTransformer
30
  sentences = ["This is an example sentence", "Each sentence is converted"]
31
 
32
+ model = SentenceTransformer('{MODEL_NAME}')
33
  embeddings = model.encode(sentences)
34
  print(embeddings)
35
  ```
 
55
  sentences = ['This is an example sentence', 'Each sentence is converted']
56
 
57
  # Load model from HuggingFace Hub
58
+ tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
59
+ model = AutoModel.from_pretrained('{MODEL_NAME}')
60
 
61
  # Tokenize sentences
62
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
73
  ```
74
 
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
+ ## Evaluation Results
78
 
79
  <!--- Describe how your model was evaluated -->
80
 
81
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
 
 
 
 
 
83
 
84
+ ## Training
85
+ The model was trained with the parameters:
 
 
 
 
 
86
 
87
+ **DataLoader**:
 
 
 
 
 
88
 
89
+ `sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader` of length 4442 with parameters:
90
+ ```
91
+ {'batch_size': 128}
92
+ ```
93
 
94
+ **Loss**:
95
 
96
+ `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
97
+ ```
98
+ {'scale': 20.0, 'similarity_fct': 'cos_sim'}
99
+ ```
100
 
101
  **DataLoader**:
102
 
103
+ `torch.utils.data.dataloader.DataLoader` of length 719 with parameters:
104
  ```
105
+ {'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
106
  ```
107
 
108
  **Loss**:
109
 
110
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
111
 
112
+ Parameters of the fit()-Method:
113
+ ```
114
+ {
115
+ "epochs": 4,
116
+ "evaluation_steps": 1000,
117
+ "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
118
+ "max_grad_norm": 1.0,
119
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
120
+ "optimizer_params": {
121
+ "lr": 1e-06
122
+ },
123
+ "scheduler": "WarmupLinear",
124
+ "steps_per_epoch": null,
125
+ "warmup_steps": 288,
126
+ "weight_decay": 0.01
127
+ }
128
+ ```
129
+
130
 
131
  ## Full Model Architecture
132
  ```
133
  SentenceTransformer(
134
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DebertaV2Model
135
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
136
  )
137
  ```
138
 
139
  ## Citing & Authors
140
 
141
+ <!--- Describe where people can find more information -->
 
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "upskyy/kf-deberta-multitask",
3
+ "architectures": [
4
+ "DebertaV2Model"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "conv_act": "gelu",
8
+ "conv_kernel_size": 0,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-07,
15
+ "max_position_embeddings": 512,
16
+ "max_relative_positions": -1,
17
+ "model_type": "deberta-v2",
18
+ "norm_rel_ebd": "layer_norm",
19
+ "num_attention_heads": 12,
20
+ "num_hidden_layers": 12,
21
+ "pad_token_id": 0,
22
+ "pooler_dropout": 0,
23
+ "pooler_hidden_act": "gelu",
24
+ "pooler_hidden_size": 768,
25
+ "pos_att_type": [
26
+ "p2c",
27
+ "c2p"
28
+ ],
29
+ "position_biased_input": false,
30
+ "position_buckets": 256,
31
+ "relative_attention": true,
32
+ "share_att_key": true,
33
+ "torch_dtype": "float32",
34
+ "transformers_version": "4.41.0",
35
+ "type_vocab_size": 0,
36
+ "vocab_size": 130000
37
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.36.1",
5
+ "pytorch": "1.11.0"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null
9
+ }
eval/similarity_evaluation_sts-dev_results.csv ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
2
+ 0,1000,0.87424182724437,0.873113883627151,0.8626342456499717,0.8683649687856948,0.862234101221366,0.8679090218497976,0.8420911984535485,0.8388683728199267
3
+ 0,-1,0.8704544232547093,0.8711604107071151,0.8618454435894635,0.8664964585911143,0.8613477153921905,0.8661491861079217,0.8390755643453752,0.8367505848740203
4
+ 1,1000,0.8653159391248947,0.8658128143759094,0.8583427277521668,0.8636325864952359,0.8577919927964012,0.8632203032907328,0.8320704397644403,0.8285392001745013
5
+ 1,-1,0.8664721939672407,0.867188827810382,0.8574375558592068,0.8622162330750675,0.8572350229431882,0.8620521245434254,0.8304817808652534,0.8276482440404608
6
+ 2,1000,0.8714413815604427,0.8728773176225486,0.8606519719208473,0.8661686113883175,0.8597764457766773,0.8650461871211873,0.8402147325095083,0.8390394174225602
7
+ 2,-1,0.8705943005027337,0.8702975560007158,0.8607474787071386,0.866365162116999,0.8600589682539279,0.8655968405285576,0.8314075606364041,0.8294845620402314
8
+ 3,1000,0.8732898338039496,0.8727113406361378,0.8618326659582233,0.867660640894365,0.8609977907811955,0.866602343006294,0.836684284412931,0.8344851740499123
9
+ 3,-1,0.8706550381317323,0.8703280791541039,0.8594185149543394,0.8655157123583306,0.858653349453231,0.8647258505997637,0.8309840916492293,0.8295422738240362
10
+ 4,1000,0.8704857775744986,0.8704915027579151,0.8587016084316031,0.8650513081211857,0.8579561230558775,0.8640968092375928,0.8350023458985515,0.8333924241815882
11
+ 4,-1,0.8711526078122271,0.8711521971781401,0.8595390375115806,0.8659608852653925,0.858776265520779,0.8649677155364356,0.8349352221295189,0.8333007621962971
12
+ 0,-1,0.8763115872556583,0.8761023572835317,0.8669522940489013,0.8720087690155365,0.8663943699721328,0.8712943539266835,0.8412022905614112,0.8384612426442756
13
+ 1,-1,0.8763205288075023,0.8760076739421485,0.8662215668559664,0.8714828451613065,0.8656997872065376,0.8709107415595765,0.8417178065990268,0.8393132450146403
14
+ 2,-1,0.8764778415959709,0.8761420637174745,0.8666452983853821,0.8718069663837321,0.8661081380378363,0.8711184783592194,0.8417928269640543,0.8392738072648638
15
+ 3,-1,0.8761662401985245,0.8758175528104466,0.8664189381023497,0.8715279651845382,0.865886230191413,0.8709402326833412,0.8416127232242585,0.838971926272541
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:905ee97ae34c07044a3af264c045fc33c4663025153db50bfe0049a765b41957
3
+ size 741185640
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "[CLS]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "[SEP]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "[MASK]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "[PAD]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "[SEP]",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "[CLS]",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "[CLS]",
47
+ "do_basic_tokenize": true,
48
+ "do_lower_case": false,
49
+ "eos_token": "[SEP]",
50
+ "mask_token": "[MASK]",
51
+ "max_length": 128,
52
+ "model_max_length": 512,
53
+ "never_split": null,
54
+ "pad_to_multiple_of": null,
55
+ "pad_token": "[PAD]",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
+ "sep_token": "[SEP]",
59
+ "stride": 0,
60
+ "strip_accents": null,
61
+ "tokenize_chinese_chars": true,
62
+ "tokenizer_class": "BertTokenizer",
63
+ "truncation_side": "right",
64
+ "truncation_strategy": "longest_first",
65
+ "unk_token": "[UNK]"
66
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff