Upload 12 files

Browse files

Files changed (12) hide show

1_Pooling/config.json +7 -0
README.md +43 -84
config.json +37 -0
config_sentence_transformers.json +9 -0
eval/similarity_evaluation_sts-dev_results.csv +15 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +51 -0
tokenizer.json +0 -0
tokenizer_config.json +66 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false
+}

README.md CHANGED Viewed

@@ -1,24 +1,16 @@
 ---
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - feature-extraction
 - sentence-similarity
 - transformers
-datasets:
-- kornlu
-language:
-- ko
-license: cc-by-4.0
----
-# bi-matrix/gmatrix-embedding
-해당 모델은 [KF-DeBERTa](https://huggingface.co/kakaobank/kf-deberta-base) 모델과 KorSTS, KorNLI 데이터셋을 활용하였으며, sentence-transformers의 공식 문서 내 소개된 [continue-learning](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py) 방법을 통해 아래와 같이 학습되었습니다.
-1. NLI 데이터셋을 통해 nagative sampling 후 MultipleNegativeRankingLoss 활용 및 STS 데이터셋을 통해 CosineSimilarityLoss를 활용하여 Multi-task Learning 학습 10epoch 진행
-2. Learning Rate를 1e-06으로 줄여서 4epoch 추가 Multi-task 학습 진행
----
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 <!--- Describe your model here -->
@@ -37,7 +29,7 @@ Then you can use the model like this:
 from sentence_transformers import SentenceTransformer
 sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer("bi-matrix/gmatrix-embedding")
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
@@ -63,8 +55,8 @@ def mean_pooling(model_output, attention_mask):
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained("bi-matrix/gmatrix-embedding")
-model = AutoModel.from_pretrained("bi-matrix/gmatrix-embedding")
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -81,102 +73,69 @@ print(sentence_embeddings)
 ```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-KorSTS 평가 데이터셋으로 평가한 결과입니다.
-- Cosine Pearson: 85.77
-- Cosine Spearman: 86.30
-- Manhattan Pearson: 84.84
-- Manhattan Spearman: 85.33
-- Euclidean Pearson: 84.82
-- Euclidean Spearman: 85.29
-- Dot Pearson: 83.19
-- Dot Spearman: 83.19
-<br>
-|model|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman|
-|:-------------------------|-----------------:|------------------:|--------------------:|---------------------:|--------------------:|---------------------:|--------------:|---------------:|
-|[**gmatrix-embedding**](https://huggingface.co/bi-matrix/gmatrix-embedding)|**85.77**|**86.30**|**84.82**|**85.29**|**84.84**|**85.33**|**83.19**|**83.19**|
-|[kf-deberta-multitask](https://huggingface.co/upskyy/kf-deberta-multitask)|85.75|86.25|84.79|85.25|84.80|85.27|82.93|82.86|
-|[ko-sroberta-multitask](https://huggingface.co/jhgan/ko-sroberta-multitask)|84.77|85.6|83.71|84.40|83.70|84.38|82.42|82.33|
-|[ko-sbert-multitask](https://huggingface.co/jhgan/ko-sbert-multitask)|84.13|84.71|82.42|82.66|82.41|82.69|80.05|79.69|
-|[ko-sroberta-base-nli](https://huggingface.co/jhgan/ko-sroberta-nli)|82.83|83.85|82.87|83.29|82.88|83.28|80.34|79.69|
-|[ko-sbert-nli](https://huggingface.co/jhgan/ko-sbert-multitask)|82.24|83.16|82.19|82.31|82.18|82.3|79.3|78.78|
-|[ko-sroberta-sts](https://huggingface.co/jhgan/ko-sroberta-sts)|81.84|81.82|81.15|81.25|81.14|81.25|79.09|78.54|
-|[ko-sbert-sts](https://huggingface.co/jhgan/ko-sbert-sts)|81.55|81.23|79.94|79.79|79.9|79.75|76.02|75.31|
-<br>
 <!--- Describe how your model was evaluated -->
-G-MATRIX Embedding 데이터셋 측정 결과입니다.
-사람 3명이서 0~5점으로 두 문장간의 유사도를 측정하여 점수를 내고 평균을 구하여 각 모델의 임베딩값을 통해
-코사인 유사도, 유클리디안 거리, 맨하탄 거리, Dot-product를 구하여 피어슨, 스피어만 상관계수를 구한 값입니다.
-- Cosine Pearson: 75.86
-- Cosine Spearman: 65.75
-- Manhattan Pearson: 72.65
-- Manhattan Spearman: 65.20
-- Euclidean Pearson: 72.48
-- Euclidean Spearman: 65.32
-- Dot Pearson: 64.71
-- Dot Spearman: 53.90
-<br>
-model|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman|
-|:-------------------------|-----------------:|------------------:|--------------------:|---------------------:|--------------------:|---------------------:|--------------:|---------------:|
-|[**gmatrix-embedding**](https://huggingface.co/bi-matrix/gmatrix-embedding)|**75.86**|**65.75**|**72.65**|**65.20**|**72.48**|**65.32**|**64.71**|**53.90**|
-|[ko-sroberta-multitask](https://huggingface.co/jhgan/ko-sroberta-multitask)|71.78|63.16|70.80|63.47|70.89|63.72|53.57|44.23|
-|[bge-m3](https://huggingface.co/BAAI/bge-m3)|64.15|60.65|61.88|60.68|61.88|60.19|64.16|60.71|
-<br>
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6350f6750b94548566da3279/CcK0QL3oQAz7sJOCtH6PB.png)
-<br>
-## G-MATRIX Embedding 레이블링 판단 기준 (KLUE-RoBERTa의 STS 데이터 생성 참고)
-1. 두 문장의 유사한 정도를 보고 0~5점으로 판단
-2. 맞춤법, 띄어쓰기, 온점이나 쉼표 차이는 판단 대상이 아님
-3. 문장의 의도, 표현이 담고 있는 의미를 비교
-4. 두 문장에 공통적으로 사용된 단어의 유무를 찾는 것이 아닌, 문장의 의미가 유사한지를 비교
-5. 0은 의미적 유사성이 없는 경우이고, 5는 의미적으로 동등함을 뜻함
-## Training
-The model was trained with the parameters:
 **DataLoader**:
-`torch.utils.data.dataloader.DataLoader` of length 329 with parameters:
 ```
-{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
 ```
 **Loss**:
 `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
 ## Full Model Architecture
 ```
 SentenceTransformer(
-  (0): Transformer({'max_seq_length': 128, 'do_lower_case': True}) with Transformer model: DeBERTaV2Model
-  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
 )
 ```
 ## Citing & Authors
-<!--- Describe where people can find more information -->
-[MINSANG SONG] at [BI-Matrix](https://www.bimatrix.co.kr/)

 ---
+library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - feature-extraction
 - sentence-similarity
 - transformers
+---
+# {MODEL_NAME}
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 <!--- Describe your model here -->
 from sentence_transformers import SentenceTransformer
 sentences = ["This is an example sentence", "Each sentence is converted"]
+model = SentenceTransformer('{MODEL_NAME}')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
+model = AutoModel.from_pretrained('{MODEL_NAME}')
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 ```
+## Evaluation Results
 <!--- Describe how your model was evaluated -->
+For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
+## Training
+The model was trained with the parameters:
+**DataLoader**:
+`sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader` of length 4442 with parameters:
+```
+{'batch_size': 128}
+```
+**Loss**:
+`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
+  ```
+  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
+  ```
 **DataLoader**:
+`torch.utils.data.dataloader.DataLoader` of length 719 with parameters:
 ```
+{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
 ```
 **Loss**:
 `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
+Parameters of the fit()-Method:
+```
+{
+    "epochs": 4,
+    "evaluation_steps": 1000,
+    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
+    "max_grad_norm": 1.0,
+    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
+    "optimizer_params": {
+        "lr": 1e-06
+    },
+    "scheduler": "WarmupLinear",
+    "steps_per_epoch": null,
+    "warmup_steps": 288,
+    "weight_decay": 0.01
+}
+```
 ## Full Model Architecture
 ```
 SentenceTransformer(
+  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DebertaV2Model
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
 )
 ```
 ## Citing & Authors
+<!--- Describe where people can find more information -->

config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "_name_or_path": "upskyy/kf-deberta-multitask",
+  "architectures": [
+    "DebertaV2Model"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "conv_act": "gelu",
+  "conv_kernel_size": 0,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-07,
+  "max_position_embeddings": 512,
+  "max_relative_positions": -1,
+  "model_type": "deberta-v2",
+  "norm_rel_ebd": "layer_norm",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pooler_dropout": 0,
+  "pooler_hidden_act": "gelu",
+  "pooler_hidden_size": 768,
+  "pos_att_type": [
+    "p2c",
+    "c2p"
+  ],
+  "position_biased_input": false,
+  "position_buckets": 256,
+  "relative_attention": true,
+  "share_att_key": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.41.0",
+  "type_vocab_size": 0,
+  "vocab_size": 130000
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "__version__": {
+    "sentence_transformers": "2.2.2",
+    "transformers": "4.36.1",
+    "pytorch": "1.11.0"
+  },
+  "prompts": {},
+  "default_prompt_name": null
+}

eval/similarity_evaluation_sts-dev_results.csv ADDED Viewed

	@@ -0,0 +1,15 @@

+epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
+0,1000,0.87424182724437,0.873113883627151,0.8626342456499717,0.8683649687856948,0.862234101221366,0.8679090218497976,0.8420911984535485,0.8388683728199267
+0,-1,0.8704544232547093,0.8711604107071151,0.8618454435894635,0.8664964585911143,0.8613477153921905,0.8661491861079217,0.8390755643453752,0.8367505848740203
+1,1000,0.8653159391248947,0.8658128143759094,0.8583427277521668,0.8636325864952359,0.8577919927964012,0.8632203032907328,0.8320704397644403,0.8285392001745013
+1,-1,0.8664721939672407,0.867188827810382,0.8574375558592068,0.8622162330750675,0.8572350229431882,0.8620521245434254,0.8304817808652534,0.8276482440404608
+2,1000,0.8714413815604427,0.8728773176225486,0.8606519719208473,0.8661686113883175,0.8597764457766773,0.8650461871211873,0.8402147325095083,0.8390394174225602
+2,-1,0.8705943005027337,0.8702975560007158,0.8607474787071386,0.866365162116999,0.8600589682539279,0.8655968405285576,0.8314075606364041,0.8294845620402314
+3,1000,0.8732898338039496,0.8727113406361378,0.8618326659582233,0.867660640894365,0.8609977907811955,0.866602343006294,0.836684284412931,0.8344851740499123
+3,-1,0.8706550381317323,0.8703280791541039,0.8594185149543394,0.8655157123583306,0.858653349453231,0.8647258505997637,0.8309840916492293,0.8295422738240362
+4,1000,0.8704857775744986,0.8704915027579151,0.8587016084316031,0.8650513081211857,0.8579561230558775,0.8640968092375928,0.8350023458985515,0.8333924241815882
+4,-1,0.8711526078122271,0.8711521971781401,0.8595390375115806,0.8659608852653925,0.858776265520779,0.8649677155364356,0.8349352221295189,0.8333007621962971
+0,-1,0.8763115872556583,0.8761023572835317,0.8669522940489013,0.8720087690155365,0.8663943699721328,0.8712943539266835,0.8412022905614112,0.8384612426442756
+1,-1,0.8763205288075023,0.8760076739421485,0.8662215668559664,0.8714828451613065,0.8656997872065376,0.8709107415595765,0.8417178065990268,0.8393132450146403
+2,-1,0.8764778415959709,0.8761420637174745,0.8666452983853821,0.8718069663837321,0.8661081380378363,0.8711184783592194,0.8417928269640543,0.8392738072648638
+3,-1,0.8761662401985245,0.8758175528104466,0.8664189381023497,0.8715279651845382,0.865886230191413,0.8709402326833412,0.8416127232242585,0.838971926272541

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:905ee97ae34c07044a3af264c045fc33c4663025153db50bfe0049a765b41957
+size 741185640

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 128,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,66 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "[CLS]",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "eos_token": "[SEP]",
+  "mask_token": "[MASK]",
+  "max_length": 128,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff