--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers datasets: - kornlu language: - ko license: cc-by-4.0 --- # bi-matrix/gmatrix-embedding 해당 모델은 [KF-DeBERTa](https://huggingface.co/kakaobank/kf-deberta-base) 모델과 KorSTS, KorNLI 데이터셋을 활용하였으며, sentence-transformers의 공식 문서 내 소개된 [continue-learning](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py) 방법을 통해 아래와 같이 학습되었습니다. 1. NLI 데이터셋을 통해 nagative sampling 후 MultipleNegativeRankingLoss 활용 및 STS 데이터셋을 통해 CosineSimilarityLoss를 활용하여 Multi-task Learning 학습 10epoch 진행 2. Learning Rate를 1e-06으로 줄여서 4epoch 추가 Multi-task 학습 진행 --- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. ## Usage (Sentence-Transformers) Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: ``` pip install -U sentence-transformers ``` Then you can use the model like this: ```python from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer("bi-matrix/gmatrix-embedding") embeddings = model.encode(sentences) print(embeddings) ``` ## Usage (HuggingFace Transformers) Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. ```python from transformers import AutoTokenizer, AutoModel import torch #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Sentences we want sentence embeddings for sentences = ['This is an example sentence', 'Each sentence is converted'] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("bi-matrix/gmatrix-embedding") model = AutoModel.from_pretrained("bi-matrix/gmatrix-embedding") # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, mean pooling. sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings) ``` ## Evaluation Results KorSTS 평가 데이터셋으로 평가한 결과입니다. - Cosine Pearson: 85.77 - Cosine Spearman: 86.30 - Manhattan Pearson: 84.84 - Manhattan Spearman: 85.33 - Euclidean Pearson: 84.82 - Euclidean Spearman: 85.29 - Dot Pearson: 83.19 - Dot Spearman: 83.19
|model|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman| |:-------------------------|-----------------:|------------------:|--------------------:|---------------------:|--------------------:|---------------------:|--------------:|---------------:| |[**gmatrix-embedding**](https://huggingface.co/bi-matrix/gmatrix-embedding)|**85.77**|**86.30**|**84.82**|**85.29**|**84.84**|**85.33**|**83.19**|**83.19**| |[kf-deberta-multitask](https://huggingface.co/upskyy/kf-deberta-multitask)|85.75|86.25|84.79|85.25|84.80|85.27|82.93|82.86| |[ko-sroberta-multitask](https://huggingface.co/jhgan/ko-sroberta-multitask)|84.77|85.6|83.71|84.40|83.70|84.38|82.42|82.33| |[ko-sbert-multitask](https://huggingface.co/jhgan/ko-sbert-multitask)|84.13|84.71|82.42|82.66|82.41|82.69|80.05|79.69| |[ko-sroberta-base-nli](https://huggingface.co/jhgan/ko-sroberta-nli)|82.83|83.85|82.87|83.29|82.88|83.28|80.34|79.69| |[ko-sbert-nli](https://huggingface.co/jhgan/ko-sbert-multitask)|82.24|83.16|82.19|82.31|82.18|82.3|79.3|78.78| |[ko-sroberta-sts](https://huggingface.co/jhgan/ko-sroberta-sts)|81.84|81.82|81.15|81.25|81.14|81.25|79.09|78.54| |[ko-sbert-sts](https://huggingface.co/jhgan/ko-sbert-sts)|81.55|81.23|79.94|79.79|79.9|79.75|76.02|75.31|
G-MATRIX Embedding 데이터셋 측정 결과입니다. 사람 3명이서 0~5점으로 두 문장간의 유사도를 측정하여 점수를 내고 평균을 구하여 각 모델의 임베딩값을 통해 코사인 유사도, 유클리디안 거리, 맨하탄 거리, Dot-product를 구하여 피어슨, 스피어만 상관계수를 구한 값입니다. - Cosine Pearson: 75.86 - Cosine Spearman: 65.75 - Manhattan Pearson: 72.65 - Manhattan Spearman: 65.20 - Euclidean Pearson: 72.48 - Euclidean Spearman: 65.32 - Dot Pearson: 64.71 - Dot Spearman: 53.90
model|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman| |:-------------------------|-----------------:|------------------:|--------------------:|---------------------:|--------------------:|---------------------:|--------------:|---------------:| |[**gmatrix-embedding**](https://huggingface.co/bi-matrix/gmatrix-embedding)|**75.86**|**65.75**|**72.65**|**65.20**|**72.48**|**65.32**|**64.71**|**53.90**| |[ko-sroberta-multitask](https://huggingface.co/jhgan/ko-sroberta-multitask)|71.78|63.16|70.80|63.47|70.89|63.72|53.57|44.23| |[bge-m3](https://huggingface.co/BAAI/bge-m3)|64.15|60.65|61.88|60.68|61.88|60.19|64.16|60.71|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6350f6750b94548566da3279/CcK0QL3oQAz7sJOCtH6PB.png)
## G-MATRIX Embedding 레이블링 판단 기준 (KLUE-RoBERTa의 STS 데이터 생성 참고) 1. 두 문장의 유사한 정도를 보고 0~5점으로 판단 2. 맞춤법, 띄어쓰기, 온점이나 쉼표 차이는 판단 대상이 아님 3. 문장의 의도, 표현이 담고 있는 의미를 비교 4. 두 문장에 공통적으로 사용된 단어의 유무를 찾는 것이 아닌, 문장의 의미가 유사한지를 비교 5. 0은 의미적 유사성이 없는 경우이고, 5는 의미적으로 동등함을 뜻함 ## Training The model was trained with the parameters: **DataLoader**: `torch.utils.data.dataloader.DataLoader` of length 329 with parameters: ``` {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} ``` **Loss**: `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` ## Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 128, 'do_lower_case': True}) with Transformer model: DeBERTaV2Model (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) ) ``` ## Citing & Authors [MINSANG SONG] at [BI-Matrix](https://www.bimatrix.co.kr/)