https://github.com/BM-K/Sentence-Embedding-is-all-you-need

Korean-Sentence-Embedding

๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.

Quick tour

import torch
from transformers import AutoModel, AutoTokenizer

def cal_score(a, b):
    if len(a.shape) == 1: a = a.unsqueeze(0)
    if len(b.shape) == 1: b = b.unsqueeze(0)

    a_norm = a / a.norm(dim=1)[:, None]
    b_norm = b / b.norm(dim=1)[:, None]
    return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100

model = AutoModel.from_pretrained('BM-K/KoSimCSE-bert') 
AutoTokenizer.from_pretrained('BM-K/KoSimCSE-bert')

sentences = ['์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.',
             '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.',
             '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.']

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings, _ = model(**inputs, return_dict=False)

score01 = cal_score(embeddings[0][0], embeddings[1][0])
score02 = cal_score(embeddings[0][0], embeddings[2][0])

Performance

  • Semantic Textual Similarity test set results
Model AVG Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERTโ€ SKT 77.40 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSBERT 80.39 82.13 82.25 80.67 80.75 80.69 80.78 77.96 77.90
KoSRoBERTa 81.64 81.20 82.20 81.79 82.34 81.59 82.20 80.62 81.25
KoSentenceBART 77.14 79.71 78.74 78.42 78.02 78.40 78.00 74.24 72.15
KoSentenceT5 77.83 80.87 79.74 80.24 79.36 80.19 79.27 72.81 70.17
KoSimCSE-BERTโ€ SKT 81.32 82.12 82.56 81.84 81.63 81.99 81.74 79.55 79.19
KoSimCSE-BERT 83.37 83.22 83.58 83.24 83.60 83.15 83.54 83.13 83.49
KoSimCSE-RoBERTa 83.65 83.60 83.77 83.54 83.76 83.55 83.77 83.55 83.64
KoSimCSE-BERT-multitask 85.71 85.29 86.02 85.63 86.01 85.57 85.97 85.26 85.93
KoSimCSE-RoBERTa-multitask 85.77 85.08 86.12 85.84 86.12 85.83 86.12 85.03 85.99
Downloads last month
569
Safetensors
Model size
111M params
Tensor type
I64
ยท
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.