๋ณธ ๋ชจ๋ธ์€ multi-task loss (MultipleNegativeLoss -> AnglELoss) ๋กœ, KlueNLI ๋ฐ KlueSTS ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต ์ฝ”๋“œ๋Š” ๋‹ค์Œ Github hyperlink์—์„œ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Usage (Huggingface inference API)

import requests

API_URL = "https://api-inference.huggingface.co/models/sorryhyun/sentence-embedding-klue-large"
headers = {"Authorization": "your_HF_token"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()


output = query({
    "inputs": {
        "source_sentence": "์ข‹์•„์š”, ์ถ”์ฒœ, ์•Œ๋ฆผ์„ค์ •๊นŒ์ง€",
        "sentences": [
            "์ข‹์•„์š” ๋ˆŒ๋Ÿฌ์ฃผ์„ธ์š”!!",
            "์ข‹์•„์š”, ์ถ”์ฒœ ๋“ฑ ์œ ํˆฌ๋ฒ„๋“ค์ด ์ข‹์•„ํ•ด์š”",
            "์•Œ๋ฆผ์„ค์ •์„ ๋ˆŒ๋Ÿฌ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค."
        ]
    },
})
if __name__ == '__main__':
  print(output)

Usage (HuggingFace Transformers)

from transformers import AutoTokenizer, AutoModel, DataCollatorWithPadding
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import Dataset

device = torch.device('cpu')
batch_size=1

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sorryhyun/sentence-embedding-klue-large')
collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModel.from_pretrained('sorryhyun/sentence-embedding-klue-large').to(device)

tokenized_data = tokenizer(sentences, padding=True, truncation=True)
tokenized_data = Dataset.from_dict(tokenized_data)
dataloader = DataLoader(tokenized_data, batch_size=batch_size, pin_memory=True, collate_fn=collator)
all_outputs = torch.zeros((len(tokenized_data), 1024)).to(device)
start_idx = 0

# I used mean-pool method for sentence representation
with torch.no_grad():
  for inputs in tqdm(dataloader):
    inputs = {k: v.to(device) for k, v in inputs.items()}
    representations, _ = model(**inputs, return_dict=False)
    attention_mask = inputs["attention_mask"]
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(representations.size()).to(representations.dtype))
    summed = torch.sum(representations * input_mask_expanded, 1)
    sum_mask = input_mask_expanded.sum(1)
    sum_mask = torch.clamp(sum_mask, min=1e-9)
    end_idx = start_idx + representations.shape[0]
    all_outputs[start_idx:end_idx] = (summed / sum_mask)
    start_idx = end_idx

Evaluation Results

Organization Backbone Model KlueSTS average KorSTS average
team-lucid DeBERTa-base 54.15 29.72
monologg Electra-base 66.97 40.98
LMkor Electra-base 70.98 43.09
deliciouscat DeBERTa-base - 67.65
BM-K Roberta-base 82.93 85.77
Klue Roberta-large 86.71 71.70
Klue (Hyperparameter searched) Roberta-large 86.21 75.54

๊ธฐ์กด ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์€ mnli, snli ๋“ฑ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๊ณ„๋ฒˆ์—ญํ•˜์—ฌ ํ•™์Šต๋œ ์ ์„ ์ฐธ๊ณ ์‚ผ์•„ Klue ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋Œ€์‹  ํ•™์Šตํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ, Klue-Roberta-large ๋ชจ๋ธ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ–ˆ์„ ๊ฒฝ์šฐ KlueSTS ๋ฐ KorSTS ํ…Œ์ŠคํŠธ์…‹์— ๋ชจ๋‘์— ๋Œ€ํ•ด ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ์ข€ ๋” elaborateํ•œ representation์„ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‚ฌ๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ํ‰๊ฐ€ ์ˆ˜์น˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ธํŒ…, ์‹œ๋“œ ๋„˜๋ฒ„ ๋“ฑ์œผ๋กœ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ฐธ๊ณ ํ•˜์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.

Training

NegativeRank loss -> simcse loss ๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.

Downloads last month
47
Safetensors
Model size
337M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train sorryhyun/sentence-embedding-klue-large