|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
license: mit |
|
datasets: |
|
- squad |
|
- eli5 |
|
- sentence-transformers/embedding-training-data |
|
- KennethTM/gooaq_pairs_danish |
|
- sentence-transformers/gooaq |
|
- KennethTM/squad_pairs_danish |
|
- KennethTM/eli5_question_answer_danish |
|
language: |
|
- da |
|
library_name: sentence-transformers |
|
widget: |
|
- source_sentence: 'Kører der cykler på vejen?' |
|
sentences: |
|
- 'I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.' |
|
- 'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.' |
|
--- |
|
|
|
# Note |
|
|
|
*This an updated version of [KennethTM/MiniLM-L6-danish-encoder](https://huggingface.co/KennethTM/MiniLM-L6-danish-encoder). This version is just trained on more data ([GooAQ dataset](https://huggingface.co/datasets/sentence-transformers/gooaq) translated to [Danish](https://huggingface.co/datasets/KennethTM/gooaq_pairs_danish)) and is otherwise the same* |
|
|
|
|
|
# MiniLM-L6-danish-encoder |
|
|
|
This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search. |
|
|
|
The maximum sequence length is 512 tokens. |
|
|
|
The model was not pre-trained from scratch but adapted from the English version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a [Danish tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish). |
|
|
|
Trained on ELI5 and SQUAD data machine translated from English to Danish. |
|
|
|
# Usage (Sentence-Transformers) |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from sentence_transformers.util import cos_sim |
|
|
|
# Given a query |
|
query = ['Kører der cykler på vejen?'] |
|
|
|
# And two passages |
|
passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.', |
|
'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.'] |
|
|
|
# Compute embeddings |
|
model = SentenceTransformer("KennethTM/MiniLM-L6-danish-encoder-v2") |
|
query_embeddings = model.encode(query) |
|
passage_embeddings = model.encode(passage) |
|
|
|
# To find most relevant passage for the query (closer to 1 means more similar) |
|
cosine_scores = cos_sim(query_embeddings, passage_embeddings) |
|
print(cosine_scores) |
|
``` |
|
# Usage (HuggingFace Transformers) |
|
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
#Mean Pooling - Take attention mask into account for correct averaging |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2") |
|
model = AutoModel.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2") |
|
|
|
# Given a query |
|
query = ['Kører der cykler på vejen?'] |
|
|
|
# And two passages |
|
passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.', |
|
'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.'] |
|
|
|
# Tokenize sentences |
|
query_encoded = tokenizer(query, padding=True, truncation=True, return_tensors='pt') |
|
passage_encoded = tokenizer(passage, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute embeddings |
|
with torch.no_grad(): |
|
query_features = model(**query_encoded) |
|
passage_features = model(**passage_encoded) |
|
|
|
# Perform pooling |
|
query_embeddings = mean_pooling(query_features, query_encoded['attention_mask']) |
|
passage_embeddings = mean_pooling(passage_features, passage_encoded['attention_mask']) |
|
|
|
# To find most relevant passage for the query (closer to 1 means more similar) |
|
cosine_scores = F.cosine_similarity(query_embeddings, passage_embeddings) |
|
print(cosine_scores) |
|
``` |