{}
The SwissBERT model was finetuned via SimCSE (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from Swissdox@LiRI. Following the Sentence Transformers approach (Reimers and Gurevych, 2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
The fine-tuning script can be accessed here.
Model Details
Model Description
- Developed by: Juri Grosjean
- Model type: XMOD
- Language(s) (NLP): de_CH, fr_CH, it_CH, rm_CH
- License: [More Information Needed]
- Finetuned from model: SwissBERT
Use
import torch
from transformers import AutoModel, AutoTokenizer
### German example
# Load swissBERT for sentence embeddings model
model_name="jgrosjean-mathesis/swissbert-for-sentence-embeddings"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def generate_sentence_embedding(sentence, language):
# Set adapter to specified language
if "de" in language:
model.set_default_language("de_CH")
if "fr" in language:
model.set_default_language("fr_CH")
if "it" in language:
model.set_default_language("it_CH")
if "rm" in language:
model.set_default_language("rm_CH")
# Tokenize input sentence
inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt", max_length=512)
# Take tokenized input and pass it through the model
with torch.no_grad():
outputs = model(**inputs)
# Extract average sentence embeddings from the last hidden layer
embedding = outputs.last_hidden_state.mean(dim=1)
return embedding
sentence_embedding = generate_sentence_embedding("Wir feiern am 1. August den Schweizer Nationalfeiertag.", language="de")
print(sentence_embedding)
Output:
tensor([[ 5.6306e-02, -2.8375e-01, -4.1495e-02, 7.4393e-02, -3.1552e-01,
1.5213e-01, -1.0258e-01, 2.2790e-01, -3.5968e-02, 3.1769e-01,
1.9354e-01, 1.9748e-02, -1.5236e-01, -2.2657e-01, 1.3345e-02,
...]])
Semantic Textual Similarity
from sklearn.metrics.pairwise import cosine_similarity
# Define two sentences
sentence_1 = ["Der Zug kommt um 9 Uhr in Zürich an."]
sentence_2 = ["Le train arrive à Lausanne à 9h."]
#Compute embedding for both
embedding_1 = generate_sentence_embedding(sentence_1, language="de")
embedding_2 = generate_sentence_embedding(sentence_2, language="fr")
#Compute cosine-similarity
cosine_score = cosine_similarity((embedding_1, embedding_2)
#Output the score
print("The cosine score for", sentence_1, "and", sentence_2, "is", cosine_score)
Bias, Risks, and Limitations
This model has been trained on news articles only. Hence, it might not perform as well on other text classes.
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing [optional]
[More Information Needed]
Training Hyperparameters
- Training regime: python3 train_simcse_multilingual.py
--seed 54699
--model_name_or_path zurichNLP/swissbert
--train_file /srv/scratch2/grosjean/Masterarbeit/data_subsets
--output_dir /srv/scratch2/grosjean/Masterarbeit/model
--overwrite_output_dir
--save_strategy no
--do_train
--num_train_epochs 1
--learning_rate 1e-5
--per_device_train_batch_size 4
--gradient_accumulation_steps 128
--max_seq_length 512
--overwrite_cache
--pooler_type avg
--pad_to_max_length
--temp 0.05
--fp16
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]