sentence-swissbert / README.md
jgrosjean's picture
Update README.md
d8f35d4
|
raw
history blame
6.94 kB
metadata
{}

The SwissBERT model was finetuned via SimCSE (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from Swissdox@LiRI. Following the Sentence Transformers approach (Reimers and Gurevych, 2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.

The fine-tuning script can be accessed here.

image/png

Model Details

Model Description

  • Developed by: Juri Grosjean
  • Model type: XMOD
  • Language(s) (NLP): de_CH, fr_CH, it_CH, rm_CH
  • License: [More Information Needed]
  • Finetuned from model: SwissBERT

Use

import torch

from transformers import AutoModel, AutoTokenizer

### German example

# Load swissBERT for sentence embeddings model
model_name="jgrosjean-mathesis/swissbert-for-sentence-embeddings"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def generate_sentence_embedding(sentence, language):

    # Set adapter to specified language
    if "de" in language:
      model.set_default_language("de_CH")
    if "fr" in language:
      model.set_default_language("fr_CH")
    if "it" in language:
      model.set_default_language("it_CH")
    if "rm" in language:
      model.set_default_language("rm_CH")
    
    # Tokenize input sentence
    inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt", max_length=512)

    # Take tokenized input and pass it through the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract average sentence embeddings from the last hidden layer
    embedding = outputs.last_hidden_state.mean(dim=1)

    return embedding

sentence_embedding = generate_sentence_embedding("Wir feiern am 1. August den Schweizer Nationalfeiertag.", language="de")
print(sentence_embedding)

Output:

tensor([[ 5.6306e-02, -2.8375e-01, -4.1495e-02,  7.4393e-02, -3.1552e-01,
          1.5213e-01, -1.0258e-01,  2.2790e-01, -3.5968e-02,  3.1769e-01,
          1.9354e-01,  1.9748e-02, -1.5236e-01, -2.2657e-01,  1.3345e-02,
        ...]])

Semantic Textual Similarity

from sklearn.metrics.pairwise import cosine_similarity

# Define two sentences
sentence_1 = ["Der Zug kommt um 9 Uhr in Zürich an."]
sentence_2 = ["Le train arrive à Lausanne à 9h."]

#Compute embedding for both
embedding_1 = generate_sentence_embedding(sentence_1, language="de")
embedding_2 = generate_sentence_embedding(sentence_2, language="fr")

#Compute cosine-similarity
cosine_score = cosine_similarity((embedding_1, embedding_2)

#Output the score
print("The cosine score for", sentence_1, "and", sentence_2, "is", cosine_score)

Bias, Risks, and Limitations

This model has been trained on news articles only. Hence, it might not perform as well on other text classes.

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: python3 train_simcse_multilingual.py
    --seed 54699
    --model_name_or_path zurichNLP/swissbert
    --train_file /srv/scratch2/grosjean/Masterarbeit/data_subsets
    --output_dir /srv/scratch2/grosjean/Masterarbeit/model
    --overwrite_output_dir
    --save_strategy no
    --do_train
    --num_train_epochs 1
    --learning_rate 1e-5
    --per_device_train_batch_size 4
    --gradient_accumulation_steps 128
    --max_seq_length 512
    --overwrite_cache
    --pooler_type avg
    --pad_to_max_length
    --temp 0.05
    --fp16

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]