|
--- |
|
|
|
|
|
{} |
|
--- |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model finetuned via [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych, |
|
2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation. |
|
|
|
The fine-tuning script can be accessed [here](Link). |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/1Ac61s_mlgW5aL6OnS6ay.png) |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
- **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean) |
|
- **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base) |
|
- **Language(s) (NLP):** de_CH, fr_CH, it_CH, rm_CH |
|
- **License:** [More Information Needed] |
|
- **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) |
|
|
|
## Use |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
```python |
|
import torch |
|
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
|
|
### German example |
|
|
|
def generate_sentence_embedding(sentence, model_name="jgrosjean-mathesis/swissbert-for-sentence-embeddings"): |
|
# Load swissBERT model |
|
model = AutoModel.from_pretrained(model_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model.set_default_language("de_CH") |
|
|
|
# Tokenize input sentence |
|
inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt", max_length=512) |
|
|
|
# Set the model to evaluation mode |
|
model.eval() |
|
|
|
# Take tokenized input and pass it through the model |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
# Extract average sentence embeddings from the last hidden layer |
|
embedding = outputs.last_hidden_state.mean(dim=1) |
|
|
|
return embedding |
|
|
|
sentence_embedding = generate_sentence_embedding("Wir feiern am 1. August den Schweizer Nationalfeiertag.") |
|
print(sentence_embedding) |
|
``` |
|
Output: |
|
``` |
|
tensor([[ 5.6306e-02, -2.8375e-01, -4.1495e-02, 7.4393e-02, -3.1552e-01, |
|
1.5213e-01, -1.0258e-01, 2.2790e-01, -3.5968e-02, 3.1769e-01, |
|
1.9354e-01, 1.9748e-02, -1.5236e-01, -2.2657e-01, 1.3345e-02, |
|
...]]) |
|
``` |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
This model has been trained on news articles only. Hence, it might not perform as well on other text classes. |
|
This multilingual model has not been fine-tuned for cross-lingual transfer. It is intended for computing sentence embeddings that can be compared mono-lingually. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
[More Information Needed] |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
#### Preprocessing [optional] |
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** python3 train_simcse_multilingual.py \ |
|
--seed 54699 \ |
|
--model_name_or_path zurichNLP/swissbert \ |
|
--train_file /srv/scratch2/grosjean/Masterarbeit/data_subsets \ |
|
--output_dir /srv/scratch2/grosjean/Masterarbeit/model \ |
|
--overwrite_output_dir \ |
|
--save_strategy no \ |
|
--do_train \ |
|
--num_train_epochs 1 \ |
|
--learning_rate 1e-5 \ |
|
--per_device_train_batch_size 4 \ |
|
--gradient_accumulation_steps 128 \ |
|
--max_seq_length 512 \ |
|
--overwrite_cache \ |
|
--pooler_type avg \ |
|
--pad_to_max_length \ |
|
--temp 0.05 \ |
|
--fp16 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
[More Information Needed] |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
[More Information Needed] |
|
|
|
### Results |
|
|
|
[More Information Needed] |
|
|
|
#### Summary |
|
|
|
|
|
|
|
## Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** [More Information Needed] |
|
- **Hours used:** [More Information Needed] |
|
- **Cloud Provider:** [More Information Needed] |
|
- **Compute Region:** [More Information Needed] |
|
- **Carbon Emitted:** [More Information Needed] |
|
|
|
## Technical Specifications [optional] |
|
|
|
### Model Architecture and Objective |
|
|
|
[More Information Needed] |
|
|
|
### Compute Infrastructure |
|
|
|
[More Information Needed] |
|
|
|
#### Hardware |
|
|
|
[More Information Needed] |
|
|
|
#### Software |
|
|
|
[More Information Needed] |
|
|
|
## Citation [optional] |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
[More Information Needed] |
|
|
|
**APA:** |
|
|
|
[More Information Needed] |
|
|
|
## Glossary [optional] |
|
|
|
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> |
|
|
|
[More Information Needed] |
|
|
|
## More Information [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Authors [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Contact |
|
|
|
[More Information Needed] |
|
|
|
|
|
|