Update README.md

d8f35d4 10 months ago

6.94 kB

	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	{}
	---

	<!-- Provide a quick summary of what the model is/does. -->

	The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
	2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.

	The fine-tuning script can be accessed [here](Link).

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Developed by: [Juri Grosjean](https://huggingface.co/jgrosjean)
	- Model type: [XMOD](https://huggingface.co/facebook/xmod-base)
	- Language(s) (NLP): de_CH, fr_CH, it_CH, rm_CH
	- License: [More Information Needed]
	- Finetuned from model: [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)

	## Use

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	```python
	import torch

	from transformers import AutoModel, AutoTokenizer

	### German example

	# Load swissBERT for sentence embeddings model
	model_name="jgrosjean-mathesis/swissbert-for-sentence-embeddings"
	model = AutoModel.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	def generate_sentence_embedding(sentence, language):

	# Set adapter to specified language
	if "de" in language:
	model.set_default_language("de_CH")
	if "fr" in language:
	model.set_default_language("fr_CH")
	if "it" in language:
	model.set_default_language("it_CH")
	if "rm" in language:
	model.set_default_language("rm_CH")

	# Tokenize input sentence
	inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt", max_length=512)

	# Take tokenized input and pass it through the model
	with torch.no_grad():
	outputs = model(**inputs)

	# Extract average sentence embeddings from the last hidden layer
	embedding = outputs.last_hidden_state.mean(dim=1)

	return embedding

	sentence_embedding = generate_sentence_embedding("Wir feiern am 1. August den Schweizer Nationalfeiertag.", language="de")
	print(sentence_embedding)
	```
	Output:
	```
	tensor([[ 5.6306e-02, -2.8375e-01, -4.1495e-02, 7.4393e-02, -3.1552e-01,
	1.5213e-01, -1.0258e-01, 2.2790e-01, -3.5968e-02, 3.1769e-01,
	1.9354e-01, 1.9748e-02, -1.5236e-01, -2.2657e-01, 1.3345e-02,
	...]])
	```

	### Semantic Textual Similarity

	```python
	from sklearn.metrics.pairwise import cosine_similarity

	# Define two sentences
	sentence_1 = ["Der Zug kommt um 9 Uhr in Zürich an."]
	sentence_2 = ["Le train arrive à Lausanne à 9h."]

	#Compute embedding for both
	embedding_1 = generate_sentence_embedding(sentence_1, language="de")
	embedding_2 = generate_sentence_embedding(sentence_2, language="fr")

	#Compute cosine-similarity
	cosine_score = cosine_similarity((embedding_1, embedding_2)

	#Output the score
	print("The cosine score for", sentence_1, "and", sentence_2, "is", cosine_score)
	```

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->
	This model has been trained on news articles only. Hence, it might not perform as well on other text classes.

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	[More Information Needed]

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	#### Preprocessing [optional]

	[More Information Needed]


	#### Training Hyperparameters

	- Training regime: python3 train_simcse_multilingual.py \
	--seed 54699 \
	--model_name_or_path zurichNLP/swissbert \
	--train_file /srv/scratch2/grosjean/Masterarbeit/data_subsets \
	--output_dir /srv/scratch2/grosjean/Masterarbeit/model \
	--overwrite_output_dir \
	--save_strategy no \
	--do_train \
	--num_train_epochs 1 \
	--learning_rate 1e-5 \
	--per_device_train_batch_size 4 \
	--gradient_accumulation_steps 128 \
	--max_seq_length 512 \
	--overwrite_cache \
	--pooler_type avg \
	--pad_to_max_length \
	--temp 0.05 \
	--fp16 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	[More Information Needed]

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	### Results

	[More Information Needed]

	#### Summary



	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: [More Information Needed]
	- Hours used: [More Information Needed]
	- Cloud Provider: [More Information Needed]
	- Compute Region: [More Information Needed]
	- Carbon Emitted: [More Information Needed]

	## Technical Specifications [optional]

	### Model Architecture and Objective

	[More Information Needed]

	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]

	APA:

	[More Information Needed]

	## Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]