jgrosjean-mathesis
/

sentence-swissbert

Sentence Similarity

Transformers

PyTorch

xmod

Inference Endpoints

Model card Files Files and versions Community

jgrosjean commited on Dec 19, 2023

Commit

b81e4bc

•

1 Parent(s): c38a19a

Update README.md

Browse files

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -113,7 +113,7 @@ German, French, Italian and Romansh documents in the [Swissdox@LiRI database](ht
 This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized.  Because of the drop-out, it will be encoded at slightly different positions in the vector space.
-The fine-tuning script can be accessed [here](Link).
 #### Training Hyperparameters
@@ -143,14 +143,14 @@ The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.u
 Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores betweend each summary and content embedding pair.
-The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches.
 #### Evaluation via Text Classification
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
-Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach.
 Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.

 This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized.  Because of the drop-out, it will be encoded at slightly different positions in the vector space.
+The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
 #### Training Hyperparameters
 Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores betweend each summary and content embedding pair.
+The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
 #### Evaluation via Text Classification
 <!-- These are the evaluation metrics being used, ideally with a description of why. -->
+Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
 Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.