jgrosjean commited on
Commit
b81e4bc
1 Parent(s): c38a19a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -113,7 +113,7 @@ German, French, Italian and Romansh documents in the [Swissdox@LiRI database](ht
113
 
114
  This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized. Because of the drop-out, it will be encoded at slightly different positions in the vector space.
115
 
116
- The fine-tuning script can be accessed [here](Link).
117
 
118
  #### Training Hyperparameters
119
 
@@ -143,14 +143,14 @@ The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.u
143
 
144
  Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores betweend each summary and content embedding pair.
145
 
146
- The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches.
147
 
148
 
149
  #### Evaluation via Text Classification
150
 
151
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
152
 
153
- Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach.
154
 
155
  Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
156
 
 
113
 
114
  This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized. Because of the drop-out, it will be encoded at slightly different positions in the vector space.
115
 
116
+ The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
117
 
118
  #### Training Hyperparameters
119
 
 
143
 
144
  Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores betweend each summary and content embedding pair.
145
 
146
+ The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
147
 
148
 
149
  #### Evaluation via Text Classification
150
 
151
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
152
 
153
+ Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
154
 
155
  Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
156