Fairseq
Galician
Catalan
AudreyVM commited on
Commit
9a3cde0
·
verified ·
1 Parent(s): 2417e26

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -62,7 +62,7 @@ the the Spanish side of the Projecte Aina Spanish-Catalan corpus using the GL-ES
62
 
63
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
64
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
65
- The filtered datasets are then concatenated to form a final corpus of **10.017.995** and before training the punctuation is normalized using a
66
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
67
 
68
 
 
62
 
63
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
64
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
65
+ The filtered datasets are then concatenated to form the final training corpus and before training the punctuation is normalized using a
66
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
67
 
68