Update README.md
Browse files
README.md
CHANGED
@@ -62,7 +62,7 @@ the the Spanish side of the Projecte Aina Spanish-Catalan corpus using the GL-ES
|
|
62 |
|
63 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
64 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
65 |
-
The filtered datasets are then concatenated to form
|
66 |
modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
|
67 |
|
68 |
|
|
|
62 |
|
63 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
64 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
65 |
+
The filtered datasets are then concatenated to form the final training corpus and before training the punctuation is normalized using a
|
66 |
modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
|
67 |
|
68 |
|