gonzalez-agirre
commited on
Commit
•
4810598
1
Parent(s):
74834e9
Update README.md
Browse files
README.md
CHANGED
@@ -113,12 +113,10 @@ Once the model has been successfully initialized, we continue its pre-training i
|
|
113 |
| Dataset | Language | Tokens (pre-epoch) | Epochs |
|
114 |
|---------------------|----------|--------------------|--------------|
|
115 |
| Wikipedia | en | 2169.97M | 1.428144485 |
|
116 |
-
| Lyrics | en | 100.60M | 0.7140722425 |
|
117 |
| C4_es | es | 53709.80M | 0.1049686196 |
|
118 |
| Biomedical | es | 455.03M | 0.7140722425 |
|
119 |
| Legal | es | 995.70M | 0.7140722425 |
|
120 |
| Wikipedia | es | 693.60M | 1.428144485 |
|
121 |
-
| Lyrics | es | 125.93M | 0.7140722425 |
|
122 |
| Gutenberg | es | 53.18M | 0.7140722425 |
|
123 |
| C4_ca | ca | 2826.00M | 2.142216727 |
|
124 |
| Biomedical | ca | 11.80M | 1.428144485 |
|
@@ -127,7 +125,6 @@ Once the model has been successfully initialized, we continue its pre-training i
|
|
127 |
| CaWaC | ca | 57.79M | 2.142216727 |
|
128 |
| Wikipedia | ca | 228.01M | 3.570361212 |
|
129 |
| Vilaweb | ca | 50.34M | 2.142216727 |
|
130 |
-
| Lyrics | ca | 0.50M | 2.142216727 |
|
131 |
|
132 |
The resulting dataset has the following language distribution:
|
133 |
|
|
|
113 |
| Dataset | Language | Tokens (pre-epoch) | Epochs |
|
114 |
|---------------------|----------|--------------------|--------------|
|
115 |
| Wikipedia | en | 2169.97M | 1.428144485 |
|
|
|
116 |
| C4_es | es | 53709.80M | 0.1049686196 |
|
117 |
| Biomedical | es | 455.03M | 0.7140722425 |
|
118 |
| Legal | es | 995.70M | 0.7140722425 |
|
119 |
| Wikipedia | es | 693.60M | 1.428144485 |
|
|
|
120 |
| Gutenberg | es | 53.18M | 0.7140722425 |
|
121 |
| C4_ca | ca | 2826.00M | 2.142216727 |
|
122 |
| Biomedical | ca | 11.80M | 1.428144485 |
|
|
|
125 |
| CaWaC | ca | 57.79M | 2.142216727 |
|
126 |
| Wikipedia | ca | 228.01M | 3.570361212 |
|
127 |
| Vilaweb | ca | 50.34M | 2.142216727 |
|
|
|
128 |
|
129 |
The resulting dataset has the following language distribution:
|
130 |
|