1-800-BAD-CODE
commited on
Commit
•
044610d
1
Parent(s):
8864f7c
Update README.md
Browse files
README.md
CHANGED
@@ -14,4 +14,56 @@ tags:
|
|
14 |
- fullstop
|
15 |
- truecase
|
16 |
- capitalization
|
17 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
- fullstop
|
15 |
- truecase
|
16 |
- capitalization
|
17 |
+
---
|
18 |
+
|
19 |
+
# Model
|
20 |
+
This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization)
|
21 |
+
for text in the 6 most popular Romance languages:
|
22 |
+
|
23 |
+
* Spanish
|
24 |
+
* French
|
25 |
+
* Portuguese
|
26 |
+
* Catalan
|
27 |
+
* Italian
|
28 |
+
* Romanian
|
29 |
+
|
30 |
+
Together, these languages cover approximately 97% of native speakers of the Romance language family.
|
31 |
+
|
32 |
+
This model predicts the following punctuation tokens:
|
33 |
+
|
34 |
+
* .
|
35 |
+
* ,
|
36 |
+
* ?
|
37 |
+
* ¿
|
38 |
+
* ACRONYM
|
39 |
+
|
40 |
+
Though rare in these languages relative to English, the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" → "`p.m.`".
|
41 |
+
|
42 |
+
# Usage
|
43 |
+
The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.
|
44 |
+
|
45 |
+
The easy way to run inference is to use the `punctuators` package:
|
46 |
+
|
47 |
+
# Training Parameters
|
48 |
+
This model was trained by concatenating between 1 and 14 random sentences.
|
49 |
+
The concatenation points became sentence boundary targets,
|
50 |
+
text was lower-cased to produce true-case targets,
|
51 |
+
and punctuation was removed to create punctuation targets.
|
52 |
+
|
53 |
+
Batches were built by randomly sampling from each language.
|
54 |
+
Each example is language homogenous (i.e., we only concatenate sentences from the same language).
|
55 |
+
Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.
|
56 |
+
|
57 |
+
The maximum length during training was 256 subtokens.
|
58 |
+
The `punctuators` package can punctuate inputs of any length.
|
59 |
+
This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.
|
60 |
+
|
61 |
+
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
|
62 |
+
|
63 |
+
# Training Data
|
64 |
+
For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
|
65 |
+
|
66 |
+
Catalan is not included in StatMT's News Crawl.
|
67 |
+
For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
|
68 |
+
|
69 |
+
# Metrics
|