1-800-BAD-CODE commited on
Commit
044610d
1 Parent(s): 8864f7c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -1
README.md CHANGED
@@ -14,4 +14,56 @@ tags:
14
  - fullstop
15
  - truecase
16
  - capitalization
17
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  - fullstop
15
  - truecase
16
  - capitalization
17
+ ---
18
+
19
+ # Model
20
+ This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization)
21
+ for text in the 6 most popular Romance languages:
22
+
23
+ * Spanish
24
+ * French
25
+ * Portuguese
26
+ * Catalan
27
+ * Italian
28
+ * Romanian
29
+
30
+ Together, these languages cover approximately 97% of native speakers of the Romance language family.
31
+
32
+ This model predicts the following punctuation tokens:
33
+
34
+ * .
35
+ * ,
36
+ * ?
37
+ * ¿
38
+ * ACRONYM
39
+
40
+ Though rare in these languages relative to English, the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" → "`p.m.`".
41
+
42
+ # Usage
43
+ The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.
44
+
45
+ The easy way to run inference is to use the `punctuators` package:
46
+
47
+ # Training Parameters
48
+ This model was trained by concatenating between 1 and 14 random sentences.
49
+ The concatenation points became sentence boundary targets,
50
+ text was lower-cased to produce true-case targets,
51
+ and punctuation was removed to create punctuation targets.
52
+
53
+ Batches were built by randomly sampling from each language.
54
+ Each example is language homogenous (i.e., we only concatenate sentences from the same language).
55
+ Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.
56
+
57
+ The maximum length during training was 256 subtokens.
58
+ The `punctuators` package can punctuate inputs of any length.
59
+ This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.
60
+
61
+ If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
62
+
63
+ # Training Data
64
+ For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
65
+
66
+ Catalan is not included in StatMT's News Crawl.
67
+ For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
68
+
69
+ # Metrics