1-800-BAD-CODE
/

punctuation_fullstop_truecase_romance

Text2Text Generation

Model card Files Files and versions Community

1-800-BAD-CODE commited on Mar 24, 2023

Commit

044610d

•

1 Parent(s): 8864f7c

Update README.md

Files changed (1) hide show

README.md +53 -1

README.md CHANGED Viewed

@@ -14,4 +14,56 @@ tags:
 - fullstop
 - truecase
 - capitalization
----

 - fullstop
 - truecase
 - capitalization
+---
+# Model
+This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization)
+for text in the 6 most popular Romance languages:
+* Spanish
+* French
+* Portuguese
+* Catalan
+* Italian
+* Romanian
+Together, these languages cover approximately 97% of native speakers of the Romance language family.
+This model predicts the following punctuation tokens:
+* .
+* ,
+* ?
+* ¿
+* ACRONYM
+Though rare in these languages relative to English, the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" &rarr; "`p.m.`".
+# Usage
+The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.
+The easy way to run inference is to use the `punctuators` package:
+# Training Parameters
+This model was trained by concatenating between 1 and 14 random sentences.
+The concatenation points became sentence boundary targets,
+text was lower-cased to produce true-case targets,
+and punctuation was removed to create punctuation targets.
+Batches were built by randomly sampling from each language.
+Each example is language homogenous (i.e., we only concatenate sentences from the same language).
+Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.
+The maximum length during training was 256 subtokens.
+The `punctuators` package can punctuate inputs of any length.
+This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.
+If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
+# Training Data
+For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
+Catalan is not included in StatMT's News Crawl.
+For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
+# Metrics