1-800-BAD-CODE
commited on
Commit
•
76f7f82
1
Parent(s):
874abfd
Update README.md
Browse files
README.md
CHANGED
@@ -66,10 +66,15 @@ from punctuators.models import PunctCapSegModelONNX
|
|
66 |
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
|
67 |
m = PunctCapSegModelONNX.from_pretrained("pcs_romance")
|
68 |
|
69 |
-
# Define some input texts to punctuate
|
70 |
input_texts: List[str] = [
|
71 |
"este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
|
72 |
"hola amigo cómo estás es un día lluvioso hoy",
|
|
|
|
|
|
|
|
|
|
|
73 |
]
|
74 |
results: List[List[str]] = m.infer(input_texts)
|
75 |
for input_text, output_texts in zip(input_texts, results):
|
@@ -119,20 +124,20 @@ This is accomplished behind the scenes by splitting the input into overlapping s
|
|
119 |
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
|
120 |
|
121 |
# Contact
|
122 |
-
Contact me at [email protected] with requests or issues, or on the community tab.
|
123 |
|
124 |
# Metrics
|
125 |
Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
|
126 |
-
Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all
|
127 |
|
128 |
-
Since punctuation is
|
129 |
|
130 |
-
Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a
|
131 |
|
132 |
Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
|
133 |
we predict it separate from the other punctuation tokens.
|
134 |
|
135 |
-
Generally,
|
136 |
|
137 |
Expand any of the following tabs to see metrics for that language.
|
138 |
|
|
|
66 |
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
|
67 |
m = PunctCapSegModelONNX.from_pretrained("pcs_romance")
|
68 |
|
69 |
+
# Define some input texts to punctuate, at least one per language
|
70 |
input_texts: List[str] = [
|
71 |
"este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
|
72 |
"hola amigo cómo estás es un día lluvioso hoy",
|
73 |
+
"hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat",
|
74 |
+
"ciao amico come va oggi è stata una giornata piovosa",
|
75 |
+
"olá amigo como tá indo estava chuvoso hoje",
|
76 |
+
"salut l'ami comment ça va il pleuvait aujourd'hui",
|
77 |
+
"salut prietene cum stă treaba azi a fost ploios",
|
78 |
]
|
79 |
results: List[List[str]] = m.infer(input_texts)
|
80 |
for input_text, output_texts in zip(input_texts, results):
|
|
|
124 |
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
|
125 |
|
126 |
# Contact
|
127 |
+
Contact me at [email protected] with requests or issues, or just let me know on the community tab.
|
128 |
|
129 |
# Metrics
|
130 |
Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
|
131 |
+
Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all letters.
|
132 |
|
133 |
+
Since punctuation is subjective (e.g., see "hello friend how's it going" in the above examples) punctuation metrics can be misleading.
|
134 |
|
135 |
+
Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a 50 instances of "¿" which should not appear).
|
136 |
|
137 |
Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
|
138 |
we predict it separate from the other punctuation tokens.
|
139 |
|
140 |
+
Generally, periods are easy, commas are a harder, question marks are hard, and acronyms are rare and noisy.
|
141 |
|
142 |
Expand any of the following tabs to see metrics for that language.
|
143 |
|