File size: 24,689 Bytes
96ca210
 
8864f7c
 
 
 
 
 
 
 
 
 
 
 
 
 
03cc4e4
 
 
044610d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
585fa7a
 
1f0729d
044610d
 
 
 
 
 
 
3e02732
044610d
a4c8506
 
 
044610d
 
 
3e02732
 
 
 
 
 
 
 
 
 
 
 
e2c4635
3e02732
 
 
 
 
 
 
 
76f7f82
3e02732
 
 
76f7f82
 
 
 
 
3e02732
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd39ba3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e02732
 
 
 
4567479
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
044610d
1f0729d
 
 
 
 
3e02732
1f0729d
044610d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f3c969
76f7f82
7f3c969
044610d
eeacaad
76f7f82
7f3c969
76f7f82
7f3c969
76f7f82
eeacaad
7f3c969
 
 
76f7f82
eeacaad
7f3c969
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eeacaad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f3c969
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
---
license: apache-2.0
language:
- es
- ca
- fr
- pt
- it
- ro
library_name: generic
tags:
- text2text-generation
- punctuation
- fullstop
- truecase
- capitalization
widget:
  - text: "hola amigo cómo estás es un día lluvioso hoy"
  - text: "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt"
---

# Model
This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization) 
for text in the 6 most popular Romance languages:

* Spanish
* French
* Portuguese
* Catalan
* Italian
* Romanian

Together, these languages cover approximately 97% of native speakers of the Romance language family.

The model comprises a SentencePiece tokenizer, a Transformer encoder, and MLP prediction heads.

This model predicts the following punctuation per input subtoken:

* .
* ,
* ?
* ¿
* ACRONYM

Though rare in these languages (relative to English), the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" → "`p.m.`".

**Widget notes** If you use the widget, it'll take a minute to load the model since a "generic" library is used.
Further, the widget does not respect multi-line output, so fullstop predictions are annotated with "\n".

# Usage
The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.

The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):

```bash
pip install punctuators
```

If this package is broken, please let me know in the community tab (I update it for each model and break it a lot!).

<details open>

  <summary>Example Usage</summary>

```python
from typing import List

from punctuators.models import PunctCapSegModelONNX

# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = PunctCapSegModelONNX.from_pretrained("pcs_romance")

# Define some input texts to punctuate, at least one per language
input_texts: List[str] = [
    "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
    "hola amigo cómo estás es un día lluvioso hoy",
    "hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat",
    "ciao amico come va oggi è stata una giornata piovosa",
    "olá amigo como tá indo estava chuvoso hoje",
    "salut l'ami comment ça va il pleuvait aujourd'hui",
    "salut prietene cum stă treaba azi a fost ploios",
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    print(f"Outputs:")
    for text in output_texts:
        print(f"\t{text}")
    print()

```

Exact output may vary based on the model version; here is the current output: 

</details>

<details open>

  <summary>Expected Output</summary>

```text
Input: este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt
Outputs:
	Este modelo fue entrenado en un GPU A100.
	En realidad, no se que dice esta frase lo traduje con NMT.

Input: hola amigo cómo estás es un día lluvioso hoy
Outputs:
	Hola, amigo.
	¿Cómo estás?
	Es un día lluvioso hoy.

Input: hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat
Outputs:
	Hola, amic.
	Com va avui?
	Ha estat un dia plujós.
	El català prediu massa puntuació per com s'ha entrenat.

Input: ciao amico come va oggi è stata una giornata piovosa
Outputs:
	Ciao amico, come va?
	Oggi è stata una giornata piovosa.

Input: olá amigo como tá indo estava chuvoso hoje
Outputs:
	Olá, amigo, como tá indo?
	Estava chuvoso hoje.

Input: salut l'ami comment ça va il pleuvait aujourd'hui
Outputs:
	Salut l'ami.
	Comment ça va?
	Il pleuvait aujourd'hui.

Input: salut prietene cum stă treaba azi a fost ploios
Outputs:
	Salut prietene, cum stă treaba azi?
	A fost ploios.
```

</details>

If you prefer your output to not be broken into separate sentences, you can disable sentence boundary detection
in the API call:

```python
input_texts: List[str] = [
    "hola amigo cómo estás es un día lluvioso hoy",
]
results: List[str] = m.infer(input_texts, apply_sbd=False)
print(results[0])
```

Instead of a `List[List[str]]` (a list of output sentences for each input), we get a `List[str]` (one output
sentence per input):

```text
Hola, amigo. ¿Cómo estás? Es un día lluvioso hoy.
```


# Training Data
For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).

Catalan is not included in StatMT's News Crawl. 
For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
Due to this, Catalan performance may be sub-par and may over-predict punctuation and sentence breaks, which is typical of OpenSubtitles.

# Training Parameters
This model was trained by concatenating between 1 and 14 random sentences. 
The concatenation points became sentence boundary targets, 
text was lower-cased to produce true-case targets,
and punctuation was removed to create punctuation targets.

Batches were built by randomly sampling from each language. 
Each example is language homogenous (i.e., we only concatenate sentences from the same language).
Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.

The maximum length during training was 256 subtokens. 
The `punctuators` package can punctuate inputs of any length.
This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.

If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.

# Contact
Contact me at [email protected] with requests or issues, or just let me know on the community tab.

# Metrics
Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all letters.

Since punctuation is subjective (e.g., see "hello friend how's it going" in the above examples) punctuation metrics can be misleading.

Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a 50 instances of "¿" which should not appear).

Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
we predict it separate from the other punctuation tokens.

Generally, periods are easy, commas are a harder, question marks are hard, and acronyms are rare and noisy.

Expand any of the following tabs to see metrics for that language.


<details>

  <summary>Spanish metrics</summary>

```text
Pre-punctuation report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.92      99.97      99.95     572069
    ¿ (label_id: 1)                                         81.93      60.46      69.57       1095
    -------------------
    micro avg                                               99.90      99.90      99.90     573164
    macro avg                                               90.93      80.22      84.76     573164
    weighted avg                                            99.89      99.90      99.89     573164
    
Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.70      98.44      98.57     517310
    <ACRONYM> (label_id: 1)                                 39.68      86.21      54.35         58
    . (label_id: 2)                                         87.72      90.41      89.04      29267
    , (label_id: 3)                                         73.17      74.68      73.92      25422
    ? (label_id: 4)                                         69.49      59.26      63.97       1107
    -------------------
    micro avg                                               96.90      96.90      96.90     573164
    macro avg                                               73.75      81.80      75.97     573164
    weighted avg                                            96.94      96.90      96.92     573164
    
True-casing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.85      99.73      99.79    2164982
    UPPER (label_id: 1)                                     92.01      95.32      93.64      69437
    -------------------
    micro avg                                               99.60      99.60      99.60    2234419
    macro avg                                               95.93      97.53      96.71    2234419
    weighted avg                                            99.61      99.60      99.60    2234419

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                   100.00      99.98      99.99     543228
    FULLSTOP (label_id: 1)                                  99.66      99.93      99.80      32931
    -------------------
    micro avg                                               99.98      99.98      99.98     576159
    macro avg                                               99.83      99.96      99.89     576159
    weighted avg                                            99.98      99.98      99.98     576159
```

</details>


<details>

  <summary>Portuguese metrics</summary>

```text
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                   100.00     100.00     100.00     539822
    ¿ (label_id: 1)                                          0.00       0.00       0.00          0
    -------------------
    micro avg                                              100.00     100.00     100.00     539822
    macro avg                                              100.00     100.00     100.00     539822
    weighted avg                                           100.00     100.00     100.00     539822

Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.77      98.27      98.52     481148
    <ACRONYM> (label_id: 1)                                  0.00       0.00       0.00          0
    . (label_id: 2)                                         87.63      90.63      89.11      29090
    , (label_id: 3)                                         74.44      78.69      76.50      28549
    ? (label_id: 4)                                         66.30      52.27      58.45       1035
    -------------------
    micro avg                                               96.74      96.74      96.74     539822
    macro avg                                               81.79      79.96      80.65     539822
    weighted avg                                            96.82      96.74      96.77     539822

True-casing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.90      99.82      99.86    2082598
    UPPER (label_id: 1)                                     94.75      97.08      95.90      70555
    -------------------
    micro avg                                               99.73      99.73      99.73    2153153
    macro avg                                               97.32      98.45      97.88    2153153
    weighted avg                                            99.73      99.73      99.73    2153153

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                   100.00      99.98      99.99     509905
    FULLSTOP (label_id: 1)                                  99.72      99.98      99.85      32909
    -------------------
    micro avg                                               99.98      99.98      99.98     542814
    macro avg                                               99.86      99.98      99.92     542814
    weighted avg                                            99.98      99.98      99.98     542814

```

</details>


<details>

  <summary>Romanian metrics</summary>

```text
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                   100.00     100.00     100.00     580702
    ¿ (label_id: 1)                                          0.00       0.00       0.00          0
    -------------------
    micro avg                                              100.00     100.00     100.00     580702
    macro avg                                              100.00     100.00     100.00     580702
    weighted avg                                           100.00     100.00     100.00     580702

Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.56      98.47      98.51     520647
    <ACRONYM> (label_id: 1)                                 52.00      79.89      63.00        179
    . (label_id: 2)                                         87.29      89.37      88.32      29852
    , (label_id: 3)                                         75.26      74.69      74.97      29218
    ? (label_id: 4)                                         60.73      55.46      57.98        806
    -------------------
    micro avg                                               96.74      96.74      96.74     580702
    macro avg                                               74.77      79.57      76.56     580702
    weighted avg                                            96.74      96.74      96.74     580702

Truecasing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.84      99.75      99.79    2047297
    UPPER (label_id: 1)                                     93.56      95.65      94.59      77424
    -------------------
    micro avg                                               99.60      99.60      99.60    2124721
    macro avg                                               96.70      97.70      97.19    2124721
    weighted avg                                            99.61      99.60      99.60    2124721

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                   100.00      99.96      99.98     550858
    FULLSTOP (label_id: 1)                                  99.26      99.94      99.60      32833
    -------------------
    micro avg                                               99.95      99.95      99.95     583691
    macro avg                                               99.63      99.95      99.79     583691
    weighted avg                                            99.96      99.95      99.96     583691

```
</details>

<details>

  <summary>Italian metrics</summary>

```text
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                   100.00     100.00     100.00     577636
    ¿ (label_id: 1)                                          0.00       0.00       0.00          0
    -------------------
    micro avg                                              100.00     100.00     100.00     577636
    macro avg                                              100.00     100.00     100.00     577636
    weighted avg                                           100.00     100.00     100.00     577636

Punctuation report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.10      97.73      97.91     522727
    <ACRONYM> (label_id: 1)                                 41.76      48.72      44.97         78
    . (label_id: 2)                                         81.71      86.70      84.13      28881
    , (label_id: 3)                                         61.72      63.24      62.47      24703
    ? (label_id: 4)                                         62.55      41.78      50.10       1247
    -------------------
    micro avg                                               95.58      95.58      95.58     577636
    macro avg                                               69.17      67.63      67.92     577636
    weighted avg                                            95.64      95.58      95.60     577636

Truecasing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.76      99.70      99.73    2160781
    UPPER (label_id: 1)                                     91.18      92.76      91.96      72471
    -------------------
    micro avg                                               99.47      99.47      99.47    2233252
    macro avg                                               95.47      96.23      95.85    2233252
    weighted avg                                            99.48      99.47      99.48    2233252

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.98      99.99     547875
    FULLSTOP (label_id: 1)                                  99.72      99.91      99.82      32742
    -------------------
    micro avg                                               99.98      99.98      99.98     580617
    macro avg                                               99.86      99.95      99.90     580617
    weighted avg                                            99.98      99.98      99.98     580617
```
</details>

<details>

  <summary>French metrics</summary>

```text
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                   100.00     100.00     100.00     614010
    ¿ (label_id: 1)                                          0.00       0.00       0.00          0
    -------------------
    micro avg                                              100.00     100.00     100.00     614010
    macro avg                                              100.00     100.00     100.00     614010
    weighted avg                                           100.00     100.00     100.00     614010

Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.72      98.57      98.65     556366
    <ACRONYM> (label_id: 1)                                 38.46      71.43      50.00         49
    . (label_id: 2)                                         86.41      88.56      87.47      28969
    , (label_id: 3)                                         72.15      72.80      72.47      27183
    ? (label_id: 4)                                         75.81      67.78      71.57       1443
    -------------------
    micro avg                                               96.88      96.88      96.88     614010
    macro avg                                               74.31      79.83      76.03     614010
    weighted avg                                            96.91      96.88      96.89     614010

Truecasing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.84      99.80      99.82    2127174
    UPPER (label_id: 1)                                     93.72      94.73      94.22      66496
    -------------------
    micro avg                                               99.65      99.65      99.65    2193670
    macro avg                                               96.78      97.27      97.02    2193670
    weighted avg                                            99.65      99.65      99.65    2193670

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.94      99.97     584331
    FULLSTOP (label_id: 1)                                  98.92      99.90      99.41      32661
    -------------------
    micro avg                                               99.94      99.94      99.94     616992
    macro avg                                               99.46      99.92      99.69     616992
    weighted avg                                            99.94      99.94      99.94     616992

```
</details>

<details>

  <summary>Catalan metrics</summary>

```text
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.97     100.00      99.98     143817
    ¿ (label_id: 1)                                          0.00       0.00       0.00         50
    -------------------
    micro avg                                               99.97      99.97      99.97     143867
    macro avg                                               49.98      50.00      49.99     143867
    weighted avg                                            99.93      99.97      99.95     143867

Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    97.61      97.73      97.67     119040
    <ACRONYM> (label_id: 1)                                  0.00       0.00       0.00         28
    . (label_id: 2)                                         74.02      79.46      76.65      15282
    , (label_id: 3)                                         60.88      50.75      55.36       5836
    ? (label_id: 4)                                         64.94      60.28      62.52       3681
    -------------------
    micro avg                                               92.90      92.90      92.90     143867
    macro avg                                               59.49      57.64      58.44     143867
    weighted avg                                            92.76      92.90      92.80     143867

Truecasing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.81      99.83      99.82     422395
    UPPER (label_id: 1)                                     97.09      96.81      96.95      24854
    -------------------
    micro avg                                               99.66      99.66      99.66     447249
    macro avg                                               98.45      98.32      98.39     447249
    weighted avg                                            99.66      99.66      99.66     447249

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.93      99.63      99.78     123867
    FULLSTOP (label_id: 1)                                  97.97      99.59      98.77      22000
    -------------------
    micro avg                                               99.63      99.63      99.63     145867
    macro avg                                               98.95      99.61      99.28     145867
    weighted avg                                            99.63      99.63      99.63     145867

```
</details>