Update README.md

4567479 over 1 year ago

24.7 kB

	---
	license: apache-2.0
	language:
	- es
	- ca
	- fr
	- pt
	- it
	- ro
	library_name: generic
	tags:
	- text2text-generation
	- punctuation
	- fullstop
	- truecase
	- capitalization
	widget:
	- text: "hola amigo cómo estás es un día lluvioso hoy"
	- text: "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt"
	---

	# Model
	This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization)
	for text in the 6 most popular Romance languages:

	* Spanish
	* French
	* Portuguese
	* Catalan
	* Italian
	* Romanian

	Together, these languages cover approximately 97% of native speakers of the Romance language family.

	The model comprises a SentencePiece tokenizer, a Transformer encoder, and MLP prediction heads.

	This model predicts the following punctuation per input subtoken:

	* .
	* ,
	* ?
	* ¿
	* ACRONYM

	Though rare in these languages (relative to English), the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" → "`p.m.`".

	Widget notes If you use the widget, it'll take a minute to load the model since a "generic" library is used.
	Further, the widget does not respect multi-line output, so fullstop predictions are annotated with "\n".

	# Usage
	The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.

	The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):

	```bash
	pip install punctuators
	```

	If this package is broken, please let me know in the community tab (I update it for each model and break it a lot!).

	<details open>

	<summary>Example Usage</summary>

	```python
	from typing import List

	from punctuators.models import PunctCapSegModelONNX

	# Instantiate this model
	# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
	m = PunctCapSegModelONNX.from_pretrained("pcs_romance")

	# Define some input texts to punctuate, at least one per language
	input_texts: List[str] = [
	"este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
	"hola amigo cómo estás es un día lluvioso hoy",
	"hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat",
	"ciao amico come va oggi è stata una giornata piovosa",
	"olá amigo como tá indo estava chuvoso hoje",
	"salut l'ami comment ça va il pleuvait aujourd'hui",
	"salut prietene cum stă treaba azi a fost ploios",
	]
	results: List[List[str]] = m.infer(input_texts)
	for input_text, output_texts in zip(input_texts, results):
	print(f"Input: {input_text}")
	print(f"Outputs:")
	for text in output_texts:
	print(f"\t{text}")
	print()

	```

	Exact output may vary based on the model version; here is the current output:

	</details>

	<details open>

	<summary>Expected Output</summary>

	```text
	Input: este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt
	Outputs:
	Este modelo fue entrenado en un GPU A100.
	En realidad, no se que dice esta frase lo traduje con NMT.

	Input: hola amigo cómo estás es un día lluvioso hoy
	Outputs:
	Hola, amigo.
	¿Cómo estás?
	Es un día lluvioso hoy.

	Input: hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat
	Outputs:
	Hola, amic.
	Com va avui?
	Ha estat un dia plujós.
	El català prediu massa puntuació per com s'ha entrenat.

	Input: ciao amico come va oggi è stata una giornata piovosa
	Outputs:
	Ciao amico, come va?
	Oggi è stata una giornata piovosa.

	Input: olá amigo como tá indo estava chuvoso hoje
	Outputs:
	Olá, amigo, como tá indo?
	Estava chuvoso hoje.

	Input: salut l'ami comment ça va il pleuvait aujourd'hui
	Outputs:
	Salut l'ami.
	Comment ça va?
	Il pleuvait aujourd'hui.

	Input: salut prietene cum stă treaba azi a fost ploios
	Outputs:
	Salut prietene, cum stă treaba azi?
	A fost ploios.
	```

	</details>

	If you prefer your output to not be broken into separate sentences, you can disable sentence boundary detection
	in the API call:

	```python
	input_texts: List[str] = [
	"hola amigo cómo estás es un día lluvioso hoy",
	]
	results: List[str] = m.infer(input_texts, apply_sbd=False)
	print(results[0])
	```

	Instead of a `List[List[str]]` (a list of output sentences for each input), we get a `List[str]` (one output
	sentence per input):

	```text
	Hola, amigo. ¿Cómo estás? Es un día lluvioso hoy.
	```


	# Training Data
	For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).

	Catalan is not included in StatMT's News Crawl.
	For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
	Due to this, Catalan performance may be sub-par and may over-predict punctuation and sentence breaks, which is typical of OpenSubtitles.

	# Training Parameters
	This model was trained by concatenating between 1 and 14 random sentences.
	The concatenation points became sentence boundary targets,
	text was lower-cased to produce true-case targets,
	and punctuation was removed to create punctuation targets.

	Batches were built by randomly sampling from each language.
	Each example is language homogenous (i.e., we only concatenate sentences from the same language).
	Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.

	The maximum length during training was 256 subtokens.
	The `punctuators` package can punctuate inputs of any length.
	This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.

	If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.

	# Contact
	Contact me at [email protected] with requests or issues, or just let me know on the community tab.

	# Metrics
	Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
	Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all letters.

	Since punctuation is subjective (e.g., see "hello friend how's it going" in the above examples) punctuation metrics can be misleading.

	Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a 50 instances of "¿" which should not appear).

	Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
	we predict it separate from the other punctuation tokens.

	Generally, periods are easy, commas are a harder, question marks are hard, and acronyms are rare and noisy.

	Expand any of the following tabs to see metrics for that language.


	<details>

	<summary>Spanish metrics</summary>

	```text
	Pre-punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 99.92 99.97 99.95 572069
	¿ (label_id: 1) 81.93 60.46 69.57 1095
	-------------------
	micro avg 99.90 99.90 99.90 573164
	macro avg 90.93 80.22 84.76 573164
	weighted avg 99.89 99.90 99.89 573164

	Punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 98.70 98.44 98.57 517310
	<ACRONYM> (label_id: 1) 39.68 86.21 54.35 58
	. (label_id: 2) 87.72 90.41 89.04 29267
	, (label_id: 3) 73.17 74.68 73.92 25422
	? (label_id: 4) 69.49 59.26 63.97 1107
	-------------------
	micro avg 96.90 96.90 96.90 573164
	macro avg 73.75 81.80 75.97 573164
	weighted avg 96.94 96.90 96.92 573164

	True-casing report:
	label precision recall f1 support
	LOWER (label_id: 0) 99.85 99.73 99.79 2164982
	UPPER (label_id: 1) 92.01 95.32 93.64 69437
	-------------------
	micro avg 99.60 99.60 99.60 2234419
	macro avg 95.93 97.53 96.71 2234419
	weighted avg 99.61 99.60 99.60 2234419

	Fullstop report:
	label precision recall f1 support
	NOSTOP (label_id: 0) 100.00 99.98 99.99 543228
	FULLSTOP (label_id: 1) 99.66 99.93 99.80 32931
	-------------------
	micro avg 99.98 99.98 99.98 576159
	macro avg 99.83 99.96 99.89 576159
	weighted avg 99.98 99.98 99.98 576159
	```

	</details>


	<details>

	<summary>Portuguese metrics</summary>

	```text
	Pre-punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 100.00 100.00 100.00 539822
	¿ (label_id: 1) 0.00 0.00 0.00 0
	-------------------
	micro avg 100.00 100.00 100.00 539822
	macro avg 100.00 100.00 100.00 539822
	weighted avg 100.00 100.00 100.00 539822

	Punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 98.77 98.27 98.52 481148
	<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0
	. (label_id: 2) 87.63 90.63 89.11 29090
	, (label_id: 3) 74.44 78.69 76.50 28549
	? (label_id: 4) 66.30 52.27 58.45 1035
	-------------------
	micro avg 96.74 96.74 96.74 539822
	macro avg 81.79 79.96 80.65 539822
	weighted avg 96.82 96.74 96.77 539822

	True-casing report:
	label precision recall f1 support
	LOWER (label_id: 0) 99.90 99.82 99.86 2082598
	UPPER (label_id: 1) 94.75 97.08 95.90 70555
	-------------------
	micro avg 99.73 99.73 99.73 2153153
	macro avg 97.32 98.45 97.88 2153153
	weighted avg 99.73 99.73 99.73 2153153

	Fullstop report:
	label precision recall f1 support
	NOSTOP (label_id: 0) 100.00 99.98 99.99 509905
	FULLSTOP (label_id: 1) 99.72 99.98 99.85 32909
	-------------------
	micro avg 99.98 99.98 99.98 542814
	macro avg 99.86 99.98 99.92 542814
	weighted avg 99.98 99.98 99.98 542814

	```

	</details>


	<details>

	<summary>Romanian metrics</summary>

	```text
	Pre-punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 100.00 100.00 100.00 580702
	¿ (label_id: 1) 0.00 0.00 0.00 0
	-------------------
	micro avg 100.00 100.00 100.00 580702
	macro avg 100.00 100.00 100.00 580702
	weighted avg 100.00 100.00 100.00 580702

	Punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 98.56 98.47 98.51 520647
	<ACRONYM> (label_id: 1) 52.00 79.89 63.00 179
	. (label_id: 2) 87.29 89.37 88.32 29852
	, (label_id: 3) 75.26 74.69 74.97 29218
	? (label_id: 4) 60.73 55.46 57.98 806
	-------------------
	micro avg 96.74 96.74 96.74 580702
	macro avg 74.77 79.57 76.56 580702
	weighted avg 96.74 96.74 96.74 580702

	Truecasing report:
	label precision recall f1 support
	LOWER (label_id: 0) 99.84 99.75 99.79 2047297
	UPPER (label_id: 1) 93.56 95.65 94.59 77424
	-------------------
	micro avg 99.60 99.60 99.60 2124721
	macro avg 96.70 97.70 97.19 2124721
	weighted avg 99.61 99.60 99.60 2124721

	Fullstop report:
	label precision recall f1 support
	NOSTOP (label_id: 0) 100.00 99.96 99.98 550858
	FULLSTOP (label_id: 1) 99.26 99.94 99.60 32833
	-------------------
	micro avg 99.95 99.95 99.95 583691
	macro avg 99.63 99.95 99.79 583691
	weighted avg 99.96 99.95 99.96 583691

	```
	</details>

	<details>

	<summary>Italian metrics</summary>

	```text
	Pre-punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 100.00 100.00 100.00 577636
	¿ (label_id: 1) 0.00 0.00 0.00 0
	-------------------
	micro avg 100.00 100.00 100.00 577636
	macro avg 100.00 100.00 100.00 577636
	weighted avg 100.00 100.00 100.00 577636

	Punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 98.10 97.73 97.91 522727
	<ACRONYM> (label_id: 1) 41.76 48.72 44.97 78
	. (label_id: 2) 81.71 86.70 84.13 28881
	, (label_id: 3) 61.72 63.24 62.47 24703
	? (label_id: 4) 62.55 41.78 50.10 1247
	-------------------
	micro avg 95.58 95.58 95.58 577636
	macro avg 69.17 67.63 67.92 577636
	weighted avg 95.64 95.58 95.60 577636

	Truecasing report:
	label precision recall f1 support
	LOWER (label_id: 0) 99.76 99.70 99.73 2160781
	UPPER (label_id: 1) 91.18 92.76 91.96 72471
	-------------------
	micro avg 99.47 99.47 99.47 2233252
	macro avg 95.47 96.23 95.85 2233252
	weighted avg 99.48 99.47 99.48 2233252

	Fullstop report:
	label precision recall f1 support
	NOSTOP (label_id: 0) 99.99 99.98 99.99 547875
	FULLSTOP (label_id: 1) 99.72 99.91 99.82 32742
	-------------------
	micro avg 99.98 99.98 99.98 580617
	macro avg 99.86 99.95 99.90 580617
	weighted avg 99.98 99.98 99.98 580617
	```
	</details>

	<details>

	<summary>French metrics</summary>

	```text
	Pre-punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 100.00 100.00 100.00 614010
	¿ (label_id: 1) 0.00 0.00 0.00 0
	-------------------
	micro avg 100.00 100.00 100.00 614010
	macro avg 100.00 100.00 100.00 614010
	weighted avg 100.00 100.00 100.00 614010

	Punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 98.72 98.57 98.65 556366
	<ACRONYM> (label_id: 1) 38.46 71.43 50.00 49
	. (label_id: 2) 86.41 88.56 87.47 28969
	, (label_id: 3) 72.15 72.80 72.47 27183
	? (label_id: 4) 75.81 67.78 71.57 1443
	-------------------
	micro avg 96.88 96.88 96.88 614010
	macro avg 74.31 79.83 76.03 614010
	weighted avg 96.91 96.88 96.89 614010

	Truecasing report:
	label precision recall f1 support
	LOWER (label_id: 0) 99.84 99.80 99.82 2127174
	UPPER (label_id: 1) 93.72 94.73 94.22 66496
	-------------------
	micro avg 99.65 99.65 99.65 2193670
	macro avg 96.78 97.27 97.02 2193670
	weighted avg 99.65 99.65 99.65 2193670

	Fullstop report:
	label precision recall f1 support
	NOSTOP (label_id: 0) 99.99 99.94 99.97 584331
	FULLSTOP (label_id: 1) 98.92 99.90 99.41 32661
	-------------------
	micro avg 99.94 99.94 99.94 616992
	macro avg 99.46 99.92 99.69 616992
	weighted avg 99.94 99.94 99.94 616992

	```
	</details>

	<details>

	<summary>Catalan metrics</summary>

	```text
	Pre-punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 99.97 100.00 99.98 143817
	¿ (label_id: 1) 0.00 0.00 0.00 50
	-------------------
	micro avg 99.97 99.97 99.97 143867
	macro avg 49.98 50.00 49.99 143867
	weighted avg 99.93 99.97 99.95 143867

	Punctuation report:
	label precision recall f1 support
	<NULL> (label_id: 0) 97.61 97.73 97.67 119040
	<ACRONYM> (label_id: 1) 0.00 0.00 0.00 28
	. (label_id: 2) 74.02 79.46 76.65 15282
	, (label_id: 3) 60.88 50.75 55.36 5836
	? (label_id: 4) 64.94 60.28 62.52 3681
	-------------------
	micro avg 92.90 92.90 92.90 143867
	macro avg 59.49 57.64 58.44 143867
	weighted avg 92.76 92.90 92.80 143867

	Truecasing report:
	label precision recall f1 support
	LOWER (label_id: 0) 99.81 99.83 99.82 422395
	UPPER (label_id: 1) 97.09 96.81 96.95 24854
	-------------------
	micro avg 99.66 99.66 99.66 447249
	macro avg 98.45 98.32 98.39 447249
	weighted avg 99.66 99.66 99.66 447249

	Fullstop report:
	label precision recall f1 support
	NOSTOP (label_id: 0) 99.93 99.63 99.78 123867
	FULLSTOP (label_id: 1) 97.97 99.59 98.77 22000
	-------------------
	micro avg 99.63 99.63 99.63 145867
	macro avg 98.95 99.61 99.28 145867
	weighted avg 99.63 99.63 99.63 145867

	```
	</details>