|
--- |
|
license: apache-2.0 |
|
library_name: generic |
|
tags: |
|
- text2text-generation |
|
- punctuation |
|
- sentence-boundary-detection |
|
- truecasing |
|
language: |
|
- af |
|
- am |
|
- ar |
|
- bg |
|
- bn |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- fa |
|
- fi |
|
- fr |
|
- gu |
|
- hi |
|
- hr |
|
- hu |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- kk |
|
- kn |
|
- ko |
|
- ky |
|
- lt |
|
- lv |
|
- mk |
|
- ml |
|
- mr |
|
- nl |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- rw |
|
- so |
|
- sr |
|
- sw |
|
- ta |
|
- te |
|
- tr |
|
- uk |
|
- zh |
|
--- |
|
|
|
# Model Overview |
|
This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes), |
|
and detects sentence boundaries (full stops) in 47 languages. |
|
|
|
|
|
|
|
## Post-Punctuation Tokens |
|
This model predicts the following set of punctuation tokens after each subtoken: |
|
|
|
| Token | Description | Relevant Languages | |
|
| ---: | :---------- | :----------- | |
|
| \<NULL\> | No punctuation | All | |
|
| \<ACRONYM\> | Every character in this subword is followed by a period | Primarily English, some European | |
|
| . | Latin full stop | Many | |
|
| , | Latin comma | Many | |
|
| ? | Latin question mark | Many | |
|
| ? | Full-width question mark | Chinese, Japanese | |
|
| , | Full-width comma | Chinese, Japanese | |
|
| 。 | Full-width full stop | Chinese, Japanese | |
|
| 、 | Ideographic comma | Chinese, Japanese | |
|
| ・ | Middle dot | Japanese | |
|
| । | Danda | Hindi, Bengali, Oriya | |
|
| ؟ | Arabic question mark | Arabic | |
|
| ; | Greek question mark | Greek | |
|
| ። | Ethiopic full stop | Amharic | |
|
| ፣ | Ethiopic comma | Amharic | |
|
| ፧ | Ethiopic question mark | Amharic | |
|
|
|
|
|
## Pre-Punctuation Tokens |
|
This model predicts the following set of punctuation tokens before each subword: |
|
|
|
| Token | Description | Relevant Languages | |
|
| ---: | :---------- | :----------- | |
|
| \<NULL\> | No punctuation | All | |
|
| ¿ | Inverted question mark | Spanish | |
|
|
|
|
|
|
|
# Training Details |
|
This model was trained in the NeMo framework. |
|
|
|
## Training Data |
|
This model was trained with News Crawl data from WMT. |
|
|
|
1M lines of text for each language was used, except for a few low-resource languages which may have used less. |
|
|
|
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author. |
|
|
|
# Limitations |
|
This model was trained on news data, and may not perform well on conversational or informal data. |
|
|
|
Further, this model is unlikely to be of production quality. |
|
It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data. |
|
This is also a base-sized model with many languages and many tasks, so capacity may be limited. |
|
|
|
|
|
# Evaluation |
|
In these metrics, keep in mind that |
|
1. The data is noisy |
|
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect. |
|
When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages. |
|
4. Punctuation can be subjective. E.g., |
|
|
|
`Hola mundo, ¿cómo estás?` |
|
|
|
or |
|
|
|
`Hola mundo. ¿Cómo estás?` |
|
|
|
When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics. |
|
|
|
## Test Data and Example Generation |
|
Each test example was generated using the following procedure: |
|
|
|
1. Concatenate 10 random sentences |
|
2. Lower-case the concatenated sentence |
|
3. Remove all punctuation |
|
|
|
The data is a held-out portion of News Crawl, which has been deduplicated. |
|
3,000 lines of data per language was used, generating 3,000 unique examples of 5 sentences each. |
|
The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated. |
|
|
|
Examples longer than the model's maximum length were truncated. |
|
The number of affected sentences can be estimated from the "full stop" support: with 3,000 |
|
sentences and 10 sentences per example, we expect 30,000 full stop targets total. |
|
|
|
## Selected Language Evaluation Reports |
|
|
|
|
|
|