1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

 ---
 license: apache-2.0
+library_name: generic
+tags:
+  - text2text-generation
+  - punctuation
+  - sentence-boundary-detection
+  - truecasing
+language:
+  - af
+  - am
+  - ar
+  - bg
+  - bn
+  - de
+  - el
+  - en
+  - es
+  - et
+  - fa
+  - fi
+  - fr
+  - gu
+  - hi
+  - hr
+  - hu
+  - id
+  - is
+  - it
+  - ja
+  - kk
+  - kn
+  - ko
+  - ky
+  - lt
+  - lv
+  - mk
+  - ml
+  - mr
+  - nl
+  - or
+  - pa
+  - pl
+  - ps
+  - pt
+  - ro
+  - ru
+  - rw
+  - so
+  - sr
+  - sw
+  - ta
+  - te
+  - tr
+  - uk
+  - zh
 ---
+# Model Overview
+This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
+and detects sentence boundaries (full stops) in 47 languages.
+## Post-Punctuation Tokens
+This model predicts the following set of "post" punctuation tokens:
+| Token  | Description | Relavant Languages |
+| ---: | :---------- | :----------- |
+| \<NULL\>    | No punctuation | All |
+| \<ACRONYM\>    | Every character in this subword is followed by a period | Primarily English, some European |
+| .    | Latin full stop | Many |
+| ,    | Latin comma | Many |
+| ?    | Latin question mark | Many |
+| ？    | Full-width question mark | Chinese, Japanese |
+| ，    | Full-width comma | Chinese, Japanese |
+| 。    | Full-width full stop | Chinese, Japanese |
+| 、    | Ideographic comma | Chinese, Japanese |
+| ・    | Middle dot | Japanese |
+| ।    | Danda | Hindi, Bengali, Oriya |
+| ؟    | Arabic question mark | Arabic |
+| ;    | Greek question mark | Greek |
+| ።    | Ethiopic full stop | Amharic |
+| ፣    | Ethiopic comma | Amharic |
+| ፧    | Ethiopic question mark | Amharic |
+## Pre-Punctuation Tokens
+This model predicts the following set of "post" punctuation tokens:
+| Token  | Description | Relavant Languages |
+| ---: | :---------- | :----------- |
+| ¿    | Inverted question mark | Spanish |
+# Training Details
+This model was trained in the NeMo framework.
+## Training Data
+This model was trained with News Crawl data from WMT.
+1M lines of text for each language was used, except for a few low-resource languages which may have used less.
+Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
+# Limitations
+This model was trained on news data, and may not perform well on conversational or informal data.
+Further, this model is unlikely to be of production quality.
+It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
+This is also a base-sized model with many languages and many tasks, so capacity may be limited.
+# Evaluation
+In these metrics, keep in mind that
+1. The data is noisy
+2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
+   When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
+4. Punctuation can be subjective. E.g.,
+   `Hola mundo, ¿cómo estás?`
+   or
+   `Hola mundo. ¿Cómo estás?`
+   When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.
+## Test Data and Example Generation
+Each test example was generated using the following procedure:
+1. Concatenate 10 random sentences
+2. Lower-case the concatenated sentence
+3. Remove all punctuation
+The data is a held-out portion of News Crawl, which has been deduplicated.
+3,000 lines of data per language was used, generating 3,000 unique examples of 5 sentences each.
+The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated.
+Examples longer than the model's maximum length were truncated.
+The number of affected sentences can be estimated from the "full stop" support: with 3,000
+sentences and 10 sentences per example, we expect 30,000 full stop targets total.
+## Selected Language Evaluation Reports