1-800-BAD-CODE commited on
Commit
2d25caf
·
1 Parent(s): 47bbd43

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md CHANGED
@@ -1,3 +1,147 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: generic
4
+ tags:
5
+ - text2text-generation
6
+ - punctuation
7
+ - sentence-boundary-detection
8
+ - truecasing
9
+ language:
10
+ - af
11
+ - am
12
+ - ar
13
+ - bg
14
+ - bn
15
+ - de
16
+ - el
17
+ - en
18
+ - es
19
+ - et
20
+ - fa
21
+ - fi
22
+ - fr
23
+ - gu
24
+ - hi
25
+ - hr
26
+ - hu
27
+ - id
28
+ - is
29
+ - it
30
+ - ja
31
+ - kk
32
+ - kn
33
+ - ko
34
+ - ky
35
+ - lt
36
+ - lv
37
+ - mk
38
+ - ml
39
+ - mr
40
+ - nl
41
+ - or
42
+ - pa
43
+ - pl
44
+ - ps
45
+ - pt
46
+ - ro
47
+ - ru
48
+ - rw
49
+ - so
50
+ - sr
51
+ - sw
52
+ - ta
53
+ - te
54
+ - tr
55
+ - uk
56
+ - zh
57
  ---
58
+
59
+ # Model Overview
60
+ This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes),
61
+ and detects sentence boundaries (full stops) in 47 languages.
62
+
63
+
64
+
65
+ ## Post-Punctuation Tokens
66
+ This model predicts the following set of "post" punctuation tokens:
67
+
68
+ | Token | Description | Relavant Languages |
69
+ | ---: | :---------- | :----------- |
70
+ | \<NULL\> | No punctuation | All |
71
+ | \<ACRONYM\> | Every character in this subword is followed by a period | Primarily English, some European |
72
+ | . | Latin full stop | Many |
73
+ | , | Latin comma | Many |
74
+ | ? | Latin question mark | Many |
75
+ | ? | Full-width question mark | Chinese, Japanese |
76
+ | , | Full-width comma | Chinese, Japanese |
77
+ | 。 | Full-width full stop | Chinese, Japanese |
78
+ | 、 | Ideographic comma | Chinese, Japanese |
79
+ | ・ | Middle dot | Japanese |
80
+ | । | Danda | Hindi, Bengali, Oriya |
81
+ | ؟ | Arabic question mark | Arabic |
82
+ | ; | Greek question mark | Greek |
83
+ | ። | Ethiopic full stop | Amharic |
84
+ | ፣ | Ethiopic comma | Amharic |
85
+ | ፧ | Ethiopic question mark | Amharic |
86
+
87
+
88
+ ## Pre-Punctuation Tokens
89
+ This model predicts the following set of "post" punctuation tokens:
90
+
91
+ | Token | Description | Relavant Languages |
92
+ | ---: | :---------- | :----------- |
93
+ | ¿ | Inverted question mark | Spanish |
94
+
95
+
96
+
97
+ # Training Details
98
+ This model was trained in the NeMo framework.
99
+
100
+ ## Training Data
101
+ This model was trained with News Crawl data from WMT.
102
+
103
+ 1M lines of text for each language was used, except for a few low-resource languages which may have used less.
104
+
105
+ Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
106
+
107
+ # Limitations
108
+ This model was trained on news data, and may not perform well on conversational or informal data.
109
+
110
+ Further, this model is unlikely to be of production quality.
111
+ It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
112
+ This is also a base-sized model with many languages and many tasks, so capacity may be limited.
113
+
114
+
115
+ # Evaluation
116
+ In these metrics, keep in mind that
117
+ 1. The data is noisy
118
+ 2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
119
+ When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
120
+ 4. Punctuation can be subjective. E.g.,
121
+
122
+ `Hola mundo, ¿cómo estás?`
123
+
124
+ or
125
+
126
+ `Hola mundo. ¿Cómo estás?`
127
+
128
+ When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.
129
+
130
+ ## Test Data and Example Generation
131
+ Each test example was generated using the following procedure:
132
+
133
+ 1. Concatenate 10 random sentences
134
+ 2. Lower-case the concatenated sentence
135
+ 3. Remove all punctuation
136
+
137
+ The data is a held-out portion of News Crawl, which has been deduplicated.
138
+ 3,000 lines of data per language was used, generating 3,000 unique examples of 5 sentences each.
139
+ The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated.
140
+
141
+ Examples longer than the model's maximum length were truncated.
142
+ The number of affected sentences can be estimated from the "full stop" support: with 3,000
143
+ sentences and 10 sentences per example, we expect 30,000 full stop targets total.
144
+
145
+ ## Selected Language Evaluation Reports
146
+
147
+