1-800-BAD-CODE
commited on
Commit
·
46d08de
1
Parent(s):
7487210
typos
Browse files
README.md
CHANGED
@@ -178,17 +178,17 @@ Next, the input sequence is encoded with a base-sized Transformer, consisting of
|
|
178 |
2. **Post-punctuation**:
|
179 |
The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
|
180 |
Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
|
181 |
-
Post
|
182 |
|
183 |
3. **Re-encoding**
|
184 |
All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
|
185 |
Therefore, we must conditional all further predictions on the post punctuation tokens.
|
186 |
For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
|
187 |
Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
|
188 |
-
The concatenated joint representation is re-encoded to confer global context to each time step to incorporate
|
189 |
|
190 |
4. **Pre-punctuation**
|
191 |
-
After the re-encoding, another classification network predicts "pre" punctuation, or
|
192 |
In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
|
193 |
Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
|
194 |
|
@@ -204,7 +204,7 @@ Therefore, we shift the binary sentence boundary decisions to the right by one:
|
|
204 |
Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
205 |
|
206 |
7. **True-case prediction**
|
207 |
-
Armed with the knowledge of
|
208 |
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
|
209 |
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
|
210 |
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
|
@@ -213,7 +213,7 @@ This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.
|
|
213 |
## Post-Punctuation Tokens
|
214 |
This model predicts the following set of "post" punctuation tokens:
|
215 |
|
216 |
-
| Token | Description |
|
217 |
| ---: | :---------- | :----------- |
|
218 |
| . | Latin full stop | Many |
|
219 |
| , | Latin comma | Many |
|
@@ -234,7 +234,7 @@ This model predicts the following set of "post" punctuation tokens:
|
|
234 |
## Pre-Punctuation Tokens
|
235 |
This model predicts the following set of "post" punctuation tokens:
|
236 |
|
237 |
-
| Token | Description |
|
238 |
| ---: | :---------- | :----------- |
|
239 |
| ¿ | Inverted question mark | Spanish |
|
240 |
|
|
|
178 |
2. **Post-punctuation**:
|
179 |
The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
|
180 |
Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
|
181 |
+
Post punctuation is predicted once per subword - further discussion is below.
|
182 |
|
183 |
3. **Re-encoding**
|
184 |
All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
|
185 |
Therefore, we must conditional all further predictions on the post punctuation tokens.
|
186 |
For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
|
187 |
Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
|
188 |
+
The concatenated joint representation is re-encoded to confer global context to each time step to incorporate punctuation predictions into subsequent tasks.
|
189 |
|
190 |
4. **Pre-punctuation**
|
191 |
+
After the re-encoding, another classification network predicts "pre" punctuation, or punctuation tokens that may appear before a word.
|
192 |
In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
|
193 |
Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
|
194 |
|
|
|
204 |
Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
205 |
|
206 |
7. **True-case prediction**
|
207 |
+
Armed with the knowledge of punctuation and sentence boundaries, a classification network predicts true-casing.
|
208 |
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
|
209 |
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
|
210 |
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
|
|
|
213 |
## Post-Punctuation Tokens
|
214 |
This model predicts the following set of "post" punctuation tokens:
|
215 |
|
216 |
+
| Token | Description | Relevant Languages |
|
217 |
| ---: | :---------- | :----------- |
|
218 |
| . | Latin full stop | Many |
|
219 |
| , | Latin comma | Many |
|
|
|
234 |
## Pre-Punctuation Tokens
|
235 |
This model predicts the following set of "post" punctuation tokens:
|
236 |
|
237 |
+
| Token | Description | Relevant Languages |
|
238 |
| ---: | :---------- | :----------- |
|
239 |
| ¿ | Inverted question mark | Spanish |
|
240 |
|