1-800-BAD-CODE commited on
Commit
46d08de
·
1 Parent(s): 7487210
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -178,17 +178,17 @@ Next, the input sequence is encoded with a base-sized Transformer, consisting of
178
  2. **Post-punctuation**:
179
  The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
180
  Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
181
- Post punctation is predicted once per subword - further discussion is below.
182
 
183
  3. **Re-encoding**
184
  All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
185
  Therefore, we must conditional all further predictions on the post punctuation tokens.
186
  For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
187
  Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
188
- The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
189
 
190
  4. **Pre-punctuation**
191
- After the re-encoding, another classification network predicts "pre" punctuation, or punctation tokens that may appear before a word.
192
  In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
193
  Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
194
 
@@ -204,7 +204,7 @@ Therefore, we shift the binary sentence boundary decisions to the right by one:
204
  Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
205
 
206
  7. **True-case prediction**
207
- Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
208
  Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
209
  (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
210
  This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
@@ -213,7 +213,7 @@ This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.
213
  ## Post-Punctuation Tokens
214
  This model predicts the following set of "post" punctuation tokens:
215
 
216
- | Token | Description | Relavant Languages |
217
  | ---: | :---------- | :----------- |
218
  | . | Latin full stop | Many |
219
  | , | Latin comma | Many |
@@ -234,7 +234,7 @@ This model predicts the following set of "post" punctuation tokens:
234
  ## Pre-Punctuation Tokens
235
  This model predicts the following set of "post" punctuation tokens:
236
 
237
- | Token | Description | Relavant Languages |
238
  | ---: | :---------- | :----------- |
239
  | ¿ | Inverted question mark | Spanish |
240
 
 
178
  2. **Post-punctuation**:
179
  The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
180
  Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
181
+ Post punctuation is predicted once per subword - further discussion is below.
182
 
183
  3. **Re-encoding**
184
  All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
185
  Therefore, we must conditional all further predictions on the post punctuation tokens.
186
  For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
187
  Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
188
+ The concatenated joint representation is re-encoded to confer global context to each time step to incorporate punctuation predictions into subsequent tasks.
189
 
190
  4. **Pre-punctuation**
191
+ After the re-encoding, another classification network predicts "pre" punctuation, or punctuation tokens that may appear before a word.
192
  In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
193
  Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
194
 
 
204
  Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
205
 
206
  7. **True-case prediction**
207
+ Armed with the knowledge of punctuation and sentence boundaries, a classification network predicts true-casing.
208
  Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
209
  (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
210
  This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
 
213
  ## Post-Punctuation Tokens
214
  This model predicts the following set of "post" punctuation tokens:
215
 
216
+ | Token | Description | Relevant Languages |
217
  | ---: | :---------- | :----------- |
218
  | . | Latin full stop | Many |
219
  | , | Latin comma | Many |
 
234
  ## Pre-Punctuation Tokens
235
  This model predicts the following set of "post" punctuation tokens:
236
 
237
+ | Token | Description | Relevant Languages |
238
  | ---: | :---------- | :----------- |
239
  | ¿ | Inverted question mark | Spanish |
240