1-800-BAD-CODE
/

punct_cap_seg_47_language

Text2Text Generation

ONNX

generic

punctuation

sentence-boundary-detection

truecasing

Model card Files Files and versions Community

1-800-BAD-CODE commited on May 7, 2023

Commit

46d08de

1 Parent(s): 7487210

typos

Browse files

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -178,17 +178,17 @@ Next, the input sequence is encoded with a base-sized Transformer, consisting of
 2. **Post-punctuation**:
 The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
 Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
-Post punctation is predicted once per subword - further discussion is below.
 3. **Re-encoding**
 All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
 Therefore, we must conditional all further predictions on the post punctuation tokens.
 For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
 Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
-The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
 4. **Pre-punctuation**
-After the re-encoding, another classification network predicts "pre" punctuation, or punctation tokens that may appear before a word.
 In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
 Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
@@ -204,7 +204,7 @@ Therefore, we shift the binary sentence boundary decisions to the right by one:
 Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
 7. **True-case prediction**
-Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
 Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
 (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
 This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
@@ -213,7 +213,7 @@ This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.
 ## Post-Punctuation Tokens
 This model predicts the following set of "post" punctuation tokens:
-| Token  | Description | Relavant Languages |
 | ---: | :---------- | :----------- |
 | .    | Latin full stop | Many |
 | ,    | Latin comma | Many |
@@ -234,7 +234,7 @@ This model predicts the following set of "post" punctuation tokens:
 ## Pre-Punctuation Tokens
 This model predicts the following set of "post" punctuation tokens:
-| Token  | Description | Relavant Languages |
 | ---: | :---------- | :----------- |
 | ¿    | Inverted question mark | Spanish |

 2. **Post-punctuation**:
 The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
 Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
+Post punctuation is predicted once per subword - further discussion is below.
 3. **Re-encoding**
 All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
 Therefore, we must conditional all further predictions on the post punctuation tokens.
 For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
 Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
+The concatenated joint representation is re-encoded to confer global context to each time step to incorporate punctuation predictions into subsequent tasks.
 4. **Pre-punctuation**
+After the re-encoding, another classification network predicts "pre" punctuation, or punctuation tokens that may appear before a word.
 In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
 Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
 Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
 7. **True-case prediction**
+Armed with the knowledge of punctuation and sentence boundaries, a classification network predicts true-casing.
 Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
 (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
 This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
 ## Post-Punctuation Tokens
 This model predicts the following set of "post" punctuation tokens:
+| Token  | Description | Relevant Languages |
 | ---: | :---------- | :----------- |
 | .    | Latin full stop | Many |
 | ,    | Latin comma | Many |
 ## Pre-Punctuation Tokens
 This model predicts the following set of "post" punctuation tokens:
+| Token  | Description | Relevant Languages |
 | ---: | :---------- | :----------- |
 | ¿    | Inverted question mark | Spanish |