1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on May 8, 2023

Commit

7978b81

·

1 Parent(s): 7acbcf4

Update README.md

Files changed (1) hide show

README.md +41 -0

README.md CHANGED Viewed

@@ -61,6 +61,47 @@ This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (
 and detects sentence boundaries (full stops) in 47 languages.
 ## Post-Punctuation Tokens
 This model predicts the following set of punctuation tokens after each subtoken:

 and detects sentence boundaries (full stops) in 47 languages.
+## Tokenizer
+Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode
+the text. Per HF's comments,
+```python
+# Original fairseq vocab and spm vocab must be "aligned":
+# Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9
+# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
+# fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'
+# spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'
+```
+The SP model was un-hacked with the following snippet
+(SentencePice experts, let me know if there is a problem here):
+```python
+from sentencepiece import SentencePieceProcessor
+from sentencepiece.sentencepiece_model_pb2 import ModelProto
+m = ModelProto()
+m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())
+pieces = list(m.pieces)
+pieces = (
+    [
+        ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
+        ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
+        ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
+        ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
+    ]
+    + pieces[3:]
+    + [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
+)
+del m.pieces[:]
+m.pieces.extend(pieces)
+with open("/path/to/new/sp.model", "wb") as f:
+    f.write(m.SerializeToString())
+```
 ## Post-Punctuation Tokens
 This model predicts the following set of punctuation tokens after each subtoken: