1-800-BAD-CODE commited on
Commit
7978b81
·
1 Parent(s): 7acbcf4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -0
README.md CHANGED
@@ -61,6 +61,47 @@ This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (
61
  and detects sentence boundaries (full stops) in 47 languages.
62
 
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ## Post-Punctuation Tokens
66
  This model predicts the following set of punctuation tokens after each subtoken:
 
61
  and detects sentence boundaries (full stops) in 47 languages.
62
 
63
 
64
+ ## Tokenizer
65
+
66
+ Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode
67
+ the text. Per HF's comments,
68
+
69
+ ```python
70
+ # Original fairseq vocab and spm vocab must be "aligned":
71
+ # Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
72
+ # -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
73
+ # fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-'
74
+ # spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a'
75
+ ```
76
+
77
+ The SP model was un-hacked with the following snippet
78
+ (SentencePice experts, let me know if there is a problem here):
79
+
80
+ ```python
81
+ from sentencepiece import SentencePieceProcessor
82
+ from sentencepiece.sentencepiece_model_pb2 import ModelProto
83
+
84
+ m = ModelProto()
85
+ m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())
86
+
87
+ pieces = list(m.pieces)
88
+ pieces = (
89
+ [
90
+ ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
91
+ ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
92
+ ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
93
+ ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
94
+ ]
95
+ + pieces[3:]
96
+ + [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
97
+ )
98
+ del m.pieces[:]
99
+ m.pieces.extend(pieces)
100
+
101
+ with open("/path/to/new/sp.model", "wb") as f:
102
+ f.write(m.SerializeToString())
103
+ ```
104
+
105
 
106
  ## Post-Punctuation Tokens
107
  This model predicts the following set of punctuation tokens after each subtoken: