1-800-BAD-CODE
commited on
Commit
·
7978b81
1
Parent(s):
7acbcf4
Update README.md
Browse files
README.md
CHANGED
@@ -61,6 +61,47 @@ This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (
|
|
61 |
and detects sentence boundaries (full stops) in 47 languages.
|
62 |
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
## Post-Punctuation Tokens
|
66 |
This model predicts the following set of punctuation tokens after each subtoken:
|
|
|
61 |
and detects sentence boundaries (full stops) in 47 languages.
|
62 |
|
63 |
|
64 |
+
## Tokenizer
|
65 |
+
|
66 |
+
Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode
|
67 |
+
the text. Per HF's comments,
|
68 |
+
|
69 |
+
```python
|
70 |
+
# Original fairseq vocab and spm vocab must be "aligned":
|
71 |
+
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
|
72 |
+
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
|
73 |
+
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-'
|
74 |
+
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a'
|
75 |
+
```
|
76 |
+
|
77 |
+
The SP model was un-hacked with the following snippet
|
78 |
+
(SentencePice experts, let me know if there is a problem here):
|
79 |
+
|
80 |
+
```python
|
81 |
+
from sentencepiece import SentencePieceProcessor
|
82 |
+
from sentencepiece.sentencepiece_model_pb2 import ModelProto
|
83 |
+
|
84 |
+
m = ModelProto()
|
85 |
+
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())
|
86 |
+
|
87 |
+
pieces = list(m.pieces)
|
88 |
+
pieces = (
|
89 |
+
[
|
90 |
+
ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
|
91 |
+
ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
|
92 |
+
ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
|
93 |
+
ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
|
94 |
+
]
|
95 |
+
+ pieces[3:]
|
96 |
+
+ [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
|
97 |
+
)
|
98 |
+
del m.pieces[:]
|
99 |
+
m.pieces.extend(pieces)
|
100 |
+
|
101 |
+
with open("/path/to/new/sp.model", "wb") as f:
|
102 |
+
f.write(m.SerializeToString())
|
103 |
+
```
|
104 |
+
|
105 |
|
106 |
## Post-Punctuation Tokens
|
107 |
This model predicts the following set of punctuation tokens after each subtoken:
|