jed351 commited on
Commit
8f88c67
·
1 Parent(s): 3f5f000

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -1
README.md CHANGED
@@ -5,4 +5,16 @@ It is simply a base model in which the embeddings and tokenizer were patched wit
5
  I used this repo to identify missing Cantonese characters
6
  https://github.com/ayaka14732/bert-tokenizer-cantonese
7
 
8
- My forked and modified version: https://github.com/jedcheng/bert-tokenizer-cantonese
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  I used this repo to identify missing Cantonese characters
6
  https://github.com/ayaka14732/bert-tokenizer-cantonese
7
 
8
+ My forked and modified version: https://github.com/jedcheng/bert-tokenizer-cantonese
9
+
10
+
11
+ After identifying the missing characters, the huggingface library provides very high level API to modify the tokenizer and embeddings.
12
+
13
+ """
14
+ Download your model from the Huggingface library
15
+
16
+ tokenizer.add_tokens("your new tokens")
17
+ model.resize_token_embeddings(len(tokenizer))
18
+
19
+
20
+ """