jed351 commited on
Commit
f6a57de
·
1 Parent(s): 351b0c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -7,16 +7,17 @@ It is simply a base model in which the embeddings and tokenizer were patched wit
7
 
8
 
9
 
10
- I used this repo to identify missing Cantonese characters
11
- https://github.com/ayaka14732/bert-tokenizer-cantonese
12
 
13
- My forked and modified version: https://github.com/jedcheng/bert-tokenizer-cantonese
14
 
15
- After identifying the missing characters, the huggingface library provides very high level API to modify the tokenizer and embeddings.
16
 
17
  ```
18
- Download your model from the Huggingface library
19
 
20
  tokenizer.add_tokens("your new tokens")
21
  model.resize_token_embeddings(len(tokenizer))
 
 
22
  ```
 
7
 
8
 
9
 
10
+ I used this [repo](https://github.com/ayaka14732/bert-tokenizer-cantonese) to identify missing Cantonese characters
 
11
 
12
+ [My forked and modified version](https://github.com/jedcheng/bert-tokenizer-cantonese)
13
 
14
+ After identifying the missing characters, the Huggingface library provides very high level API to modify the tokenizer and embeddings.
15
 
16
  ```
17
+ Download a tokenizer and a model from the Huggingface library. Then:
18
 
19
  tokenizer.add_tokens("your new tokens")
20
  model.resize_token_embeddings(len(tokenizer))
21
+
22
+ tokenizer.push_to_hub("your model name")
23
  ```