gpt2-tiny-zh-hk / README.md
jed351's picture
Update README.md
8f88c67
|
raw
history blame
649 Bytes

This model has not been trained on any Cantonese material.

It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters.

I used this repo to identify missing Cantonese characters https://github.com/ayaka14732/bert-tokenizer-cantonese

My forked and modified version: https://github.com/jedcheng/bert-tokenizer-cantonese

After identifying the missing characters, the huggingface library provides very high level API to modify the tokenizer and embeddings.

""" Download your model from the Huggingface library

tokenizer.add_tokens("your new tokens") model.resize_token_embeddings(len(tokenizer))

"""