README.md · jed351/gpt2-tiny-zh-hk at 8f88c67339c0a3aef0eb75c0bcd512abfda6b911

This model has not been trained on any Cantonese material.

It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters.

I used this repo to identify missing Cantonese characters https://github.com/ayaka14732/bert-tokenizer-cantonese

After identifying the missing characters, the huggingface library provides very high level API to modify the tokenizer and embeddings.

""" Download your model from the Huggingface library

tokenizer.add_tokens("your new tokens") model.resize_token_embeddings(len(tokenizer))

"""