This model has not been trained on any Cantonese material.
It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters.
I used this repo to identify missing Cantonese characters https://github.com/ayaka14732/bert-tokenizer-cantonese
My forked and modified version: https://github.com/jedcheng/bert-tokenizer-cantonese
After identifying the missing characters, the huggingface library provides very high level API to modify the tokenizer and embeddings.
""" Download your model from the Huggingface library
tokenizer.add_tokens("your new tokens") model.resize_token_embeddings(len(tokenizer))
"""