|
--- |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: gpt2-lihkg |
|
results: |
|
- task: |
|
name: Causal Language Modeling |
|
type: text-generation |
|
dataset: |
|
name: lihkg_data |
|
type: lihkg_data |
|
metrics: |
|
- name: Perplexity |
|
type: Perplexity |
|
value: 30.93 |
|
license: openrail |
|
--- |
|
|
|
|
|
|
|
# gpt2-base-lihkg |
|
|
|
**Please be aware that the training data might contain inappropriate content. This model is intended for research purposes only.** |
|
|
|
|
|
|
|
The base model can be found [here](https://huggingface.co/jed351/gpt2-base-zh-hk), which was obtained by |
|
patching a [GPT2 Chinese model](https://huggingface.co/ckiplab/gpt2-base-chinese) and its tokenizer with Cantonese characters. |
|
Refer to the base model for info on the patching process. |
|
|
|
|
|
The training data was obtained from scrapping an online forum in Hong Kong called LIHKG. |
|
The tool can be found [here](https://github.com/ayaka14732/lihkg-scraper). |
|
Please also check out the [Bart model](https://huggingface.co/Ayaka/bart-base-cantonese) created by her. |
|
|
|
|
|
|
|
### Limitations |
|
The model was trained on ~10GB of data scrapped from LIHKG. |
|
It might contain violent and rude languages so as the text generated by the model. |
|
Please do not use it for anything other than research or entertainment. |
|
|
|
|
|
The comments on LIHKG also tend to be very short. |
|
Thus the model cannot generate anything more than a line. In a lot of occasions might not even generate new tokens. |
|
|
|
|
|
|
|
### Training procedure |
|
|
|
Please refer to the [script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling) |
|
provided by Huggingface. |
|
|
|
|
|
The model was trained for 7 epochs on 1 NVIDIA Quadro RTX6000 at the Research Computing Services of Imperial College London. |
|
|
|
|
|
### How to use it? |
|
``` |
|
from transformers import AutoTokenizer |
|
from transformers import TextGenerationPipeline, AutoModelForCausalLM |
|
tokenizer = AutoTokenizer.from_pretrained("jed351/gpt2_base_zh-hk-lihkg") |
|
model = AutoModelForCausalLM.from_pretrained("jed351/gpt2_base_zh-hk-lihkg") |
|
# try messing around with the parameters |
|
generator = TextGenerationPipeline(model, tokenizer, |
|
max_new_tokens=200, |
|
no_repeat_ngram_size=3) #, device=0) #if you have a GPU |
|
input_string = "your input" |
|
output = generator(input_string) |
|
string = output[0]['generated_text'].replace(' ', '') |
|
print(string) |
|
``` |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.26.0.dev0 |
|
- Pytorch 1.13.1 |
|
- Datasets 2.8.0 |
|
- Tokenizers 0.13.2 |