jed351
/

gpt2_base_zh-hk-lihkg

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

jed351 commited on Feb 14, 2023

Commit

80155a7

·

1 Parent(s): 15dc901

Create README.md

Files changed (1) hide show

README.md +70 -0

README.md ADDED Viewed

	@@ -0,0 +1,70 @@

+---
+metrics:
+- accuracy
+model-index:
+- name: gpt2-lihkg
+  results:
+  - task:
+      name: Causal Language Modeling
+      type: text-generation
+    dataset:
+      name: lihkg_data
+      type: lihkg_data
+    metrics:
+    - name: Perplexity
+      type: Perplexity
+      value: 30.93
+license: openrail
+---
+# gpt2-shikoto
+**Please be aware that the training data might contain inappropriate content. This model is intended for research purposes only.**
+The base model can be found [here](https://huggingface.co/jed351/gpt2-base-zh-hk), which was obtained by
+patching a [GPT2 Chinese model](https://huggingface.co/ckiplab/gpt2-base-chinese) and its tokenizer with Cantonese characters.
+Refer to the base model for info on the patching process.
+The training data was obtained from scrapping an online forum in Hong Kong called LIHKG.
+The tool can be found [here](https://github.com/ayaka14732/lihkg-scraper).
+Please also check out the [Bart model](https://huggingface.co/Ayaka/bart-base-cantonese) created by her.
+## Training procedure
+Please refer to the [script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling)
+provided by Huggingface.
+The model was trained for 400,000 steps with batch size 5 on 2 NVIDIA Quadro RTX6000 for around 40 hours at the Research Computing Services of Imperial College London.
+### How to use it?
+```
+from transformers import AutoTokenizer
+from transformers import TextGenerationPipeline, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("jed351/gpt2_base_zh-hk-lihkg")
+model = AutoModelForCausalLM.from_pretrained("jed351/gpt2_base_zh-hk-lihkg")
+# try messing around with the parameters
+generator = TextGenerationPipeline(model, tokenizer,
+                                   max_new_tokens=200,
+                                   no_repeat_ngram_size=3) #, device=0) #if you have a GPU
+input_string = "your input"
+output = generator(input_string)
+string = output[0]['generated_text'].replace(' ', '')
+print(string)
+```
+### Framework versions
+- Transformers 4.26.0.dev0
+- Pytorch 1.13.1
+- Datasets 2.8.0
+- Tokenizers 0.13.2