jed351 commited on
Commit
80155a7
1 Parent(s): 15dc901

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ metrics:
3
+ - accuracy
4
+ model-index:
5
+ - name: gpt2-lihkg
6
+ results:
7
+ - task:
8
+ name: Causal Language Modeling
9
+ type: text-generation
10
+ dataset:
11
+ name: lihkg_data
12
+ type: lihkg_data
13
+ metrics:
14
+ - name: Perplexity
15
+ type: Perplexity
16
+ value: 30.93
17
+ license: openrail
18
+ ---
19
+
20
+
21
+
22
+ # gpt2-shikoto
23
+
24
+ **Please be aware that the training data might contain inappropriate content. This model is intended for research purposes only.**
25
+
26
+
27
+
28
+ The base model can be found [here](https://huggingface.co/jed351/gpt2-base-zh-hk), which was obtained by
29
+ patching a [GPT2 Chinese model](https://huggingface.co/ckiplab/gpt2-base-chinese) and its tokenizer with Cantonese characters.
30
+ Refer to the base model for info on the patching process.
31
+
32
+
33
+ The training data was obtained from scrapping an online forum in Hong Kong called LIHKG.
34
+ The tool can be found [here](https://github.com/ayaka14732/lihkg-scraper).
35
+ Please also check out the [Bart model](https://huggingface.co/Ayaka/bart-base-cantonese) created by her.
36
+
37
+
38
+ ## Training procedure
39
+
40
+ Please refer to the [script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling)
41
+ provided by Huggingface.
42
+
43
+
44
+ The model was trained for 400,000 steps with batch size 5 on 2 NVIDIA Quadro RTX6000 for around 40 hours at the Research Computing Services of Imperial College London.
45
+
46
+
47
+
48
+
49
+ ### How to use it?
50
+ ```
51
+ from transformers import AutoTokenizer
52
+ from transformers import TextGenerationPipeline, AutoModelForCausalLM
53
+ tokenizer = AutoTokenizer.from_pretrained("jed351/gpt2_base_zh-hk-lihkg")
54
+ model = AutoModelForCausalLM.from_pretrained("jed351/gpt2_base_zh-hk-lihkg")
55
+ # try messing around with the parameters
56
+ generator = TextGenerationPipeline(model, tokenizer,
57
+ max_new_tokens=200,
58
+ no_repeat_ngram_size=3) #, device=0) #if you have a GPU
59
+ input_string = "your input"
60
+ output = generator(input_string)
61
+ string = output[0]['generated_text'].replace(' ', '')
62
+ print(string)
63
+ ```
64
+
65
+ ### Framework versions
66
+
67
+ - Transformers 4.26.0.dev0
68
+ - Pytorch 1.13.1
69
+ - Datasets 2.8.0
70
+ - Tokenizers 0.13.2