Update README.md
Browse files
README.md
CHANGED
@@ -15,6 +15,8 @@ The limitations of this model are that it can only generate text in the style of
|
|
15 |
|
16 |
I created my own dataset to train this model. I chose 14 novels written by H G Wells for my dataset. Most of the novels in the dataset are of the genre science fiction. The dataset contains more than 1 million tokens.
|
17 |
|
|
|
|
|
18 |
The texts included in the corpus are novels written by H G Wells. The novels in the corpus are:
|
19 |
|
20 |
The Time Machine - 37677
|
@@ -47,13 +49,14 @@ The Red Room - 4618
|
|
47 |
|
48 |
The total number of tokens in the corpus is 1043588.
|
49 |
|
50 |
-
The corpus was created by downloading and combining 14 novels of the famous author H G Wells from Project Gutenberg. Most of these novels are science fiction novels, so this model has been trained to generate text of the science fiction genre. It
|
|
|
51 |
|
52 |
-
The corpus consists of 14 novels written by H G Wells downloaded from Project Gutenberg. The text added by Project Gutenberg at the beginning and end of each novel
|
53 |
was converted into one line. Then the single line was broken into 20 parts. In this way 20 lines were generated for each novel. The lines from each novel were then combined and
|
54 |
-
stored in a single text file. The text was tokenized by using the
|
55 |
|
56 |
-
The values of the
|
57 |
|
58 |
batch_size = 2
|
59 |
|
@@ -65,12 +68,13 @@ learning rate = 5e-4
|
|
65 |
|
66 |
warmup steps = 1e2
|
67 |
|
68 |
-
The corpus has been uploaded on HuggingFace. It can be accessed from the following link: https://huggingface.co/datasets/MinzaKhan/HGWells
|
69 |
-
|
70 |
Training Loss: 2.26
|
71 |
|
72 |
Training Perplexity: 9.57
|
73 |
|
74 |
Validation Loss: 3.84
|
75 |
|
76 |
-
Validation Perplexity: 46.43
|
|
|
|
|
|
|
|
15 |
|
16 |
I created my own dataset to train this model. I chose 14 novels written by H G Wells for my dataset. Most of the novels in the dataset are of the genre science fiction. The dataset contains more than 1 million tokens.
|
17 |
|
18 |
+
The evaluation results are good. The model is able to generate text in the style of H G Wells. Most of the text generated is of the science fiction genre.
|
19 |
+
|
20 |
The texts included in the corpus are novels written by H G Wells. The novels in the corpus are:
|
21 |
|
22 |
The Time Machine - 37677
|
|
|
49 |
|
50 |
The total number of tokens in the corpus is 1043588.
|
51 |
|
52 |
+
The corpus was created by downloading and combining 14 novels of the famous author H G Wells from Project Gutenberg. Most of these novels are science fiction novels, so this model has been trained to generate text of the science fiction genre. It has been trained to produce text in the style of H G Wells.
|
53 |
+
This model was created on 23rd February, 2023.
|
54 |
|
55 |
+
The corpus consists of 14 novels written by H G Wells downloaded from Project Gutenberg. The text added by Project Gutenberg at the beginning and end of each novel was removed. Then the entire text in each novel
|
56 |
was converted into one line. Then the single line was broken into 20 parts. In this way 20 lines were generated for each novel. The lines from each novel were then combined and
|
57 |
+
stored in a single text file. This is the preprocessing that was done on the text files. The text was tokenized by using the GPT2Tokenizer from the transformers library. This text file was then used to finetune the model.
|
58 |
|
59 |
+
The values of the hyperparameters used during finetuning are:
|
60 |
|
61 |
batch_size = 2
|
62 |
|
|
|
68 |
|
69 |
warmup steps = 1e2
|
70 |
|
|
|
|
|
71 |
Training Loss: 2.26
|
72 |
|
73 |
Training Perplexity: 9.57
|
74 |
|
75 |
Validation Loss: 3.84
|
76 |
|
77 |
+
Validation Perplexity: 46.43
|
78 |
+
|
79 |
+
The corpus has been uploaded on HuggingFace. It can be accessed from the following link: https://huggingface.co/datasets/MinzaKhan/HGWells
|
80 |
+
|