StKirill commited on
Commit
0649ddf
·
verified ·
1 Parent(s): f447f28

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -3
README.md CHANGED
@@ -40,9 +40,28 @@ The original transcripts that I scraped from the Internet were in English. Then
40
 
41
  3. Third, because I wanted to capture a style unique to each character, I threw out common phrases used by all 6 main characters ("You know what!", "Oh my god!", etc.).
42
 
43
- Thus, I collected a corpus of dialogues for the 6 characters that included **XXXX** sentences. You can see the detailed distribution by number of sentences and words for each character in the table below:
44
- !["Number of replicas for every character"](https://huggingface.co/spaces/StKirill/GenerativeChatBot/blob/main/images/NumberOfReplicasInEpisode.png)
45
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
46
 
47
  ### Data analysis
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
 
40
 
41
  3. Third, because I wanted to capture a style unique to each character, I threw out common phrases used by all 6 main characters ("You know what!", "Oh my god!", etc.).
42
 
43
+ Thus, I collected a corpus of dialogues for the 6 characters that included about 8k sentences for every character. You can see the detailed distribution by number of sentences for each character in the table below:
44
+ !["Number of replicas for every character"](./images/NumberOfReplicasOverCharacter.png)
 
45
 
46
  ### Data analysis
47
+ Number of replicas for all seasons are shown below:
48
+ !["Number of replicas for all seasons"](./images/NumberOfReplicasOverSeason.png)
49
+ As could be seen that mean value of replicas over all seasons are 6000 with standard deviation about 400.
50
+
51
+ Number of replicas for all episodes are shown below:
52
+ !["Number of replicas for episode"](./images/NumberOfReplicasInEpisode.png)
53
+ As could be seen mean value of replicas in episode are 265. Standard deviation is about 65 replicas.
54
+
55
+ # Analysing the styles of the characters
56
+ Before training the language patterns, I investigated the style features of Rachel. Specifically, to identify the speech features, I did the following:
57
+
58
+ * Calculated descriptive statistics: number of words, average number of words per sentence, readability index, proportion of compound words, etc.
59
+
60
+ * The most frequent words for the characters;
61
+
62
+ * The proportion of positive and negative words.
63
+
64
+ From the plot above, we can draw tentative conclusions about the specifics of the characters' speech. For example, Ross and Rachel are the most talkative, they have the maximum number of sentences.
65
+
66
+ After this initial analysis of speech, I investigated the vocabulary in more detail and analysed it in terms of the complexity of the words used by the characters. We conventionally took long words consisting of more than 4 syllables as "difficult" words. The proportion of difficult words for each character can be seen in the graph below:
67