Spaces:

StKirill
/

GenerativeChatBot

Sleeping

App Files Files Community

StKirill commited on Feb 25, 2024

Commit

0649ddf

verified ·

1 Parent(s): f447f28

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -3

README.md CHANGED Viewed

@@ -40,9 +40,28 @@ The original transcripts that I scraped from the Internet were in English. Then
 3. Third, because I wanted to capture a style unique to each character, I threw out common phrases used by all 6 main characters ("You know what!", "Oh my god!", etc.).
-Thus, I collected a corpus of dialogues for the 6 characters that included **XXXX** sentences. You can see the detailed distribution by number of sentences and words for each character in the table below:
-!["Number of replicas for every character"](https://huggingface.co/spaces/StKirill/GenerativeChatBot/blob/main/images/NumberOfReplicasInEpisode.png)
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 ### Data analysis

 3. Third, because I wanted to capture a style unique to each character, I threw out common phrases used by all 6 main characters ("You know what!", "Oh my god!", etc.).
+Thus, I collected a corpus of dialogues for the 6 characters that included about 8k sentences for every character. You can see the detailed distribution by number of sentences for each character in the table below:
+!["Number of replicas for every character"](./images/NumberOfReplicasOverCharacter.png)
 ### Data analysis
+Number of replicas for all seasons are shown below:
+!["Number of replicas for all seasons"](./images/NumberOfReplicasOverSeason.png)
+As could be seen that mean value of replicas over all seasons are 6000 with standard deviation about 400.
+Number of replicas for all episodes are shown below:
+!["Number of replicas for episode"](./images/NumberOfReplicasInEpisode.png)
+As could be seen mean value of replicas in episode are 265. Standard deviation is about 65 replicas.
+# Analysing the styles of the characters
+Before training the language patterns, I investigated the style features of Rachel. Specifically, to identify the speech features, I did the following:
+* Calculated descriptive statistics: number of words, average number of words per sentence, readability index, proportion of compound words, etc.
+* The most frequent words for the characters;
+* The proportion of positive and negative words.
+From the plot above, we can draw tentative conclusions about the specifics of the characters' speech. For example, Ross and Rachel are the most talkative, they have the maximum number of sentences.
+After this initial analysis of speech, I investigated the vocabulary in more detail and analysed it in terms of the complexity of the words used by the characters. We conventionally took long words consisting of more than 4 syllables as "difficult" words. The proportion of difficult words for each character can be seen in the graph below: