Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -40,9 +40,28 @@ The original transcripts that I scraped from the Internet were in English. Then
|
|
40 |
|
41 |
3. Third, because I wanted to capture a style unique to each character, I threw out common phrases used by all 6 main characters ("You know what!", "Oh my god!", etc.).
|
42 |
|
43 |
-
Thus, I collected a corpus of dialogues for the 6 characters that included
|
44 |
-
.
|
42 |
|
43 |
+
Thus, I collected a corpus of dialogues for the 6 characters that included about 8k sentences for every character. You can see the detailed distribution by number of sentences for each character in the table below:
|
44 |
+

|
|
|
45 |
|
46 |
### Data analysis
|
47 |
+
Number of replicas for all seasons are shown below:
|
48 |
+

|
49 |
+
As could be seen that mean value of replicas over all seasons are 6000 with standard deviation about 400.
|
50 |
+
|
51 |
+
Number of replicas for all episodes are shown below:
|
52 |
+

|
53 |
+
As could be seen mean value of replicas in episode are 265. Standard deviation is about 65 replicas.
|
54 |
+
|
55 |
+
# Analysing the styles of the characters
|
56 |
+
Before training the language patterns, I investigated the style features of Rachel. Specifically, to identify the speech features, I did the following:
|
57 |
+
|
58 |
+
* Calculated descriptive statistics: number of words, average number of words per sentence, readability index, proportion of compound words, etc.
|
59 |
+
|
60 |
+
* The most frequent words for the characters;
|
61 |
+
|
62 |
+
* The proportion of positive and negative words.
|
63 |
+
|
64 |
+
From the plot above, we can draw tentative conclusions about the specifics of the characters' speech. For example, Ross and Rachel are the most talkative, they have the maximum number of sentences.
|
65 |
+
|
66 |
+
After this initial analysis of speech, I investigated the vocabulary in more detail and analysed it in terms of the complexity of the words used by the characters. We conventionally took long words consisting of more than 4 syllables as "difficult" words. The proportion of difficult words for each character can be seen in the graph below:
|
67 |
|