Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -71,5 +71,25 @@ After this initial analysis of speech, I investigated the vocabulary in more det
|
|
71 |
## Most frequent Rachel words
|
72 |
I've done an analysis of Rachel's most frequent words, excluding stopwords from nltk.stopwords("english"). Result of this analysis is shown below.
|
73 |

|
|
|
|
|
74 |
|
75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
71 |
## Most frequent Rachel words
|
72 |
I've done an analysis of Rachel's most frequent words, excluding stopwords from nltk.stopwords("english"). Result of this analysis is shown below.
|
73 |

|
74 |
+
Or this data could be shown in image
|
75 |
+

|
76 |
|
77 |
|
78 |
+
# Data preparation
|
79 |
+
So we collected Rachel's phrases and split them into two datasets: replicas and phrases. For modelling purposes, we provided all replicas with an additional set of tokens and tags:
|
80 |
+
|
81 |
+
Special tokens <s> and </s> that denote the beginning and end of the example.
|
82 |
+
|
83 |
+
The character's name is written in capital letters.
|
84 |
+
|
85 |
+
A special pseudonym NOTFRIEND, which was a marker of the other speaker's replica in dialogue pairs "the replica of the NOTFRIED - the response of the HERO". We used such a pseudonym to separate other people's replicas from the hero whose style we want to mimic.
|
86 |
+
|
87 |
+
Using the data with additional tokens, I generated two datasets for Rachel in English. Below is a brief description of each of them:
|
88 |
+
|
89 |
+
1. Raw monologues - a dataset containing individual lines of one of the characters. This dataset allows the model to get the most information about the style of a particular character.
|
90 |
+
|
91 |
+

|
92 |
+
|
93 |
+
2. Raw dialogues - a dataset that contains the pairs "NON-Friend's cue - HERO's response", separated by a line break character \n. The dialogue dataset is needed because we want our model to be able to maintain a Friends-style conversation with the user, not just generate text.
|
94 |
+
|
95 |
+

|