StKirill commited on
Commit
cd153d6
·
verified ·
1 Parent(s): 4b0d648

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -0
README.md CHANGED
@@ -71,5 +71,25 @@ After this initial analysis of speech, I investigated the vocabulary in more det
71
  ## Most frequent Rachel words
72
  I've done an analysis of Rachel's most frequent words, excluding stopwords from nltk.stopwords("english"). Result of this analysis is shown below.
73
  !["Most frequent Rachel's words"](./images/RachelMostFrequentWords.png)
 
 
74
 
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ## Most frequent Rachel words
72
  I've done an analysis of Rachel's most frequent words, excluding stopwords from nltk.stopwords("english"). Result of this analysis is shown below.
73
  !["Most frequent Rachel's words"](./images/RachelMostFrequentWords.png)
74
+ Or this data could be shown in image
75
+ !["Most frequent rachel's words in image"](./images/RachelWords.png)
76
 
77
 
78
+ # Data preparation
79
+ So we collected Rachel's phrases and split them into two datasets: replicas and phrases. For modelling purposes, we provided all replicas with an additional set of tokens and tags:
80
+
81
+ Special tokens <s> and </s> that denote the beginning and end of the example.
82
+
83
+ The character's name is written in capital letters.
84
+
85
+ A special pseudonym NOTFRIEND, which was a marker of the other speaker's replica in dialogue pairs "the replica of the NOTFRIED - the response of the HERO". We used such a pseudonym to separate other people's replicas from the hero whose style we want to mimic.
86
+
87
+ Using the data with additional tokens, I generated two datasets for Rachel in English. Below is a brief description of each of them:
88
+
89
+ 1. Raw monologues - a dataset containing individual lines of one of the characters. This dataset allows the model to get the most information about the style of a particular character.
90
+
91
+ !["raw monologues"](./images/phrases.png)
92
+
93
+ 2. Raw dialogues - a dataset that contains the pairs "NON-Friend's cue - HERO's response", separated by a line break character \n. The dialogue dataset is needed because we want our model to be able to maintain a Friends-style conversation with the user, not just generate text.
94
+
95
+ !["raw dialogs"](./images/replicas.png)