Spaces:

StKirill
/

GenerativeChatBot

Sleeping

App Files Files Community

StKirill commited on Mar 2, 2024

Commit

cd153d6

verified ·

1 Parent(s): 4b0d648

Update README.md

Browse files

Files changed (1) hide show

README.md +20 -0

README.md CHANGED Viewed

@@ -71,5 +71,25 @@ After this initial analysis of speech, I investigated the vocabulary in more det
 ## Most frequent Rachel words
 I've done an analysis of Rachel's most frequent words, excluding stopwords from nltk.stopwords("english"). Result of this analysis is shown below.
 !["Most frequent Rachel's words"](./images/RachelMostFrequentWords.png)

 ## Most frequent Rachel words
 I've done an analysis of Rachel's most frequent words, excluding stopwords from nltk.stopwords("english"). Result of this analysis is shown below.
 !["Most frequent Rachel's words"](./images/RachelMostFrequentWords.png)
+Or this data could be shown in image
+!["Most frequent rachel's words in image"](./images/RachelWords.png)
+# Data preparation
+So we collected Rachel's phrases and split them into two datasets: replicas and phrases. For modelling purposes, we provided all replicas with an additional set of tokens and tags:
+Special tokens <s> and </s> that denote the beginning and end of the example.
+The character's name is written in capital letters.
+A special pseudonym NOTFRIEND, which was a marker of the other speaker's replica in dialogue pairs "the replica of the NOTFRIED - the response of the HERO". We used such a pseudonym to separate other people's replicas from the hero whose style we want to mimic.
+Using the data with additional tokens, I generated two datasets for Rachel in English. Below is a brief description of each of them:
+1. Raw monologues - a dataset containing individual lines of one of the characters. This dataset allows the model to get the most information about the style of a particular character.
+!["raw monologues"](./images/phrases.png)
+2. Raw dialogues - a dataset that contains the pairs "NON-Friend's cue - HERO's response", separated by a line break character \n.  The dialogue dataset is needed because we want our model to be able to maintain a Friends-style conversation with the user, not just generate text.
+!["raw dialogs"](./images/replicas.png)