Spaces:

StKirill
/

GenerativeChatBot

Sleeping

App Files Files Community

StKirill commited on Feb 24, 2024

Commit

084ba67

verified ·

1 Parent(s): 8dbb072

Update README.md

Browse files

Files changed (1) hide show

README.md +9 -0

README.md CHANGED Viewed

@@ -31,6 +31,15 @@ I decided to continue to use TV series, which I settled in the previuos project,
 3. This is a popular TV series that many of us have watched and know well. This means I can make assumptions about the data (e.g. Phoebe speaks in simpler words, etc.) and assess the realism of the style of the generated text based on my viewing experience.
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 3. This is a popular TV series that many of us have watched and know well. This means I can make assumptions about the data (e.g. Phoebe speaks in simpler words, etc.) and assess the realism of the style of the generated text based on my viewing experience.
+## Data collection
+The original transcripts that I scraped from the Internet were in English. Then I performed the following preprocessing of the text:
+1. Firstly, I cleaned the data from minor graphemic errors specific to the transcripts. For example, if a character said something long, his words might contain repetitions of vowels to imitate a long sound ("noooooooooooooooooooooo"). Other phrases contained repetitions of question and exclamation marks to imitate emotions in writing ("Noooooo!!!!", "WHATooooo????") Since such "errors" are quite non-standard, I wrote my our own system of data cleaning rules and applied it to transcripts.
+2. Secondly, I noticed that some words contained repetitions of the same word for comicality. I also removed such repetitions, leaving only one copy of the repeated word.
+3. Third, because I wanted to capture a style unique to each character, I threw out common phrases used by all 6 main characters ("You know what!", "Oh my god!", etc.).
+Thus, I collected a corpus of dialogues for the 6 characters that included 14196 sentences. You can see the detailed distribution by number of sentences and words for each character in the table below:
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference