StKirill commited on
Commit
084ba67
·
verified ·
1 Parent(s): 8dbb072

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -31,6 +31,15 @@ I decided to continue to use TV series, which I settled in the previuos project,
31
  3. This is a popular TV series that many of us have watched and know well. This means I can make assumptions about the data (e.g. Phoebe speaks in simpler words, etc.) and assess the realism of the style of the generated text based on my viewing experience.
32
 
33
 
 
 
34
 
 
 
 
 
 
 
 
35
 
36
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
31
  3. This is a popular TV series that many of us have watched and know well. This means I can make assumptions about the data (e.g. Phoebe speaks in simpler words, etc.) and assess the realism of the style of the generated text based on my viewing experience.
32
 
33
 
34
+ ## Data collection
35
+ The original transcripts that I scraped from the Internet were in English. Then I performed the following preprocessing of the text:
36
 
37
+ 1. Firstly, I cleaned the data from minor graphemic errors specific to the transcripts. For example, if a character said something long, his words might contain repetitions of vowels to imitate a long sound ("noooooooooooooooooooooo"). Other phrases contained repetitions of question and exclamation marks to imitate emotions in writing ("Noooooo!!!!", "WHATooooo????") Since such "errors" are quite non-standard, I wrote my our own system of data cleaning rules and applied it to transcripts.
38
+
39
+ 2. Secondly, I noticed that some words contained repetitions of the same word for comicality. I also removed such repetitions, leaving only one copy of the repeated word.
40
+
41
+ 3. Third, because I wanted to capture a style unique to each character, I threw out common phrases used by all 6 main characters ("You know what!", "Oh my god!", etc.).
42
+
43
+ Thus, I collected a corpus of dialogues for the 6 characters that included 14196 sentences. You can see the detailed distribution by number of sentences and words for each character in the table below:
44
 
45
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference