Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -31,6 +31,15 @@ I decided to continue to use TV series, which I settled in the previuos project,
|
|
31 |
3. This is a popular TV series that many of us have watched and know well. This means I can make assumptions about the data (e.g. Phoebe speaks in simpler words, etc.) and assess the realism of the style of the generated text based on my viewing experience.
|
32 |
|
33 |
|
|
|
|
|
34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
31 |
3. This is a popular TV series that many of us have watched and know well. This means I can make assumptions about the data (e.g. Phoebe speaks in simpler words, etc.) and assess the realism of the style of the generated text based on my viewing experience.
|
32 |
|
33 |
|
34 |
+
## Data collection
|
35 |
+
The original transcripts that I scraped from the Internet were in English. Then I performed the following preprocessing of the text:
|
36 |
|
37 |
+
1. Firstly, I cleaned the data from minor graphemic errors specific to the transcripts. For example, if a character said something long, his words might contain repetitions of vowels to imitate a long sound ("noooooooooooooooooooooo"). Other phrases contained repetitions of question and exclamation marks to imitate emotions in writing ("Noooooo!!!!", "WHATooooo????") Since such "errors" are quite non-standard, I wrote my our own system of data cleaning rules and applied it to transcripts.
|
38 |
+
|
39 |
+
2. Secondly, I noticed that some words contained repetitions of the same word for comicality. I also removed such repetitions, leaving only one copy of the repeated word.
|
40 |
+
|
41 |
+
3. Third, because I wanted to capture a style unique to each character, I threw out common phrases used by all 6 main characters ("You know what!", "Oh my god!", etc.).
|
42 |
+
|
43 |
+
Thus, I collected a corpus of dialogues for the 6 characters that included 14196 sentences. You can see the detailed distribution by number of sentences and words for each character in the table below:
|
44 |
|
45 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|