CIS5190abcd
/

svm

yitingliii commited on Dec 13, 2024

Commit

cfabd2f

verified ·

1 Parent(s): 6e11957

Create data_cleaning.py

Files changed (1) hide show

data_cleaning.py ADDED Viewed

+```python
+def clean(df):
+    stop_words = set(stopwords.words('english'))
+    lemmatizer = WordNetLemmatizer()
+    cleaned_headlines = []
+    for headline in df['title']:
+        headline = BeautifulSoup(headline, 'html.parser').get_text()
+        headline = re.sub(r'[^a-zA-Z0-9\s]', '', headline)
+        headline = re.sub(r'\s+', ' ', headline).strip()
+        headline = headline.lower()
+        words = headline.split()
+        words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
+        cleaned_headline = ' '.join(words)
+        cleaned_headlines.append(cleaned_headline)
+    df['title'] = cleaned_headlines
+    df.drop_duplicates(subset=['title'], inplace=True)
+    return df
+```