Spaces:
Runtime error
Runtime error
File size: 3,898 Bytes
59804c6 959b9ce 01d568a 959b9ce 01d568a 959b9ce |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
import os
os.system('pip install nltk')
os.system('pip install sklearn')
import streamlit as st
import pandas as pd
import numpy as np
import pickle
from PIL import Image
# preprocessing
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
# modeling
from sklearn import svm
# creating page sections
site_header = st.container()
business_context = st.container()
data_desc = st.container()
performance = st.container()
tweet_input = st.container()
model_results = st.container()
sentiment_analysis = st.container()
contact = st.container()
with site_header:
st.title('Twitter Hate Speech Detection')
with business_context:
st.header('The Problem of Content Moderation')
st.write("""
**Human content moderation exploits people by consistently traumatizing and underpaying them.** In 2019, an [article](https://www.theverge.com/2019/6/19/18681845/facebook-moderator-interviews-video-trauma-ptsd-cognizant-tampa) on The Verge exposed the extensive list of horrific working conditions that employees faced at Cognizant, which was Facebook’s primary moderation contractor. Unfortunately, **every major tech company**, including **Twitter**, uses human moderators to some extent, both domestically and overseas.
Hate speech is defined as **abusive or threatening speech that expresses prejudice against a particular group, especially on the basis of race, religion or sexual orientation.** Usually, the difference between hate speech and offensive language comes down to subtle context or diction.
""")
with data_desc:
understanding, venn = st.columns(2)
with understanding:
st.text('')
st.write("""
The **data** for this project was sourced from a Cornell University [study](https://github.com/t-davidson/hate-speech-and-offensive-language) titled *Automated Hate Speech Detection and the Problem of Offensive Language*.
The `.csv` file has **24,802 rows** where **6% of the tweets were labeled as "Hate Speech".**
Each tweet's label was voted on by crowdsource and determined by majority rules.
""")
with tweet_input:
st.header('Is Your Tweet Considered Hate Speech?')
st.write("""*Please note that this prediction is based on how the model was trained, so it may not be an accurate representation.*""")
# user input here
user_text = st.text_input('Enter Tweet', max_chars=280) # setting input as user_text
with model_results:
st.subheader('Prediction:')
if user_text:
# processing user_text
# removing punctuation
user_text = re.sub('[%s]' % re.escape(string.punctuation), '', user_text)
# tokenizing
stop_words = set(stopwords.words('english'))
tokens = nltk.word_tokenize(user_text)
# removing stop words
stopwords_removed = [token.lower() for token in tokens if token.lower() not in stop_words]
# taking root word
lemmatizer = WordNetLemmatizer()
lemmatized_output = []
for word in stopwords_removed:
lemmatized_output.append(lemmatizer.lemmatize(word))
# instantiating count vectorizor
count = CountVectorizer(stop_words=stop_words)
X_train = pickle.load(open("C:\Users\User\Downloads\X_train", 'rb'))
X_test = lemmatized_output
X_train_count = count.fit_transform(X_train)
X_test_count = count.transform(X_test)
# loading in model
final_model = pickle.load(open("C:\Users\User\Downloads\bayes", 'rb'))
# apply model to make predictions
prediction = final_model.predict(X_test_count[0])
if prediction == 0:
st.subheader('**Not Hate Speech**')
else:
st.subheader('**Hate Speech**')
st.text('') |