File size: 10,308 Bytes
2facf4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
import streamlit as st

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

    </style>

""", unsafe_allow_html=True)

# Introduction
st.markdown('<div class="main-title">Detecting Toxic Comments with Spark NLP</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <p>Welcome to the Spark NLP Toxic Comment Detection Demo App! Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.</p>

    <p>This app demonstrates how to use Spark NLP's MulticlassifierDL to automatically detect toxic comments, including categories like identity hate, insult, obscene, severe toxic, and threat.</p>

</div>

""", unsafe_allow_html=True)

# st.image('images/Toxic-Comments.jpg', caption="Different types of toxic comments detected using Spark NLP", use_column_width='auto')

# About Toxic Comment Classification
st.markdown('<div class="sub-title">About Toxic Comment Classification</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The Conversation AI team, a research initiative founded by Jigsaw and Google (both part of Alphabet), is working on tools to help improve online conversations. One area of focus is the study of negative online behaviors, like toxic comments (comments that are rude, disrespectful, or likely to make someone leave a discussion).</p>

    <p>This app utilizes the Spark NLP MulticlassifierDL model to detect various types of toxicity in comments. This model is capable of identifying and categorizing toxic comments into different classes such as toxic, severe toxic, identity hate, insult, obscene, and threat.</p>

</div>

""", unsafe_allow_html=True)

# Using MulticlassifierDL in Spark NLP
st.markdown('<div class="sub-title">Using MulticlassifierDL in Spark NLP</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The MulticlassifierDL annotator in Spark NLP uses deep learning to classify text into multiple categories. This approach allows for a more nuanced understanding of the toxicity in comments, providing better tools for moderating online discussions.</p>

    <p>Spark NLP also offers other annotators and models for different NLP tasks. If you are interested in exploring more, please check the <a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#multiclassifierdl" target="_blank" rel="noopener">MulticlassifierDL</a> documentation.</p>

</div>

""", unsafe_allow_html=True)

st.markdown('<h2 class="sub-title">Example Usage in Python</h2>', unsafe_allow_html=True)
st.markdown('<p>Here’s how you can implement toxic comment classification using the MulticlassifierDL annotator in Spark NLP:</p>', unsafe_allow_html=True)

# Setup Instructions
st.markdown('<div class="sub-title">Setup</div>', unsafe_allow_html=True)
st.markdown('<p>To install Spark NLP in Python, use your favorite package manager (conda, pip, etc.). For example:</p>', unsafe_allow_html=True)
st.code("""

pip install spark-nlp

pip install pyspark

""", language="bash")

st.markdown("<p>Then, import Spark NLP and start a Spark session:</p>", unsafe_allow_html=True)
st.code("""

import sparknlp



# Start Spark Session

spark = sparknlp.start()

""", language='python')

# Toxic Comment Classification Example
st.markdown('<div class="sub-title">Example Usage: Toxic Comment Classification with MulticlassifierDL</div>', unsafe_allow_html=True)
st.code('''

from sparknlp.base import DocumentAssembler

from sparknlp.annotator import UniversalSentenceEncoder, MultiClassifierDLModel

from pyspark.ml import Pipeline



# Step 1: Transforms raw texts to document annotation

document = DocumentAssembler() \\

    .setInputCol("text") \\

    .setOutputCol("document")



# Step 2: Use Universal Sentence Encoder for embeddings

use = UniversalSentenceEncoder.pretrained() \\

    .setInputCols(["document"]) \\

    .setOutputCol("use_embeddings")



# Step 3: Multiclass classification model

docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_toxic") \\

    .setInputCols(["use_embeddings"]) \\

    .setOutputCol("category") \\

    .setThreshold(0.5)



# Define the pipeline

pipeline = Pipeline(

    stages = [

        document,

        use,

        docClassifier

    ]

)



# Create a Spark Data Frame with example sentences

data = spark.createDataFrame(

    [

        ["She should stop sticking her tongue out before someone rubs their ass on it. Filthy bitch!!!"]

    ]

).toDF("text")  # use the column name `text` defined in the pipeline as input



# Fit-transform to get predictions

result = pipeline.fit(data).transform(data).select("text", "category.result").show(truncate=50)

''', language='python')

st.text("""

+--------------------------------------------------+------------------------+

|                                              text|                  result|

+--------------------------------------------------+------------------------+

|She should stop sticking her tongue out before ...|[toxic, insult, obscene]|

+--------------------------------------------------+------------------------+

""")

st.markdown("""

<p>The code snippet demonstrates how to set up a pipeline in Spark NLP to classify toxic comments using the MulticlassifierDL annotator. The resulting DataFrame contains the predictions for each comment.</p>

""", unsafe_allow_html=True)

# One-liner Alternative
st.markdown('<div class="sub-title">One-liner Alternative</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>In October 2022, John Snow Labs released the open-source <code>johnsnowlabs</code> library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow, especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all of John Snow Lab’s libraries and can be installed with pip:</p>

    <p><code>pip install johnsnowlabs</code></p>

</div>

""", unsafe_allow_html=True)

st.markdown('<p>To run toxic comment classification with one line of code, we can simply:</p>', unsafe_allow_html=True)
st.code("""

# Import the NLP module which contains Spark NLP and NLU libraries

from johnsnowlabs import nlp



sample_text = ["You are a horrible person!", "I love your new profile picture!", "Go away, no one likes you."]



# Returns a pandas DataFrame, we select the desired columns

nlp.load('en.classify.toxic').predict(sample_text, output_level='sentence')

""", language='python')

st.image('images/johnsnowlabs-toxic-output.png', use_column_width='auto')

st.markdown("""

<p>This approach demonstrates how to use the <code>johnsnowlabs</code> library to perform toxic comment classification with a single line of code. The resulting DataFrame contains the predictions for each comment.</p>

""", unsafe_allow_html=True)

# Benchmarking
st.markdown('<div class="sub-title">Benchmarking</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Here are the benchmarking results for the MulticlassifierDL model on the toxic comment classification task:</p>

    <pre>

    precision    recall  f1-score   support



    0       0.56      0.30      0.39       127

    1       0.71      0.70      0.70       761

    2       0.76      0.72      0.74       824

    3       0.55      0.21      0.31       147

    4       0.79      0.38      0.51       50

    5       0.94      1.00      0.97       1504



    micro avg       0.83      0.80      0.81      3413

    macro avg       0.72      0.55      0.60      3413

    weighted avg    0.81      0.80      0.80      3413

    samples avg     0.84      0.83      0.80      3413



    F1 micro averaging: 0.8113432835820896

</div>

""", unsafe_allow_html=True)

# Additional Resources
st.markdown('<div class="sub-title">Additional Resources</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li>Python Docs : <a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#multiclassifierdl" target="_blank" rel="noopener">ClassifierDLModel</a></li>

        <li>Model used : <a class="link" href="https://sparknlp.org/2021/01/21/multiclassifierdl_use_toxic_en.html" target="_blank" rel="noopener">multiclassifierdl_use_toxic</a></li>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)