File size: 10,727 Bytes
f7b8b91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
import streamlit as st

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        .sidebar-content {

            font-size: 16px;

        }

    </style>

""", unsafe_allow_html=True)

# Introduction
st.markdown('<div class="main-title">Correcting Typos and Spelling Errors with Spark NLP and Python</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <p>Correcting typos and spelling errors is an essential task in NLP pipelines. Ensuring data correctness can significantly improve the performance of machine learning models. In this article, we will explore how to perform spell checking using rule-based and machine learning-based models in Spark NLP with Python.</p>

</div>

""", unsafe_allow_html=True)

# Background
st.markdown('<div class="sub-title">Introduction</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Spell checking identifies words in texts that have spelling errors or are misspelled. Text data from social media or extracted using Optical Character Recognition (OCR) often contains typos, misspellings, or spurious symbols that can impact machine learning models.</p>

    <p>Having spelling errors in data can reduce model performance. For example, if "John" appears as "J0hn", the model treats them as two separate words, complicating the model and reducing its effectiveness. Spell checking and correction can preprocess data to improve model training.</p>

</div>

""", unsafe_allow_html=True)

# Spell Checking in Spark NLP
st.markdown('<div class="sub-title">Spell Checking in Spark NLP</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Spark NLP provides three approaches for spell checking and correction:</p>

    <ul>

        <li><strong>NorvigSweetingAnnotator:</strong> Based on Peter Norvig’s algorithm with modifications like limiting vowel swapping and using Hamming distance.</li>

        <li><strong>SymmetricDeleteAnnotator:</strong> Based on the SymSpell algorithm.</li>

        <li><strong>ContextSpellCheckerAnnotator:</strong> A deep learning model using contextual information for error detection and correction.</li>

    </ul>

</div>

""", unsafe_allow_html=True)

# Example Code
st.markdown('<div class="sub-title">Example Code</div>', unsafe_allow_html=True)
st.markdown('<p>Here is an example of how to use these models in Spark NLP:</p>', unsafe_allow_html=True)

# Step-by-step code
st.markdown('<div class="sub-title">Setup</div>', unsafe_allow_html=True)
st.markdown('<p>To install Spark NLP in Python, use your favorite package manager (conda, pip, etc.). For example:</p>', unsafe_allow_html=True)
st.code("""

pip install spark-nlp

pip install pyspark

""", language="bash")

st.markdown('<p>Then, import Spark NLP and start a Spark session:</p>', unsafe_allow_html=True)
st.code("""

import sparknlp



# Start Spark Session

spark = sparknlp.start()

""", language='python')

# Step 1: Document Assembler
st.markdown('<div class="sub-title">Step 1: Document Assembler</div>', unsafe_allow_html=True)
st.markdown('<p>Transform raw texts to document annotation:</p>', unsafe_allow_html=True)
st.code("""

from sparknlp.base import DocumentAssembler



documentAssembler = DocumentAssembler()\\

    .setInputCol("text")\\

    .setOutputCol("document")

""", language='python')

# Step 2: Tokenization
st.markdown('<div class="sub-title">Step 2: Tokenization</div>', unsafe_allow_html=True)
st.markdown('<p>Split text into individual tokens:</p>', unsafe_allow_html=True)
st.code("""

from sparknlp.annotator import Tokenizer



tokenizer = Tokenizer()\\

    .setInputCols(["document"])\\

    .setOutputCol("token")

""", language='python')

# Step 3: Spell Checker Models
st.markdown('<div class="sub-title">Step 3: Spell Checker Models</div>', unsafe_allow_html=True)
st.markdown('<p>Choose and load one of the spell checker models:</p>', unsafe_allow_html=True)

st.code("""

from sparknlp.annotator import ContextSpellCheckerModel, NorvigSweetingModel, SymmetricDeleteModel



# One of the spell checker annotators

symspell = SymmetricDeleteModel.pretrained("spellcheck_sd")\\

    .setInputCols(["token"])\\

    .setOutputCol("symspell")



norvig = NorvigSweetingModel.pretrained("spellcheck_norvig")\\

    .setInputCols(["token"])\\

    .setOutputCol("norvig")



context = ContextSpellCheckerModel.pretrained("spellcheck_dl")\\

    .setInputCols(["token"])\\

    .setOutputCol("context")

""", language='python')

# Step 4: Pipeline Definition
st.markdown('<div class="sub-title">Step 4: Pipeline Definition</div>', unsafe_allow_html=True)
st.markdown('<p>Define the pipeline stages:</p>', unsafe_allow_html=True)
st.code("""

from pyspark.ml import Pipeline



# Define the pipeline stages

pipeline = Pipeline().setStages([documentAssembler, tokenizer, symspell, norvig, context])

""", language='python')

# Step 5: Fitting and Transforming
st.markdown('<div class="sub-title">Step 5: Fitting and Transforming</div>', unsafe_allow_html=True)
st.markdown('<p>Fit the pipeline and transform the data:</p>', unsafe_allow_html=True)
st.code("""

# Create an empty DataFrame to fit the pipeline

empty_df = spark.createDataFrame([[""]]).toDF("text")

pipelineModel = pipeline.fit(empty_df)



# Example text for correction

example_df = spark.createDataFrame([["Plaese alliow me tao introdduce myhelf, I am a man of wealth und tiaste"]]).toDF("text")

result = pipelineModel.transform(example_df)

""", language='python')

# Step 6: Displaying Results
st.markdown('<div class="sub-title">Step 6: Displaying Results</div>', unsafe_allow_html=True)
st.markdown('<p>Show the results from the different spell checker models:</p>', unsafe_allow_html=True)
st.code("""

# Show results

result.selectExpr("norvig.result as norvig", "symspell.result as symspell", "context.result as context").show(truncate=False)

""", language='python')

st.markdown("""

<p>The output from the example code will show the corrected text using three different models:</p>

<table>

  <tr>

    <th>norvig</th>

    <th>symspell</th>

    <th>context</th>

  </tr>

  <tr>

    <td>[Please, allow, me, tao, introduce, myself, ,, I, am, a, man, of, wealth, und, taste]</td>

    <td>[Place, allow, me, to, introduce, myself, ,, I, am, a, man, of, wealth, und, taste]</td>

    <td>[Please, allow, me, to, introduce, myself, ,, I, am, a, man, of, wealth, and, taste]</td>

  </tr>

</table>

""", unsafe_allow_html=True)

# One-liner Alternative
st.markdown('<div class="sub-title">One-liner Alternative</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Introducing the <code>johnsnowlabs</code> library: In October 2022, John Snow Labs released a unified open-source library containing all their products under one roof. This includes Spark NLP, Spark NLP Display, and NLU. Simplify your workflow with:</p>

    <p><code>pip install johnsnowlabs</code></p>

    <p>For spell checking, use one line of code:</p>

    <pre>

    <code class="language-python">

# Import the NLP module which contains Spark NLP and NLU libraries

from johnsnowlabs import nlp

# Use Norvig model

nlp.load("en.spell.norvig").predict("Plaese alliow me tao introdduce myhelf, I am a man of wealth und tiaste", output_level='token')

    </code>

    </pre>

</div>

""", unsafe_allow_html=True)

st.image('images/johnsnowlabs-output.png', use_column_width='auto')

# Conclusion
st.markdown("""

<div class="section">

    <h2>Conclusion</h2>

    <p>We introduced three models for spell checking and correction in Spark NLP: NorvigSweeting, SymmetricDelete, and ContextSpellChecker. These models can be integrated into Spark NLP pipelines for efficient processing of large datasets.</p>

</div>

""", unsafe_allow_html=True)

# References
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#norvigsweeting-spellchecker" target="_blank" rel="noopener">NorvigSweeting</a> documentation page</li>

        <li><a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#symmetricdelete-spellchecker" target="_blank" rel="noopener">SymmetricDeleter</a> documentation page</li>

        <li><a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#contextspellchecker" target="_blank" rel="noopener">ContextSpellChecker</a> documentation page</li>

        <li><a class="link" href="https://medium.com/spark-nlp/applying-context-aware-spell-checking-in-spark-nlp-3c29c46963bc" target="_blank" rel="noopener nofollow">Applying Context Aware Spell Checking in Spark NLP</a></li>

        <li><a class="link" href="https://towardsdatascience.com/training-a-contextual-spell-checker-for-italian-language-66dda528e4bf" target="_blank" rel="noopener nofollow">Training a Contextual Spell Checker for Italian Language</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)