import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Introduction st.markdown('

Correcting Typos and Spelling Errors with Spark NLP and Python

', unsafe_allow_html=True) st.markdown("""

Correcting typos and spelling errors is an essential task in NLP pipelines. Ensuring data correctness can significantly improve the performance of machine learning models. In this article, we will explore how to perform spell checking using rule-based and machine learning-based models in Spark NLP with Python.

""", unsafe_allow_html=True) # Background st.markdown('

Introduction

', unsafe_allow_html=True) st.markdown("""

Spell checking identifies words in texts that have spelling errors or are misspelled. Text data from social media or extracted using Optical Character Recognition (OCR) often contains typos, misspellings, or spurious symbols that can impact machine learning models.

Having spelling errors in data can reduce model performance. For example, if "John" appears as "J0hn", the model treats them as two separate words, complicating the model and reducing its effectiveness. Spell checking and correction can preprocess data to improve model training.

""", unsafe_allow_html=True) # Spell Checking in Spark NLP st.markdown('

Spell Checking in Spark NLP

', unsafe_allow_html=True) st.markdown("""

Spark NLP provides three approaches for spell checking and correction:

NorvigSweetingAnnotator: Based on Peter Norvig’s algorithm with modifications like limiting vowel swapping and using Hamming distance.
SymmetricDeleteAnnotator: Based on the SymSpell algorithm.
ContextSpellCheckerAnnotator: A deep learning model using contextual information for error detection and correction.

""", unsafe_allow_html=True) # Example Code st.markdown('

Example Code

', unsafe_allow_html=True) st.markdown('

Here is an example of how to use these models in Spark NLP:

', unsafe_allow_html=True) # Step-by-step code st.markdown('

Setup

', unsafe_allow_html=True) st.markdown('

To install Spark NLP in Python, use your favorite package manager (conda, pip, etc.). For example:

', unsafe_allow_html=True) st.code(""" pip install spark-nlp pip install pyspark """, language="bash") st.markdown('

Then, import Spark NLP and start a Spark session:

', unsafe_allow_html=True) st.code(""" import sparknlp # Start Spark Session spark = sparknlp.start() """, language='python') # Step 1: Document Assembler st.markdown('

Step 1: Document Assembler

', unsafe_allow_html=True) st.markdown('

Transform raw texts to document annotation:

', unsafe_allow_html=True) st.code(""" from sparknlp.base import DocumentAssembler documentAssembler = DocumentAssembler()\\ .setInputCol("text")\\ .setOutputCol("document") """, language='python') # Step 2: Tokenization st.markdown('

Step 2: Tokenization

', unsafe_allow_html=True) st.markdown('

Split text into individual tokens:

', unsafe_allow_html=True) st.code(""" from sparknlp.annotator import Tokenizer tokenizer = Tokenizer()\\ .setInputCols(["document"])\\ .setOutputCol("token") """, language='python') # Step 3: Spell Checker Models st.markdown('

Step 3: Spell Checker Models

', unsafe_allow_html=True) st.markdown('

Choose and load one of the spell checker models:

', unsafe_allow_html=True) st.code(""" from sparknlp.annotator import ContextSpellCheckerModel, NorvigSweetingModel, SymmetricDeleteModel # One of the spell checker annotators symspell = SymmetricDeleteModel.pretrained("spellcheck_sd")\\ .setInputCols(["token"])\\ .setOutputCol("symspell") norvig = NorvigSweetingModel.pretrained("spellcheck_norvig")\\ .setInputCols(["token"])\\ .setOutputCol("norvig") context = ContextSpellCheckerModel.pretrained("spellcheck_dl")\\ .setInputCols(["token"])\\ .setOutputCol("context") """, language='python') # Step 4: Pipeline Definition st.markdown('

Step 4: Pipeline Definition

', unsafe_allow_html=True) st.markdown('

Define the pipeline stages:

', unsafe_allow_html=True) st.code(""" from pyspark.ml import Pipeline # Define the pipeline stages pipeline = Pipeline().setStages([documentAssembler, tokenizer, symspell, norvig, context]) """, language='python') # Step 5: Fitting and Transforming st.markdown('

Step 5: Fitting and Transforming

', unsafe_allow_html=True) st.markdown('

Fit the pipeline and transform the data:

', unsafe_allow_html=True) st.code(""" # Create an empty DataFrame to fit the pipeline empty_df = spark.createDataFrame([[""]]).toDF("text") pipelineModel = pipeline.fit(empty_df) # Example text for correction example_df = spark.createDataFrame([["Plaese alliow me tao introdduce myhelf, I am a man of wealth und tiaste"]]).toDF("text") result = pipelineModel.transform(example_df) """, language='python') # Step 6: Displaying Results st.markdown('

Step 6: Displaying Results

', unsafe_allow_html=True) st.markdown('

Show the results from the different spell checker models:

', unsafe_allow_html=True) st.code(""" # Show results result.selectExpr("norvig.result as norvig", "symspell.result as symspell", "context.result as context").show(truncate=False) """, language='python') st.markdown("""

The output from the example code will show the corrected text using three different models:

norvig	symspell	context
[Please, allow, me, tao, introduce, myself, ,, I, am, a, man, of, wealth, und, taste]	[Place, allow, me, to, introduce, myself, ,, I, am, a, man, of, wealth, und, taste]	[Please, allow, me, to, introduce, myself, ,, I, am, a, man, of, wealth, and, taste]

""", unsafe_allow_html=True) # One-liner Alternative st.markdown('

One-liner Alternative

', unsafe_allow_html=True) st.markdown("""

Introducing the johnsnowlabs library: In October 2022, John Snow Labs released a unified open-source library containing all their products under one roof. This includes Spark NLP, Spark NLP Display, and NLU. Simplify your workflow with:

pip install johnsnowlabs

For spell checking, use one line of code:

    
# Import the NLP module which contains Spark NLP and NLU libraries
from johnsnowlabs import nlp
# Use Norvig model
nlp.load("en.spell.norvig").predict("Plaese alliow me tao introdduce myhelf, I am a man of wealth und tiaste", output_level='token')

""", unsafe_allow_html=True) st.image('images/johnsnowlabs-output.png', use_column_width='auto') # Conclusion st.markdown("""

Conclusion

We introduced three models for spell checking and correction in Spark NLP: NorvigSweeting, SymmetricDelete, and ContextSpellChecker. These models can be integrated into Spark NLP pipelines for efficient processing of large datasets.

""", unsafe_allow_html=True) # References st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

NorvigSweeting documentation page
SymmetricDeleter documentation page
ContextSpellChecker documentation page
Applying Context Aware Spell Checking in Spark NLP
Training a Contextual Spell Checker for Italian Language

""", unsafe_allow_html=True) st.markdown('

Community & Support

', unsafe_allow_html=True) st.markdown("""

Official Website: Documentation and examples
Slack: Live discussion with the community and team
GitHub: Bug reports, feature requests, and contributions
Medium: Spark NLP articles
YouTube: Video tutorials

""", unsafe_allow_html=True)