File size: 15,113 Bytes
e11fca5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
import streamlit as st

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

    </style>

""", unsafe_allow_html=True)

# Title
st.markdown('<div class="main-title">Automatic Language Detection Using Spark NLP in Python</div>', unsafe_allow_html=True)

# Introduction
st.markdown("""

<div class="section">

    <p>Language detection is a critical component of Natural Language Processing (NLP), which involves automatically identifying the language of a given piece of text. This functionality is essential in various multilingual applications where the language of input text might not be known in advance. Accurate language detection can enhance the performance of downstream NLP tasks such as machine translation, sentiment analysis, and information retrieval.</p>

</div>

""", unsafe_allow_html=True)

# What is Language Detection
st.markdown('<div class="sub-title">What is Language Detection?</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Language detection models analyze text to determine its language by examining features such as:</p>

    <ul>

        <li><b>Character Set</b>: Identifying language-specific characters and symbols.</li>

        <li><b>Word Frequency</b>: Recognizing common words and their usage patterns in different languages.</li>

        <li><b>N-grams</b>: Analyzing sequences of n words to detect language-specific phrases and structures.</li>

    </ul>

    <p>Models are typically trained on extensive datasets (e.g., Wikipedia, Tatoeba) using statistical and deep learning methods to recognize these patterns. Once trained, these models can predict the language of new text by comparing its features with those learned during training.</p>

</div>

""", unsafe_allow_html=True)

# Importance and Use Cases
st.markdown('<div class="sub-title">Importance and Use Cases</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Accurate language detection is pivotal for many applications, including:</p>

    <ul>

        <li><b>Machine Translation</b>: Automatically translating text into various languages.</li>

        <li><b>Sentiment Analysis</b>: Analyzing sentiments in multilingual datasets.</li>

        <li><b>Information Retrieval</b>: Enhancing search results by filtering content based on language.</li>

        <li><b>Spam Filtering</b>: Identifying spam content in multiple languages.</li>

        <li><b>Social Media Analysis</b>: Processing and categorizing user-generated content in different languages.</li>

    </ul>

</div>

""", unsafe_allow_html=True)

# Spark NLP's LanguageDetectorDL
st.markdown('<div class="sub-title">Spark NLP\'s LanguageDetectorDL</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The <code>LanguageDetectorDL</code> annotator from Spark NLP is designed for high accuracy in language detection. It utilizes pretrained deep learning models to identify languages with precision. This annotator can effectively handle documents containing mixed languages by analyzing sentence segments and selecting the most probable language.</p>

</div>

""", unsafe_allow_html=True)

# Setup Instructions
st.markdown('<div class="sub-title">Setup</div>', unsafe_allow_html=True)
st.markdown('<p>To install Spark NLP and extract keywords in Python, simply use your favorite package manager (conda, pip, etc.). For example:</p>', unsafe_allow_html=True)
st.code("""

pip install spark-nlp

pip install pyspark

""", language="bash")
st.markdown('<p>For other installation options and environments, refer to the <a href="https://nlp.johnsnowlabs.com/docs/en/install" class="link">official documentation</a>.</p>', unsafe_allow_html=True)

st.markdown("<p>Then, import Spark NLP and start a Spark session:</p>", unsafe_allow_html=True)
st.code("""

import sparknlp



# Start Spark Session

spark = sparknlp.start()

""", language='python')

# Using LanguageDetectorDL
st.markdown('<div class="sub-title">Using LanguageDetectorDL</div>', unsafe_allow_html=True)
st.code("""

# Import necessary modules

from sparknlp.base import DocumentAssembler, Pipeline

from sparknlp.annotator import LanguageDetectorDL

import pyspark.sql.functions as F



# Step 1: Transform raw text into `document` annotation

document_assembler = (

    DocumentAssembler()

    .setInputCol("text")

    .setOutputCol("document")

)



# Step 2: Detect the language of the text

language_detector = (

    LanguageDetectorDL.pretrained()

    .setInputCols("document")

    .setOutputCol("language") 

)



# Create the NLP pipeline

nlpPipeline = Pipeline(stages=[document_assembler, language_detector])



# Sample texts in different languages

data = spark.createDataFrame([

    ["Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages."],

    ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."],

    ["Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."],

    ["Spark NLP es una biblioteca de procesamiento de texto de código abierto para el procesamiento avanzado de lenguaje natural para los lenguajes de programación Python, Java y Scala."],

    ["Spark NLP é uma biblioteca de processamento de texto de código aberto para processamento avançado de linguagem natural para as linguagens de programação Python, Java e Scala"]

]).toDF("text")



# Transform the data with the pipeline

result = nlpPipeline.fit(data).transform(data)



# Show the results

result.select("text", "language.result").show(truncate=100)

""", language='python')

st.text("""

+----------------------------------------------------------------------------------------------------+------+

|                                                                                                text|result|

+----------------------------------------------------------------------------------------------------+------+

|Spark NLP is an open-source text processing library for advanced natural language processing for ...|  [en]|

|Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du la...|  [fr]|

|Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprach...|  [de]|

|Spark NLP es una biblioteca de procesamiento de texto de código abierto para el procesamiento ava...|  [es]|

|Spark NLP é uma biblioteca de processamento de texto de código aberto para processamento avançado...|  [pt]|

+----------------------------------------------------------------------------------------------------+------+

""")

# One-Liner Alternative
st.markdown('<div class="sub-title">One-Liner Alternative</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>John Snow Labs has introduced a unified library to simplify workflows across various products, including Spark NLP. Install the library with:</p>

    <pre><code>pip install johnsnowlabs</code></pre>

    <p>Use the following one-liner code for quick language detection:</p>

</div>

""", unsafe_allow_html=True)

st.code("""

# Import the NLP module which contains Spark NLP and NLU libraries

from johnsnowlabs import nlp



# Sample text in Polish

sample_text = "Spark NLP to biblioteka edytorów tekstu typu open source do zaawansowanego przetwarzania języka naturalnego w językach programowania Python, Java i Scala."



# Detect language with one line of code

result = nlp.load('xx.classify.wiki_95').predict(sample_text, output_level='sentence')

""", language='python')

st.markdown("""

<table style="width:100%; border-collapse: collapse; margin-top: 20px;">

  <thead>

    <tr style="background-color: #4A90E2; color: white; text-align: left;">

      <th style="padding: 12px;">Language</th>

      <th style="padding: 12px;">Confidence</th>

      <th style="padding: 12px;">Sentence</th>

    </tr>

  </thead>

  <tbody>

    <tr style="background-color: #f9f9f9;">

      <td style="padding: 12px; border: 1px solid #ddd;">pl</td>

      <td style="padding: 12px; border: 1px solid #ddd;">9.0</td>

      <td style="padding: 12px; border: 1px solid #ddd;">Spark NLP to biblioteka edytorów tekstu typu open source do zaawansowanego przetwarzania języka naturalnego w językach programowania Python, Java i Scala.</td>

    </tr>

  </tbody>

</table>

""", unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <p><b>Benefits of the One-Liner</b></p>

    <p>This approach is convenient for quick implementations and testing. The one-liner model is based on default configurations, which may suffice for general use cases. However, for more specialized needs, customizing the pipeline or choosing specific models might be necessary.</p>

</div>

""", unsafe_allow_html=True)

# Notes and Recommendations
st.markdown('<div class="sub-title">Notes and Recommendations</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><b>Customizing Pipelines</b>: While the one-liner is efficient, building a custom pipeline with specific models and configurations allows for greater flexibility and optimization according to the application's requirements.</li>

        <li><b>Handling Mixed Languages</b>: <code>LanguageDetectorDL</code> can effectively manage texts with multiple languages by analyzing sentence segments. Ensure your pipeline is configured to handle such cases appropriately.</li>

        <li><b>Performance Considerations</b>: When working with large datasets, optimizing Spark configurations and resources is crucial for maintaining performance and avoiding bottlenecks.</li>

    </ul>

</div>

""", unsafe_allow_html=True)

# Benchmarking Section
st.markdown('<div class="sub-title">Benchmarking</div>', unsafe_allow_html=True)
st.write("")
st.markdown('<p><a href="https://sparknlp.org/2020/12/05/ld_wiki_tatoeba_cnn_375_xx.html" class="link" target="_blank">ld_wiki_tatoeba_cnn_375</a> Model Evaluated on Europarl dataset which the model has never seen:</p>', unsafe_allow_html=True)
st.text("""

+--------+-----+-------+------------------+     

|src_lang|count|correct|         precision|       

+--------+-----+-------+------------------+  

|      fr| 1000|   1000|               1.0|    

|      de| 1000|    999|             0.999|       

|      fi| 1000|    999|             0.999|              

|      nl| 1000|    998|             0.998|          +-------+--------------------+ 

|      el| 1000|    997|             0.997|          |summary|           precision|

|      en| 1000|    995|             0.995|          +-------+--------------------+  

|      es| 1000|    994|             0.994|          |  count|                  21| 

|      it| 1000|    993|             0.993|          |   mean|  0.9758952066282511|   

|      sv| 1000|    991|             0.991|          | stddev|0.029434744995013935|   

|      da| 1000|    987|             0.987|          |    min|  0.8862144420131292|

|      pl|  914|    901|0.9857768052516411|          |    max|                 1.0| 

|      hu|  880|    866|0.9840909090909091|          +-------+--------------------+

|      pt| 1000|    980|              0.98|                         

|      et|  928|    907|0.9773706896551724| 

|      ro|  784|    766|0.9770408163265306|

|      lt| 1000|    976|             0.976|

|      bg| 1000|    965|             0.965|

|      cs| 1000|    945|             0.945|

|      sk| 1000|    944|             0.944|

|      lv|  916|    843|0.9203056768558951|

|      sl|  914|    810|0.8862144420131292|

+--------+-----+-------+------------------+

""")

# Conclusion
st.markdown("""

<div class="section">

    <h2>Conclusion</h2>

    <p>Accurate language detection is a foundational step in many NLP workflows. Spark NLP’s <code>LanguageDetectorDL</code> annotator offers a robust solution for detecting languages in diverse text corpora. With its integration into Spark's powerful data processing framework, it enables efficient handling of large-scale multilingual datasets, providing accurate language identification for various applications.</p>

</div>

""", unsafe_allow_html=True)

# References and Additional Information
st.markdown('<div class="sub-title">References and Additional Information</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <ul>

        <li><a href="https://nlp.johnsnowlabs.com/docs/en/annotators#languagedetectordl" class="link" target="_blank">Documentation: LanguageDetectorDL</a></li>

        <li><a href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/language_detector_dl/index.html#sparknlp.annotator.language_detector_dl.LanguageDetectorDL" class="link" target="_blank">Python Docs: LanguageDetectorDL</a></li>

        <li><a href="https://sparknlp.org/2020/12/05/ld_wiki_tatoeba_cnn_375_xx.html" class="link" target="_blank">ld_wiki_tatoeba_cnn_375</a></li>

        <li><a href="https://www.johnsnowlabs.com/how-to-detect-languages-with-python-a-comprehensive-guide/" class="link" target="_blank">Reference Article</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)