File size: 9,949 Bytes
6dfd639
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
import streamlit as st

# Page configuration
st.set_page_config(
    layout="wide", 
    initial_sidebar_state="auto"
)

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        .benchmark-table {

            width: 100%;

            border-collapse: collapse;

            margin-top: 20px;

        }

        .benchmark-table th, .benchmark-table td {

            border: 1px solid #ddd;

            padding: 8px;

            text-align: left;

        }

        .benchmark-table th {

            background-color: #4A90E2;

            color: white;

        }

        .benchmark-table td {

            background-color: #f2f2f2;

        }

    </style>

""", unsafe_allow_html=True)

# Title
st.markdown('<div class="main-title">Introduction to XLM-RoBERTa Annotators in Spark NLP</div>', unsafe_allow_html=True)

# Subtitle
st.markdown("""

<div class="section">

    <p>XLM-RoBERTa (Cross-lingual Robustly Optimized BERT Approach) is an advanced multilingual model that extends the capabilities of RoBERTa to over 100 languages. Pre-trained on a massive, diverse corpus, XLM-RoBERTa is designed to handle various NLP tasks in a multilingual context, making it ideal for applications that require cross-lingual understanding. Below, we provide an overview of the XLM-RoBERTa annotators for these tasks:</p>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">XLM-RoBERTa for Token Classification', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p><strong>Token Classification</strong> is a crucial NLP task that involves assigning labels to individual tokens (words or subwords) within a sentence. This task is fundamental for applications like Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and other fine-grained text analyses.</p>

    <p>XLM-RoBERTa, with its multilingual capabilities, is particularly suited for token classification in diverse linguistic contexts. Leveraging this model in Spark NLP allows for robust and scalable token classification across multiple languages, making it an invaluable tool for multilingual NLP projects.</p>

    <p>Using XLM-RoBERTa for token classification enables:</p>

    <ul>

        <li><strong>Multilingual NER:</strong> Identify and categorize entities in text across various languages, such as persons (PER), organizations (ORG), locations (LOC), and more.</li>

        <li><strong>Cross-lingual Transfer Learning:</strong> Apply learned models from one language to another, benefiting from XLM-RoBERTa's shared representations across languages.</li>

        <li><strong>Enhanced Text Processing:</strong> Improve text categorization, information extraction, and data retrieval across multilingual datasets.</li>

    </ul>

    <p>Advantages of using XLM-RoBERTa for token classification in Spark NLP include:</p>

    <ul>

        <li><strong>Multilingual Expertise:</strong> XLM-RoBERTa's training on a vast multilingual corpus ensures strong performance across different languages.</li>

        <li><strong>Scalability:</strong> Integrated with Apache Spark, Spark NLP allows processing of large-scale multilingual datasets efficiently.</li>

        <li><strong>Model Flexibility:</strong> Fine-tune XLM-RoBERTa models or use pre-trained ones based on your specific language needs and tasks.</li>

    </ul>

</div>

""", unsafe_allow_html=True)

# General Information about Using Token Classification Models
st.markdown('<div class="sub-title">How to Use XLM-RoBERTa for Token Classification in Spark NLP</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>To harness XLM-RoBERTa for token classification, Spark NLP provides a straightforward pipeline setup. Below is a sample implementation that demonstrates how to use XLM-RoBERTa for Named Entity Recognition (NER). The model's multilingual capabilities ensure that it can be applied effectively across different languages, making it ideal for diverse NLP tasks.</p>

</div>

""", unsafe_allow_html=True)

st.code('''

from sparknlp.base import *

from sparknlp.annotator import *

from pyspark.ml import Pipeline

from pyspark.sql.functions import col, expr



documentAssembler = DocumentAssembler() \\

    .setInputCol("text") \\

    .setOutputCol("document")



tokenizer = Tokenizer() \\

    .setInputCols(["document"]) \\

    .setOutputCol("token")



# Example of loading a token classification model (e.g., XLM-RoBERTa)

token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_large_finetuned_conll03_english", "xx") \\

    .setInputCols(["document", "token"]) \\

    .setOutputCol("ner")



ner_converter = NerConverter() \\

    .setInputCols(['document', 'token', 'ner']) \\

    .setOutputCol('ner_chunk')



pipeline = Pipeline(stages=[

    documentAssembler,

    tokenizer,

    token_classifier,

    ner_converter

])



data = spark.createDataFrame([["Spark NLP provides powerful tools for multilingual NLP."]]).toDF("text")

result = pipeline.fit(data).transform(data)



result.selectExpr("explode(ner_chunk) as ner_chunk").select(

    col("ner_chunk.result").alias("chunk"),

    col("ner_chunk.metadata.entity").alias("ner_label")

).show(truncate=False)

''', language='python')

# Results Example
st.text("""

+--------------------------+---------+

|chunk                     |ner_label|

+--------------------------+---------+

|Spark NLP                 |ORG      |

+--------------------------+---------+

""")

# Model Info Section
st.markdown('<div class="sub-title">Choosing the Right XLM-RoBERTa Model</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>Spark NLP provides access to various pre-trained XLM-RoBERTa models tailored for token classification tasks. Selecting the appropriate model can significantly impact performance, particularly in multilingual contexts.</p>

    <p>To explore and choose the most suitable XLM-RoBERTa model for your needs, visit the <a class="link" href="https://sparknlp.org/models?annotator=XlmRoBertaForTokenClassification" target="_blank">Spark NLP Models Hub</a>. Here, you will find detailed descriptions of each model, including their specific applications and supported languages.</p>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://huggingface.co/xlm-roberta-large-finetuned-conll03-english" target="_blank">Hugging Face: xlm-roberta-large-finetuned-conll03-english</a></li>

        <li><a class="link" href="https://arxiv.org/abs/1911.02116" target="_blank">XLM-RoBERTa: A Multilingual Language Model</a></li>

        <li><a class="link" href="https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr" target="_blank">GitHub: XLM-RoBERTa Examples</a></li>

        <li><a class="link" href="https://aclanthology.org/2021.acl-long.330.pdf" target="_blank">ACL: Multilingual Transfer of NER Models</a></li>

        <li><a class="link" href="https://dl.acm.org/doi/pdf/10.1145/3442188.3445922" target="_blank">ACM: Analysis of Multilingual Models</a></li>

        <li><a class="link" href="https://arxiv.org/pdf/2008.03415.pdf" target="_blank">Efficient Multilingual Language Models</a></li>

        <li><a class="link" href="https://mlco2.github.io/impact#compute" target="_blank">ML CO2 Impact Estimator</a></li>

        <li><a class="link" href="https://arxiv.org/abs/1910.09700" target="_blank">Cross-lingual Transfer with XLM-RoBERTa</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)

# Footer
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)

st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>

        <li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>

        <li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)