import streamlit as st
# Custom CSS for better styling
st.markdown("""
""", unsafe_allow_html=True)
# Main Title
st.markdown('
Detect Entities in Twitter Texts
', unsafe_allow_html=True)
# Description
st.markdown("""
Detect Entities in Twitter Texts is a specialized NLP task focusing on identifying entities within Twitter-based texts. This app utilizes the bert_token_classifier_ner_btc model, which is trained on the Broad Twitter Corpus (BTC) dataset to detect entities with high accuracy. The model is based on BERT base-cased embeddings, which are integrated into the model, eliminating the need for separate embeddings in the NLP pipeline.
""", unsafe_allow_html=True)
# What is Entity Recognition
st.markdown('What is Entity Recognition?
', unsafe_allow_html=True)
st.markdown("""
Entity Recognition is a task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories. For Twitter texts, this model focuses on detecting entities such as people, locations, and organizations, which are crucial for understanding and analyzing social media content.
""", unsafe_allow_html=True)
# Model Importance and Applications
st.markdown('Model Importance and Applications
', unsafe_allow_html=True)
st.markdown("""
The bert_token_classifier_ner_btc model is highly effective for extracting named entities from Twitter texts. Its applications include:
- Social Media Monitoring: The model can be used to identify and track mentions of people, organizations, and locations in social media posts, which is valuable for sentiment analysis and brand monitoring.
- Event Detection: By recognizing key entities, the model helps in detecting and summarizing events discussed on Twitter, such as breaking news or trending topics.
- Market Research: Companies can use the model to analyze customer opinions and identify trends related to their products or services based on entity mentions.
- Content Classification: The model aids in categorizing Twitter content based on the detected entities, which can be useful for organizing and filtering large volumes of social media data.
Why use the bert_token_classifier_ner_btc model?
- Pre-trained on BTC Dataset: The model is specifically trained on Twitter data, making it well-suited for handling social media text.
- Integrated BERT Embeddings: It uses BERT base-cased embeddings, providing strong performance without needing additional embedding components.
- High Accuracy: The model achieves impressive precision and recall, ensuring reliable entity detection.
- Ease of Use: Simplifies the process of entity recognition with minimal setup required.
""", unsafe_allow_html=True)
# Predicted Entities
st.markdown('Predicted Entities
', unsafe_allow_html=True)
st.markdown("""
- PER: Person's name.
- LOC: Location or place.
- ORG: Organization or company name.
""", unsafe_allow_html=True)
# How to Use the Model
st.markdown('How to Use the Model
', unsafe_allow_html=True)
st.markdown("""
To use this model in Python, follow these steps:
""", unsafe_allow_html=True)
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, expr
import pandas as pd
# Define the components of the pipeline
document_assembler = DocumentAssembler() \\
.setInputCol("text") \\
.setOutputCol("document")
tokenizer = Tokenizer() \\
.setInputCols(["document"]) \\
.setOutputCol("token")
tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en") \\
.setInputCols("token", "document") \\
.setOutputCol("ner") \\
.setCaseSensitive(True)
ner_converter = NerConverter() \\
.setInputCols(["document", "token", "ner"]) \\
.setOutputCol("ner_chunk")
# Create the pipeline
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
tokenClassifier,
ner_converter
])
# Create some example data
test_sentences = ["Pentagram's Dominic Lippa is working on a new identity for University of Arts London."]
data = spark.createDataFrame(pd.DataFrame({'text': test_sentences}))
# Apply the pipeline to the data
model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))
result = model.transform(data)
# Display results
result.select(
expr("explode(ner_chunk) as ner_chunk")
).select(
col("ner_chunk.result").alias("chunk"),
col("ner_chunk.metadata.entity").alias("ner_label")
).show(truncate=False)
''', language='python')
# Results
st.text("""
+-------------------------+---------+
|chunk |ner_label|
+-------------------------+---------+
|Pentagram's |ORG |
|Dominic Lippa |PER |
|University of Arts London|ORG |
+-------------------------+---------+
""")
# Model Information
st.markdown('Model Information
', unsafe_allow_html=True)
st.markdown("""
Model Name |
bert_token_classifier_ner_btc |
Compatibility |
Spark NLP 3.2.2+ |
License |
Open Source |
Edition |
Official |
Input Labels |
[sentence, token] |
Output Labels |
[ner] |
Language |
en |
Case Sensitive |
true |
Max Sentence Length |
128 |
""", unsafe_allow_html=True)
# Data Source
st.markdown('Data Source
', unsafe_allow_html=True)
st.markdown("""
""", unsafe_allow_html=True)
# Benchmark
st.markdown('Benchmarking
', unsafe_allow_html=True)
st.markdown("""
The bert_token_classifier_ner_btc model has been evaluated on various benchmarks, including the following metrics:
Label |
Precision |
Recall |
F1 Score |
Support |
PER |
0.93 |
0.92 |
0.92 |
1200 |
LOC |
0.90 |
0.89 |
0.89 |
800 |
ORG |
0.94 |
0.93 |
0.93 |
1000 |
Average |
0.92 |
0.91 |
0.91 |
3000 |
""", unsafe_allow_html=True)
# Conclusion
st.markdown('Conclusion
', unsafe_allow_html=True)
st.markdown("""
The bert_token_classifier_ner_btc model offers a powerful and effective solution for detecting entities in Twitter texts. Its training on the Broad Twitter Corpus (BTC) ensures that it is well-adapted to handle the unique characteristics of social media language.
With high accuracy in identifying people, locations, and organizations, this model is invaluable for applications ranging from social media monitoring to market research and event detection. Its integration of BERT base-cased embeddings allows for robust entity recognition with minimal setup required.
For anyone looking to enhance their social media analysis capabilities or improve their NLP workflows, leveraging this model can significantly streamline the process of extracting and classifying named entities from Twitter content.
""", unsafe_allow_html=True)
# References
st.markdown('References
', unsafe_allow_html=True)
st.markdown("""
""", unsafe_allow_html=True)
# Community & Support
st.markdown('Community & Support
', unsafe_allow_html=True)
st.markdown("""
- Official Website: Documentation and examples
- Slack: Live discussion with the community and team
- GitHub: Bug reports, feature requests, and contributions
- Medium: Spark NLP articles
- YouTube: Video tutorials
""", unsafe_allow_html=True)