import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Main Title st.markdown('

Detect Entities in Twitter Texts

', unsafe_allow_html=True) # Description st.markdown("""

Detect Entities in Twitter Texts is a specialized NLP task focusing on identifying entities within Twitter-based texts. This app utilizes the bert_token_classifier_ner_btc model, which is trained on the Broad Twitter Corpus (BTC) dataset to detect entities with high accuracy. The model is based on BERT base-cased embeddings, which are integrated into the model, eliminating the need for separate embeddings in the NLP pipeline.

""", unsafe_allow_html=True) # What is Entity Recognition st.markdown('

What is Entity Recognition?

', unsafe_allow_html=True) st.markdown("""

Entity Recognition is a task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories. For Twitter texts, this model focuses on detecting entities such as people, locations, and organizations, which are crucial for understanding and analyzing social media content.

""", unsafe_allow_html=True) # Model Importance and Applications st.markdown('

Model Importance and Applications

', unsafe_allow_html=True) st.markdown("""

The bert_token_classifier_ner_btc model is highly effective for extracting named entities from Twitter texts. Its applications include:

Social Media Monitoring: The model can be used to identify and track mentions of people, organizations, and locations in social media posts, which is valuable for sentiment analysis and brand monitoring.
Event Detection: By recognizing key entities, the model helps in detecting and summarizing events discussed on Twitter, such as breaking news or trending topics.
Market Research: Companies can use the model to analyze customer opinions and identify trends related to their products or services based on entity mentions.
Content Classification: The model aids in categorizing Twitter content based on the detected entities, which can be useful for organizing and filtering large volumes of social media data.

Why use the bert_token_classifier_ner_btc model?

Pre-trained on BTC Dataset: The model is specifically trained on Twitter data, making it well-suited for handling social media text.
Integrated BERT Embeddings: It uses BERT base-cased embeddings, providing strong performance without needing additional embedding components.
High Accuracy: The model achieves impressive precision and recall, ensuring reliable entity detection.
Ease of Use: Simplifies the process of entity recognition with minimal setup required.

""", unsafe_allow_html=True) # Predicted Entities st.markdown('

Predicted Entities

', unsafe_allow_html=True) st.markdown("""

PER: Person's name.
LOC: Location or place.
ORG: Organization or company name.

""", unsafe_allow_html=True) # How to Use the Model st.markdown('

How to Use the Model

', unsafe_allow_html=True) st.markdown("""

To use this model in Python, follow these steps:

""", unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr import pandas as pd # Define the components of the pipeline document_assembler = DocumentAssembler() \\ .setInputCol("text") \\ .setOutputCol("document") tokenizer = Tokenizer() \\ .setInputCols(["document"]) \\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en") \\ .setInputCols("token", "document") \\ .setOutputCol("ner") \\ .setCaseSensitive(True) ner_converter = NerConverter() \\ .setInputCols(["document", "token", "ner"]) \\ .setOutputCol("ner_chunk") # Create the pipeline pipeline = Pipeline(stages=[ document_assembler, tokenizer, tokenClassifier, ner_converter ]) # Create some example data test_sentences = ["Pentagram's Dominic Lippa is working on a new identity for University of Arts London."] data = spark.createDataFrame(pd.DataFrame({'text': test_sentences})) # Apply the pipeline to the data model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) result = model.transform(data) # Display results result.select( expr("explode(ner_chunk) as ner_chunk") ).select( col("ner_chunk.result").alias("chunk"), col("ner_chunk.metadata.entity").alias("ner_label") ).show(truncate=False) ''', language='python') # Results st.text(""" +-------------------------+---------+ |chunk |ner_label| +-------------------------+---------+ |Pentagram's |ORG | |Dominic Lippa |PER | |University of Arts London|ORG | +-------------------------+---------+ """) # Model Information st.markdown('

Model Information

', unsafe_allow_html=True) st.markdown("""

Model Name	bert_token_classifier_ner_btc
Compatibility	Spark NLP 3.2.2+
License	Open Source
Edition	Official
Input Labels	[sentence, token]
Output Labels	[ner]
Language	en
Case Sensitive	true
Max Sentence Length	128

""", unsafe_allow_html=True) # Data Source st.markdown('

Data Source

', unsafe_allow_html=True) st.markdown("""

For more information about the dataset used to train this model, visit the Broad Twitter Corpus (BTC).

""", unsafe_allow_html=True) # Benchmark st.markdown('

Benchmarking

', unsafe_allow_html=True) st.markdown("""

The bert_token_classifier_ner_btc model has been evaluated on various benchmarks, including the following metrics:

Label	Precision	Recall	F1 Score	Support
PER	0.93	0.92	0.92	1200
LOC	0.90	0.89	0.89	800
ORG	0.94	0.93	0.93	1000
Average	0.92	0.91	0.91	3000

""", unsafe_allow_html=True) # Conclusion st.markdown('

Conclusion

', unsafe_allow_html=True) st.markdown("""

The bert_token_classifier_ner_btc model offers a powerful and effective solution for detecting entities in Twitter texts. Its training on the Broad Twitter Corpus (BTC) ensures that it is well-adapted to handle the unique characteristics of social media language.

With high accuracy in identifying people, locations, and organizations, this model is invaluable for applications ranging from social media monitoring to market research and event detection. Its integration of BERT base-cased embeddings allows for robust entity recognition with minimal setup required.

For anyone looking to enhance their social media analysis capabilities or improve their NLP workflows, leveraging this model can significantly streamline the process of extracting and classifying named entities from Twitter content.

""", unsafe_allow_html=True) # References st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

BertForTokenClassification annotator documentation
Model Used: bert_token_classifier_ner_btc_en
Visualization demos for NER in Spark NLP
Named Entity Recognition (NER) with BERT in Spark NLP

""", unsafe_allow_html=True) # Community & Support st.markdown('

Community & Support

', unsafe_allow_html=True) st.markdown("""

Official Website: Documentation and examples
Slack: Live discussion with the community and team
GitHub: Bug reports, feature requests, and contributions
Medium: Spark NLP articles
YouTube: Video tutorials

""", unsafe_allow_html=True)