import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Main Title st.markdown('
Detect Entities in Twitter Texts
', unsafe_allow_html=True) # Description st.markdown("""

Detect Entities in Twitter Texts is a specialized NLP task focusing on identifying entities within Twitter-based texts. This app utilizes the bert_token_classifier_ner_btc model, which is trained on the Broad Twitter Corpus (BTC) dataset to detect entities with high accuracy. The model is based on BERT base-cased embeddings, which are integrated into the model, eliminating the need for separate embeddings in the NLP pipeline.

""", unsafe_allow_html=True) # What is Entity Recognition st.markdown('
What is Entity Recognition?
', unsafe_allow_html=True) st.markdown("""

Entity Recognition is a task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories. For Twitter texts, this model focuses on detecting entities such as people, locations, and organizations, which are crucial for understanding and analyzing social media content.

""", unsafe_allow_html=True) # Model Importance and Applications st.markdown('
Model Importance and Applications
', unsafe_allow_html=True) st.markdown("""

The bert_token_classifier_ner_btc model is highly effective for extracting named entities from Twitter texts. Its applications include:

Why use the bert_token_classifier_ner_btc model?

""", unsafe_allow_html=True) # Predicted Entities st.markdown('
Predicted Entities
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # How to Use the Model st.markdown('
How to Use the Model
', unsafe_allow_html=True) st.markdown("""

To use this model in Python, follow these steps:

""", unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr import pandas as pd # Define the components of the pipeline document_assembler = DocumentAssembler() \\ .setInputCol("text") \\ .setOutputCol("document") tokenizer = Tokenizer() \\ .setInputCols(["document"]) \\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en") \\ .setInputCols("token", "document") \\ .setOutputCol("ner") \\ .setCaseSensitive(True) ner_converter = NerConverter() \\ .setInputCols(["document", "token", "ner"]) \\ .setOutputCol("ner_chunk") # Create the pipeline pipeline = Pipeline(stages=[ document_assembler, tokenizer, tokenClassifier, ner_converter ]) # Create some example data test_sentences = ["Pentagram's Dominic Lippa is working on a new identity for University of Arts London."] data = spark.createDataFrame(pd.DataFrame({'text': test_sentences})) # Apply the pipeline to the data model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) result = model.transform(data) # Display results result.select( expr("explode(ner_chunk) as ner_chunk") ).select( col("ner_chunk.result").alias("chunk"), col("ner_chunk.metadata.entity").alias("ner_label") ).show(truncate=False) ''', language='python') # Results st.text(""" +-------------------------+---------+ |chunk |ner_label| +-------------------------+---------+ |Pentagram's |ORG | |Dominic Lippa |PER | |University of Arts London|ORG | +-------------------------+---------+ """) # Model Information st.markdown('
Model Information
', unsafe_allow_html=True) st.markdown("""
Model Name bert_token_classifier_ner_btc
Compatibility Spark NLP 3.2.2+
License Open Source
Edition Official
Input Labels [sentence, token]
Output Labels [ner]
Language en
Case Sensitive true
Max Sentence Length 128
""", unsafe_allow_html=True) # Data Source st.markdown('
Data Source
', unsafe_allow_html=True) st.markdown("""

For more information about the dataset used to train this model, visit the Broad Twitter Corpus (BTC).

""", unsafe_allow_html=True) # Benchmark st.markdown('
Benchmarking
', unsafe_allow_html=True) st.markdown("""

The bert_token_classifier_ner_btc model has been evaluated on various benchmarks, including the following metrics:

Label Precision Recall F1 Score Support
PER 0.93 0.92 0.92 1200
LOC 0.90 0.89 0.89 800
ORG 0.94 0.93 0.93 1000
Average 0.92 0.91 0.91 3000
""", unsafe_allow_html=True) # Conclusion st.markdown('
Conclusion
', unsafe_allow_html=True) st.markdown("""

The bert_token_classifier_ner_btc model offers a powerful and effective solution for detecting entities in Twitter texts. Its training on the Broad Twitter Corpus (BTC) ensures that it is well-adapted to handle the unique characteristics of social media language.

With high accuracy in identifying people, locations, and organizations, this model is invaluable for applications ranging from social media monitoring to market research and event detection. Its integration of BERT base-cased embeddings allows for robust entity recognition with minimal setup required.

For anyone looking to enhance their social media analysis capabilities or improve their NLP workflows, leveraging this model can significantly streamline the process of extracting and classifying named entities from Twitter content.

""", unsafe_allow_html=True) # References st.markdown('
References
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # Community & Support st.markdown('
Community & Support
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True)