import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Main Title st.markdown('
NER Model for 10 African Languages
', unsafe_allow_html=True) # What is Named Entity Recognition (NER)? st.markdown('
What is Named Entity Recognition (NER)?
', unsafe_allow_html=True) st.markdown("""

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying entities within a text into predefined categories such as names of people, organizations, locations, dates, and more. NER helps in structuring unstructured text, making it easier to analyze and extract meaningful information.

For example, in the sentence "Barack Obama was born in Hawaii," NER would identify "Barack Obama" as a person (PER) and "Hawaii" as a location (LOC).

""", unsafe_allow_html=True) # Importance of NER st.markdown('
Importance of NER
', unsafe_allow_html=True) st.markdown("""

NER is essential for various applications, including:

""", unsafe_allow_html=True) # Description st.markdown('
Description
', unsafe_allow_html=True) st.markdown("""

This model is imported from Hugging Face. It’s been trained using xlm_roberta_large fine-tuned model on 10 African languages: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian Pidgin, Swahili, Wolof, and Yorùbá.

""", unsafe_allow_html=True) # Predicted Entities st.markdown('
Predicted Entities
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # How to use st.markdown('
How to use
', unsafe_allow_html=True) st.markdown("""

To use this model, follow these steps in Python:

""", unsafe_allow_html=True) st.code(""" from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline # Define the components of the pipeline documentAssembler = DocumentAssembler() \\ .setInputCol("text") \\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \\ .setInputCols(["document"]) \\ .setOutputCol("sentence") tokenizer = Tokenizer() \\ .setInputCols(["sentence"]) \\ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_masakhaner", "xx") \\ .setInputCols(["sentence",'token']) \\ .setOutputCol("ner") ner_converter = NerConverter() \\ .setInputCols(["sentence", "token", "ner"]) \\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = '''አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።''' result = model.transform(spark.createDataFrame([[text]]).toDF("text")) # Display the results result.selectExpr("explode(arrays_zip(ner_chunk.result, ner_chunk.metadata)) as entity") .selectExpr("entity['0'] as chunk", "entity['1'].entity as ner_label") .show(truncate=False) """, language="python") # Results import pandas as pd # Create the data for the DataFrame data = { "chunk": ["አህመድ ቫንዳ", "ከ3-10-2000 ጀምሮ", "በአዲስ አበባ"], "ner_label": ["PER", "DATE", "LOC"] } # Creating the DataFrame df = pd.DataFrame(data) df.index += 1 st.dataframe(df) # What Can We Do with This Model? st.markdown('
What Can We Do with This Model?
', unsafe_allow_html=True) st.markdown("""

This NER model for 10 African languages enables various applications:

""", unsafe_allow_html=True) # Model Information st.markdown('
Model Information
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # Data Source st.markdown('
Data Source
', unsafe_allow_html=True) st.markdown("""

The model was trained using the dataset available at Hugging Face.

""", unsafe_allow_html=True) # Benchmarking st.markdown('
Benchmarking
', unsafe_allow_html=True) st.markdown("""

Evaluating the performance of NER models is crucial to understanding their effectiveness in real-world applications. Below are the benchmark results for the xlm_roberta_large_token_classifier_masakhaner model, focusing on various named entity categories across 10 African languages. The metrics used include F1-score, which is a standard for evaluating classification models.

""", unsafe_allow_html=True) st.markdown(""" --- | language | F1-score | |----------|----------| | amh | 75.76 | | hau | 91.75 | | ibo | 86.26 | | kin | 76.38 | | lug | 84.64 | | luo | 80.65 | | pcm | 89.55 | | swa | 89.48 | | wol | 70.70 | | yor | 82.05 | --- """, unsafe_allow_html=True) st.markdown("""

These results demonstrate the model's ability to accurately identify and classify named entities in multiple African languages. The F1-scores indicate the balance between precision and recall for each language, reflecting the model's robustness across diverse linguistic contexts.

""", unsafe_allow_html=True) # Conclusion/Summary st.markdown('
Conclusion
', unsafe_allow_html=True) st.markdown("""

The xlm_roberta_large_token_classifier_masakhaner model showcases significant performance in recognizing named entities in 10 African languages. This model leverages xlm_roberta_large embeddings to enhance its understanding and accuracy in identifying entities such as persons, locations, dates, and organizations. Its integration into Spark NLP provides an efficient and scalable solution for processing multilingual text data, making it an invaluable tool for researchers and developers working with African languages.

""", unsafe_allow_html=True) # References st.markdown('
References
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # Community & Support st.markdown('
Community & Support
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True)