Shaida Muhammad
Update README.md
2d6504c verified
|
raw
history blame
3.62 kB
metadata
license: mit
language:
  - ur

mit

ayeshasameer/xlm-roberta-roman-urdu-sentiment

Model Description

The ayeshasameer/xlm-roberta-roman-urdu-sentiment model is a fine-tuned version of XLM-RoBERTa, specifically adapted for sentiment analysis tasks on Roman Urdu text. XLM-RoBERTa is a multilingual variant of RoBERTa, pre-trained on a diverse set of languages, making it highly versatile for various NLP tasks across multiple languages.

This model is trained to classify Roman Urdu text into three sentiment categories:

  • Positive
  • Neutral
  • Negative

Model Architecture

  • Model Type: XLM-RoBERTa
  • Number of Layers: 12
  • Hidden Size: 768
  • Number of Attention Heads: 12
  • Intermediate Size: 3072
  • Max Position Embeddings: 514
  • Vocabulary Size: 250002
  • Hidden Activation Function: GELU
  • Hidden Dropout Probability: 0.1
  • Attention Dropout Probability: 0.1
  • Layer Norm Epsilon: 1e-5

Training Data

The model was fine-tuned on a dataset of Roman Urdu text, labeled for sentiment analysis. The dataset includes text from social media, news comments, and other sources where Roman Urdu is commonly used. The labels for the dataset were:

  • Positive
  • Neutral
  • Negative

Intended Use

The model is intended for sentiment analysis of Roman Urdu text, which is commonly used in informal settings like social media, chat applications, and user-generated content platforms. It can be used to understand the sentiment behind user comments, reviews, and other forms of text communication.

Example Usage

Here is an example of how to use this model with the Hugging Face Transformers library in Python:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.special import softmax

# Load the model and tokenizer
model_name = "ayeshasameer/xlm-roberta-roman-urdu-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Preprocess the input text
text = "Mein ek bahut acha insaan hon."
inputs = tokenizer(text, return_tensors="pt")

# Get model predictions
outputs = model(**inputs)
scores = outputs[0][0].detach().numpy()
scores = softmax(scores)

# Output the sentiment scores
sentiment = {
    "Negative": scores[0],
    "Neutral": scores[1],
    "Positive": scores[2]
}
print(sentiment)

Evaluation

The model was evaluated on a held-out test set of Roman Urdu text and achieved the following performance metrics:

  • Accuracy: 0.XX
  • Precision: 0.XX
  • Recall: 0.XX
  • F1 Score: 0.XX

These metrics indicate the model's ability to correctly classify the sentiment of Roman Urdu text.

Limitations

While the model performs well on the provided dataset, there are some limitations:

  • The model may not generalize well to domains or types of text that were not represented in the training data.
  • Misclassifications can occur, especially with text that contains sarcasm, slang, or context-specific language that the model was not trained on.
  • The model's performance is dependent on the quality and representativeness of the training data.

Ethical Considerations

When using the model, it is essential to consider the ethical implications:

  • Ensure that the text being analyzed does not contain sensitive or private information.
  • Be mindful of potential biases in the training data, which could affect the model's predictions.
  • Use the model responsibly, especially in applications that may impact individuals or communities.