license: mit
language:
- ur
mit
ayeshasameer/xlm-roberta-roman-urdu-sentiment
Model Description
The ayeshasameer/xlm-roberta-roman-urdu-sentiment
model is a fine-tuned version of XLM-RoBERTa, specifically adapted for sentiment analysis tasks on Roman Urdu text. XLM-RoBERTa is a multilingual variant of RoBERTa, pre-trained on a diverse set of languages, making it highly versatile for various NLP tasks across multiple languages.
This model is trained to classify Roman Urdu text into three sentiment categories:
- Positive
- Neutral
- Negative
Model Architecture
- Model Type: XLM-RoBERTa
- Number of Layers: 12
- Hidden Size: 768
- Number of Attention Heads: 12
- Intermediate Size: 3072
- Max Position Embeddings: 514
- Vocabulary Size: 250002
- Hidden Activation Function: GELU
- Hidden Dropout Probability: 0.1
- Attention Dropout Probability: 0.1
- Layer Norm Epsilon: 1e-5
Training Data
The model was fine-tuned on a dataset of Roman Urdu text, labeled for sentiment analysis. The dataset includes text from social media, news comments, and other sources where Roman Urdu is commonly used. The labels for the dataset were:
- Positive
- Neutral
- Negative
Intended Use
The model is intended for sentiment analysis of Roman Urdu text, which is commonly used in informal settings like social media, chat applications, and user-generated content platforms. It can be used to understand the sentiment behind user comments, reviews, and other forms of text communication.
Example Usage
Here is an example of how to use this model with the Hugging Face Transformers library in Python:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.special import softmax
# Load the model and tokenizer
model_name = "ayeshasameer/xlm-roberta-roman-urdu-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Preprocess the input text
text = "Mein ek bahut acha insaan hon."
inputs = tokenizer(text, return_tensors="pt")
# Get model predictions
outputs = model(**inputs)
scores = outputs[0][0].detach().numpy()
scores = softmax(scores)
# Output the sentiment scores
sentiment = {
"Negative": scores[0],
"Neutral": scores[1],
"Positive": scores[2]
}
print(sentiment)
Evaluation
The model was evaluated on a held-out test set of Roman Urdu text and achieved the following performance metrics:
- Accuracy: 0.XX
- Precision: 0.XX
- Recall: 0.XX
- F1 Score: 0.XX
These metrics indicate the model's ability to correctly classify the sentiment of Roman Urdu text.
Limitations
While the model performs well on the provided dataset, there are some limitations:
- The model may not generalize well to domains or types of text that were not represented in the training data.
- Misclassifications can occur, especially with text that contains sarcasm, slang, or context-specific language that the model was not trained on.
- The model's performance is dependent on the quality and representativeness of the training data.
Ethical Considerations
When using the model, it is essential to consider the ethical implications:
- Ensure that the text being analyzed does not contain sensitive or private information.
- Be mindful of potential biases in the training data, which could affect the model's predictions.
- Use the model responsibly, especially in applications that may impact individuals or communities.