README.md · tykea/khmer-fasttext-sentiment-analysis at main

metadata

license: apache-2.0
language:
  - km
metrics:
  - accuracy
base_model:
  - facebook/fasttext-km-vectors
pipeline_tag: text-classification
library_name: fasttext

This is a fine-tuned version of the FastText KM model for sentiment analysis to classify khmer texts into 2 categories; Postive and Negative.

Task: Sentiment analysis (binary classification).
Languages Supported: Khmer.
Intended Use Cases:
- Analyzing customer reviews.
- Social media sentiment detection.
Limitations: - Performance may degrade on languages or domains not present in the training data. - Does not handle sarcasm or highly ambiguous inputs well.

The model was evaluated on a test set of 400 samples, achieving the following performance:
Test Accuracy: 81%
Precision: 81%
Recall: 81%
F1 Score: 81%

Confusion Matrix:

Predicted\Actual	Negative	Positive
Negative	165	44
Positive	31	160
The model supports a maximum sequence length of 512 tokens.

How to Use

from huggingface_hub import hf_hub_download
import fasttext
from khmernltk import word_tokenize

model = fasttext.load_model(hf_hub_download("tykea/khmer-fasttext-sentiment-analysis", "model.bin"))

def predict(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Join tokens back into a single string
    tokenized_text = ' '.join(tokens)
    # Make predictions
    predictions = model.predict(tokenized_text)
    # Map labels to human-readable format
    label_mapping = {
        '__label__0': 'negative',
        '__label__1': 'positive'
    }
    # Get the predicted label
    predicted_label = predictions[0][0]
    # Map the predicted label
    human_readable_label = label_mapping.get(predicted_label, 'unknown')
    return human_readable_label
predict('នេះគីជាល្បះអវិជ្ជមានសម្រាប់ប្រជាជនខ្មែរ')

tykea
/

khmer-fasttext-sentiment-analysis

Limitations: - Performance may degrade on languages or domains not present in the training data. - Does not handle sarcasm or highly ambiguous inputs well.

How to Use