File size: 2,244 Bytes
79ab19c fd49efe 79ab19c 018e931 10dab53 018e931 50f5394 10dab53 3b2c451 10dab53 1290240 3efb6ba 1290240 10dab53 3efb6ba 10dab53 3efb6ba 10dab53 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
---
datasets:
- msislam/marc-code-mixed-small
language:
- de
- en
- es
- fr
metrics:
- seqeval
widget:
- text: >-
Hala Madrid y nada más. It means Go Madrid and nothing more.
- text: >-
Hallo, Guten Tag! how are you?
- text: >-
Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
---
# Code-Mixed Language Detection using XLM-RoBERTa
## Description
This model detects Languages with its boundary by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
## Training Dataset
The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small).
## Results
```python
'DE': {'precision': 0.9870741390453328,
'recall': 0.9883516686696866,
'f1': 0.9877124907612713}
'EN': {'precision': 0.9901617633147289,
'recall': 0.9914748508098892,
'f1': 0.9908178720181748}
'ES': {'precision': 0.9912407007439404,
'recall': 0.9912407007439404,
'f1': 0.9912407007439406}
'FR': {'precision': 0.9872469872469872,
'recall': 0.9871314927468414,
'f1': 0.9871892366188945}
'overall_precision': 0.9888723454274744
'overall_recall': 0.9895702634880803
'overall_f1': 0.9892211813585232
'overall_accuracy': 0.9993651810717168
```
## Usage
The model can be used as follows:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'
tokens = tokenizer(text, add_special_tokens= False, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
labels_predicted = logits.argmax(-1)
lang_tag_predicted = [model_best.config.id2label[t.item()] for t in labels_predicted[0]]
lang_tag_predicted
```
|