--- datasets: - msislam/marc-code-mixed-small language: - de - en - es - fr metrics: - seqeval widget: - text: >- Hala Madrid y nada más. It means Go Madrid and nothing more. - text: >- Hallo, Guten Tag! how are you? - text: >- Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks. --- # Code-Mixed Language Detection using XLM-RoBERTa ## Description This model detects languages in a text (Code-Mixed text) with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). ## Training Dataset The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset that has been used to train, validate, and test this model can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small). ## Results ```python 'DE': {'precision': 0.9870741390453328, 'recall': 0.9883516686696866, 'f1': 0.9877124907612713} 'EN': {'precision': 0.9901617633147289, 'recall': 0.9914748508098892, 'f1': 0.9908178720181748} 'ES': {'precision': 0.9912407007439404, 'recall': 0.9912407007439404, 'f1': 0.9912407007439406} 'FR': {'precision': 0.9872469872469872, 'recall': 0.9871314927468414, 'f1': 0.9871892366188945} 'overall_precision': 0.9888723454274744 'overall_recall': 0.9895702634880803 'overall_f1': 0.9892211813585232 'overall_accuracy': 0.9993651810717168 ``` ## Usage The model can be used as follows: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta") model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta") text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.' tokens = tokenizer(text, add_special_tokens= False, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits labels_predicted = logits.argmax(-1) lang_tag_predicted = [model_best.config.id2label[t.item()] for t in labels_predicted[0]] lang_tag_predicted ```