metadata
license: apache-2.0
datasets:
- common_language
language:
- ar
- eu
- br
- ca
- zh
- cv
- cs
- nl
- en
- eo
- et
- fr
- ka
- de
- el
- id
- ia
- it
- ja
- rw
- ky
- lv
- mt
- mn
- fa
- pl
- pt
- ro
- rm
- ru
- sl
- es
- sv
- ta
- tt
- tr
- uk
- cy
metrics:
- accuracy
- precision
- recall
- f1
tags:
- language-detection
- Frisian
- Dhivehi
- Hakha_Chin
- Kabyle
- Sakha
Overview
This model supports the detection of 45 languages, and it's fine-tuned using multilingual-e5-base model on the common-language dataset.
The overall accuracy is 98.37%, and more evaluation results are shown the below.
Download the model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('Mike0307/multilingual-e5-language-detection')
model = AutoModelForSequenceClassification.from_pretrained('Mike0307/multilingual-e5-language-detection', num_labels=45)
Example of language detection
import torch
languages = [
"Arabic", "Basque", "Breton", "Catalan", "Chinese_China", "Chinese_Hongkong",
"Chinese_Taiwan", "Chuvash", "Czech", "Dhivehi", "Dutch", "English",
"Esperanto", "Estonian", "French", "Frisian", "Georgian", "German", "Greek",
"Hakha_Chin", "Indonesian", "Interlingua", "Italian", "Japanese", "Kabyle",
"Kinyarwanda", "Kyrgyz", "Latvian", "Maltese", "Mongolian", "Persian", "Polish",
"Portuguese", "Romanian", "Romansh_Sursilvan", "Russian", "Sakha", "Slovenian",
"Spanish", "Swedish", "Tamil", "Tatar", "Turkish", "Ukranian", "Welsh"
]
def predict(text, model, tokenizer, device = torch.device('cpu')):
model.to(device)
model.eval()
tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors="pt")
input_ids = tokenized['input_ids']
attention_mask = tokenized['attention_mask']
with torch.no_grad():
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=1)
return probabilities
def get_topk(probabilities, languages, k=3):
topk_prob, topk_indices = torch.topk(probabilities, k)
topk_prob = topk_prob.cpu().numpy()[0].tolist()
topk_indices = topk_indices.cpu().numpy()[0].tolist()
topk_labels = [languages[index] for index in topk_indices]
return topk_prob, topk_labels
text = "你的測試句子"
probabilities = predict(text, model, tokenizer)
topk_prob, topk_labels = get_topk(probabilities, languages)
print(topk_prob, topk_labels)
# [0.999620258808, 0.00025940246996469, 2.7690215574693e-05]
# ['Chinese_Taiwan', 'Chinese_Hongkong', 'Chinese_China']
Evaluation Results
The test datasets refers to the common_language test datasets.
language | precision | recall | f1-score | support |
---|---|---|---|---|
Arabic | 1.00 | 1.00 | 1.00 | 151 |
Basque | 0.99 | 1.00 | 1.00 | 111 |
Breton | 1.00 | 0.90 | 0.95 | 252 |
Catalan | 0.96 | 0.99 | 0.97 | 96 |
Chinese_China | 0.98 | 1.00 | 0.99 | 100 |
Chinese_Hongkong | 0.97 | 0.87 | 0.92 | 115 |
Chinese_Taiwan | 0.92 | 0.98 | 0.95 | 170 |
Chuvash | 0.98 | 1.00 | 0.99 | 137 |
Czech | 0.98 | 1.00 | 0.99 | 128 |
Dhivehi | 1.00 | 1.00 | 1.00 | 111 |
Dutch | 0.99 | 1.00 | 0.99 | 144 |
English | 0.96 | 1.00 | 0.98 | 98 |
Esperanto | 0.98 | 0.98 | 0.98 | 107 |
Estonian | 1.00 | 0.99 | 0.99 | 93 |
French | 0.95 | 1.00 | 0.98 | 106 |
Frisian | 1.00 | 0.98 | 0.99 | 117 |
Georgian | 1.00 | 1.00 | 1.00 | 110 |
German | 1.00 | 1.00 | 1.00 | 101 |
Greek | 1.00 | 1.00 | 1.00 | 153 |
Hakha_Chin | 0.99 | 1.00 | 0.99 | 202 |
Indonesian | 0.99 | 0.99 | 0.99 | 150 |
Interlingua | 0.96 | 0.97 | 0.96 | 182 |
Italian | 0.99 | 0.94 | 0.96 | 100 |
Japanese | 1.00 | 1.00 | 1.00 | 144 |
Kabyle | 1.00 | 0.96 | 0.98 | 156 |
Kinyarwanda | 0.97 | 1.00 | 0.99 | 103 |
Kyrgyz | 0.98 | 1.00 | 0.99 | 129 |
Latvian | 0.98 | 0.98 | 0.98 | 171 |
Maltese | 0.99 | 0.98 | 0.98 | 152 |
Mongolian | 1.00 | 1.00 | 1.00 | 112 |
Persian | 1.00 | 1.00 | 1.00 | 123 |
Polish | 0.91 | 0.99 | 0.95 | 128 |
Portuguese | 0.94 | 0.99 | 0.96 | 124 |
Romanian | 1.00 | 1.00 | 1.00 | 152 |
Romansh_Sursilvan | 0.99 | 0.95 | 0.97 | 106 |
Russian | 0.99 | 0.99 | 0.99 | 100 |
Sakha | 0.99 | 1.00 | 1.00 | 105 |
Slovenian | 0.99 | 1.00 | 1.00 | 166 |
Spanish | 0.96 | 0.95 | 0.95 | 94 |
Swedish | 0.99 | 1.00 | 0.99 | 190 |
Tamil | 1.00 | 1.00 | 1.00 | 135 |
Tatar | 1.00 | 0.96 | 0.98 | 173 |
Turkish | 1.00 | 1.00 | 1.00 | 137 |
Ukranian | 0.99 | 1.00 | 1.00 | 126 |
Welsh | 0.98 | 1.00 | 0.99 | 103 |
macro avg | 0.98 | 0.99 | 0.98 | 5963 |
weighted avg | 0.98 | 0.98 | 0.98 | 5963 |
overall accuracy | 0.9837 | 5963 |