metadata

license: apache-2.0
datasets:
  - papluca/language-identification
language:
  - en
  - de
  - fr
  - es
metrics:
  - precision
  - recall
  - f1
  - accuracy
pipeline_tag: text-classification

German, English, French and Spanish Language Detector

The ImranzamanML/GEFS-language-detector is a fined tuned model by using the dataset of papluca Language Identification and the base model xlm-roberta-base .

This language detection model demonstrated exceptional performance, achieving an impressive F1 score close to 100%. This result significantly exceeds typical benchmarks and underscores the model's accuracy and reliability in identifying languages.

Predicted output:

Model will return the language detection in the language codes like:

de as German
en as English
fr as French
es as Spanish

Supported languages

Currently this model support 4 languages but in future more languages will be added.

Following languages supported by the model:

German (de)
English (en)
French (fr)
Spanish (es)

Use a pipeline as a high-level helper

from transformers import pipeline

text=["Mir gefällt die Art und Weise, Sprachen zu erkennen",
      "I like the way to detect languages",
      "Me gusta la forma de detectar idiomas",
      "J'aime la façon de détecter les langues"]
pipe = pipeline("text-classification", model="ImranzamanML/GEFS-language-detector")
lang_detect=pipe(text, top_k=1)
print("The detected language is", lang_detect)

Load model directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ImranzamanML/GEFS-language-detector")
model = AutoModelForSequenceClassification.from_pretrained("ImranzamanML/GEFS-language-detector")

Model Training

Epoch	  Training Loss	    Validation Loss
1	      0.002600	        0.000148  
2	      0.001000	        0.000015
3	      0.000000	        0.000011
4	      0.001800	        0.000009
5	      0.002700	        0.000016
6	      0.001600	        0.000012
7	      0.001300	        0.000009
8	      0.001200	        0.000008
9	      0.000900	        0.000007
10	      0.000900	        0.000007

Testing Results

Language   Precision   Recall	F1 	     Accuracy
de	       0.9997	   0.9998	0.9998   0.9999
en	       1.0000	   1.0000	1.0000	 1.0000
fr	       0.9995	   0.9996	0.9996	 0.9996
es	       0.9994	   0.9996	0.9995	 0.9996

About Author

Name: Muhammad Imran Zaman

Company: Theum AG

Role: Machine Learning Engineer

Professional Links:

Kaggle: Profile
LinkedIn: Profile
Google Scholar: Profile
YouTube: Channel
GitHub: Channel