qgyd2021's picture
Update README.md
c07051b verified
---
license: apache-2.0
language:
- zh
- ja
- ar
- en
- hi
metrics:
- accuracy
library_name: allennlp
---
## Language Identification
该模型是基于 AllenNLP 在 [qgyd2021/language_identification](https://huggingface.co/datasets/qgyd2021/language_identification) 数据集上训练的语种识别模型。
在 valid 验证集上的准确率情况:
| 语种 | 样本数量 | 准确率 |
| :--- | :----: | ------: |
| af | 6221 | 0.8666 |
| ar | 19808 | 0.9994 |
| bg | 19913 | 0.9958 |
| bn | 7396 | 0.9968 |
| bs | 1653 | 0.8232 |
| cs | 19122 | 0.9615 |
| da | 19500 | 0.9727 |
| de | 19702 | 0.996 |
| el | 19455 | 0.9761 |
| en | 39710 | 0.9942 |
| eo | 18542 | 0.9944 |
| es | 19924 | 0.9937 |
| et | 19482 | 0.9727 |
| fi | 19223 | 0.9554 |
| fo | 4612 | 0.9697 |
| fr | 19990 | 0.9957 |
| ga | 19949 | 0.9973 |
| gl | 508 | 0.822 |
| hi | 19984 | 0.9965 |
| hi_en | 1358 | 0.951 |
| hr | 18840 | 0.9789 |
| hu | 669 | 0.8873 |
| hy | 124 | 0.9688 |
| id | 4669 | 0.9968 |
| is | 19795 | 0.9876 |
| it | 19742 | 0.9941 |
| ja | 20130 | 0.9996 |
| ko | 20098 | 0.9998 |
| lt | 19280 | 0.9721 |
| lv | 19459 | 0.9931 |
| mr | 10300 | 0.9961 |
| mt | 19708 | 0.993 |
| nl | 18452 | 0.9258 |
| no | 19404 | 0.9714 |
| pl | 19920 | 0.9973 |
| pt | 19996 | 0.9946 |
| ro | 19804 | 0.9944 |
| ru | 20003 | 0.9954 |
| sk | 19804 | 0.9861 |
| sl | 19665 | 0.9926 |
| sv | 18941 | 0.95 |
| sw | 19768 | 0.9871 |
| th | 19917 | 0.9991 |
| tl | 19572 | 0.9991 |
| tn | 19883 | 0.9933 |
| tr | 19809 | 0.9939 |
| ts | 19752 | 0.9854 |
| uk | 17643 | 0.9994 |
| ur | 19895 | 0.992 |
| vi | 19836 | 0.9982 |
| yo | 1936 | 0.9827 |
| zh | 40108 | 0.9996 |
| zu | 5406 | 0.9905 |
测试代码:
```python
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import argparse
import time
from allennlp.models.archival import archive_model, load_archive
from allennlp.predictors.text_classifier import TextClassifierPredictor
from project_settings import project_path
def get_args():
"""
python3 step_5_predict_by_archive.py
:return:
"""
parser = argparse.ArgumentParser()
parser.add_argument(
"--text",
default="hello guy.",
type=str
)
parser.add_argument(
"--archive_file",
default=(project_path / "trained_models/language_identification").as_posix(),
type=str
)
args = parser.parse_args()
return args
def main():
args = get_args()
archive = load_archive(archive_file=args.archive_file)
predictor = TextClassifierPredictor(
model=archive.model,
dataset_reader=archive.dataset_reader,
)
json_dict = {
"sentence": args.text
}
begin_time = time.time()
outputs = predictor.predict_json(
json_dict
)
label = outputs["label"]
prob = round(max(outputs["probs"]), 4)
print(label)
print(prob)
print('time cost: {}'.format(time.time() - begin_time))
return
if __name__ == '__main__':
main()
```
requirements.txt
```text
allennlp==2.10.1
allennlp-models==2.10.1
torch==1.12.1
overrides==1.9.0
pytorch_pretrained_bert==0.6.2
```