qgyd2021
/

language_identification

Model card Files Files and versions Community

language_identification / README.md

qgyd2021's picture

Update README.md

c07051b verified over 1 year ago

|

history blame contribute delete

3.43 kB

	---
	license: apache-2.0
	language:
	- zh
	- ja
	- ar
	- en
	- hi
	metrics:
	- accuracy
	library_name: allennlp
	---
	## Language Identification

	该模型是基于 AllenNLP 在 [qgyd2021/language_identification](https://huggingface.co/datasets/qgyd2021/language_identification) 数据集上训练的语种识别模型。



	在 valid 验证集上的准确率情况：

	\| 语种 \| 样本数量 \| 准确率 \|
	\| :--- \| :----: \| ------: \|
	\| af \| 6221 \| 0.8666 \|
	\| ar \| 19808 \| 0.9994 \|
	\| bg \| 19913 \| 0.9958 \|
	\| bn \| 7396 \| 0.9968 \|
	\| bs \| 1653 \| 0.8232 \|
	\| cs \| 19122 \| 0.9615 \|
	\| da \| 19500 \| 0.9727 \|
	\| de \| 19702 \| 0.996 \|
	\| el \| 19455 \| 0.9761 \|
	\| en \| 39710 \| 0.9942 \|
	\| eo \| 18542 \| 0.9944 \|
	\| es \| 19924 \| 0.9937 \|
	\| et \| 19482 \| 0.9727 \|
	\| fi \| 19223 \| 0.9554 \|
	\| fo \| 4612 \| 0.9697 \|
	\| fr \| 19990 \| 0.9957 \|
	\| ga \| 19949 \| 0.9973 \|
	\| gl \| 508 \| 0.822 \|
	\| hi \| 19984 \| 0.9965 \|
	\| hi_en \| 1358 \| 0.951 \|
	\| hr \| 18840 \| 0.9789 \|
	\| hu \| 669 \| 0.8873 \|
	\| hy \| 124 \| 0.9688 \|
	\| id \| 4669 \| 0.9968 \|
	\| is \| 19795 \| 0.9876 \|
	\| it \| 19742 \| 0.9941 \|
	\| ja \| 20130 \| 0.9996 \|
	\| ko \| 20098 \| 0.9998 \|
	\| lt \| 19280 \| 0.9721 \|
	\| lv \| 19459 \| 0.9931 \|
	\| mr \| 10300 \| 0.9961 \|
	\| mt \| 19708 \| 0.993 \|
	\| nl \| 18452 \| 0.9258 \|
	\| no \| 19404 \| 0.9714 \|
	\| pl \| 19920 \| 0.9973 \|
	\| pt \| 19996 \| 0.9946 \|
	\| ro \| 19804 \| 0.9944 \|
	\| ru \| 20003 \| 0.9954 \|
	\| sk \| 19804 \| 0.9861 \|
	\| sl \| 19665 \| 0.9926 \|
	\| sv \| 18941 \| 0.95 \|
	\| sw \| 19768 \| 0.9871 \|
	\| th \| 19917 \| 0.9991 \|
	\| tl \| 19572 \| 0.9991 \|
	\| tn \| 19883 \| 0.9933 \|
	\| tr \| 19809 \| 0.9939 \|
	\| ts \| 19752 \| 0.9854 \|
	\| uk \| 17643 \| 0.9994 \|
	\| ur \| 19895 \| 0.992 \|
	\| vi \| 19836 \| 0.9982 \|
	\| yo \| 1936 \| 0.9827 \|
	\| zh \| 40108 \| 0.9996 \|
	\| zu \| 5406 \| 0.9905 \|




	测试代码：
	```python
	#!/usr/bin/python3
	# -- coding: utf-8 --
	import argparse
	import time

	from allennlp.models.archival import archive_model, load_archive
	from allennlp.predictors.text_classifier import TextClassifierPredictor

	from project_settings import project_path


	def get_args():
	"""
	python3 step_5_predict_by_archive.py
	:return:
	"""
	parser = argparse.ArgumentParser()
	parser.add_argument(
	"--text",
	default="hello guy.",
	type=str
	)
	parser.add_argument(
	"--archive_file",
	default=(project_path / "trained_models/language_identification").as_posix(),
	type=str
	)
	args = parser.parse_args()
	return args


	def main():
	args = get_args()

	archive = load_archive(archive_file=args.archive_file)

	predictor = TextClassifierPredictor(
	model=archive.model,
	dataset_reader=archive.dataset_reader,
	)

	json_dict = {
	"sentence": args.text
	}

	begin_time = time.time()
	outputs = predictor.predict_json(
	json_dict
	)
	label = outputs["label"]
	prob = round(max(outputs["probs"]), 4)
	print(label)
	print(prob)

	print('time cost: {}'.format(time.time() - begin_time))
	return


	if __name__ == '__main__':
	main()

	```

	requirements.txt
	```text
	allennlp==2.10.1
	allennlp-models==2.10.1
	torch==1.12.1
	overrides==1.9.0
	pytorch_pretrained_bert==0.6.2
	```