医疗领域中文命名实体识别
项目地址:https://github.com/iioSnail/chinese_medical_ner
使用方法:
from transformers import AutoModelForTokenClassification, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('iioSnail/bert-base-chinese-medical-ner')
model = AutoModelForTokenClassification.from_pretrained("iioSnail/bert-base-chinese-medical-ner")
sentences = ["瘦脸针、水光针和玻尿酸详解!", "半月板钙化的病因有哪些?"]
inputs = tokenizer(sentences, return_tensors="pt", padding=True, add_special_tokens=False)
outputs = model(**inputs)
outputs = outputs.logits.argmax(-1) * inputs['attention_mask']
print(outputs)
输出结果:
tensor([[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 4, 4],
[1, 2, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 0, 0]])
其中 1=B, 2=I, 3=E, 4=O
。1, 3
表示一个二字医疗实体,1,2,3
表示一个3字医疗实体, 1,2,2,3
表示一个4字医疗实体,依次类推。
可以使用项目中的MedicalNerModel.format_outputs(sentences, outputs)
来将输出进行转换。
效果如下:
[
[
{'start': 0, 'end': 3, 'word': '瘦脸针'},
{'start': 4, 'end': 7, 'word': '水光针'},
{'start': 8, 'end': 11, 'word': '玻尿酸'}、
],
[
{'start': 0, 'end': 5, 'word': '半月板钙化'}
]
]
- Downloads last month
- 102
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.