BERT for English Named Entity Recognition(bert4ner) Model

英文实体识别模型

bert4ner-base-uncased evaluate CoNLL-2003 test data:

The overall performance of BERT on CoNLL-2003 test:

Accuracy Recall F1
BertSoftmax 0.8956 0.9132 0.9043

在CoNLL-2003的测试集上达到接近SOTA水平。

BertSoftmax的网络结构(原生BERT)。

本项目开源在实体识别项目:nerpy,可支持bert4ner模型,通过如下命令调用:

英文实体识别:

>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
entities:  [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]

模型文件组成:

bert4ner-base-uncased
    ├── config.json
    ├── model_args.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

Usage (HuggingFace Transformers)

Without nerpy, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

Install package:

pip install transformers seqeval
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-uncased")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-uncased")
label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
              "E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]

sentence = "AL-AIN, United Arab Emirates 1996-12-06"


def get_entity(sentence):
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    word_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy()[1:-1])]
    print(sentence)
    print(word_tags)

    pred_labels = [i[1] for i in word_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = tokens[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)

数据集

实体识别数据集

数据集 语料 下载链接 文件大小
CNER中文实体识别数据集 CNER(12万字) CNER github 1.1MB
PEOPLE中文实体识别数据集 人民日报数据集(200万字) PEOPLE github 12.8MB
CoNLL03英文实体识别数据集 CoNLL-2003数据集(22万字) CoNLL03 github 1.7MB

input format

Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.

EU	S-ORG
rejects	O
German	S-MISC
call	O
to	O
boycott	O
British	S-MISC
lamb	O
.	O

Peter	B-PER
Blackburn	E-PER

如果需要训练bert4ner,请参考https://github.com/shibing624/nerpy/tree/main/examples

Citation

@software{nerpy,
  author = {Xu Ming},
  title = {nerpy: Named Entity Recognition toolkit},
  year = {2022},
  url = {https://github.com/shibing624/nerpy},
}
Downloads last month
13
Safetensors
Model size
109M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.