File size: 5,101 Bytes
246559f 379a8f1 246559f 379a8f1 246559f 379a8f1 246559f 1a08da7 159397d 14b9e33 58ff653 cc52cdc 58ff653 cc52cdc 58ff653 4b485c9 58ff653 4b485c9 159397d 246559f 1a08da7 7d467f2 1a08da7 246559f 159397d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
---
library_name: transformers
base_model: google-bert/bert-base-chinese
tags:
- generated_from_trainer
datasets:
- peoples_daily_ner
metrics:
- f1
model-index:
- name: models_for_ner
results:
- task:
type: token-classification
name: Token Classification
dataset:
name: peoples_daily_ner
type: peoples_daily_ner
config: peoples_daily_ner
split: validation
args: peoples_daily_ner
metrics:
- type: f1
value: 0.9508438253415484
name: F1
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# models_for_ner
This model is a fine-tuned version of [google-bert/bert-base-chinese](https://huggingface.co/google-bert/bert-base-chinese) on the peoples_daily_ner dataset.
It achieves the following results on the evaluation set:
- Loss: 0.0219
- F1: 0.9508
## Model description
### 使用方法(pipline的方法)
```python
from transformers import pipeline
ner_pipe = pipeline('token-classification', model='roberthsu2003/models_for_ner',aggregation_strategy='simple')
inputs = '徐國堂在台北上班'
res = ner_pipe(inputs)
print(res)
res_result = {}
for r in res:
entity_name = r['entity_group']
start = r['start']
end = r['end']
if entity_name not in res_result:
res_result[entity_name] = []
res_result[entity_name].append(inputs[start:end])
res_result
#==output==
{'PER': ['徐國堂'], 'LOC': ['台北']}
```
### 使用方法(model,tokenizer)
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import numpy as np
# Load the pre-trained model and tokenizer
model = AutoModelForTokenClassification.from_pretrained('roberthsu2003/models_for_ner')
tokenizer = AutoTokenizer.from_pretrained('roberthsu2003/models_for_ner')
# The label mapping (you might need to adjust this based on your training)
#['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label_list = list(model.config.id2label.values())
def predict_ner(text):
"""Predicts NER tags for a given text using the loaded model."""
# Encode the text
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
# Get model predictions
outputs = model(**inputs)
predictions = np.argmax(outputs.logits.detach().numpy(), axis=-1)
# Get the word IDs from the encoded inputs
# This is the key change - word_ids() is a method on the encoding result, not the tokenizer itself
word_ids = inputs.word_ids(batch_index=0)
pred_tags = []
for word_id, pred in zip(word_ids, predictions[0]):
if word_id is None:
continue # Skip special tokens
pred_tags.append(label_list[pred])
return pred_tags
#To get the entities, you'll need to group consecutive non-O tags:
def get_entities(tags):
"""Groups consecutive NER tags to extract entities."""
entities = []
start_index = -1
current_entity_type = None
for i, tag in enumerate(tags):
if tag != 'O':
if start_index == -1:
start_index = i
current_entity_type = tag[2:] # Extract entity type (e.g., PER, LOC, ORG)
else: #tag == 'O'
if start_index != -1:
entities.append((start_index, i, current_entity_type))
start_index = -1
current_entity_type = None
if start_index != -1:
entities.append((start_index, len(tags), current_entity_type))
return entities
# Example usage:
text = "徐國堂在台北上班"
ner_tags = predict_ner(text)
print(f"Text: {text}")
#==output==
#Text: 徐國堂在台北上班
print(f"NER Tags: {ner_tags}")
#===output==
#NER Tags: ['B-PER', 'I-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'O', 'O']
entities = get_entities(ner_tags)
word_tokens = tokenizer.tokenize(text) # Tokenize to get individual words
print(f"Entities:")
for start, end, entity_type in entities:
entity_text = "".join(word_tokens[start:end])
print(f"- {entity_text}: {entity_type}")
#==output==
#Entities:
#- 徐國堂: PER
#- 台北: LOC
```
## Intended uses & limitations
More information needed
## Training and evaluation data
More information needed
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 64
- eval_batch_size: 128
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 3
### Training results
| Training Loss | Epoch | Step | Validation Loss | F1 |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.0274 | 1.0 | 327 | 0.0204 | 0.9510 |
| 0.0127 | 2.0 | 654 | 0.0174 | 0.9592 |
| 0.0063 | 3.0 | 981 | 0.0186 | 0.9602 |
### Framework versions
- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.3.2
- Tokenizers 0.21.0 |