models_for_ner / README.md

roberthsu2003

Update README.md

7d467f2 verified about 1 month ago

5.1 kB

	---
	library_name: transformers
	base_model: google-bert/bert-base-chinese
	tags:
	- generated_from_trainer
	datasets:
	- peoples_daily_ner
	metrics:
	- f1
	model-index:
	- name: models_for_ner
	results:
	- task:
	type: token-classification
	name: Token Classification
	dataset:
	name: peoples_daily_ner
	type: peoples_daily_ner
	config: peoples_daily_ner
	split: validation
	args: peoples_daily_ner
	metrics:
	- type: f1
	value: 0.9508438253415484
	name: F1
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# models_for_ner

	This model is a fine-tuned version of [google-bert/bert-base-chinese](https://huggingface.co/google-bert/bert-base-chinese) on the peoples_daily_ner dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.0219
	- F1: 0.9508

	## Model description

	### 使用方法(pipline的方法)

	```python
	from transformers import pipeline

	ner_pipe = pipeline('token-classification', model='roberthsu2003/models_for_ner',aggregation_strategy='simple')
	inputs = '徐國堂在台北上班'
	res = ner_pipe(inputs)
	print(res)
	res_result = {}
	for r in res:
	entity_name = r['entity_group']
	start = r['start']
	end = r['end']
	if entity_name not in res_result:
	res_result[entity_name] = []
	res_result[entity_name].append(inputs[start:end])

	res_result
	#==output==
	{'PER': ['徐國堂'], 'LOC': ['台北']}
	```

	### 使用方法(model,tokenizer)

	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer
	import numpy as np

	# Load the pre-trained model and tokenizer
	model = AutoModelForTokenClassification.from_pretrained('roberthsu2003/models_for_ner')
	tokenizer = AutoTokenizer.from_pretrained('roberthsu2003/models_for_ner')

	# The label mapping (you might need to adjust this based on your training)
	#['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
	label_list = list(model.config.id2label.values())


	def predict_ner(text):
	"""Predicts NER tags for a given text using the loaded model."""
	# Encode the text
	inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)

	# Get model predictions
	outputs = model(**inputs)
	predictions = np.argmax(outputs.logits.detach().numpy(), axis=-1)

	# Get the word IDs from the encoded inputs
	# This is the key change - word_ids() is a method on the encoding result, not the tokenizer itself
	word_ids = inputs.word_ids(batch_index=0)

	pred_tags = []
	for word_id, pred in zip(word_ids, predictions[0]):
	if word_id is None:
	continue # Skip special tokens
	pred_tags.append(label_list[pred])

	return pred_tags

	#To get the entities, you'll need to group consecutive non-O tags:

	def get_entities(tags):
	"""Groups consecutive NER tags to extract entities."""
	entities = []
	start_index = -1
	current_entity_type = None
	for i, tag in enumerate(tags):
	if tag != 'O':
	if start_index == -1:
	start_index = i
	current_entity_type = tag[2:] # Extract entity type (e.g., PER, LOC, ORG)
	else: #tag == 'O'
	if start_index != -1:
	entities.append((start_index, i, current_entity_type))
	start_index = -1
	current_entity_type = None
	if start_index != -1:
	entities.append((start_index, len(tags), current_entity_type))
	return entities

	# Example usage:
	text = "徐國堂在台北上班"
	ner_tags = predict_ner(text)
	print(f"Text: {text}")
	#==output==
	#Text: 徐國堂在台北上班


	print(f"NER Tags: {ner_tags}")
	#===output==
	#NER Tags: ['B-PER', 'I-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'O', 'O']


	entities = get_entities(ner_tags)
	word_tokens = tokenizer.tokenize(text) # Tokenize to get individual words
	print(f"Entities:")
	for start, end, entity_type in entities:
	entity_text = "".join(word_tokens[start:end])
	print(f"- {entity_text}: {entity_type}")

	#==output==
	#Entities:
	#- 徐國堂: PER
	#- 台北: LOC
	```

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 64
	- eval_batch_size: 128
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 3

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| F1 \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:------:\|
	\| 0.0274 \| 1.0 \| 327 \| 0.0204 \| 0.9510 \|
	\| 0.0127 \| 2.0 \| 654 \| 0.0174 \| 0.9592 \|
	\| 0.0063 \| 3.0 \| 981 \| 0.0186 \| 0.9602 \|


	### Framework versions

	- Transformers 4.48.3
	- Pytorch 2.5.1+cu124
	- Datasets 3.3.2
	- Tokenizers 0.21.0