# BERT-Base Quantized Model for Relation Extraction This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments. --- ## Model Details - **Model Name:** BERT-Base Chinese - **Model Architecture:** BERT Base - **Task:** Relation Extraction/Classification - **Dataset:** Chinese Entity-Relation Dataset - **Quantization:** Float16 - **Fine-tuning Framework:** Hugging Face Transformers --- ## Usage ### Installation ```bash pip install transformers torch evaluate ``` ### Loading the Quantized Model ```python from transformers import BertTokenizerFast, BertForSequenceClassification import torch # Load the fine-tuned model and tokenizer model_path = "final_relation_extraction_model" tokenizer = BertTokenizerFast.from_pretrained(model_path) model = BertForSequenceClassification.from_pretrained(model_path) model.eval() # Example input with entity markers text = "笔名:[SUBJ] 木斧 [/SUBJ] 出生地:[OBJ] 成都 [/OBJ]" # Tokenize input inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128) # Inference with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predicted_class = torch.argmax(logits, dim=1).item() # Map prediction to relation label label_mapping = {0: "出生地", 1: "出生日期", 2: "民族", 3: "职业"} # Customize based on your labels predicted_relation = label_mapping[predicted_class] print(f"Predicted Relation: {predicted_relation}") ``` --- ## Performance Metrics - **Accuracy:** 0.970222 - **F1 Score:** 0.964973 - **Training Loss:** 0.130104 - **Validation Loss:** 0.066986 --- ## Fine-Tuning Details ### Dataset The model was fine-tuned on a Chinese entity-relation dataset with: - Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]` - Text preprocessed to include entity boundaries - Multiple relation types including biographical information ### Training Configuration - **Epochs:** 3 - **Batch Size:** 16 - **Learning Rate:** 2e-5 - **Max Length:** 128 tokens - **Evaluation Strategy:** epoch - **Weight Decay:** 0.01 - **Optimizer:** AdamW ### Data Processing The original SPO (Subject-Predicate-Object) format was converted to relation classification: - Each SPO triple becomes a separate training example - Entities are marked with special tokens in the text - Relations are encoded as numerical labels for classification ### Quantization Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency. --- ## Repository Structure ``` . ├── final_relation_extraction_model/ │ ├── config.json │ ├── pytorch_model.bin # Fine-tuned Model │ ├── tokenizer_config.json │ ├── special_tokens_map.json │ ├── tokenizer.json │ ├── vocab.txt │ └── added_tokens.json ├── relationship-extraction.ipynb # Training notebook └── README.md # Model documentation ``` --- ## Entity Marking Format The model expects input text with entities marked using special tokens: - Subject entities: `[SUBJ] entity_name [/SUBJ]` - Object entities: `[OBJ] entity_name [/OBJ]` Example: ``` Input: "笔名:[SUBJ] 木斧 [/SUBJ]原名:杨莆曾民族: [OBJ] 回族 [/OBJ]" Output: "民族" (ethnicity relation) ``` --- ## Supported Relations The model can classify various biographical and factual relations in Chinese text, including: - 出生地 (Birthplace) - 出生日期 (Birth Date) - 民族 (Ethnicity) - 职业 (Occupation) - And many more based on the training dataset --- ## Limitations - The model is specifically trained for Chinese text and may not work well with other languages - Performance depends on proper entity marking in the input text - The model may not generalize well to domains outside the fine-tuning dataset - Quantization may result in minor accuracy degradation compared to full-precision models --- ## Training Environment - **Platform:** Kaggle Notebooks with GPU acceleration - **GPU:** NVIDIA Tesla T4 - **Training Time:** Approximately 1 hour 5 minutes - **Framework:** Hugging Face Transformers with PyTorch backend --- ## Contributing Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions. ---