gautamnancy commited on
Commit
9511daa
Β·
verified Β·
1 Parent(s): 761868c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BERT-Base Quantized Model for Relation Extraction
2
+
3
+ This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
4
+
5
+ ---
6
+
7
+ ## Model Details
8
+
9
+ - **Model Name:** BERT-Base Chinese
10
+ - **Model Architecture:** BERT Base
11
+ - **Task:** Relation Extraction/Classification
12
+ - **Dataset:** Chinese Entity-Relation Dataset
13
+ - **Quantization:** Float16
14
+ - **Fine-tuning Framework:** Hugging Face Transformers
15
+
16
+ ---
17
+
18
+ ## Usage
19
+
20
+ ### Installation
21
+
22
+ ```bash
23
+ pip install transformers torch evaluate
24
+ ```
25
+
26
+ ### Loading the Quantized Model
27
+
28
+ ```python
29
+ from transformers import BertTokenizerFast, BertForSequenceClassification
30
+ import torch
31
+
32
+ # Load the fine-tuned model and tokenizer
33
+ model_path = "final_relation_extraction_model"
34
+ tokenizer = BertTokenizerFast.from_pretrained(model_path)
35
+ model = BertForSequenceClassification.from_pretrained(model_path)
36
+ model.eval()
37
+
38
+ # Example input with entity markers
39
+ text = "η¬”εοΌš[SUBJ] ζœ¨ζ–§ [/SUBJ] ε‡Ίη”Ÿεœ°οΌš[OBJ] ζˆιƒ½ [/OBJ]"
40
+
41
+ # Tokenize input
42
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
43
+
44
+ # Inference
45
+ with torch.no_grad():
46
+ outputs = model(**inputs)
47
+ logits = outputs.logits
48
+ predicted_class = torch.argmax(logits, dim=1).item()
49
+
50
+ # Map prediction to relation label
51
+ label_mapping = {0: "ε‡Ίη”Ÿεœ°", 1: "ε‡Ίη”Ÿζ—₯期", 2: "民族", 3: "职业"} # Customize based on your labels
52
+ predicted_relation = label_mapping[predicted_class]
53
+ print(f"Predicted Relation: {predicted_relation}")
54
+ ```
55
+
56
+ ---
57
+
58
+ ## Performance Metrics
59
+
60
+ - **Accuracy:** 0.970222
61
+ - **F1 Score:** 0.964973
62
+ - **Training Loss:** 0.130104
63
+ - **Validation Loss:** 0.066986
64
+
65
+ ---
66
+
67
+ ## Fine-Tuning Details
68
+
69
+ ### Dataset
70
+
71
+ The model was fine-tuned on a Chinese entity-relation dataset with:
72
+ - Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]`
73
+ - Text preprocessed to include entity boundaries
74
+ - Multiple relation types including biographical information
75
+
76
+ ### Training Configuration
77
+
78
+ - **Epochs:** 3
79
+ - **Batch Size:** 16
80
+ - **Learning Rate:** 2e-5
81
+ - **Max Length:** 128 tokens
82
+ - **Evaluation Strategy:** epoch
83
+ - **Weight Decay:** 0.01
84
+ - **Optimizer:** AdamW
85
+
86
+ ### Data Processing
87
+
88
+ The original SPO (Subject-Predicate-Object) format was converted to relation classification:
89
+ - Each SPO triple becomes a separate training example
90
+ - Entities are marked with special tokens in the text
91
+ - Relations are encoded as numerical labels for classification
92
+
93
+ ### Quantization
94
+
95
+ Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.
96
+
97
+ ---
98
+
99
+ ## Repository Structure
100
+
101
+ ```
102
+ .
103
+ β”œβ”€β”€ final_relation_extraction_model/
104
+ β”‚ β”œβ”€β”€ config.json
105
+ β”‚ β”œβ”€β”€ pytorch_model.bin # Fine-tuned Model
106
+ β”‚ β”œβ”€β”€ tokenizer_config.json
107
+ β”‚ β”œβ”€β”€ special_tokens_map.json
108
+ β”‚ β”œβ”€β”€ tokenizer.json
109
+ β”‚ β”œβ”€β”€ vocab.txt
110
+ β”‚ └── added_tokens.json
111
+ β”œβ”€β”€ relationship-extraction.ipynb # Training notebook
112
+ └── README.md # Model documentation
113
+ ```
114
+
115
+ ---
116
+
117
+ ## Entity Marking Format
118
+
119
+ The model expects input text with entities marked using special tokens:
120
+ - Subject entities: `[SUBJ] entity_name [/SUBJ]`
121
+ - Object entities: `[OBJ] entity_name [/OBJ]`
122
+
123
+ Example:
124
+ ```
125
+ Input: "η¬”εοΌš[SUBJ] ζœ¨ζ–§ [/SUBJ]εŽŸεοΌšζ¨θŽ†ζ›Ύζ°‘ζ—οΌš [OBJ] ε›žζ— [/OBJ]"
126
+ Output: "民族" (ethnicity relation)
127
+ ```
128
+
129
+ ---
130
+
131
+ ## Supported Relations
132
+
133
+ The model can classify various biographical and factual relations in Chinese text, including:
134
+ - ε‡Ίη”Ÿεœ° (Birthplace)
135
+ - ε‡Ίη”Ÿζ—₯期 (Birth Date)
136
+ - 民族 (Ethnicity)
137
+ - 职业 (Occupation)
138
+ - And many more based on the training dataset
139
+
140
+ ---
141
+
142
+ ## Limitations
143
+
144
+ - The model is specifically trained for Chinese text and may not work well with other languages
145
+ - Performance depends on proper entity marking in the input text
146
+ - The model may not generalize well to domains outside the fine-tuning dataset
147
+ - Quantization may result in minor accuracy degradation compared to full-precision models
148
+
149
+ ---
150
+
151
+ ## Training Environment
152
+
153
+ - **Platform:** Kaggle Notebooks with GPU acceleration
154
+ - **GPU:** NVIDIA Tesla T4
155
+ - **Training Time:** Approximately 1 hour 5 minutes
156
+ - **Framework:** Hugging Face Transformers with PyTorch backend
157
+
158
+ ---
159
+
160
+ ## Contributing
161
+
162
+ Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.
163
+
164
+ ---