|
--- |
|
language: |
|
- en |
|
base_model: |
|
- FacebookAI/xlm-roberta-large |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
--- |
|
|
|
# Patent Classification Model |
|
|
|
### Model Description |
|
|
|
**multilabel_patent_classifier** is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on patent class information between 1855-1883 made available [here](http://walkerhanlon.com/data_resources/british_patent_classification_database.zip). |
|
|
|
It has been trained to recognize 146 classes of named entities outlined by the British Patent Office. These are made available [here](https://huggingface.co/matthewleechen/multiclass-classifier-patents/edit/main/BPO_classes.csv). |
|
|
|
We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 10 epochs with a learning rate of 2e-05 and a batch size of 64. |
|
|
|
### Usage |
|
|
|
This model can be used with HuggingFace Transformer's Pipelines API for NER: |
|
|
|
```python |
|
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer |
|
|
|
model_name = "matthewleechen/multilabel_patent_classifier" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
pipe = pipeline( |
|
task="text-classification", |
|
model=model, |
|
device = 0, |
|
tokenizer=tokenizer, |
|
return_all_scores=True |
|
) |
|
|
|
``` |
|
|
|
### Training Data |
|
|
|
Our training data consists of patent titles labelled with 0-1 tags for each patent class. Labels were generated by the British Patent Office between 1855-1883 and our patent titles were extracted from the front pages of our specification texts using a patent title NER [model](https://huggingface.co/matthewleechen/patent_titles_ner). |
|
|
|
### Training Procedure |
|
|
|
We use the standard multi-label classification protocols with the HuggingFace Trainer API, but replace the default `BCEWithLogitsLoss` with a [focal loss](https://arxiv.org/pdf/1708.02002) function (α=1, γ=2) to address class imbalance. Both during evaluation and at inference, we apply a sigmoid to each logit and use a 0.5 threshold to determine positive labels for each class. |
|
|
|
### Evaluation |
|
|
|
We compute precision, recall, and F1 for each class (with a 0.5 sigmoid threshold), as well as exact match (only if ground truth and predicted classes are identical) and any match (if any overlap between ground truth and predicted classes) percentages. |
|
|
|
These scores are aggregated for the test set below. |
|
|
|
<table> |
|
<thead> |
|
<tr> |
|
<th>Metric Type</th> |
|
<th>Precision (Micro)</th> |
|
<th>Recall (Micro)</th> |
|
<th>F1 (Micro)</th> |
|
<th>Exact Match</th> |
|
<th>Any Match</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td>Micro Average</td> |
|
<td>83.4%</td> |
|
<td>60.3%</td> |
|
<td>70.0%</td> |
|
<td>52.9%</td> |
|
<td>90.8%</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
|
|
## References |
|
|
|
```bibtex |
|
@misc{hanlon2016, |
|
title = {{British Patent Technology Classification Database: 1855–1882}}, |
|
author = {Hanlon, Walker}, |
|
year = {2016}, |
|
url = {http://www.econ.ucla.edu/whanlon/}, |
|
note = {Available at: \url{http://www.econ.ucla.edu/whanlon/}} |
|
} |
|
|
|
@misc{lin2018focallossdenseobject, |
|
title={Focal Loss for Dense Object Detection}, |
|
author={Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollár}, |
|
year={2018}, |
|
eprint={1708.02002}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/1708.02002}, |
|
} |
|
``` |
|
|
|
## Citation |
|
|
|
If you use our model in your research, please cite our accompanying paper as follows: |
|
|
|
```bibtex |
|
@article{bct2025, |
|
title = {300 Years of British Patents}, |
|
author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero}, |
|
journal = {arXiv preprint arXiv:2401.12345}, |
|
year = {2025}, |
|
url = {https://arxiv.org/abs/2401.12345} |
|
} |
|
``` |