multilabel_patent_classifier / README.md

Update README.md

8940132 verified 6 months ago

3.89 kB

	---
	language:
	- en
	base_model:
	- FacebookAI/xlm-roberta-large
	pipeline_tag: text-classification
	library_name: transformers
	---

	# Patent Classification Model

	### Model Description

	multilabel_patent_classifier is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on patent class information between 1855-1883 made available [here](http://walkerhanlon.com/data_resources/british_patent_classification_database.zip).

	It has been trained to recognize 146 classes of named entities outlined by the British Patent Office. These are made available [here](https://huggingface.co/matthewleechen/multiclass-classifier-patents/edit/main/BPO_classes.csv).

	We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 10 epochs with a learning rate of 2e-05 and a batch size of 64.

	### Usage

	This model can be used with HuggingFace Transformer's Pipelines API for NER:

	```python
	from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

	model_name = "matthewleechen/multilabel_patent_classifier"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	pipe = pipeline(
	task="text-classification",
	model=model,
	device = 0,
	tokenizer=tokenizer,
	return_all_scores=True
	)

	```

	### Training Data

	Our training data consists of patent titles labelled with 0-1 tags for each patent class. Labels were generated by the British Patent Office between 1855-1883 and our patent titles were extracted from the front pages of our specification texts using a patent title NER [model](https://huggingface.co/matthewleechen/patent_titles_ner).

	### Training Procedure

	We use the standard multi-label classification protocols with the HuggingFace Trainer API, but replace the default `BCEWithLogitsLoss` with a [focal loss](https://arxiv.org/pdf/1708.02002) function (α=1, γ=2) to address class imbalance. Both during evaluation and at inference, we apply a sigmoid to each logit and use a 0.5 threshold to determine positive labels for each class.

	### Evaluation

	We compute precision, recall, and F1 for each class (with a 0.5 sigmoid threshold), as well as exact match (only if ground truth and predicted classes are identical) and any match (if any overlap between ground truth and predicted classes) percentages.

	These scores are aggregated for the test set below.

	<table>
	<thead>
	<tr>
	<th>Metric Type</th>
	<th>Precision (Micro)</th>
	<th>Recall (Micro)</th>
	<th>F1 (Micro)</th>
	<th>Exact Match</th>
	<th>Any Match</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Micro Average</td>
	<td>83.4%</td>
	<td>60.3%</td>
	<td>70.0%</td>
	<td>52.9%</td>
	<td>90.8%</td>
	</tr>
	</tbody>
	</table>


	## References

	```bibtex
	@misc{hanlon2016,
	title = {{British Patent Technology Classification Database: 1855–1882}},
	author = {Hanlon, Walker},
	year = {2016},
	url = {http://www.econ.ucla.edu/whanlon/},
	note = {Available at: \url{http://www.econ.ucla.edu/whanlon/}}
	}

	@misc{lin2018focallossdenseobject,
	title={Focal Loss for Dense Object Detection},
	author={Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollár},
	year={2018},
	eprint={1708.02002},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/1708.02002},
	}
	```

	## Citation

	If you use our model in your research, please cite our accompanying paper as follows:

	```bibtex
	@article{bct2025,
	title = {300 Years of British Patents},
	author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
	journal = {arXiv preprint arXiv:2401.12345},
	year = {2025},
	url = {https://arxiv.org/abs/2401.12345}
	}
	```