Upload tokenizer

c69a9b0 verified 4 months ago

3.85 kB

	---
	library_name: transformers
	tags:
	- citation
	- text-classification
	- science
	license: apache-2.0
	language:
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- 'no'
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- ug
	- uk
	- ur
	- uz
	- vi
	- xh
	- yi
	- zh
	base_model:
	- distilbert/distilbert-base-multilingual-cased
	---

	# Citation Pre-Screening

	<!-- Provide a quick summary of what the model is/does. -->

	## Overview

	<details>
	<summary>Click to expand</summary>

	- Model type: Language Model
	- Architecture: DistilBERT
	- Language: Multilingual
	- License: Apache 2.0
	- Task: Binary Classification (Citation Pre-Screening)
	- Dataset: SIRIS-Lab/citation-parser-TYPE
	- Additional Resources:
	- [GitHub](https://github.com/sirisacademic/citation-parser)
	</details>

	## Model description

	The Citation Pre-Screening model is part of the [`Citation Parser`](https://github.com/sirisacademic/citation-parser) package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on DistilBERT, is specifically designed for automated citation processing workflows, making it an essential component of the Citation Parser tool for citation metadata extraction and validation.

	The model was trained on a dataset containing citation texts, with the labels `True` (valid citation) and `False` (invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label.

	The fine-tuning process was done with the DistilBERT-base-multilingual-cased architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data.

	## Intended Usage

	This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows.

	## How to use

	```python
	from transformers import pipeline

	# Load the model
	citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening")

	# Example citation text
	citation_text = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》"

	# Classify the citation
	result = citation_classifier(citation_text)
	print(result)
	```

	## Training

	The model was trained using the Citation Pre-Screening Dataset consisting of:

	- Training data: 3599 samples
	- Test data: 400 samples

	The following hyperparameters were used for training:

	- Model Path: `distilbert/distilbert-base-multilingual-cased`
	- Batch Size: 32
	- Number of Epochs: 4
	- Learning Rate: 2e-5
	- Max Sequence Length: 512

	## Evaluation Metrics

	The model's performance was evaluated on the test set, and the following results were obtained:

	\| Metric \| Value \|
	\|----------------------\|--------\|
	\| Accuracy \| 0.95 \|
	\| Macro avg F1 \| 0.94 \|
	\| Weighted avg F1 \| 0.95 \|

	## Additional information

	### Authors

	- SIRIS Lab, Research Division of SIRIS Academic.

	### License

	This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

	### Contact
	For further information, send an email to either [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]).