|
--- |
|
library_name: transformers |
|
tags: |
|
- citation |
|
- text-classification |
|
- science |
|
license: apache-2.0 |
|
language: |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- 'no' |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- zh |
|
base_model: |
|
- distilbert/distilbert-base-multilingual-cased |
|
--- |
|
|
|
# Citation Pre-Screening |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
## Overview |
|
|
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
- **Model type:** Language Model |
|
- **Architecture:** DistilBERT |
|
- **Language:** Multilingual |
|
- **License:** Apache 2.0 |
|
- **Task:** Binary Classification (Citation Pre-Screening) |
|
- **Dataset:** SIRIS-Lab/citation-parser-TYPE |
|
- **Additional Resources:** |
|
- [GitHub](https://github.com/sirisacademic/citation-parser) |
|
</details> |
|
|
|
## Model description |
|
|
|
The **Citation Pre-Screening** model is part of the [`Citation Parser`](https://github.com/sirisacademic/citation-parser) package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on **DistilBERT**, is specifically designed for automated citation processing workflows, making it an essential component of the **Citation Parser** tool for citation metadata extraction and validation. |
|
|
|
The model was trained on a dataset containing citation texts, with the labels `True` (valid citation) and `False` (invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label. |
|
|
|
The fine-tuning process was done with the **DistilBERT-base-multilingual-cased** architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data. |
|
|
|
## Intended Usage |
|
|
|
This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows. |
|
|
|
## How to use |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
# Load the model |
|
citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening") |
|
|
|
# Example citation text |
|
citation_text = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》" |
|
|
|
# Classify the citation |
|
result = citation_classifier(citation_text) |
|
print(result) |
|
``` |
|
|
|
## Training |
|
|
|
The model was trained using the **Citation Pre-Screening Dataset** consisting of: |
|
|
|
- **Training data**: 3599 samples |
|
- **Test data**: 400 samples |
|
|
|
The following hyperparameters were used for training: |
|
|
|
- **Model Path**: `distilbert/distilbert-base-multilingual-cased` |
|
- **Batch Size**: 32 |
|
- **Number of Epochs**: 4 |
|
- **Learning Rate**: 2e-5 |
|
- **Max Sequence Length**: 512 |
|
|
|
## Evaluation Metrics |
|
|
|
The model's performance was evaluated on the test set, and the following results were obtained: |
|
|
|
| Metric | Value | |
|
|----------------------|--------| |
|
| **Accuracy** | 0.95 | |
|
| **Macro avg F1** | 0.94 | |
|
| **Weighted avg F1** | 0.95 | |
|
|
|
## Additional information |
|
|
|
### Authors |
|
|
|
- SIRIS Lab, Research Division of SIRIS Academic. |
|
|
|
### License |
|
|
|
This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
### Contact |
|
For further information, send an email to either [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]). |
|
|
|
|