--- library_name: transformers tags: - citation - text-classification - science license: apache-2.0 language: - af - am - ar - as - az - be - bg - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - 'no' - om - or - pa - pl - ps - pt - ro - ru - sa - sd - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - ug - uk - ur - uz - vi - xh - yi - zh base_model: - distilbert/distilbert-base-multilingual-cased --- # Citation Pre-Screening ## Overview
Click to expand - **Model type:** Language Model - **Architecture:** DistilBERT - **Language:** Multilingual - **License:** Apache 2.0 - **Task:** Binary Classification (Citation Pre-Screening) - **Dataset:** SIRIS-Lab/citation-parser-TYPE - **Additional Resources:** - [GitHub](https://github.com/sirisacademic/citation-parser)
## Model description The **Citation Pre-Screening** model is part of the [`Citation Parser`](https://github.com/sirisacademic/citation-parser) package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on **DistilBERT**, is specifically designed for automated citation processing workflows, making it an essential component of the **Citation Parser** tool for citation metadata extraction and validation. The model was trained on a dataset containing citation texts, with the labels `True` (valid citation) and `False` (invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label. The fine-tuning process was done with the **DistilBERT-base-multilingual-cased** architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data. ## Intended Usage This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows. ## How to use ```python from transformers import pipeline # Load the model citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening") # Example citation text citation_text = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》" # Classify the citation result = citation_classifier(citation_text) print(result) ``` ## Training The model was trained using the **Citation Pre-Screening Dataset** consisting of: - **Training data**: 3599 samples - **Test data**: 400 samples The following hyperparameters were used for training: - **Model Path**: `distilbert/distilbert-base-multilingual-cased` - **Batch Size**: 32 - **Number of Epochs**: 4 - **Learning Rate**: 2e-5 - **Max Sequence Length**: 512 ## Evaluation Metrics The model's performance was evaluated on the test set, and the following results were obtained: | Metric | Value | |----------------------|--------| | **Accuracy** | 0.95 | | **Macro avg F1** | 0.94 | | **Weighted avg F1** | 0.95 | ## Additional information ### Authors - SIRIS Lab, Research Division of SIRIS Academic. ### License This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). ### Contact For further information, send an email to either [nicolau.duransilva@sirisacademic.com](mailto:nicolau.duransilva@sirisacademic.com) or [info@sirisacademic.com](mailto:info@sirisacademic.com).