File size: 3,848 Bytes
fdb0f61
 
581261f
 
 
 
 
 
c69a9b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
581261f
 
fdb0f61
 
581261f
fdb0f61
 
 
581261f
fdb0f61
581261f
 
 
 
 
 
 
 
 
 
 
 
fdb0f61
581261f
fdb0f61
581261f
fdb0f61
581261f
fdb0f61
581261f
fdb0f61
581261f
fdb0f61
581261f
fdb0f61
581261f
fdb0f61
581261f
 
fdb0f61
581261f
 
fdb0f61
581261f
 
fdb0f61
581261f
 
 
 
fdb0f61
9cba09a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdb0f61
581261f
fdb0f61
581261f
fdb0f61
581261f
9cba09a
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
library_name: transformers
tags:
- citation
- text-classification
- science
license: apache-2.0
language:
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
base_model:
- distilbert/distilbert-base-multilingual-cased
---

# Citation Pre-Screening

<!-- Provide a quick summary of what the model is/does. -->

## Overview

<details>
<summary>Click to expand</summary>
  
- **Model type:** Language Model
- **Architecture:** DistilBERT
- **Language:** Multilingual
- **License:** Apache 2.0
- **Task:** Binary Classification (Citation Pre-Screening)
- **Dataset:** SIRIS-Lab/citation-parser-TYPE
- **Additional Resources:**
  - [GitHub](https://github.com/sirisacademic/citation-parser)
</details>

## Model description

The **Citation Pre-Screening** model is part of the [`Citation Parser`](https://github.com/sirisacademic/citation-parser) package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on **DistilBERT**, is specifically designed for automated citation processing workflows, making it an essential component of the **Citation Parser** tool for citation metadata extraction and validation.

The model was trained on a dataset containing citation texts, with the labels `True` (valid citation) and `False` (invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label. 

The fine-tuning process was done with the **DistilBERT-base-multilingual-cased** architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data.

## Intended Usage

This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows.

## How to use

```python
from transformers import pipeline

# Load the model
citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening")

# Example citation text
citation_text = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》"

# Classify the citation
result = citation_classifier(citation_text)
print(result)
```

## Training

The model was trained using the **Citation Pre-Screening Dataset** consisting of:

- **Training data**: 3599 samples
- **Test data**: 400 samples

The following hyperparameters were used for training:

- **Model Path**: `distilbert/distilbert-base-multilingual-cased`
- **Batch Size**: 32
- **Number of Epochs**: 4
- **Learning Rate**: 2e-5
- **Max Sequence Length**: 512

## Evaluation Metrics

The model's performance was evaluated on the test set, and the following results were obtained:

| Metric               | Value  |
|----------------------|--------|
| **Accuracy**          | 0.95   |
| **Macro avg F1**      | 0.94   |
| **Weighted avg F1**   | 0.95   |

## Additional information

### Authors 

- SIRIS Lab, Research Division of SIRIS Academic.

### License

This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

### Contact
For further information, send an email to either [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]).