File size: 1,925 Bytes
9d421c1
 
 
 
 
 
 
76bab35
 
b6c70a5
76bab35
3853df6
9d421c1
b65bfc0
76bab35
 
 
 
 
b3c19c1
 
76bab35
b3c19c1
 
76bab35
b3c19c1
 
 
76bab35
b3c19c1
 
 
 
 
 
76bab35
b3c19c1
 
76bab35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
license: apache-2.0
language:
- en
pipeline_tag: text-classification
library_name: transformers
---


# yizhao-fin-en-scorer 
## Introduction
This is a BERT model fine-tuned on a high-quality English financial dataset. It generates a financial relevance score for each piece of text, and based on this score, different quality financial data can be filtered by strategically setting thresholds. For the complete data cleaning process, please refer to [YiZhao](https://github.com/HITsz-TMG/YiZhao).

To collect training samples, we use the **Qwen-72B** model to thoroughly annotate small batches of samples extracted from English datasets, and scored them from 0 to 5 based on financial relevance. Given the uneven class distribution in the labeled samples, we apply undersampling techniques to ensure class balance. As a result, the final English training dataset contains nearly **50,000** samples. During the training process, we fix the embedding layer and encoder layer, and save the model parameters that achieve optimal performance based on the **F1 score**.
## Quickstart
Here is an example code snippet for generating financial relevance scores using this model.
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

text = "You are a smart robot"
fin_model_name = "fin-model-en-v0.1"

fin_tokenizer = AutoTokenizer.from_pretrained(fin_model_name)
fin_model = AutoModelForSequenceClassification.from_pretrained(fin_model_name)

fin_inputs = fin_tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
fin_outputs = fin_model(**fin_inputs)
fin_logits = fin_outputs.logits.squeeze(-1).float().detach().numpy()

fin_score = fin_logits.item()
result = {
    "text": text,
    "fin_score": fin_score,
    "fin_int_score": int(round(max(0, min(fin_score, 5))))
}

print(result)
# {'text': 'You are a smart robot', 'fin_score': 0.3258197605609894, 'fin_int_score': 0}
```