File size: 4,719 Bytes
cdd0bf3
10f1826
cdd0bf3
 
dbcdaa9
 
cdd0bf3
 
 
 
 
10f1826
 
 
 
092602d
ffe4e01
 
f80aa8e
ffe4e01
ba6b2bf
7a2f56c
f80aa8e
 
 
de6a727
 
f80aa8e
de6a727
f80aa8e
cdd0bf3
 
 
 
 
 
 
82b5221
cdd0bf3
 
 
 
 
9d10ed9
82b5221
 
7285db3
82b5221
7285db3
 
 
 
c003d50
7285db3
 
 
 
82b5221
 
7285db3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d216795
7285db3
 
cdd0bf3
 
 
 
 
7285db3
cdd0bf3
d216795
cdd0bf3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10f1826
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
license: mit
base_model: bert-base-cased
tags:
- CENIA
- News
metrics:
- accuracy
model-index:
- name: bert-base-cased-finetuned
  results: []
datasets:
- cmunhozc/usa_news_en
language:
- en
pipeline_tag: text-classification

widget:
  - text: "Poll: Which COVID-related closure in San Francisco has you the most shook up? || President Trump has pardoned Edward DeBartolo Jr., the former San Francisco 49ers owner convicted in a gambling fraud scandal."
    output:
      - label: RELATED
        score: 0
      - label: UNRELATED
        score: 1     
  - text: "The first batch of 2020 census data surprised many. A look at what's next || There were some genuine surprises in the first batch of data from the nation’s 2020 head count released this week by the U.S. Census Bureau."
    output:
      - label: RELATED
        score: 1
      - label: UNRELATED
        score: 0     
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# bert-base-cased-finetuned

This model is a fine-tuned version of [bert-base-cased](https://huggingface.co/bert-base-cased) on the [usa_news_en train dataset](https://huggingface.co/datasets/cmunhozc/usa_news_en).
It achieves the following results on the evaluation set:
- Loss: 0.0900
- Accuracy: 0.9800

## Model description
 
The fine-tuned model corresponds to a binary classification model that determines whether two English news headlines are related or not related. In the following paper **{News Gathering: Leveraging Transformers to
Rank News}** it can find more details. To utilize the fine-tuned model, you can follow the steps outlined below:

```python 
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer

### 1. Load the model:
model_name = "cmunhozc/news-ranking-ft-bert"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

### 2. Dataset:
def preprocess_fctn(examples):
  return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)
...
encoded_dataset = dataset.map(preprocess_fctn, batched=True, load_from_cache_file=False)
...

### 3. Evaluation:
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)

trainer_hf = Trainer(model,
                     eval_dataset    = encoded_dataset['validation'],
                     tokenizer       = tokenizer,
                     compute_metrics = compute_metrics)

trainer_hf.evaluate()

predictions = trainer_hf.predict(encoded_dataset["validation"])
acc_val  = metric.compute(predictions=np.argmax(predictions.predictions,axis=1).tolist(), references=predictions.label_ids)['accuracy']
```
Finally, with the classification above model, you can follow the steps below to generate the news ranking.
- For each news article in the [google_news_en dataset](https://huggingface.co/datasets/cmunhozc/google_news_en) dataset positioned as the first element in a pair, retrieve all corresponding pairs from the dataset.
- Employing pair encoders, rank the news articles that occupy the second position in each pair, determining their relevance to the first article.
- Organize each list generated by the encoders based on the probabilities obtained for the relevance class.

## Intended uses & limitations

More information needed

## Training, evaluation and test data

The training data is sourced from the *train* split in [usa_news_en dataset](https://huggingface.co/datasets/cmunhozc/usa_news_en), and a similar procedure is applied for the *validation* set. In the case of testing, the initial segment for the text classification model is derived from the *test_1* and *test_2* splits. As for the ranking model, the test dataset from [google_news_en dataset](https://huggingface.co/datasets/cmunhozc/google_news_en) is utilized

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Accuracy |
|:-------------:|:-----:|:-----:|:---------------:|:--------:|
| 0.0967        | 1.0   | 3526  | 0.0651          | 0.9771   |
| 0.0439        | 2.0   | 7052  | 0.0820          | 0.9776   |
| 0.0231        | 3.0   | 10578 | 0.0900          | 0.9800   |


### Framework versions

- Transformers 4.35.2
- Pytorch 2.1.0+cu121
- Datasets 2.16.1
- Tokenizers 0.15.0