File size: 4,719 Bytes
cdd0bf3 10f1826 cdd0bf3 dbcdaa9 cdd0bf3 10f1826 092602d ffe4e01 f80aa8e ffe4e01 ba6b2bf 7a2f56c f80aa8e de6a727 f80aa8e de6a727 f80aa8e cdd0bf3 82b5221 cdd0bf3 9d10ed9 82b5221 7285db3 82b5221 7285db3 c003d50 7285db3 82b5221 7285db3 d216795 7285db3 cdd0bf3 7285db3 cdd0bf3 d216795 cdd0bf3 10f1826 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
license: mit
base_model: bert-base-cased
tags:
- CENIA
- News
metrics:
- accuracy
model-index:
- name: bert-base-cased-finetuned
results: []
datasets:
- cmunhozc/usa_news_en
language:
- en
pipeline_tag: text-classification
widget:
- text: "Poll: Which COVID-related closure in San Francisco has you the most shook up? || President Trump has pardoned Edward DeBartolo Jr., the former San Francisco 49ers owner convicted in a gambling fraud scandal."
output:
- label: RELATED
score: 0
- label: UNRELATED
score: 1
- text: "The first batch of 2020 census data surprised many. A look at what's next || There were some genuine surprises in the first batch of data from the nation’s 2020 head count released this week by the U.S. Census Bureau."
output:
- label: RELATED
score: 1
- label: UNRELATED
score: 0
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# bert-base-cased-finetuned
This model is a fine-tuned version of [bert-base-cased](https://huggingface.co/bert-base-cased) on the [usa_news_en train dataset](https://huggingface.co/datasets/cmunhozc/usa_news_en).
It achieves the following results on the evaluation set:
- Loss: 0.0900
- Accuracy: 0.9800
## Model description
The fine-tuned model corresponds to a binary classification model that determines whether two English news headlines are related or not related. In the following paper **{News Gathering: Leveraging Transformers to
Rank News}** it can find more details. To utilize the fine-tuned model, you can follow the steps outlined below:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer
### 1. Load the model:
model_name = "cmunhozc/news-ranking-ft-bert"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
### 2. Dataset:
def preprocess_fctn(examples):
return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)
...
encoded_dataset = dataset.map(preprocess_fctn, batched=True, load_from_cache_file=False)
...
### 3. Evaluation:
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
trainer_hf = Trainer(model,
eval_dataset = encoded_dataset['validation'],
tokenizer = tokenizer,
compute_metrics = compute_metrics)
trainer_hf.evaluate()
predictions = trainer_hf.predict(encoded_dataset["validation"])
acc_val = metric.compute(predictions=np.argmax(predictions.predictions,axis=1).tolist(), references=predictions.label_ids)['accuracy']
```
Finally, with the classification above model, you can follow the steps below to generate the news ranking.
- For each news article in the [google_news_en dataset](https://huggingface.co/datasets/cmunhozc/google_news_en) dataset positioned as the first element in a pair, retrieve all corresponding pairs from the dataset.
- Employing pair encoders, rank the news articles that occupy the second position in each pair, determining their relevance to the first article.
- Organize each list generated by the encoders based on the probabilities obtained for the relevance class.
## Intended uses & limitations
More information needed
## Training, evaluation and test data
The training data is sourced from the *train* split in [usa_news_en dataset](https://huggingface.co/datasets/cmunhozc/usa_news_en), and a similar procedure is applied for the *validation* set. In the case of testing, the initial segment for the text classification model is derived from the *test_1* and *test_2* splits. As for the ranking model, the test dataset from [google_news_en dataset](https://huggingface.co/datasets/cmunhozc/google_news_en) is utilized
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
### Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:-----:|:---------------:|:--------:|
| 0.0967 | 1.0 | 3526 | 0.0651 | 0.9771 |
| 0.0439 | 2.0 | 7052 | 0.0820 | 0.9776 |
| 0.0231 | 3.0 | 10578 | 0.0900 | 0.9800 |
### Framework versions
- Transformers 4.35.2
- Pytorch 2.1.0+cu121
- Datasets 2.16.1
- Tokenizers 0.15.0 |