|
--- |
|
language: |
|
- en |
|
license: mit |
|
library_name: transformers |
|
tags: |
|
- fake news |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
Predicts whether the news article's title is fake or real. |
|
This is my first work, if you find the model interesting or useful, please like it, it will encourage me to do more research <3 |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
This model's purpose is to classify, whether the information, given in the news article, is true or false. It was trained on 2 datasets, |
|
combined and preprocessed. 0 (LABEL_0) stands for false and 1 stands for true. |
|
|
|
- **Developed by:** Ostap Mykhailiv |
|
- **Model type:** Classification |
|
- **Language(s) (NLP):** English |
|
- **License:** MIT |
|
- **Finetuned from model:** google-bert/bert-base-uncased |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
Since it's a Bert model, it also exhibits bias. Be careful about checking some specific data by this model, since |
|
it was trained on pre-2023 data. Additionally, the lack of preprocessing for people's names in the training data might |
|
cause a bias towards certain persons. |
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
To get better overall results, I decided to make a title truncation in training. Though it increased the overall result for both longer and |
|
shorter text, one should not give less than 6 and more than 12 words for predictions, excluding stopwords. For the preprocess operations look below. |
|
One can translate news from the language into English, though it may not give the expected results. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
``` |
|
from transformers import pipeline |
|
pipe = pipeline("text-classification", model="omykhailiv/bert-fake-news-recognition") |
|
pipe.predict('Some text') |
|
``` |
|
It will return something like this: |
|
[{'label': 'LABEL_0', 'score': 0.7248537290096283}] |
|
Where 'LABEL_0' means false and 'score' stands for the probability of it. |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
https://huggingface.co/datasets/GonzaloA/fake_news |
|
https://github.com/GeorgeMcIntire/fake_real_news_dataset |
|
|
|
#### Preprocessing |
|
Preprocessing was made by using this function. Note that the data, tested below, was not truncated to |
|
12 >= len(new_filtered_words) >= 6, but it has still been pre-processed. |
|
``` |
|
import re |
|
import string |
|
import spacy |
|
from nltk.corpus import stopwords |
|
lem = spacy.load('en_core_web_sm') |
|
def testing_data_prep(text): |
|
""" |
|
Args: |
|
text (str): The input text string. |
|
|
|
Returns: |
|
str: The preprocessed text string, or an empty string if the length |
|
does not meet the specified criteria (6 to 20 words). |
|
""" |
|
# Convert text to lowercase for case-insensitive processing |
|
text = str(text).lower() |
|
|
|
# Remove HTML tags and their contents (e.g., "<tag>text</tag>") |
|
text = re.sub('<.*?>+\w+<.*?>', '', text) |
|
|
|
# Remove punctuation using regular expressions and string escaping |
|
text = re.sub('[%s]' % re.escape(string.punctuation), '', text) |
|
|
|
# Remove words containing alphanumeric characters followed by digits |
|
# (e.g., "model2023", "data10") |
|
text = re.sub('\w*\d\w*', '', text) |
|
|
|
# Remove newline characters |
|
text = re.sub('\n', '', text) |
|
|
|
# Replace multiple whitespace characters with a single space |
|
text = re.sub('\\s+', ' ', text) |
|
|
|
# Lemmatize words (convert them to their base form) |
|
text = lem(text) |
|
words = [word.lemma_ for word in text] |
|
|
|
# Removing stopwords, such as do, not, as, etc. (https://gist.github.com/sebleier/554280) |
|
new_filtered_words = [ |
|
word for word in words if word not in stopwords.words('english')] |
|
if 20 >= len(new_filtered_words) >= 6: |
|
return ' '.join(new_filtered_words) |
|
return ' ' |
|
``` |
|
|
|
#### Training Hyperparameters |
|
The following hyperparameters were used during training: |
|
|
|
- learning_rate: 2e-5 |
|
- train_batch_size: 32 |
|
- eval_batch_size: 32 |
|
- num_epochs: 5 |
|
- warmup_steps: 500 |
|
- weight_decay: 0.03 |
|
- random seed: 42 |
|
|
|
### Testing Data, Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
https://huggingface.co/datasets/GonzaloA/fake_news |
|
https://github.com/GeorgeMcIntire/fake_real_news_dataset |
|
https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/ |
|
https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/data |
|
|
|
|
|
#### Metrics |
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
Accuracy |
|
|
|
### Results |
|
For testing on GonzaloA/fake_news test split dataset |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.93 0.94 0.94 3782 |
|
1 0.95 0.94 0.95 4335 |
|
|
|
accuracy 0.94 8117 |
|
macro avg 0.94 0.94 0.94 8117 |
|
weighted avg 0.94 0.94 0.94 8117 |
|
``` |
|
|
|
For testing on https://github.com/GeorgeMcIntire/fake_real_news_dataset |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.93 0.88 0.90 2297 |
|
1 0.89 0.93 0.91 2297 |
|
|
|
accuracy 0.91 4594 |
|
macro avg 0.91 0.91 0.91 4594 |
|
weighted avg 0.91 0.91 0.91 4594 |
|
|
|
``` |
|
For testing on https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/ |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.9736 0.9750 0.9743 10455 |
|
1 0.9726 0.9711 0.9718 9541 |
|
|
|
accuracy 0.9731 19996 |
|
macro avg 0.9731 0.9731 0.9731 19996 |
|
weighted avg 0.9731 0.9731 0.9731 19996 |
|
``` |
|
|
|
For testing on random 1k rows of https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/data |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.87 0.80 0.84 492 |
|
1 0.82 0.89 0.85 508 |
|
|
|
accuracy 0.85 1000 |
|
macro avg 0.85 0.85 0.85 1000 |
|
weighted avg 0.85 0.85 0.85 1000 |
|
``` |
|
#### Hardware |
|
|
|
Tesla T4 GPU, available for free in Google Collab |