Edit model card

Model Card for Model ID

Predicts whether the news article's title is fake or real. This is my first work, if you find the model interesting or useful, please like it, it will encourage me to do more research <3

Model Details

Model Description

This model's purpose is to classify, whether the information, given in the news article, is true or false. It was trained on 2 datasets, combined and preprocessed. 0 (LABEL_0) stands for false and 1 stands for true.

  • Developed by: Ostap Mykhailiv
  • Model type: Classification
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: google-bert/bert-base-uncased

Bias, Risks, and Limitations

Since it's a Bert model, it also exhibits bias. Be careful about checking some specific data by this model, since it was trained on pre-2023 data. Additionally, the lack of preprocessing for people's names in the training data might cause a bias towards certain persons.

Recommendations

To get better overall results, I decided to make a title truncation in training. Though it increased the overall result for both longer and shorter text, one should not give less than 6 and more than 12 words for predictions, excluding stopwords. For the preprocess operations look below. One can translate news from the language into English, though it may not give the expected results.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import pipeline
pipe = pipeline("text-classification", model="omykhailiv/bert-fake-news-recognition")
pipe.predict('Some text')

It will return something like this: [{'label': 'LABEL_0', 'score': 0.7248537290096283}] Where 'LABEL_0' means false and 'score' stands for the probability of it.

Training Data

https://huggingface.co/datasets/GonzaloA/fake_news https://github.com/GeorgeMcIntire/fake_real_news_dataset

Preprocessing

Preprocessing was made by using this function. Note that the data, tested below, was not truncated to 12 >= len(new_filtered_words) >= 6, but it has still been pre-processed.

import re
import string
import spacy
from nltk.corpus import stopwords
lem = spacy.load('en_core_web_sm')
def testing_data_prep(text):
    """
    Args:
        text (str): The input text string.

    Returns:
        str: The preprocessed text string, or an empty string if the length
             does not meet the specified criteria (6 to 20 words).
    """
    # Convert text to lowercase for case-insensitive processing
    text = str(text).lower()

    # Remove HTML tags and their contents (e.g., "<tag>text</tag>")
    text = re.sub('<.*?>+\w+<.*?>', '', text)

    # Remove punctuation using regular expressions and string escaping
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)

    # Remove words containing alphanumeric characters followed by digits
    # (e.g., "model2023", "data10")
    text = re.sub('\w*\d\w*', '', text)

    # Remove newline characters
    text = re.sub('\n', '', text)

    # Replace multiple whitespace characters with a single space
    text = re.sub('\\s+', ' ', text)

    # Lemmatize words (convert them to their base form)
    text = lem(text)
    words = [word.lemma_ for word in text]
    
    # Removing stopwords, such as do, not, as, etc. (https://gist.github.com/sebleier/554280)
    new_filtered_words = [
    word for word in words if word not in stopwords.words('english')]
    if 20 >= len(new_filtered_words) >= 6:
      return ' '.join(new_filtered_words)
    return ' '

Training Hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-5
  • train_batch_size: 32
  • eval_batch_size: 32
  • num_epochs: 5
  • warmup_steps: 500
  • weight_decay: 0.03
  • random seed: 42

Testing Data, Metrics

Testing Data

https://huggingface.co/datasets/GonzaloA/fake_news https://github.com/GeorgeMcIntire/fake_real_news_dataset https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/ https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/data

Metrics

Accuracy

Results

For testing on GonzaloA/fake_news test split dataset

              precision    recall  f1-score   support

           0       0.93      0.94      0.94      3782
           1       0.95      0.94      0.95      4335

    accuracy                           0.94      8117
   macro avg       0.94      0.94      0.94      8117
weighted avg       0.94      0.94      0.94      8117

For testing on https://github.com/GeorgeMcIntire/fake_real_news_dataset

               precision    recall  f1-score   support

           0       0.93      0.88      0.90      2297
           1       0.89      0.93      0.91      2297

    accuracy                           0.91      4594
   macro avg       0.91      0.91      0.91      4594
weighted avg       0.91      0.91      0.91      4594

For testing on https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/

              precision    recall  f1-score   support

           0     0.9736    0.9750    0.9743     10455
           1     0.9726    0.9711    0.9718      9541

    accuracy                         0.9731     19996
   macro avg     0.9731    0.9731    0.9731     19996
weighted avg     0.9731    0.9731    0.9731     19996

For testing on random 1k rows of https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/data

              precision    recall  f1-score   support

           0       0.87      0.80      0.84       492
           1       0.82      0.89      0.85       508

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000

Hardware

Tesla T4 GPU, available for free in Google Collab

Downloads last month
10
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.