metadata

library_name: transformers
license: apache-2.0
language:
  - en
metrics:
  - accuracy
pipeline_tag: text-classification

Model Card for Model ID

Predicts whether the news article's title is fake or real.

Model Details

Model Description

This model's purpose is to classify, whether the information, given in the news article, is true or false. It was trained on 2 datasets, combined and preprocessed. 0 (LABEL_0) stands for false and 1 stands for true.

Developed by: Ostap Mykhailiv
Model type: Classification
Language(s) (NLP): English
License: apache-2.0
Finetuned from model: google-bert/bert-base-uncased

Model Usage

This model can be used for whatever reason you need, also a site hosted, based on this model is here: (todo)

Bias, Risks, and Limitations

As a Bert model, this also has bias. It can't be considered as a somewhat state-of-the-art model, because it was trained on old data (about 2022 and older), so it may not be considered as a reliable fake-news checker about military conflicts in Ukraine, Israel, and so on. Please consider, that the names of people in the data were not preprocessed, so it might be also biased toward certain names.

Recommendations

To get better overall results, I decided to make a title truncation in training. Though it increased the overall result for both longer and shorter text, one should not give less than 6 and more than 12 words for predictions, excluding stopwords. For the preprocess operations look below. One can translate news from language into English, though it may not give the expected results.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import pipeline pipe = pipeline("text-classification", model="omykhailiv/bert-fake-news-recognition") pipe.predict('Some text')

It will return something like this: [{'label': 'LABEL_0', 'score': 0.7248537290096283}] Where 'LABEL_0' means false and score means the probability of it.

Training Data

https://huggingface.co/datasets/GonzaloA/fake_news https://github.com/GeorgeMcIntire/fake_real_news_dataset

Preprocessing

Preprocessing was made by using this function:

import re
import string
import spacy
from nltk.corpus import stopwords
lem = spacy.load('en_core_web_sm')
stop_words = set(stopwords.words('english'))
def testing_data_prep(text):
    """
    Args:
        text (str): The input text string.

    Returns:
        str: The preprocessed text string, or an empty string if the length
             does not meet the specified criteria (8 to 12 words).
    """
    # Convert text to lowercase for case-insensitive processing
    text = str(text).lower()

    # Remove HTML tags and their contents (e.g., "<tag>text</tag>")
    text = re.sub('<.*?>+\w+<.*?>', '', text)

    # Remove punctuation using regular expressions and string escaping
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)

    # Remove words containing alphanumeric characters followed by digits
    # (e.g., "model2023", "data10")
    text = re.sub('\w*\d\w*', '', text)

    # Remove newline characters
    text = re.sub('\n', '', text)

    # Replace multiple whitespace characters with a single space
    text = re.sub('\\s+', ' ', text)

    # Lemmatize words (convert them to their base form)
    text = lem(text)
    words = [word.lemma_ for word in text]
    
    # Removing stopwords, such as do, not, as, etc. (https://gist.github.com/sebleier/554280)
    new_filtered_words = [
    word for word in words if word not in stopwords.words('english')]
    if 12 >= len(new_filtered_words) >= 6:
      return ' '.join(new_filtered_words)
    return ' '.join(new_filtered_words)

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-5
train_batch_size: 32
eval_batch_size: 32
num_epochs: 5
warmup_steps: 500
weight_decay: 0.03
random seed: 42

Speeds, Sizes, Times [optional]

[More Information Needed]

Testing Data, Metrics

Testing Data

https://huggingface.co/datasets/GonzaloA/fake_news https://github.com/GeorgeMcIntire/fake_real_news_dataset https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/ https://arxiv.org/pdf/1806.00749v1, the dataset download link: https://drive.google.com/file/d/0B3e3qZpPtccsMFo5bk9Ib3VCc2c/view?resourcekey=0-_eqAfKOCKbuE-xFFCmEzyg

Metrics

Accuracy

Results

[More Information Needed]

Summary

Hardware

Tesla T4 GPU, available for free in Google Collab