|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
Predicts whether the news article's title is fake or real. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
This model's purpose is to classify, whether the information, given in the news article, is true or false. It was trained on 2 datasets, |
|
combined and preprocessed. 0 (LABEL_0) stands for false and 1 stands for true. |
|
|
|
- **Developed by:** Ostap Mykhailiv |
|
- **Model type:** Classification |
|
- **Language(s) (NLP):** English |
|
- **License:** apache-2.0 |
|
- **Finetuned from model:** google-bert/bert-base-uncased |
|
|
|
### Model Usage |
|
This model can be used for whatever reason you need, also a site hosted, based on this model is here: (todo) |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
As a Bert model, this also has bias. It can't be considered as a somewhat state-of-the-art model, because |
|
it was trained on old data (about 2022 and older), so it may not be considered as a reliable fake-news checker |
|
about military conflicts in Ukraine, Israel, and so on. Please consider, that the names of people in the data were not preprocessed, so |
|
it might be also biased toward certain names. |
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
To get better overall results, I decided to make a title truncation in training. Though it increased the overall result for both longer and |
|
shorter text, one should not give less than 6 and more than 12 words for predictions, excluding stopwords. For the preprocess operations look below. |
|
One can translate news from language into English, though it may not give the expected results. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
from transformers import pipeline |
|
pipe = pipeline("text-classification", model="omykhailiv/bert-fake-news-recognition") |
|
pipe.predict('Some text') |
|
|
|
It will return something like this: |
|
[{'label': 'LABEL_0', 'score': 0.7248537290096283}] |
|
Where 'LABEL_0' means false and score means the probability of it. |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
https://huggingface.co/datasets/GonzaloA/fake_news |
|
https://github.com/GeorgeMcIntire/fake_real_news_dataset |
|
|
|
#### Preprocessing |
|
Preprocessing was made by using this function: |
|
``` |
|
import re |
|
import string |
|
import spacy |
|
from nltk.corpus import stopwords |
|
lem = spacy.load('en_core_web_sm') |
|
stop_words = set(stopwords.words('english')) |
|
def testing_data_prep(text): |
|
""" |
|
Args: |
|
text (str): The input text string. |
|
|
|
Returns: |
|
str: The preprocessed text string, or an empty string if the length |
|
does not meet the specified criteria (8 to 12 words). |
|
""" |
|
# Convert text to lowercase for case-insensitive processing |
|
text = str(text).lower() |
|
|
|
# Remove HTML tags and their contents (e.g., "<tag>text</tag>") |
|
text = re.sub('<.*?>+\w+<.*?>', '', text) |
|
|
|
# Remove punctuation using regular expressions and string escaping |
|
text = re.sub('[%s]' % re.escape(string.punctuation), '', text) |
|
|
|
# Remove words containing alphanumeric characters followed by digits |
|
# (e.g., "model2023", "data10") |
|
text = re.sub('\w*\d\w*', '', text) |
|
|
|
# Remove newline characters |
|
text = re.sub('\n', '', text) |
|
|
|
# Replace multiple whitespace characters with a single space |
|
text = re.sub('\\s+', ' ', text) |
|
|
|
# Lemmatize words (convert them to their base form) |
|
text = lem(text) |
|
words = [word.lemma_ for word in text] |
|
|
|
# Removing stopwords, such as do, not, as, etc. (https://gist.github.com/sebleier/554280) |
|
new_filtered_words = [ |
|
word for word in words if word not in stopwords.words('english')] |
|
if 12 >= len(new_filtered_words) >= 6: |
|
return ' '.join(new_filtered_words) |
|
return ' '.join(new_filtered_words) |
|
``` |
|
|
|
#### Training Hyperparameters |
|
The following hyperparameters were used during training: |
|
|
|
- learning_rate: 2e-5 |
|
- train_batch_size: 32 |
|
- eval_batch_size: 32 |
|
- num_epochs: 5 |
|
- warmup_steps: 500 |
|
- weight_decay: 0.03 |
|
- random seed: 42 |
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
[More Information Needed] |
|
|
|
|
|
### Testing Data, Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
https://huggingface.co/datasets/GonzaloA/fake_news |
|
https://github.com/GeorgeMcIntire/fake_real_news_dataset |
|
https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/ |
|
https://arxiv.org/pdf/1806.00749v1, the dataset download link: https://drive.google.com/file/d/0B3e3qZpPtccsMFo5bk9Ib3VCc2c/view?resourcekey=0-_eqAfKOCKbuE-xFFCmEzyg |
|
|
|
|
|
#### Metrics |
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
Accuracy |
|
|
|
### Results |
|
|
|
[More Information Needed] |
|
|
|
#### Summary |
|
|
|
|
|
#### Hardware |
|
|
|
Tesla T4 GPU, available for free in Google Collab |
|
|