Update README.md

2d92aee verified 6 months ago

5.39 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	metrics:
	- accuracy
	pipeline_tag: text-classification
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->
	Predicts whether the news article's title is fake or real.


	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->
	This model's purpose is to classify, whether the information, given in the news article, is true or false. It was trained on 2 datasets,
	combined and preprocessed. 0 (LABEL_0) stands for false and 1 stands for true.

	- Developed by: Ostap Mykhailiv
	- Model type: Classification
	- Language(s) (NLP): English
	- License: apache-2.0
	- Finetuned from model: google-bert/bert-base-uncased

	### Model Usage
	This model can be used for whatever reason you need, also a site hosted, based on this model is here: (todo)

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->
	As a Bert model, this also has bias. It can't be considered as a somewhat state-of-the-art model, because
	it was trained on old data (about 2022 and older), so it may not be considered as a reliable fake-news checker
	about military conflicts in Ukraine, Israel, and so on. Please consider, that the names of people in the data were not preprocessed, so
	it might be also biased toward certain names.

	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
	To get better overall results, I decided to make a title truncation in training. Though it increased the overall result for both longer and
	shorter text, one should not give less than 6 and more than 12 words for predictions, excluding stopwords. For the preprocess operations look below.
	One can translate news from language into English, though it may not give the expected results.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	from transformers import pipeline
	pipe = pipeline("text-classification", model="omykhailiv/bert-fake-news-recognition")
	pipe.predict('Some text')

	It will return something like this:
	[{'label': 'LABEL_0', 'score': 0.7248537290096283}]
	Where 'LABEL_0' means false and score means the probability of it.

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
	https://huggingface.co/datasets/GonzaloA/fake_news
	https://github.com/GeorgeMcIntire/fake_real_news_dataset

	#### Preprocessing
	Preprocessing was made by using this function:
	```
	import re
	import string
	import spacy
	from nltk.corpus import stopwords
	lem = spacy.load('en_core_web_sm')
	stop_words = set(stopwords.words('english'))
	def testing_data_prep(text):
	"""
	Args:
	text (str): The input text string.

	Returns:
	str: The preprocessed text string, or an empty string if the length
	does not meet the specified criteria (8 to 12 words).
	"""
	# Convert text to lowercase for case-insensitive processing
	text = str(text).lower()

	# Remove HTML tags and their contents (e.g., "<tag>text</tag>")
	text = re.sub('<.?>+\w+<.?>', '', text)

	# Remove punctuation using regular expressions and string escaping
	text = re.sub('[%s]' % re.escape(string.punctuation), '', text)

	# Remove words containing alphanumeric characters followed by digits
	# (e.g., "model2023", "data10")
	text = re.sub('\w\d\w', '', text)

	# Remove newline characters
	text = re.sub('\n', '', text)

	# Replace multiple whitespace characters with a single space
	text = re.sub('\\s+', ' ', text)

	# Lemmatize words (convert them to their base form)
	text = lem(text)
	words = [word.lemma_ for word in text]

	# Removing stopwords, such as do, not, as, etc. (https://gist.github.com/sebleier/554280)
	new_filtered_words = [
	word for word in words if word not in stopwords.words('english')]
	if 12 >= len(new_filtered_words) >= 6:
	return ' '.join(new_filtered_words)
	return ' '.join(new_filtered_words)
	```

	#### Training Hyperparameters
	The following hyperparameters were used during training:

	- learning_rate: 2e-5
	- train_batch_size: 32
	- eval_batch_size: 32
	- num_epochs: 5
	- warmup_steps: 500
	- weight_decay: 0.03
	- random seed: 42

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	[More Information Needed]


	### Testing Data, Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	https://huggingface.co/datasets/GonzaloA/fake_news
	https://github.com/GeorgeMcIntire/fake_real_news_dataset
	https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/
	https://arxiv.org/pdf/1806.00749v1, the dataset download link: https://drive.google.com/file/d/0B3e3qZpPtccsMFo5bk9Ib3VCc2c/view?resourcekey=0-_eqAfKOCKbuE-xFFCmEzyg


	#### Metrics
	<!-- These are the evaluation metrics being used, ideally with a description of why. -->
	Accuracy

	### Results

	[More Information Needed]

	#### Summary


	#### Hardware

	Tesla T4 GPU, available for free in Google Collab