ModelForge
/

spam-classifier

Text Classification

binary-classification

Model card Files Files and versions Community

spam-classifier / venv /lib /python3.11 /site-packages /sklearn /datasets /descr /rcv1.rst

Sam Chaudry

Upload folder using huggingface_hub

7885a28 verified about 1 month ago

2.46 kB

	.. _rcv1_dataset:

	RCV1 dataset
	------------

	Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually
	categorized newswire stories made available by Reuters, Ltd. for research
	purposes. The dataset is extensively described in [1]_.

	Data Set Characteristics:

	============== =====================
	Classes 103
	Samples total 804414
	Dimensionality 47236
	Features real, between 0 and 1
	============== =====================

	:func:`sklearn.datasets.fetch_rcv1` will load the following
	version: RCV1-v2, vectors, full sets, topics multilabels::

	>>> from sklearn.datasets import fetch_rcv1
	>>> rcv1 = fetch_rcv1()

	It returns a dictionary-like object, with the following attributes:

	``data``:
	The feature matrix is a scipy CSR sparse matrix, with 804414 samples and
	47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.
	A nearly chronological split is proposed in [1]_: The first 23149 samples are
	the training set. The last 781265 samples are the testing set. This follows
	the official LYRL2004 chronological split. The array has 0.16% of non zero
	values::

	>>> rcv1.data.shape
	(804414, 47236)

	``target``:
	The target values are stored in a scipy CSR sparse matrix, with 804414 samples
	and 103 categories. Each sample has a value of 1 in its categories, and 0 in
	others. The array has 3.15% of non zero values::

	>>> rcv1.target.shape
	(804414, 103)

	``sample_id``:
	Each sample can be identified by its ID, ranging (with gaps) from 2286
	to 810596::

	>>> rcv1.sample_id[:3]
	array([2286, 2287, 2288], dtype=uint32)

	``target_names``:
	The target values are the topics of each sample. Each sample belongs to at
	least one topic, and to up to 17 topics. There are 103 topics, each
	represented by a string. Their corpus frequencies span five orders of
	magnitude, from 5 occurrences for 'GMIL', to 381327 for 'CCAT'::

	>>> rcv1.target_names[:3].tolist() # doctest: +SKIP
	['E11', 'ECAT', 'M11']

	The dataset will be downloaded from the `rcv1 homepage`_ if necessary.
	The compressed size is about 656 MB.

	.. _rcv1 homepage: http://jmlr.csail.mit.edu/papers/volume5/lewis04a/


	.. rubric:: References

	.. [1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004).
	RCV1: A new benchmark collection for text categorization research.
	The Journal of Machine Learning Research, 5, 361-397.