Continuing my streak by releasing the Wikireading dataset: a large collection of scraped non-fiction books predominantly in Russian language. its5Q/wikireading
Here's the highlights: - ~7B tokens, or ~28B characters, making it a great candidate for use in pretraining - Contains non-fiction works from many knowledge domains - Includes both the original HTML and extracted text of book chapters