Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
hlkyΒ 
posted an update Aug 21
Post
2191
BIG update dropped for bigdata-pw/Flickr - now ~515M images! Target for the next update: 1B

In case you missed them; other recent drops include bigdata-pw/Dinosaurs - a small set of BIG creatures πŸ¦•πŸ¦– and the first in a series of articles about the art of web scraping! https://huggingface.co/blog/hlky/web-scraping-101 https://huggingface.co/blog/hlky/web-scraping-102

Stay tuned for exciting datasets and models coming soon:
- PC and Console game screenshots
- TV/Film actors biographies and photos (think facial recognition and automatic captioning!)
- bigdata-pw/lyrics-gpt v2
- and more!
deleted
This comment has been hidden
Β·

Please refrain from advertising your service on my post, thanks!

Is it possible to use this dataset offline and download in bulk? Or portions of it? It would be so useful to have as a resource for all sorts of projects. Appreciate the upload either way. Thanks.

Β·

This article should get you started: https://huggingface.co/blog/hlky/processing-parquets-102

We'll cover more advanced topics like downloading into WebDatasets, which is recommended if you want millions of images, in later articles.

If there's any specific kind of filtering you'd like to see covered or anything else just let me know, always happy to help!

Great web scraping tutorials. Will you also write a tutorial about how you handle the artsy data scraping? Thanks.

Β·

https://huggingface.co/blog/hlky/web-scraping-201

It's a theoretical lesson on limitations that apply for some services like Artsy and how to handle them.

Edit: looks like that didn't post properly, I'll have to rewrite it.

In this post