@hlky on Hugging Face: "BIG update dropped for https://huggingface.co/datasets/bigdata-pw/Flickr

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

hlky

posted an update Aug 21, 2024

Post

2211

BIG update dropped for bigdata-pw/Flickr - now ~515M images! Target for the next update: 1B

In case you missed them; other recent drops include bigdata-pw/Dinosaurs - a small set of BIG creatures 🦕🦖 and the first in a series of articles about the art of web scraping! https://huggingface.co/blog/hlky/web-scraping-101 https://huggingface.co/blog/hlky/web-scraping-102

Stay tuned for exciting datasets and models coming soon:
- PC and Console game screenshots
- TV/Film actors biographies and photos (think facial recognition and automatic captioning!)
- bigdata-pw/lyrics-gpt v2
- and more!

deleted

Aug 22, 2024

This comment has been hidden

hlky

Aug 22, 2024

Please refrain from advertising your service on my post, thanks!

MattHVisual

Aug 23, 2024

Is it possible to use this dataset offline and download in bulk? Or portions of it? It would be so useful to have as a resource for all sorts of projects. Appreciate the upload either way. Thanks.

hlky

Aug 23, 2024

This article should get you started: https://huggingface.co/blog/hlky/processing-parquets-102

We'll cover more advanced topics like downloading into WebDatasets, which is recommended if you want millions of images, in later articles.

If there's any specific kind of filtering you'd like to see covered or anything else just let me know, always happy to help!

eijoac

Dec 12, 2024

Great web scraping tutorials. Will you also write a tutorial about how you handle the artsy data scraping? Thanks.

hlky

Dec 12, 2024

•

edited Dec 14, 2024

https://huggingface.co/blog/hlky/web-scraping-201

It's a theoretical lesson on limitations that apply for some services like Artsy and how to handle them.

Edit: looks like that didn't post properly, I'll have to rewrite it.

In this post