Querying Hugging Face Datasets with the DuckDB UI

Community Article Published April 3, 2025

Hugging Face hosts a whopping 384k+ datasets that range from a few thousand rows to 100s of million. While the browser-based Data Studio (powered by DuckDB WASM) is powerful, exploring very large datasets or running complex queries can sometimes be limited by browser constraints.

This is where the new DuckDB Local UI comes into play! Starting in DuckDB v1.2.1, Motherduck and DuckDB Labs collaborated to bring a local UI to the DuckDB CLI.

image/png

This is particularly powerful because it leverages your local machine's resources (CPU, RAM), bypassing browser limitations for significantly faster and more complex queries on any Hugging Face Dataset.

Why use the DuckDB UI?

  • Leverages your machines full CPU cores and available RAM
  • Significantly faster for multi-million row datasets
  • Fully featured UI
    • Column Explorer
    • Schema Viewers
    • Table Summaries
    • Notebook-like cells

My favorite feature is the the Column Explorer.

DuckDB UI

Image from DuckDB.org

Getting Started

To launch the UI, simply open your terminal and run:

duckdb --ui

If you haven't installed DuckDB before it's as easy as:

curl https://install.duckdb.org | sh

or via Homebrew

brew install duckdb

This will start DuckDB automatically with an in-memory database.

Connecting to over 384k+ Hugging Face Datasets

DuckDB provides a seamless integration with Hugging Face datasets. Here are the two primary ways to connect:

Method 1: Using with hf:// protocol

DuckDB's httpfs extension understands the hf:// protocol, allowing you to query datasets directly. For optimal performance, use the @~parquet suffix in the path. This tells DuckDB to access the efficient Parquet file conversions of the dataset hosted by Hugging Face. Here is a helpful guide to understand how the hf:// protocol works.

DuckDB UI

As an example, here's what the SQL would look like if we wanted to query the glaiveai/reasoning-v1-20m dataset.

select * from 'hf://datasets/glaiveai/reasoning-v1-20m@~parquet/default/train/*.parquet' limit 500

Method 2: Using the "Copy for DuckDB CLI" button in Data Studio (faster)

The Hugging Face Data Studio provides a handy shortcut:

  1. Navigate to a dataset on the Hugging Face Hub (e.g., facebook/natural_reasoning)
  2. Open the Data Studio
  3. Run any initial query (or just the default LIMIT 10 query)
  4. Click the "Copy for DuckDB CLI" button

This will copy SQL code to your clipboard that first creates a convenient view for the dataset split and includes your query.

Watch how it works:

The SQL looks like this:

CREATE VIEW train AS (SELECT * FROM read_parquet('hf://datasets/facebook/natural_reasoning@~parquet/default/train/*.parquet'));
-- The SQL console is powered by DuckDB WASM and runs entirely in the browser.
-- Get started by typing a query or selecting a view from the options below.
SELECT * FROM train LIMIT 10

If we copy this into the DuckDB UI and run it. We get something that looks like this!

image/png

In summary, the DuckDB Local UI provides a fast, powerful, and feature-rich way to explore Hugging Face datasets directly on your machine. Give it a try!

Have you used DuckDB with HF Datasets? Let us know your experience!

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment