Hub documentation
PyArrow
PyArrow
Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics.
Since PyArrow supports fsspec to read and write remote data, you can use the Hugging Face paths (hf://
) to read and write data on the Hub.
It is especially useful for Parquet data, since Parquet is the most common file format on Hugging Face.
Indeed, Parquet is particularly efficient thanks to its structure, typing, metadata and compression.
Load a Table
You can load data from local files or from remote storage like Hugging Face Datasets. PyArrow supports many formats including CSV, JSON and more importantly Parquet:
>>> import pyarrow.parquet as pq
>>> table = pq.read_table("path/to/data.parquet")
To load a file from Hugging Face, the path needs to start with hf://
. For example, the path to the stanfordnlp/imdb dataset repository is hf://datasets/stanfordnlp/imdb
. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file plain_text/train-00000-of-00001.parquet
as a pyarrow Table (it requires pyarrow>=21.0
):
>>> import pyarrow.parquet as pq
>>> table = pq.read_table("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
>>> table
pyarrow.Table
text: string
label: int64
----
text: [["I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it (... 1542 chars omitted)", ...],...,[..., "The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritan (... 221 chars omitted)"]]
label: [[0,0,0,0,0,...,0,0,0,0,0],...,[1,1,1,1,1,...,1,1,1,1,1]]
If you donβt want to load the full Parquet data, you can get the Parquet metadata or load row group by row group instead:
>>> import pyarrow.parquet as pq
>>> pf = pq.ParquetFile("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
>>> pf.metadata
<pyarrow._parquet.FileMetaData object at 0x1171b4090>
created_by: parquet-cpp-arrow version 12.0.0
num_columns: 2
num_rows: 25000
num_row_groups: 25
format_version: 2.6
serialized_size: 62036
>>> for i in pf.num_row_groups:
... table = pf.read_row_group(i)
... ...
For more information on the Hugging Face paths and how they are implemented, please refer to the the client libraryβs documentation on the HfFileSystem.
Save a Table
You can save a pyarrow Table using pyarrow.parquet.write_table
to a local file or to Hugging Face directly.
To save the Table on Hugging Face, you first need to Login with your Hugging Face account, for example using:
hf auth login
Then you can create a dataset repository, for example using:
from huggingface_hub import HfApi
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
Finally, you can use Hugging Face paths in PyArrow:
import pyarrow.parquet as pq
pq.write_table(table, "hf://datasets/username/my_dataset/imdb.parquet", use_content_defined_chunking=True)
# or write in separate files if the dataset has train/validation/test splits
pq.write_table(table_train, "hf://datasets/username/my_dataset/train.parquet", use_content_defined_chunking=True)
pq.write_table(table_valid, "hf://datasets/username/my_dataset/validation.parquet", use_content_defined_chunking=True)
pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", use_content_defined_chunking=True)
We use use_content_defined_chunking=True
to enable faster uploads and downloads from Hugging Face thanks to Xet deduplication (it requires pyarrow>=21.0
).
Content defined chunking (CDC) makes the Parquet writer chunk the data pages in a way that makes duplicate data chunked and compressed identically. Without CDC, the pages are arbitrarily chunked and therefore duplicate data are impossible to detect because of compression. Thanks to CDC, Parquet uploads and downloads from Hugging Face are faster, since duplicate data are uploaded or downloaded only once.
Find more information about Xet here.
Use Images
You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this:
Example 1: Example 2:
folder/ folder/
βββ metadata.parquet βββ metadata.parquet
βββ img000.png βββ images
βββ img001.png βββ img000.png
... ...
βββ imgNNN.png βββ imgNNN.png
You can iterate on the images paths like this:
from pathlib import Path
import pyarrow as pq
folder_path = Path("path/to/folder")
table = pq.read_table(folder_path + "metadata.parquet")
for file_name in table["file_name"].to_pylist():
image_path = folder_path / file_name
...
Since the dataset is in a supported structure (a metadata.parquet
file with a file_name
field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images.
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path=folder_path,
repo_id="username/my_image_dataset",
repo_type="dataset",
)
Embed Images inside Parquet
PyArrow has a binary type which allows to have the images bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the images (bytes and path) and the samples metadata:
import pyarrow as pa
import pyarrow.parquet as pq
# Embed the image bytes in Arrow
image_array = pa.array([
{
"bytes": (folder_path / file_name).read_bytes(),
"path": file_name,
}
for file_name in table["file_name"].to_pylist()
])
table.append_column("image", image_array)
# (Optional) Set the HF Image type for the Dataset Viewer and the `datasets` library
features = {"image": {"_type": "Image"}} # or using datasets.Features(...).to_dict()
schema_metadata = {"huggingface": {"dataset_info": {"features": features}}}
table = table.replace_schema_metadata(schema_metadata)
# Save to Parquet
# (Optional) with use_content_defined_chunking for faster uploads and downloads
pq.write_table(table, "data.parquet", use_content_defined_chunking=True)
Setting the Image type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that βimageβ contains images and not just binary data.
Use Audios
You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this:
Example 1: Example 2:
folder/ folder/
βββ metadata.parquet βββ metadata.parquet
βββ rec000.wav βββ audios
βββ rec001.wav βββ rec000.wav
... ...
βββ recNNN.wav βββ recNNN.wav
You can iterate on the audios paths like this:
from pathlib import Path
import pyarrow as pq
folder_path = Path("path/to/folder")
table = pq.read_table(folder_path + "metadata.parquet")
for file_name in table["file_name"].to_pylist():
audio_path = folder_path / file_name
...
Since the dataset is in a supported structure (a metadata.parquet
file with a file_name
field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio.
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path=folder_path,
repo_id="username/my_audio_dataset",
repo_type="dataset",
)
Embed Audio inside Parquet
PyArrow has a binary type which allows for having audio bytes in Arrow tables. Therefore, it enables saving the dataset as one single Parquet file containing both the audio (bytes and path) and the samples metadata:
import pyarrow as pa
import pyarrow.parquet as pq
# Embed the audio bytes in Arrow
audio_array = pa.array([
{
"bytes": (folder_path / file_name).read_bytes(),
"path": file_name,
}
for file_name in table["file_name"].to_pylist()
])
table.append_column("audio", audio_array)
# (Optional) Set the HF Audio type for the Dataset Viewer and the `datasets` library
features = {"audio": {"_type": "Audio"}} # or using datasets.Features(...).to_dict()
schema_metadata = {"huggingface": {"dataset_info": {"features": features}}}
table = table.replace_schema_metadata(schema_metadata)
# Save to Parquet
# (Optional) with use_content_defined_chunking for faster uploads and downloads
pq.write_table(table, "data.parquet", use_content_defined_chunking=True)
Setting the Audio type in the Arrow schema metadata enables other libraries and the Hugging Face Dataset Viewer to recognise that βaudioβ contains audio data, not just binary data.
< > Update on GitHub