Scaling AI-based Data Processing with Hugging Face + Dask
β’
20
outlines
library with transformers
to to define a JSON schema that the generation has to follow. It uses a Finite State Machine with token_id
as transitions.Search and save datasets generated with a LLM in real time
Generate synthetic dataset files (JSON Lines)
ReWrite datasets with a text instruction
JupyterLab Notebooks With PySpark Optimized for HF Datasets
Generate synthetic dataset files (JSON Lines)
Visualize HDF5 data of the dataset
JupyterLab Notebooks With PySpark Optimized for HF Datasets