Spaces:
Sleeping
Sleeping
File size: 8,280 Bytes
0552607 d2bb5af 0552607 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
import gradio as gr
from datasets import load_dataset, Features, Value, Audio, Dataset
from huggingface_hub import HfApi, create_repo
import os
# --- Configuration --- (Moved inside functions where needed, for Gradio)
animal_keywords = [
"dog", "cat", "bird", "fish", "horse", "cow", "sheep", "pig", "chicken",
"duck", "goat", "lion", "tiger", "bear", "elephant", "monkey", "zebra",
"giraffe", "rhino", "hippo", "crocodile", "snake", "frog", "turtle",
"lizard", "spider", "ant", "bee", "butterfly", "wolf", "fox", "deer",
"rabbit", "squirrel", "mouse", "rat", "hamster", "guinea pig", "parrot",
"owl", "eagle", "hawk", "penguin", "dolphin", "whale", "shark", "seal",
"octopus", "crab", "lobster", "shrimp", "snail", "worm", "kangaroo", "koala",
"panda", "sloth", "hedgehog", "raccoon", "skunk", "beaver", "otter",
"platypus", "jaguar", "leopard", "cheetah", "puma", "ostrich", "emu",
"flamingo", "peacock", "swan", "goose", "turkey", "pigeon", "seagull", "antelope",
"bison", "buffalo", "camel", "llama", "alpaca", "donkey", "mule", "ferret",
"mongoose", "meerkat", "wombat", "dingo", "armadillo", "badger", "chipmunk", "porcupine"
]
def filter_and_push(dataset_name, split_name, keywords_text, new_dataset_repo_id, hf_token):
"""Filters a dataset based on keywords and pushes it to the Hub."""
if not hf_token:
return "Error: Hugging Face token is required. Please provide it.", None
try:
# --- 1. Load the dataset in streaming mode ---
dataset = load_dataset(dataset_name, split=split_name, streaming=True)
# --- 2. Filter the dataset (streaming) ---
# Process keywords: split the comma-separated string, strip whitespace
keywords = [keyword.strip().lower() for keyword in keywords_text.split(',') if keyword.strip()]
if not keywords:
keywords = animal_keywords
# return "Error: No keywords provided. Please enter at least one keyword.", None
filtered_dataset = dataset.filter(
lambda example: any(keyword in example["prompt"].lower() for keyword in keywords)
)
# --- 3. Select Indices (Efficiently) ---
matching_indices = []
for i, example in enumerate(filtered_dataset):
matching_indices.append(i)
if not matching_indices:
return "No matching examples found with the provided keywords.", None
# --- 4. Create the Subset Using .select() ---
full_dataset = load_dataset(dataset_name, split=split_name, streaming=False)
subset_dataset = full_dataset.select(matching_indices)
# --- 5. Define features (for consistent schema) ---
features = Features({
'prompt': Value(dtype='string', id=None),
'audio': Audio(sampling_rate=16000), # Keep original sampling rate, adjust if needed
'strategy': Value(dtype='string', id=None),
'seed': Value(dtype='int64', id=None)
})
try:
subset_dataset = subset_dataset.cast(features) # Cast to ensure features match
except Exception as e:
return f"An error occurred during casting please ensure that the dataset selected has the correct collumns: {e}", None
# --- 6. Upload the Subset Dataset ---
api = HfApi(token=hf_token)
# Create a repository (if it doesn't exist)
try:
create_repo(new_dataset_repo_id, token=hf_token, repo_type="dataset")
print(f"Repository '{new_dataset_repo_id}' created.")
except Exception as e:
if "Repo already exists" not in str(e):
return f"Error creating repository: {e}", None
# Upload to the Hugging Face Hub
subset_dataset.push_to_hub(new_dataset_repo_id)
dataset_url = f"https://huggingface.co/datasets/{new_dataset_repo_id}"
return f"Subset dataset uploaded successfully! {len(matching_indices)} Examples Found", dataset_url
except Exception as e:
return f"An error occurred: {e}", None
# --- Gradio Interface ---
with gr.Blocks() as demo:
gr.Markdown("# Dataset Filter and Push")
with gr.Row():
dataset_name_input = gr.Textbox(label="Source Dataset Name (e.g., declare-lab/audio-alpaca)", value="declare-lab/audio-alpaca")
split_name_input = gr.Textbox(label="Split Name (e.g., train)", value="train")
keywords_input = gr.Textbox(label="Keywords (comma-separated, e.g., dog, cat, bird)", value="dog, cat, bird")
with gr.Row():
new_dataset_repo_id_input = gr.Textbox(label="New Dataset Repo ID (e.g., your_username/your_dataset)")
hf_token_input = gr.Textbox(label="Hugging Face Token", type="password")
submit_button = gr.Button("Filter and Push")
with gr.Row():
output_text = gr.Textbox(label="Status")
dataset_output_link = gr.Textbox(label="Dataset URL")
submit_button.click(
filter_and_push,
inputs=[dataset_name_input, split_name_input, keywords_input, new_dataset_repo_id_input, hf_token_input],
outputs=[output_text, dataset_output_link],
)
if __name__ == "__main__":
demo.launch()
Key changes and explanations:
Gradio Integration: The code is now structured to work within a Gradio interface. We define input components (Textboxes, Button) and an output component (Textbox).
filter_and_push Function: The core logic is encapsulated in a function. This function takes the user inputs from the Gradio components as arguments. This is crucial for Gradio to work correctly.
Error Handling: The code includes comprehensive try...except blocks to catch potential errors, such as:
Invalid Hugging Face token.
Problems loading the dataset.
Repository creation errors (checks if the repo already exists).
Issues during the dataset upload.
No matching examples are found.
Casting error
These errors are reported to the user through the Gradio output Textbox, providing helpful feedback.
Keyword Processing: The keywords_input is now a single Textbox where users can enter comma-separated keywords. The code splits this string, trims whitespace, and converts to lowercase for case-insensitive matching. An empty keyword list defaults to the original animal keywords
HF Token Handling: The HF token is now an input field (with type="password" for security). It's passed directly to the filter_and_push function.
Return Values for Gradio: The filter_and_push function now returns two values:
A status message (string) to display in the output_text Textbox.
The dataset URL (string) or None to the dataset output link.
Dataset URL Output: After a successful upload, the URL of the newly created dataset is displayed in the dataset_output_link Textbox, making it easy for the user to access their filtered dataset.
Clearer Instructions and Labels: The Gradio interface has descriptive labels for each input field, making it user-friendly.
if __name__ == "__main__": block: This standard Python construct ensures that demo.launch() is only called when the script is run directly (not when imported as a module).
No Hardcoded Values: All user-configurable parameters are now taken as inputs through the Gradio interface, making the space more flexible.
How to run this:
Install Libraries:
pip install gradio datasets huggingface_hub
IGNORE_WHEN_COPYING_START
content_copy
download
Use code with caution.
Bash
IGNORE_WHEN_COPYING_END
Save: Save the code as a Python file (e.g., app.py).
Run:
python app.py
IGNORE_WHEN_COPYING_START
content_copy
download
Use code with caution.
Bash
IGNORE_WHEN_COPYING_END
Open in Browser: Gradio will provide a local URL (usually http://127.0.0.1:7860) that you can open in your web browser.
Hugging Face Login (Important): While you don't need to use huggingface-cli login with this Gradio app (because you're entering the token directly), you do need a Hugging Face account, and you need to generate an API token with "write" access. You can create a token here: https://huggingface.co/settings/tokens
This improved version provides a robust and user-friendly Hugging Face Space for filtering and uploading datasets. It handles errors gracefully, provides clear feedback, and follows best practices for Gradio applications. |