|
# π AI Dataset Studio - Complete Deployment Guide |
|
|
|
**Deploy your AI-powered dataset creation platform with Perplexity integration** |
|
|
|
--- |
|
|
|
## π Pre-Deployment Checklist |
|
|
|
### β
**Required Files** |
|
Ensure you have all these files ready: |
|
|
|
``` |
|
ai-dataset-studio/ |
|
βββ app.py # Main application with Perplexity integration |
|
βββ perplexity_client.py # Perplexity AI client |
|
βββ config.py # Configuration management |
|
βββ requirements.txt # Dependencies |
|
βββ README.md # Documentation |
|
βββ DEPLOYMENT.md # This guide |
|
βββ utils.py # Utility functions (optional) |
|
``` |
|
|
|
### β
**API Keys & Environment** |
|
- [ ] **Perplexity API Key** - Get from [Perplexity AI](https://www.perplexity.ai/) |
|
- [ ] **HuggingFace Account** - For Space hosting |
|
- [ ] **Optional**: HuggingFace Token for private datasets |
|
|
|
--- |
|
|
|
## π― Deployment Options |
|
|
|
### **Option 1: Full AI-Powered Deployment (Recommended)** |
|
*Best for: Professional use, maximum features* |
|
|
|
#### Hardware: **T4 Small** ($0.60/hour) |
|
- β
GPU acceleration for AI models |
|
- β
Fast processing (5-15s per article) |
|
- β
All Perplexity features enabled |
|
- β
Production-ready performance |
|
|
|
#### **Step-by-Step:** |
|
|
|
1. **Create HuggingFace Space** |
|
```bash |
|
# Go to: https://huggingface.co/new-space |
|
Space Name: ai-dataset-studio |
|
SDK: Gradio |
|
Hardware: T4 Small |
|
Visibility: Public (or Private) |
|
``` |
|
|
|
2. **Upload Files** |
|
- Copy all files from artifacts above |
|
- Ensure `app.py` is the main file |
|
- Keep file structure intact |
|
|
|
3. **Set Environment Variables** |
|
```bash |
|
# In Space Settings β Repository secrets: |
|
PERPLEXITY_API_KEY = your_perplexity_api_key_here |
|
|
|
# Optional: |
|
HF_TOKEN = your_huggingface_token |
|
LOG_LEVEL = INFO |
|
DEBUG = false |
|
``` |
|
|
|
4. **Deploy & Test** |
|
- Space will build automatically (2-3 minutes) |
|
- Test Perplexity integration first |
|
- Verify all templates work |
|
|
|
--- |
|
|
|
### **Option 2: Budget-Friendly Deployment** |
|
*Best for: Testing, learning, cost-conscious users* |
|
|
|
#### Hardware: **CPU Basic** (Free) |
|
- β‘ Basic functionality available |
|
- β οΈ Slower AI processing (30-60s per article) |
|
- β
Perplexity discovery still works |
|
- β
Perfect for getting started |
|
|
|
#### **Step-by-Step:** |
|
|
|
1. **Create Space with CPU Basic** |
|
```bash |
|
Space Name: ai-dataset-studio |
|
SDK: Gradio |
|
Hardware: CPU Basic (Free) |
|
``` |
|
|
|
2. **Upload Core Files** |
|
```bash |
|
# Essential files only: |
|
app.py |
|
perplexity_client.py |
|
requirements.txt |
|
README.md |
|
config.py |
|
``` |
|
|
|
3. **Set API Key** |
|
```bash |
|
PERPLEXITY_API_KEY = your_api_key |
|
``` |
|
|
|
4. **Gradual Upgrade Path** |
|
- Start with CPU Basic |
|
- Test functionality |
|
- Upgrade to T4 Small when ready |
|
|
|
--- |
|
|
|
### **Option 3: Enterprise Deployment** |
|
*Best for: High-volume usage, team collaboration* |
|
|
|
#### Hardware: **A10G Small** ($1.05/hour) |
|
- π Maximum performance (3-8s per article) |
|
- πͺ Handle large batch processing |
|
- π Support multiple concurrent users |
|
- π Production-scale capabilities |
|
|
|
#### **Additional Setup:** |
|
|
|
1. **Persistent Storage** |
|
```bash |
|
# In Space settings: |
|
Storage: Small Persistent ($5/month) |
|
# Enables data persistence between restarts |
|
``` |
|
|
|
2. **Advanced Configuration** |
|
```bash |
|
# Environment variables: |
|
MAX_SOURCES_PER_SEARCH = 50 |
|
BATCH_SIZE = 16 |
|
ENABLE_CACHING = true |
|
CONCURRENT_REQUESTS = 5 |
|
``` |
|
|
|
3. **Monitoring Setup** |
|
```bash |
|
# Enable detailed logging: |
|
LOG_LEVEL = DEBUG |
|
ENABLE_METRICS = true |
|
``` |
|
|
|
--- |
|
|
|
## π§ Configuration Details |
|
|
|
### **Perplexity API Setup** |
|
|
|
1. **Get API Key** |
|
```bash |
|
# Visit: https://www.perplexity.ai/ |
|
# Sign up for account |
|
# Navigate to API section |
|
# Generate new API key |
|
# Copy key for environment setup |
|
``` |
|
|
|
2. **Test API Key** |
|
```python |
|
# Quick test script: |
|
import requests |
|
|
|
headers = { |
|
'Authorization': 'Bearer YOUR_API_KEY', |
|
'Content-Type': 'application/json' |
|
} |
|
|
|
response = requests.post( |
|
'https://api.perplexity.ai/chat/completions', |
|
headers=headers, |
|
json={ |
|
"model": "llama-3.1-sonar-large-128k-online", |
|
"messages": [{"role": "user", "content": "Test message"}] |
|
} |
|
) |
|
|
|
print("API Status:", response.status_code) |
|
``` |
|
|
|
### **Hardware Requirements by Use Case** |
|
|
|
| Use Case | Hardware | Monthly Cost | Performance | Best For | |
|
|----------|----------|--------------|-------------|----------| |
|
| **Learning** | CPU Basic | Free | Basic | Students, hobbyists | |
|
| **Development** | CPU Upgrade | $22 | Good | Developers, testing | |
|
| **Production** | T4 Small | $432 | Excellent | Businesses, researchers | |
|
| **Enterprise** | A10G Small | $756 | Maximum | High-volume, teams | |
|
|
|
### **Memory & Storage Planning** |
|
|
|
```bash |
|
# Model Memory Usage: |
|
BART Summarization: ~1.5GB |
|
RoBERTa Sentiment: ~500MB |
|
BERT NER: ~400MB |
|
Base Application: ~200MB |
|
Total GPU Memory: ~2.5GB (T4 Small = 16GB, plenty of headroom) |
|
|
|
# Storage Usage: |
|
Application Files: ~50MB |
|
Model Cache: ~2GB |
|
Temporary Data: ~100MB per project |
|
Persistent Storage: Optional, recommended for large projects |
|
``` |
|
|
|
--- |
|
|
|
## π§ͺ Testing Your Deployment |
|
|
|
### **Basic Functionality Test** |
|
|
|
1. **Launch Application** |
|
```bash |
|
# Your Space URL: https://huggingface.co/spaces/YOUR_USERNAME/ai-dataset-studio |
|
# Wait for "Running" status |
|
# Interface should load within 30-60 seconds |
|
``` |
|
|
|
2. **Test Project Creation** |
|
```bash |
|
Project Name: "Test Sentiment Analysis" |
|
Template: Sentiment Analysis |
|
Description: "Testing the deployment" |
|
Click: "Create Project" |
|
Expected: "β
Project created successfully" |
|
``` |
|
|
|
3. **Test Perplexity Integration** |
|
```bash |
|
AI Search Description: "Product reviews for sentiment analysis" |
|
Search Type: General |
|
Max Sources: 10 |
|
Click: "Discover Sources with AI" |
|
Expected: List of relevant URLs with quality scores |
|
``` |
|
|
|
### **Advanced Testing** |
|
|
|
4. **Test Complete Workflow** |
|
```bash |
|
# Use discovered sources from step 3 |
|
Click: "Use These Sources" |
|
Click: "Start Scraping" |
|
Wait: Processing to complete |
|
Click: "Process Data" |
|
Select: Same template as project |
|
Click: "Export Dataset" |
|
Format: JSON |
|
Expected: Downloadable dataset file |
|
``` |
|
|
|
5. **Performance Benchmarks** |
|
```bash |
|
# Timing expectations: |
|
AI Source Discovery: 5-15 seconds |
|
Scraping 10 URLs: 30-120 seconds |
|
Processing Data: 30-180 seconds (depends on hardware) |
|
Export: 5-10 seconds |
|
``` |
|
|
|
--- |
|
|
|
## π¨ Troubleshooting |
|
|
|
### **Common Issues & Solutions** |
|
|
|
#### β **"Perplexity API key not found"** |
|
```bash |
|
# Problem: Environment variable not set |
|
# Solution: |
|
1. Go to Space Settings β Repository secrets |
|
2. Add: PERPLEXITY_API_KEY = your_key_here |
|
3. Restart Space |
|
4. Check logs for "β
Perplexity AI client initialized" |
|
``` |
|
|
|
#### β **"No sources found" from AI discovery** |
|
```bash |
|
# Problem: Search query too specific or API limits |
|
# Solutions: |
|
1. Make description more general |
|
2. Try different search types |
|
3. Check API key has sufficient credits |
|
4. Use manual URL entry as fallback |
|
``` |
|
|
|
#### β **"Model loading failed"** |
|
```bash |
|
# Problem: Insufficient memory or network issues |
|
# Solutions: |
|
1. Upgrade to T4 Small for GPU memory |
|
2. Wait 2-3 minutes for model downloads |
|
3. Check Space logs for specific errors |
|
4. Restart Space if persistent |
|
``` |
|
|
|
#### β **"Scraping failed" for multiple URLs** |
|
```bash |
|
# Problem: Rate limiting or blocked access |
|
# Solutions: |
|
1. Reduce concurrent requests |
|
2. Check robots.txt compliance |
|
3. Use more diverse sources |
|
4. Verify URLs are publicly accessible |
|
``` |
|
|
|
### **Debug Mode** |
|
|
|
Enable detailed logging for troubleshooting: |
|
|
|
```bash |
|
# Environment variables: |
|
DEBUG = true |
|
LOG_LEVEL = DEBUG |
|
|
|
# Then check Space logs for detailed information |
|
``` |
|
|
|
### **Health Check Script** |
|
|
|
```python |
|
# Add this to test basic functionality: |
|
def health_check(): |
|
"""Test all components""" |
|
|
|
# Test imports |
|
try: |
|
import gradio |
|
print("β
Gradio imported") |
|
except ImportError: |
|
print("β Gradio import failed") |
|
|
|
# Test Perplexity |
|
try: |
|
from perplexity_client import PerplexityClient |
|
client = PerplexityClient() |
|
if client._validate_api_key(): |
|
print("β
Perplexity API key valid") |
|
else: |
|
print("β Perplexity API key invalid") |
|
except Exception as e: |
|
print(f"β Perplexity error: {e}") |
|
|
|
# Test models |
|
try: |
|
from transformers import pipeline |
|
print("β
Transformers available") |
|
except ImportError: |
|
print("β οΈ Transformers not available (CPU fallback)") |
|
|
|
# Run health check in your Space |
|
``` |
|
|
|
--- |
|
|
|
## π Maintenance & Updates |
|
|
|
### **Regular Maintenance Tasks** |
|
|
|
1. **Monitor API Usage** |
|
```bash |
|
# Check Perplexity dashboard for: |
|
- API calls remaining |
|
- Rate limit status |
|
- Billing usage |
|
``` |
|
|
|
2. **Update Dependencies** |
|
```bash |
|
# Periodically update requirements.txt: |
|
gradio>=4.44.0 # Check for latest version |
|
transformers>=4.30.0 |
|
# Test thoroughly after updates |
|
``` |
|
|
|
3. **Performance Monitoring** |
|
```bash |
|
# Monitor Space metrics: |
|
- CPU/GPU usage |
|
- Memory consumption |
|
- Request response times |
|
- Error rates |
|
``` |
|
|
|
### **Backup Strategy** |
|
|
|
```bash |
|
# Important data to backup: |
|
1. Configuration files (app.py, config.py) |
|
2. Custom templates or modifications |
|
3. API keys and environment variables |
|
4. Any persistent data or datasets |
|
|
|
# HuggingFace Spaces automatically versions your files |
|
# Use git commands to manage versions |
|
``` |
|
|
|
--- |
|
|
|
## π Scaling & Optimization |
|
|
|
### **Performance Optimization** |
|
|
|
1. **Model Optimization** |
|
```python |
|
# In config.py, adjust for your needs: |
|
batch_size = 16 # Increase for better GPU utilization |
|
max_sequence_length = 256 # Reduce for faster processing |
|
confidence_threshold = 0.8 # Higher for better quality |
|
``` |
|
|
|
2. **Caching Strategy** |
|
```python |
|
# Enable model caching: |
|
cache_models = True |
|
model_cache_dir = "./model_cache" |
|
|
|
# Cache API responses: |
|
cache_api_responses = True |
|
cache_ttl_hours = 24 |
|
``` |
|
|
|
3. **Resource Management** |
|
```python |
|
# Optimize memory usage: |
|
clear_cache_after_processing = True |
|
max_concurrent_requests = 3 |
|
timeout_per_url = 10 # seconds |
|
``` |
|
|
|
### **Cost Optimization** |
|
|
|
1. **Auto-Sleep Configuration** |
|
```bash |
|
# HuggingFace Spaces auto-sleep after 1 hour idle |
|
# No additional configuration needed |
|
# Automatically resumes on next request |
|
``` |
|
|
|
2. **Hardware Scheduling** |
|
```bash |
|
# Strategy: Start with CPU Basic |
|
# Upgrade to T4 Small during processing |
|
# Downgrade back to CPU Basic when idle |
|
``` |
|
|
|
3. **API Cost Management** |
|
```bash |
|
# Perplexity API optimization: |
|
- Cache search results for similar queries |
|
- Use more specific search terms |
|
- Implement request batching |
|
- Set reasonable max_sources limits |
|
``` |
|
|
|
--- |
|
|
|
## π Best Practices |
|
|
|
### **Security Best Practices** |
|
|
|
1. **API Key Management** |
|
```bash |
|
β
Store in HuggingFace Spaces secrets |
|
β
Never commit to git repositories |
|
β
Rotate keys periodically |
|
β
Monitor usage for anomalies |
|
``` |
|
|
|
2. **Safe Scraping** |
|
```bash |
|
β
Respect robots.txt |
|
β
Implement rate limiting |
|
β
Use appropriate user agents |
|
β
Avoid private/internal networks |
|
``` |
|
|
|
3. **Data Privacy** |
|
```bash |
|
β
No persistent data storage by default |
|
β
Clear temporary files after processing |
|
β
Respect copyright and fair use |
|
β
Provide clear data source attribution |
|
``` |
|
|
|
### **Development Best Practices** |
|
|
|
1. **Testing Strategy** |
|
```bash |
|
# Test with small datasets first |
|
# Verify each step of the pipeline |
|
# Use diverse source types |
|
# Test error conditions |
|
``` |
|
|
|
2. **Version Control** |
|
```bash |
|
# Use git for file management |
|
# Tag stable releases |
|
# Document changes and updates |
|
# Keep rollback capability |
|
``` |
|
|
|
3. **Documentation** |
|
```bash |
|
# Keep README.md updated |
|
# Document custom configurations |
|
# Provide usage examples |
|
# Include troubleshooting guides |
|
``` |
|
|
|
--- |
|
|
|
## π Getting Help |
|
|
|
### **Support Channels** |
|
|
|
1. **HuggingFace Community** |
|
- Discussions: Share issues and solutions |
|
- Discord: Real-time help from community |
|
|
|
2. **GitHub Issues** |
|
- Bug reports and feature requests |
|
- Include logs and configuration details |
|
|
|
3. **Documentation** |
|
- README.md: Complete usage guide |
|
- DEPLOYMENT.md: This guide |
|
- Code comments: Inline documentation |
|
|
|
### **Information to Include When Asking for Help** |
|
|
|
```bash |
|
1. Deployment type (CPU Basic, T4 Small, etc.) |
|
2. Error messages (exact text) |
|
3. Space logs (relevant sections) |
|
4. Configuration details (without API keys) |
|
5. Steps to reproduce the issue |
|
6. Expected vs actual behavior |
|
``` |
|
|
|
--- |
|
|
|
## π Success Indicators |
|
|
|
Your deployment is successful when you see: |
|
|
|
```bash |
|
β
Space builds without errors |
|
β
Interface loads within 60 seconds |
|
β
Perplexity AI discovery works |
|
β
Can create projects and scrape URLs |
|
β
AI processing generates quality data |
|
β
Export produces valid dataset files |
|
β
No persistent errors in logs |
|
``` |
|
|
|
--- |
|
|
|
## π What's Next? |
|
|
|
After successful deployment: |
|
|
|
1. **Create Your First Dataset** |
|
- Start with a simple sentiment analysis project |
|
- Use AI discovery to find sources |
|
- Process and export a small dataset |
|
|
|
2. **Explore Advanced Features** |
|
- Try different templates |
|
- Experiment with search types |
|
- Test batch processing |
|
|
|
3. **Optimize for Your Use Case** |
|
- Adjust configurations |
|
- Create custom templates |
|
- Integrate with your ML pipeline |
|
|
|
4. **Share and Collaborate** |
|
- Make Space public to help others |
|
- Contribute improvements |
|
- Share success stories |
|
|
|
**Your AI Dataset Studio is now ready to revolutionize how you create training datasets!** π― |
|
|
|
*From idea to ML-ready dataset in minutes, not weeks.* |