Spaces:

MagicMeWizard
/

AI_Powered_Web_Scraper

Running

App Files Files Community

MagicMeWizard commited on 26 days ago

Commit

dcb20f6

verified ·

1 Parent(s): ccc5d44

Create DEPLOYMENT.md

Browse files

Files changed (1) hide show

DEPLOYMENT.md +592 -0

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,592 @@

+# 🚀 AI Dataset Studio - Complete Deployment Guide
+**Deploy your AI-powered dataset creation platform with Perplexity integration**
+---
+## 📋 Pre-Deployment Checklist
+### ✅ **Required Files**
+Ensure you have all these files ready:
+```
+ai-dataset-studio/
+├── app.py                    # Main application with Perplexity integration
+├── perplexity_client.py     # Perplexity AI client
+├── config.py               # Configuration management
+├── requirements.txt        # Dependencies
+├── README.md              # Documentation
+├── DEPLOYMENT.md          # This guide
+└── utils.py               # Utility functions (optional)
+```
+### ✅ **API Keys & Environment**
+- [ ] **Perplexity API Key** - Get from [Perplexity AI](https://www.perplexity.ai/)
+- [ ] **HuggingFace Account** - For Space hosting
+- [ ] **Optional**: HuggingFace Token for private datasets
+---
+## 🎯 Deployment Options
+### **Option 1: Full AI-Powered Deployment (Recommended)**
+*Best for: Professional use, maximum features*
+#### Hardware: **T4 Small** ($0.60/hour)
+- ✅ GPU acceleration for AI models
+- ✅ Fast processing (5-15s per article)
+- ✅ All Perplexity features enabled
+- ✅ Production-ready performance
+#### **Step-by-Step:**
+1. **Create HuggingFace Space**
+   ```bash
+   # Go to: https://huggingface.co/new-space
+   Space Name: ai-dataset-studio
+   SDK: Gradio
+   Hardware: T4 Small
+   Visibility: Public (or Private)
+   ```
+2. **Upload Files**
+   - Copy all files from artifacts above
+   - Ensure `app.py` is the main file
+   - Keep file structure intact
+3. **Set Environment Variables**
+   ```bash
+   # In Space Settings → Repository secrets:
+   PERPLEXITY_API_KEY = your_perplexity_api_key_here
+   # Optional:
+   HF_TOKEN = your_huggingface_token
+   LOG_LEVEL = INFO
+   DEBUG = false
+   ```
+4. **Deploy & Test**
+   - Space will build automatically (2-3 minutes)
+   - Test Perplexity integration first
+   - Verify all templates work
+---
+### **Option 2: Budget-Friendly Deployment**
+*Best for: Testing, learning, cost-conscious users*
+#### Hardware: **CPU Basic** (Free)
+- ⚡ Basic functionality available
+- ⚠️ Slower AI processing (30-60s per article)
+- ✅ Perplexity discovery still works
+- ✅ Perfect for getting started
+#### **Step-by-Step:**
+1. **Create Space with CPU Basic**
+   ```bash
+   Space Name: ai-dataset-studio
+   SDK: Gradio
+   Hardware: CPU Basic (Free)
+   ```
+2. **Upload Core Files**
+   ```bash
+   # Essential files only:
+   app.py
+   perplexity_client.py
+   requirements.txt
+   README.md
+   config.py
+   ```
+3. **Set API Key**
+   ```bash
+   PERPLEXITY_API_KEY = your_api_key
+   ```
+4. **Gradual Upgrade Path**
+   - Start with CPU Basic
+   - Test functionality
+   - Upgrade to T4 Small when ready
+---
+### **Option 3: Enterprise Deployment**
+*Best for: High-volume usage, team collaboration*
+#### Hardware: **A10G Small** ($1.05/hour)
+- 🚀 Maximum performance (3-8s per article)
+- 💪 Handle large batch processing
+- 🔄 Support multiple concurrent users
+- 📈 Production-scale capabilities
+#### **Additional Setup:**
+1. **Persistent Storage**
+   ```bash
+   # In Space settings:
+   Storage: Small Persistent ($5/month)
+   # Enables data persistence between restarts
+   ```
+2. **Advanced Configuration**
+   ```bash
+   # Environment variables:
+   MAX_SOURCES_PER_SEARCH = 50
+   BATCH_SIZE = 16
+   ENABLE_CACHING = true
+   CONCURRENT_REQUESTS = 5
+   ```
+3. **Monitoring Setup**
+   ```bash
+   # Enable detailed logging:
+   LOG_LEVEL = DEBUG
+   ENABLE_METRICS = true
+   ```
+---
+## 🔧 Configuration Details
+### **Perplexity API Setup**
+1. **Get API Key**
+   ```bash
+   # Visit: https://www.perplexity.ai/
+   # Sign up for account
+   # Navigate to API section
+   # Generate new API key
+   # Copy key for environment setup
+   ```
+2. **Test API Key**
+   ```python
+   # Quick test script:
+   import requests
+   headers = {
+       'Authorization': 'Bearer YOUR_API_KEY',
+       'Content-Type': 'application/json'
+   }
+   response = requests.post(
+       'https://api.perplexity.ai/chat/completions',
+       headers=headers,
+       json={
+           "model": "llama-3.1-sonar-large-128k-online",
+           "messages": [{"role": "user", "content": "Test message"}]
+       }
+   )
+   print("API Status:", response.status_code)
+   ```
+### **Hardware Requirements by Use Case**
+| Use Case | Hardware | Monthly Cost | Performance | Best For |
+|----------|----------|--------------|-------------|----------|
+| **Learning** | CPU Basic | Free | Basic | Students, hobbyists |
+| **Development** | CPU Upgrade | $22 | Good | Developers, testing |
+| **Production** | T4 Small | $432 | Excellent | Businesses, researchers |
+| **Enterprise** | A10G Small | $756 | Maximum | High-volume, teams |
+### **Memory & Storage Planning**
+```bash
+# Model Memory Usage:
+BART Summarization: ~1.5GB
+RoBERTa Sentiment: ~500MB
+BERT NER: ~400MB
+Base Application: ~200MB
+Total GPU Memory: ~2.5GB (T4 Small = 16GB, plenty of headroom)
+# Storage Usage:
+Application Files: ~50MB
+Model Cache: ~2GB
+Temporary Data: ~100MB per project
+Persistent Storage: Optional, recommended for large projects
+```
+---
+## 🧪 Testing Your Deployment
+### **Basic Functionality Test**
+1. **Launch Application**
+   ```bash
+   # Your Space URL: https://huggingface.co/spaces/YOUR_USERNAME/ai-dataset-studio
+   # Wait for "Running" status
+   # Interface should load within 30-60 seconds
+   ```
+2. **Test Project Creation**
+   ```bash
+   Project Name: "Test Sentiment Analysis"
+   Template: Sentiment Analysis
+   Description: "Testing the deployment"
+   Click: "Create Project"
+   Expected: "✅ Project created successfully"
+   ```
+3. **Test Perplexity Integration**
+   ```bash
+   AI Search Description: "Product reviews for sentiment analysis"
+   Search Type: General
+   Max Sources: 10
+   Click: "Discover Sources with AI"
+   Expected: List of relevant URLs with quality scores
+   ```
+### **Advanced Testing**
+4. **Test Complete Workflow**
+   ```bash
+   # Use discovered sources from step 3
+   Click: "Use These Sources"
+   Click: "Start Scraping"
+   Wait: Processing to complete
+   Click: "Process Data"
+   Select: Same template as project
+   Click: "Export Dataset"
+   Format: JSON
+   Expected: Downloadable dataset file
+   ```
+5. **Performance Benchmarks**
+   ```bash
+   # Timing expectations:
+   AI Source Discovery: 5-15 seconds
+   Scraping 10 URLs: 30-120 seconds
+   Processing Data: 30-180 seconds (depends on hardware)
+   Export: 5-10 seconds
+   ```
+---
+## 🚨 Troubleshooting
+### **Common Issues & Solutions**
+#### ❌ **"Perplexity API key not found"**
+```bash
+# Problem: Environment variable not set
+# Solution:
+1. Go to Space Settings → Repository secrets
+2. Add: PERPLEXITY_API_KEY = your_key_here
+3. Restart Space
+4. Check logs for "✅ Perplexity AI client initialized"
+```
+#### ❌ **"No sources found" from AI discovery**
+```bash
+# Problem: Search query too specific or API limits
+# Solutions:
+1. Make description more general
+2. Try different search types
+3. Check API key has sufficient credits
+4. Use manual URL entry as fallback
+```
+#### ❌ **"Model loading failed"**
+```bash
+# Problem: Insufficient memory or network issues
+# Solutions:
+1. Upgrade to T4 Small for GPU memory
+2. Wait 2-3 minutes for model downloads
+3. Check Space logs for specific errors
+4. Restart Space if persistent
+```
+#### ❌ **"Scraping failed" for multiple URLs**
+```bash
+# Problem: Rate limiting or blocked access
+# Solutions:
+1. Reduce concurrent requests
+2. Check robots.txt compliance
+3. Use more diverse sources
+4. Verify URLs are publicly accessible
+```
+### **Debug Mode**
+Enable detailed logging for troubleshooting:
+```bash
+# Environment variables:
+DEBUG = true
+LOG_LEVEL = DEBUG
+# Then check Space logs for detailed information
+```
+### **Health Check Script**
+```python
+# Add this to test basic functionality:
+def health_check():
+    """Test all components"""
+    # Test imports
+    try:
+        import gradio
+        print("✅ Gradio imported")
+    except ImportError:
+        print("❌ Gradio import failed")
+    # Test Perplexity
+    try:
+        from perplexity_client import PerplexityClient
+        client = PerplexityClient()
+        if client._validate_api_key():
+            print("✅ Perplexity API key valid")
+        else:
+            print("❌ Perplexity API key invalid")
+    except Exception as e:
+        print(f"❌ Perplexity error: {e}")
+    # Test models
+    try:
+        from transformers import pipeline
+        print("✅ Transformers available")
+    except ImportError:
+        print("⚠️ Transformers not available (CPU fallback)")
+# Run health check in your Space
+```
+---
+## 🔄 Maintenance & Updates
+### **Regular Maintenance Tasks**
+1. **Monitor API Usage**
+   ```bash
+   # Check Perplexity dashboard for:
+   - API calls remaining
+   - Rate limit status
+   - Billing usage
+   ```
+2. **Update Dependencies**
+   ```bash
+   # Periodically update requirements.txt:
+   gradio>=4.44.0  # Check for latest version
+   transformers>=4.30.0
+   # Test thoroughly after updates
+   ```
+3. **Performance Monitoring**
+   ```bash
+   # Monitor Space metrics:
+   - CPU/GPU usage
+   - Memory consumption
+   - Request response times
+   - Error rates
+   ```
+### **Backup Strategy**
+```bash
+# Important data to backup:
+1. Configuration files (app.py, config.py)
+2. Custom templates or modifications
+3. API keys and environment variables
+4. Any persistent data or datasets
+# HuggingFace Spaces automatically versions your files
+# Use git commands to manage versions
+```
+---
+## 📈 Scaling & Optimization
+### **Performance Optimization**
+1. **Model Optimization**
+   ```python
+   # In config.py, adjust for your needs:
+   batch_size = 16  # Increase for better GPU utilization
+   max_sequence_length = 256  # Reduce for faster processing
+   confidence_threshold = 0.8  # Higher for better quality
+   ```
+2. **Caching Strategy**
+   ```python
+   # Enable model caching:
+   cache_models = True
+   model_cache_dir = "./model_cache"
+   # Cache API responses:
+   cache_api_responses = True
+   cache_ttl_hours = 24
+   ```
+3. **Resource Management**
+   ```python
+   # Optimize memory usage:
+   clear_cache_after_processing = True
+   max_concurrent_requests = 3
+   timeout_per_url = 10  # seconds
+   ```
+### **Cost Optimization**
+1. **Auto-Sleep Configuration**
+   ```bash
+   # HuggingFace Spaces auto-sleep after 1 hour idle
+   # No additional configuration needed
+   # Automatically resumes on next request
+   ```
+2. **Hardware Scheduling**
+   ```bash
+   # Strategy: Start with CPU Basic
+   # Upgrade to T4 Small during processing
+   # Downgrade back to CPU Basic when idle
+   ```
+3. **API Cost Management**
+   ```bash
+   # Perplexity API optimization:
+   - Cache search results for similar queries
+   - Use more specific search terms
+   - Implement request batching
+   - Set reasonable max_sources limits
+   ```
+---
+## 🎓 Best Practices
+### **Security Best Practices**
+1. **API Key Management**
+   ```bash
+   ✅ Store in HuggingFace Spaces secrets
+   ✅ Never commit to git repositories
+   ✅ Rotate keys periodically
+   ✅ Monitor usage for anomalies
+   ```
+2. **Safe Scraping**
+   ```bash
+   ✅ Respect robots.txt
+   ✅ Implement rate limiting
+   ✅ Use appropriate user agents
+   ✅ Avoid private/internal networks
+   ```
+3. **Data Privacy**
+   ```bash
+   ✅ No persistent data storage by default
+   ✅ Clear temporary files after processing
+   ✅ Respect copyright and fair use
+   ✅ Provide clear data source attribution
+   ```
+### **Development Best Practices**
+1. **Testing Strategy**
+   ```bash
+   # Test with small datasets first
+   # Verify each step of the pipeline
+   # Use diverse source types
+   # Test error conditions
+   ```
+2. **Version Control**
+   ```bash
+   # Use git for file management
+   # Tag stable releases
+   # Document changes and updates
+   # Keep rollback capability
+   ```
+3. **Documentation**
+   ```bash
+   # Keep README.md updated
+   # Document custom configurations
+   # Provide usage examples
+   # Include troubleshooting guides
+   ```
+---
+## 🆘 Getting Help
+### **Support Channels**
+1. **HuggingFace Community**
+   - Discussions: Share issues and solutions
+   - Discord: Real-time help from community
+2. **GitHub Issues**
+   - Bug reports and feature requests
+   - Include logs and configuration details
+3. **Documentation**
+   - README.md: Complete usage guide
+   - DEPLOYMENT.md: This guide
+   - Code comments: Inline documentation
+### **Information to Include When Asking for Help**
+```bash
+1. Deployment type (CPU Basic, T4 Small, etc.)
+2. Error messages (exact text)
+3. Space logs (relevant sections)
+4. Configuration details (without API keys)
+5. Steps to reproduce the issue
+6. Expected vs actual behavior
+```
+---
+## 🎉 Success Indicators
+Your deployment is successful when you see:
+```bash
+✅ Space builds without errors
+✅ Interface loads within 60 seconds
+✅ Perplexity AI discovery works
+✅ Can create projects and scrape URLs
+✅ AI processing generates quality data
+✅ Export produces valid dataset files
+✅ No persistent errors in logs
+```
+---
+## 🚀 What's Next?
+After successful deployment:
+1. **Create Your First Dataset**
+   - Start with a simple sentiment analysis project
+   - Use AI discovery to find sources
+   - Process and export a small dataset
+2. **Explore Advanced Features**
+   - Try different templates
+   - Experiment with search types
+   - Test batch processing
+3. **Optimize for Your Use Case**
+   - Adjust configurations
+   - Create custom templates
+   - Integrate with your ML pipeline
+4. **Share and Collaborate**
+   - Make Space public to help others
+   - Contribute improvements
+   - Share success stories
+**Your AI Dataset Studio is now ready to revolutionize how you create training datasets!** 🎯
+*From idea to ML-ready dataset in minutes, not weeks.*