Spaces:

MagicMeWizard
/

AI_Powered_Web_Scraper

Running

App Files Files Community

AI_Powered_Web_Scraper / README.md

MagicMeWizard

Update README.md

fdb51e4 verified 5 days ago

preview code

raw

history blame contribute delete

12 kB

	---
	title: AI Dataset Studio
	emoji: 🚀
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.35.0
	app_file: app.py
	pinned: false
	---

	# 🚀 AI Dataset Studio

	Create high-quality training datasets with AI-powered source discovery

	A comprehensive platform for building ML datasets that combines web scraping, AI processing, and smart source discovery using Perplexity AI. Perfect for researchers, data scientists, and AI enthusiasts who need quality training data without the complexity.

	---

	## ✨ Key Features

	### 🧠 AI-Powered Source Discovery
	- Perplexity AI Integration: Automatically discover relevant sources based on your project description
	- Smart Search Types: General, academic, news, technical, and specialized searches
	- Quality Scoring: AI evaluates source quality and relevance for ML training
	- Diverse Source Types: Academic papers, news articles, blogs, government sources, and more

	### 🎯 Professional Dataset Creation
	- 6 ML Templates: Sentiment analysis, text classification, NER, Q&A, summarization, translation
	- Advanced AI Processing: BART, RoBERTa, and other state-of-the-art models
	- Quality Filtering: Automatic content validation and cleaning
	- Batch Processing: Handle hundreds of URLs efficiently

	### 📊 Enterprise-Grade Export
	- Multiple Formats: JSON, CSV, HuggingFace Datasets, JSONL
	- Production Ready: Proper data structure for immediate ML use
	- Rich Metadata: Source tracking, confidence scores, processing timestamps

	### 🛡️ Security & Ethics
	- Robots.txt Compliance: Respects website crawling policies
	- Rate Limiting: Responsible scraping practices
	- Content Validation: Safety checks and quality filters
	- Privacy First: No data storage, memory-only processing

	---

	## 🚀 Quick Start

	### 1. Deploy on Hugging Face Spaces

	```bash
	# Create new Space
	# Name: ai-dataset-studio
	# SDK: Gradio
	# Hardware: T4 Small (recommended) or CPU Basic (free)
	```

	### 2. Set Up Perplexity AI (Optional)

	To enable AI-powered source discovery:

	1. Get Perplexity API Key:
	- Visit [Perplexity AI](https://www.perplexity.ai/)
	- Sign up for an account
	- Get your API key from the dashboard

	2. Set Environment Variable:
	- In your Hugging Face Space settings
	- Go to "Repository secrets"
	- Add: `PERPLEXITY_API_KEY` = `your_api_key_here`

	3. Restart Your Space:
	- The AI source discovery will now be available!

	### 3. Upload Files

	Copy these files to your Space:
	- `app.py` (main application)
	- `perplexity_client.py` (AI integration)
	- `requirements.txt` (dependencies)
	- `README.md` (this file)

	---

	## 📖 How to Use

	### Step 1: 📋 Project Setup
	1. Create Project: Give your dataset a name and description
	2. Choose Template: Select ML task type (sentiment analysis, classification, etc.)
	3. Review Configuration: Check fields and example data structure

	### Step 2: 🧠 AI Source Discovery (Recommended)
	1. Describe Your Needs: Tell AI what sources you need
	```
	Example: "I need product reviews from e-commerce sites for sentiment analysis training data"
	```
	2. Configure Search: Choose search type, max sources, include academic/news
	3. Review Results: AI finds and scores relevant sources
	4. Use Sources: One-click to add discovered URLs to scraping list

	### Step 3: 🕷️ Manual URLs (Alternative)
	- Add URLs manually if not using AI discovery
	- One URL per line
	- Supports most public websites

	### Step 4: ⚙️ Data Processing
	1. Scrape Content: Extract text from all URLs
	2. AI Processing: Apply template-specific AI models
	3. Quality Control: Filter and validate results
	4. Preview Data: Review processed examples

	### Step 5: 📦 Export Dataset
	1. Choose Format: JSON, CSV, HuggingFace, or JSONL
	2. Download: Get your ML-ready dataset
	3. Use Immediately: Compatible with popular ML frameworks

	---

	## 🎯 Use Cases

	### 📰 For Journalists
	```
	Project: "News sentiment analysis across political topics"
	AI Discovery: Finds news articles from diverse sources
	Processing: Sentiment analysis with confidence scores
	Export: Clean dataset for editorial sentiment tracking
	```

	### 🏢 For Businesses
	```
	Project: "Customer review classification for product insights"
	AI Discovery: Discovers review sites and forums
	Processing: Multi-class sentiment + topic classification
	Export: Business intelligence dataset
	```

	### 🎓 For Researchers
	```
	Project: "Academic paper summarization dataset"
	AI Discovery: Finds peer-reviewed papers and preprints
	Processing: Abstractive summarization with BART
	Export: Research training dataset
	```

	### 🚀 For Startups
	```
	Project: "Competitor analysis sentiment dataset"
	AI Discovery: Finds discussions about competitor products
	Processing: NER + sentiment analysis
	Export: Market intelligence dataset
	```

	---

	## 🧠 Perplexity AI Integration

	### What It Does
	- Intelligent Search: Understands your project needs and finds relevant sources
	- Quality Assessment: Scores sources based on content quality and ML suitability
	- Diverse Discovery: Finds sources you might not think of manually
	- Time Saving: Reduces dataset creation time by 80%

	### Search Types
	- General: Broad search across all content types
	- Academic: Focus on research papers and scholarly content
	- News: Prioritize journalistic and news sources
	- Technical: Target documentation, tutorials, and technical content

	### Example Queries That Work Well
	```
	✅ "Customer reviews for electronics products sentiment analysis"
	✅ "News articles about climate change for topic classification"
	✅ "Medical research papers for text summarization"
	✅ "Social media posts about brand mentions"
	✅ "FAQ pages for question-answering datasets"
	```

	---

	## 🛠️ Configuration

	### Hardware Requirements

	\| Use Case \| Hardware \| Cost \| Performance \|
	\|----------\|----------\|------\|-------------\|
	\| Development \| CPU Basic \| Free \| 30-60s per article \|
	\| Small Projects \| CPU Upgrade \| $0.03/hr \| 15-30s per article \|
	\| Production \| T4 Small \| $0.60/hr \| 5-15s per article \|
	\| Large Scale \| A10G Small \| $1.05/hr \| 3-8s per article \|

	### Environment Variables

	```bash
	# Required for AI source discovery
	PERPLEXITY_API_KEY=your_perplexity_api_key

	# Optional customization
	MAX_SOURCES_PER_SEARCH=50
	REQUEST_TIMEOUT=30
	ENABLE_GPU_ACCELERATION=true
	```

	### Model Configuration

	The application automatically uses the best available models:

	- Sentiment Analysis: `cardiffnlp/twitter-roberta-base-sentiment-latest`
	- Summarization: `facebook/bart-large-cnn`
	- NER: `dbmdz/bert-large-cased-finetuned-conll03-english`
	- Fallbacks: Keyword-based processing when models unavailable

	---

	## 📊 Dataset Templates

	### 1. 📊 Sentiment Analysis
	```json
	{
	"text": "This product is amazing!",
	"sentiment": "positive",
	"confidence": 0.95,
	"source_url": "https://example.com/review"
	}
	```

	### 2. 📂 Text Classification
	```json
	{
	"text": "Breaking: Stock market reaches new high",
	"category": "finance",
	"source_url": "https://news.example.com"
	}
	```

	### 3. 🏷️ Named Entity Recognition
	```json
	{
	"text": "Apple Inc. was founded by Steve Jobs",
	"entities": [
	{"text": "Apple Inc.", "label": "ORG"},
	{"text": "Steve Jobs", "label": "PERSON"}
	]
	}
	```

	### 4. ❓ Question Answering
	```json
	{
	"context": "The capital of France is Paris",
	"question": "What is the capital of France?",
	"answer": "Paris"
	}
	```

	### 5. 📝 Text Summarization
	```json
	{
	"text": "Long article content...",
	"summary": "Brief summary of key points"
	}
	```

	---

	## 🚨 Troubleshooting

	### Common Issues

	#### ❌ "No Perplexity API key found"
	Solution: Set `PERPLEXITY_API_KEY` in your Space settings under "Repository secrets"

	#### ❌ "No sources found"
	Solutions:
	- Make your project description more specific
	- Try different search types (academic, news, technical)
	- Use manual URL entry as fallback

	#### ❌ "Failed to scrape URL"
	Solutions:
	- Check if URL is publicly accessible
	- Some sites block automated access (respect robots.txt)
	- Use alternative sources discovered by AI

	#### ❌ "Models not loading"
	Solutions:
	- Upgrade to T4 Small for GPU acceleration
	- Wait 2-3 minutes for model downloads
	- Use minimal version for basic functionality

	### Getting Help

	1. Check Space Logs: Look for specific error messages
	2. Try Minimal Version: Use basic functionality first
	3. Contact Support: Include error details and configuration

	---

	## 🎯 Pro Tips

	### Maximize AI Discovery Success
	```
	✅ Be specific: "Product reviews for smartphone sentiment analysis"
	❌ Be vague: "Text data for ML"

	✅ Include context: "News articles about renewable energy for classification"
	❌ Missing context: "Articles for classification"

	✅ Specify domain: "Academic papers on machine learning for summarization"
	❌ Too broad: "Papers for summarization"
	```

	### Quality Dataset Creation
	- Start with AI discovery to find diverse, high-quality sources
	- Use multiple search types for comprehensive coverage
	- Review discovered sources before bulk scraping
	- Filter by quality scores to maintain dataset standards
	- Export early and often to avoid losing work

	### Performance Optimization
	- Use T4 Small for best AI model performance
	- Enable persistent storage for large projects
	- Batch process related URLs together
	- Monitor Space usage to optimize costs

	---

	## 🌟 Advanced Features

	### Batch Source Discovery
	```python
	# The AI can find sources for multiple related projects
	projects = [
	"Product reviews for sentiment analysis",
	"News articles for topic classification",
	"Social media posts for trend analysis"
	]
	# Each gets tailored source recommendations
	```

	### Custom Templates
	- Modify existing templates for specific needs
	- Add custom fields and processing logic
	- Create domain-specific datasets

	### API Integration
	- Export datasets directly to HuggingFace Hub
	- Integrate with existing ML pipelines
	- Automate dataset updates

	---

	## 🎉 Success Stories

	> "Reduced dataset creation time from weeks to hours!" - ML Research Team

	> "AI discovery found sources we never would have thought of manually." - Data Science Startup

	> "Finally, a tool that handles the entire pipeline from idea to dataset." - Independent Researcher

	---

	## 📈 Roadmap

	### Coming Soon
	- 🔄 Auto-refresh: Automatically update datasets with new content
	- 🌍 Multi-language: Support for non-English content
	- 🤖 Custom Models: Use your own fine-tuned models
	- 📊 Analytics Dashboard: Dataset quality metrics and insights

	### Future Integrations
	- 📚 Academic APIs: PubMed, arXiv, Google Scholar
	- 🐦 Social Media: Twitter, Reddit, LinkedIn APIs
	- 💾 Cloud Storage: Direct export to S3, GCS, Azure
	- 🔗 ML Platforms: Native integration with major ML services

	---

	## 🤝 Contributing

	We welcome contributions! Areas where you can help:

	- 🐛 Bug Reports: Test edge cases and report issues
	- 💡 Feature Ideas: Suggest new templates and capabilities
	- 📖 Documentation: Improve guides and examples
	- 🧪 Testing: Try with different domains and use cases

	---

	## 📄 License

	MIT License - Feel free to use, modify, and distribute!

	---

	## 🙏 Acknowledgments

	- Perplexity AI for intelligent source discovery
	- Hugging Face for transformers and hosting platform
	- Gradio for the beautiful interface framework
	- Community for feedback and feature requests

	---

	Ready to create amazing datasets? Deploy your AI Dataset Studio today! 🚀

	Transform your ideas into ML-ready datasets in minutes, not weeks.