MagicMeWizard commited on
Commit
ed05d05
·
verified ·
1 Parent(s): d420c16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +394 -241
README.md CHANGED
@@ -1,242 +1,395 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: AI Powered Web Scraper
3
- emoji: 🏃
4
- colorFrom: yellow
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 5.35.0
8
- app_file: app.py
9
- pinned: true
10
- license: mit
11
- short_description: 'ai powered web scrapping tool '
12
- thumbnail: >-
13
- https://cdn-uploads.huggingface.co/production/uploads/6508b189ac5108b93a5f111b/MV3haSrhEtdlc5prx9rVO.png
14
- ---
15
-
16
- title: AI-Powered Web Scraper
17
- emoji: 🤖
18
- colorFrom: blue
19
- colorTo: purple
20
- sdk: gradio
21
- sdk_version: 4.44.0
22
- app_file: app.py
23
- pinned: false
24
- license: apache-2.0
25
- python_version: 3.10
26
- suggested_hardware: t4-small
27
- suggested_storage: small
28
- short_description: Professional web content extraction and AI summarization tool for journalists, analysts, and researchers
29
- tags:
30
-
31
- web-scraping
32
- content-extraction
33
- ai-summarization
34
- journalism
35
- research
36
- analysis
37
- nlp
38
- bart
39
- content-analysis
40
- models:
41
- facebook/bart-large-cnn
42
- sshleifer/distilbart-cnn-12-6
43
-
44
-
45
- 🤖 AI-Powered Web Scraper
46
- Professional-grade web content extraction and AI summarization tool designed for journalists, analysts, and researchers.
47
- 🚀 Features
48
- 🛡️ Security & Compliance
49
-
50
- Built-in URL validation and security checks
51
- Robots.txt compliance checking
52
- Protection against internal network access
53
- Input sanitization and validation
54
-
55
- 🤖 AI-Powered Analysis
56
-
57
- Advanced content summarization using BART models
58
- Intelligent keyword extraction
59
- Content quality assessment
60
- Reading time estimation
61
-
62
- 📊 Rich Metadata Extraction
63
-
64
- Article titles and authors
65
- Publication dates
66
- Meta descriptions
67
- Word count and reading metrics
68
- Social media metadata (Open Graph)
69
-
70
- 💾 Export & Data Management
71
-
72
- CSV and JSON export formats
73
- Batch processing capabilities
74
- Session data management
75
- Professional report generation
76
-
77
- 🔧 Technical Excellence
78
-
79
- Modular, maintainable code architecture
80
- Comprehensive error handling
81
- Async processing capabilities
82
- Fallback mechanisms for reliability
83
-
84
- 🎯 Target Users
85
-
86
- Journalists: Quick article summarization and fact-checking
87
- Research Analysts: Content analysis and data extraction
88
- Academic Researchers: Literature review and content analysis
89
- Content Strategists: Competitive analysis and trend research
90
-
91
- 📖 How to Use
92
-
93
- Enter URL: Paste the URL of the content you want to analyze
94
- Configure Settings: Adjust summary length and other parameters
95
- Extract & Analyze: Click the extract button to process content
96
- Review Results: Examine the AI summary, metadata, and keywords
97
- Export Data: Save results in your preferred format
98
-
99
- ⚙️ Technical Specifications
100
- AI Models
101
-
102
- Primary: Facebook BART-Large-CNN for summarization
103
- Fallback: DistilBART-CNN for faster processing
104
- Keyword Extraction: Custom frequency-based algorithm
105
-
106
- Content Processing
107
-
108
- Parser: BeautifulSoup4 with multiple extraction strategies
109
- Security: Multi-layer validation and sanitization
110
- Compliance: Automatic robots.txt checking
111
- Formats: HTML, XHTML, XML content support
112
-
113
- Performance
114
-
115
- Processing Time: ~5-15 seconds per article
116
- Content Length: Supports articles up to 50,000 words
117
- Concurrent Requests: Optimized for batch processing
118
- Memory Usage: Efficient model loading and caching
119
-
120
- 🛠️ Development
121
- Architecture
122
- ├── ContentExtractor # Web scraping and content extraction
123
- ├── AISummarizer # AI-powered summarization
124
- ├── SecurityValidator # URL and content validation
125
- ├── RobotsTxtChecker # Compliance verification
126
- └── WebScraperApp # Main application orchestrator
127
- Security Features
128
-
129
- URL scheme validation (HTTP/HTTPS only)
130
- Internal network protection
131
- Robots.txt compliance
132
- Rate limiting and throttling
133
- Input sanitization
134
-
135
- Error Handling
136
-
137
- Graceful degradation for failed requests
138
- Fallback summarization methods
139
- Comprehensive logging
140
- User-friendly error messages
141
-
142
- 📋 Supported Content Types
143
- ✅ Fully Supported
144
-
145
- News articles and blog posts
146
- Academic papers and research
147
- Documentation and tutorials
148
- Magazine articles and features
149
- Press releases and announcements
150
-
151
- ⚠️ Limited Support
152
-
153
- Dynamic JavaScript-heavy sites
154
- Single-page applications (SPAs)
155
- Password-protected content
156
- Sites with aggressive anti-bot measures
157
-
158
- ❌ Not Supported
159
-
160
- PDF documents (direct upload)
161
- Video/audio content
162
- Images and multimedia
163
- Social media posts (API required)
164
-
165
- 🔐 Privacy & Ethics
166
-
167
- No Data Storage: Content is processed in memory only
168
- Respect for robots.txt: Automatic compliance checking
169
- Rate Limiting: Respectful crawling practices
170
- User Privacy: No tracking or analytics
171
- Content Rights: Users responsible for usage rights
172
-
173
- 🚨 Troubleshooting
174
- Common Issues & Solutions
175
- Issue: ModuleNotFoundError: No module named 'bs4'
176
- bash# Solution 1: Use minimal requirements
177
- pip install gradio requests beautifulsoup4 pandas
178
-
179
- # Solution 2: Run the fix script
180
- python quick_fix.py
181
-
182
- # Solution 3: Manual installation
183
- pip install beautifulsoup4
184
- Issue: AI models not loading
185
-
186
- ✅ App still works: Uses extractive summarization as fallback
187
- 🔧 To enable AI: Ensure GPU is available or wait for model download
188
- ⚠️ First run: Models download automatically (2-3 minutes)
189
-
190
- Issue: Slow performance
191
-
192
- 💡 Upgrade hardware: Use T4 Small GPU for 5-10x speedup
193
- 🔧 Optimize settings: Reduce summary length for faster processing
194
- ⚡ Batch processing: More efficient for multiple URLs
195
-
196
- Deployment Troubleshooting
197
-
198
- Check Space logs: Look for specific error messages
199
- Verify requirements.txt: Ensure all packages are listed
200
- Hardware requirements: Upgrade if memory issues occur
201
- Restart Space: Factory reboot clears all caches
202
-
203
- Fallback Features
204
- The app includes robust fallback mechanisms:
205
-
206
- No AI models: Uses extractive summarization
207
- No NLTK: Uses basic text processing
208
- Network issues: Graceful error handling
209
- Invalid URLs: Security validation with clear messages
210
-
211
- 📈 Performance Tips
212
-
213
- Batch Processing: Process multiple URLs for efficiency
214
- Summary Length: Shorter summaries process faster
215
- Content Quality: Clean, well-structured content works best
216
- Network: Stable internet connection recommended
217
-
218
- 🤝 Contributing
219
- Contributions welcome! Areas for improvement:
220
-
221
- Additional content extractors
222
- Enhanced keyword algorithms
223
- Support for more file formats
224
- Advanced AI models
225
- Performance optimizations
226
-
227
- 📄 License
228
- Apache 2.0 License - See LICENSE file for details
229
- ⚡ Quick Start Examples
230
- Basic Usage
231
- URL: https://example.com/article
232
- Summary Length: 200 words
233
- → Extract & Summarize
234
- Batch Analysis
235
- 1. Process first URL
236
- 2. Review and export
237
- 3. Process next URL
238
- 4. Combine results
239
- 5. Final export
240
-
241
- Built with ❤️ for the research and journalism community
242
- This tool respects content creators' rights and website policies. Please use responsibly and in accordance with applicable laws and terms of service.
 
1
+ # 🚀 AI Dataset Studio
2
+
3
+ **Create high-quality training datasets with AI-powered source discovery**
4
+
5
+ A comprehensive platform for building ML datasets that combines web scraping, AI processing, and smart source discovery using Perplexity AI. Perfect for researchers, data scientists, and AI enthusiasts who need quality training data without the complexity.
6
+
7
+ ---
8
+
9
+ ## ✨ Key Features
10
+
11
+ ### 🧠 **AI-Powered Source Discovery**
12
+ - **Perplexity AI Integration**: Automatically discover relevant sources based on your project description
13
+ - **Smart Search Types**: General, academic, news, technical, and specialized searches
14
+ - **Quality Scoring**: AI evaluates source quality and relevance for ML training
15
+ - **Diverse Source Types**: Academic papers, news articles, blogs, government sources, and more
16
+
17
+ ### 🎯 **Professional Dataset Creation**
18
+ - **6 ML Templates**: Sentiment analysis, text classification, NER, Q&A, summarization, translation
19
+ - **Advanced AI Processing**: BART, RoBERTa, and other state-of-the-art models
20
+ - **Quality Filtering**: Automatic content validation and cleaning
21
+ - **Batch Processing**: Handle hundreds of URLs efficiently
22
+
23
+ ### 📊 **Enterprise-Grade Export**
24
+ - **Multiple Formats**: JSON, CSV, HuggingFace Datasets, JSONL
25
+ - **Production Ready**: Proper data structure for immediate ML use
26
+ - **Rich Metadata**: Source tracking, confidence scores, processing timestamps
27
+
28
+ ### 🛡️ **Security & Ethics**
29
+ - **Robots.txt Compliance**: Respects website crawling policies
30
+ - **Rate Limiting**: Responsible scraping practices
31
+ - **Content Validation**: Safety checks and quality filters
32
+ - **Privacy First**: No data storage, memory-only processing
33
+
34
+ ---
35
+
36
+ ## 🚀 Quick Start
37
+
38
+ ### 1. **Deploy on Hugging Face Spaces**
39
+
40
+ ```bash
41
+ # Create new Space
42
+ # Name: ai-dataset-studio
43
+ # SDK: Gradio
44
+ # Hardware: T4 Small (recommended) or CPU Basic (free)
45
+ ```
46
+
47
+ ### 2. **Set Up Perplexity AI (Optional)**
48
+
49
+ To enable AI-powered source discovery:
50
+
51
+ 1. **Get Perplexity API Key**:
52
+ - Visit [Perplexity AI](https://www.perplexity.ai/)
53
+ - Sign up for an account
54
+ - Get your API key from the dashboard
55
+
56
+ 2. **Set Environment Variable**:
57
+ - In your Hugging Face Space settings
58
+ - Go to "Repository secrets"
59
+ - Add: `PERPLEXITY_API_KEY` = `your_api_key_here`
60
+
61
+ 3. **Restart Your Space**:
62
+ - The AI source discovery will now be available!
63
+
64
+ ### 3. **Upload Files**
65
+
66
+ Copy these files to your Space:
67
+ - `app.py` (main application)
68
+ - `perplexity_client.py` (AI integration)
69
+ - `requirements.txt` (dependencies)
70
+ - `README.md` (this file)
71
+
72
+ ---
73
+
74
+ ## 📖 How to Use
75
+
76
+ ### Step 1: 📋 **Project Setup**
77
+ 1. **Create Project**: Give your dataset a name and description
78
+ 2. **Choose Template**: Select ML task type (sentiment analysis, classification, etc.)
79
+ 3. **Review Configuration**: Check fields and example data structure
80
+
81
+ ### Step 2: 🧠 **AI Source Discovery** (Recommended)
82
+ 1. **Describe Your Needs**: Tell AI what sources you need
83
+ ```
84
+ Example: "I need product reviews from e-commerce sites for sentiment analysis training data"
85
+ ```
86
+ 2. **Configure Search**: Choose search type, max sources, include academic/news
87
+ 3. **Review Results**: AI finds and scores relevant sources
88
+ 4. **Use Sources**: One-click to add discovered URLs to scraping list
89
+
90
+ ### Step 3: 🕷️ **Manual URLs** (Alternative)
91
+ - Add URLs manually if not using AI discovery
92
+ - One URL per line
93
+ - Supports most public websites
94
+
95
+ ### Step 4: ⚙️ **Data Processing**
96
+ 1. **Scrape Content**: Extract text from all URLs
97
+ 2. **AI Processing**: Apply template-specific AI models
98
+ 3. **Quality Control**: Filter and validate results
99
+ 4. **Preview Data**: Review processed examples
100
+
101
+ ### Step 5: 📦 **Export Dataset**
102
+ 1. **Choose Format**: JSON, CSV, HuggingFace, or JSONL
103
+ 2. **Download**: Get your ML-ready dataset
104
+ 3. **Use Immediately**: Compatible with popular ML frameworks
105
+
106
+ ---
107
+
108
+ ## 🎯 Use Cases
109
+
110
+ ### 📰 **For Journalists**
111
+ ```
112
+ Project: "News sentiment analysis across political topics"
113
+ AI Discovery: Finds news articles from diverse sources
114
+ Processing: Sentiment analysis with confidence scores
115
+ Export: Clean dataset for editorial sentiment tracking
116
+ ```
117
+
118
+ ### 🏢 **For Businesses**
119
+ ```
120
+ Project: "Customer review classification for product insights"
121
+ AI Discovery: Discovers review sites and forums
122
+ Processing: Multi-class sentiment + topic classification
123
+ Export: Business intelligence dataset
124
+ ```
125
+
126
+ ### 🎓 **For Researchers**
127
+ ```
128
+ Project: "Academic paper summarization dataset"
129
+ AI Discovery: Finds peer-reviewed papers and preprints
130
+ Processing: Abstractive summarization with BART
131
+ Export: Research training dataset
132
+ ```
133
+
134
+ ### 🚀 **For Startups**
135
+ ```
136
+ Project: "Competitor analysis sentiment dataset"
137
+ AI Discovery: Finds discussions about competitor products
138
+ Processing: NER + sentiment analysis
139
+ Export: Market intelligence dataset
140
+ ```
141
+
142
+ ---
143
+
144
+ ## 🧠 Perplexity AI Integration
145
+
146
+ ### **What It Does**
147
+ - **Intelligent Search**: Understands your project needs and finds relevant sources
148
+ - **Quality Assessment**: Scores sources based on content quality and ML suitability
149
+ - **Diverse Discovery**: Finds sources you might not think of manually
150
+ - **Time Saving**: Reduces dataset creation time by 80%
151
+
152
+ ### **Search Types**
153
+ - **General**: Broad search across all content types
154
+ - **Academic**: Focus on research papers and scholarly content
155
+ - **News**: Prioritize journalistic and news sources
156
+ - **Technical**: Target documentation, tutorials, and technical content
157
+
158
+ ### **Example Queries That Work Well**
159
+ ```
160
+ ✅ "Customer reviews for electronics products sentiment analysis"
161
+ ✅ "News articles about climate change for topic classification"
162
+ ✅ "Medical research papers for text summarization"
163
+ ✅ "Social media posts about brand mentions"
164
+ ✅ "FAQ pages for question-answering datasets"
165
+ ```
166
+
167
+ ---
168
+
169
+ ## 🛠️ Configuration
170
+
171
+ ### **Hardware Requirements**
172
+
173
+ | Use Case | Hardware | Cost | Performance |
174
+ |----------|----------|------|-------------|
175
+ | **Development** | CPU Basic | Free | 30-60s per article |
176
+ | **Small Projects** | CPU Upgrade | $0.03/hr | 15-30s per article |
177
+ | **Production** | T4 Small | $0.60/hr | 5-15s per article |
178
+ | **Large Scale** | A10G Small | $1.05/hr | 3-8s per article |
179
+
180
+ ### **Environment Variables**
181
+
182
+ ```bash
183
+ # Required for AI source discovery
184
+ PERPLEXITY_API_KEY=your_perplexity_api_key
185
+
186
+ # Optional customization
187
+ MAX_SOURCES_PER_SEARCH=50
188
+ REQUEST_TIMEOUT=30
189
+ ENABLE_GPU_ACCELERATION=true
190
+ ```
191
+
192
+ ### **Model Configuration**
193
+
194
+ The application automatically uses the best available models:
195
+
196
+ - **Sentiment Analysis**: `cardiffnlp/twitter-roberta-base-sentiment-latest`
197
+ - **Summarization**: `facebook/bart-large-cnn`
198
+ - **NER**: `dbmdz/bert-large-cased-finetuned-conll03-english`
199
+ - **Fallbacks**: Keyword-based processing when models unavailable
200
+
201
+ ---
202
+
203
+ ## 📊 Dataset Templates
204
+
205
+ ### 1. **📊 Sentiment Analysis**
206
+ ```json
207
+ {
208
+ "text": "This product is amazing!",
209
+ "sentiment": "positive",
210
+ "confidence": 0.95,
211
+ "source_url": "https://example.com/review"
212
+ }
213
+ ```
214
+
215
+ ### 2. **📂 Text Classification**
216
+ ```json
217
+ {
218
+ "text": "Breaking: Stock market reaches new high",
219
+ "category": "finance",
220
+ "source_url": "https://news.example.com"
221
+ }
222
+ ```
223
+
224
+ ### 3. **🏷️ Named Entity Recognition**
225
+ ```json
226
+ {
227
+ "text": "Apple Inc. was founded by Steve Jobs",
228
+ "entities": [
229
+ {"text": "Apple Inc.", "label": "ORG"},
230
+ {"text": "Steve Jobs", "label": "PERSON"}
231
+ ]
232
+ }
233
+ ```
234
+
235
+ ### 4. **❓ Question Answering**
236
+ ```json
237
+ {
238
+ "context": "The capital of France is Paris",
239
+ "question": "What is the capital of France?",
240
+ "answer": "Paris"
241
+ }
242
+ ```
243
+
244
+ ### 5. **📝 Text Summarization**
245
+ ```json
246
+ {
247
+ "text": "Long article content...",
248
+ "summary": "Brief summary of key points"
249
+ }
250
+ ```
251
+
252
+ ---
253
+
254
+ ## 🚨 Troubleshooting
255
+
256
+ ### **Common Issues**
257
+
258
+ #### ❌ **"No Perplexity API key found"**
259
+ **Solution**: Set `PERPLEXITY_API_KEY` in your Space settings under "Repository secrets"
260
+
261
+ #### ❌ **"No sources found"**
262
+ **Solutions**:
263
+ - Make your project description more specific
264
+ - Try different search types (academic, news, technical)
265
+ - Use manual URL entry as fallback
266
+
267
+ #### ❌ **"Failed to scrape URL"**
268
+ **Solutions**:
269
+ - Check if URL is publicly accessible
270
+ - Some sites block automated access (respect robots.txt)
271
+ - Use alternative sources discovered by AI
272
+
273
+ #### ❌ **"Models not loading"**
274
+ **Solutions**:
275
+ - Upgrade to T4 Small for GPU acceleration
276
+ - Wait 2-3 minutes for model downloads
277
+ - Use minimal version for basic functionality
278
+
279
+ ### **Getting Help**
280
+
281
+ 1. **Check Space Logs**: Look for specific error messages
282
+ 2. **Try Minimal Version**: Use basic functionality first
283
+ 3. **Contact Support**: Include error details and configuration
284
+
285
+ ---
286
+
287
+ ## 🎯 Pro Tips
288
+
289
+ ### **Maximize AI Discovery Success**
290
+ ```
291
+ ✅ Be specific: "Product reviews for smartphone sentiment analysis"
292
+ ❌ Be vague: "Text data for ML"
293
+
294
+ ✅ Include context: "News articles about renewable energy for classification"
295
+ ❌ Missing context: "Articles for classification"
296
+
297
+ ✅ Specify domain: "Academic papers on machine learning for summarization"
298
+ ❌ Too broad: "Papers for summarization"
299
+ ```
300
+
301
+ ### **Quality Dataset Creation**
302
+ - **Start with AI discovery** to find diverse, high-quality sources
303
+ - **Use multiple search types** for comprehensive coverage
304
+ - **Review discovered sources** before bulk scraping
305
+ - **Filter by quality scores** to maintain dataset standards
306
+ - **Export early and often** to avoid losing work
307
+
308
+ ### **Performance Optimization**
309
+ - **Use T4 Small** for best AI model performance
310
+ - **Enable persistent storage** for large projects
311
+ - **Batch process** related URLs together
312
+ - **Monitor Space usage** to optimize costs
313
+
314
+ ---
315
+
316
+ ## 🌟 Advanced Features
317
+
318
+ ### **Batch Source Discovery**
319
+ ```python
320
+ # The AI can find sources for multiple related projects
321
+ projects = [
322
+ "Product reviews for sentiment analysis",
323
+ "News articles for topic classification",
324
+ "Social media posts for trend analysis"
325
+ ]
326
+ # Each gets tailored source recommendations
327
+ ```
328
+
329
+ ### **Custom Templates**
330
+ - Modify existing templates for specific needs
331
+ - Add custom fields and processing logic
332
+ - Create domain-specific datasets
333
+
334
+ ### **API Integration**
335
+ - Export datasets directly to HuggingFace Hub
336
+ - Integrate with existing ML pipelines
337
+ - Automate dataset updates
338
+
339
+ ---
340
+
341
+ ## 🎉 Success Stories
342
+
343
+ > **"Reduced dataset creation time from weeks to hours!"** - ML Research Team
344
+
345
+ > **"AI discovery found sources we never would have thought of manually."** - Data Science Startup
346
+
347
+ > **"Finally, a tool that handles the entire pipeline from idea to dataset."** - Independent Researcher
348
+
349
+ ---
350
+
351
+ ## 📈 Roadmap
352
+
353
+ ### **Coming Soon**
354
+ - 🔄 **Auto-refresh**: Automatically update datasets with new content
355
+ - 🌍 **Multi-language**: Support for non-English content
356
+ - 🤖 **Custom Models**: Use your own fine-tuned models
357
+ - 📊 **Analytics Dashboard**: Dataset quality metrics and insights
358
+
359
+ ### **Future Integrations**
360
+ - 📚 **Academic APIs**: PubMed, arXiv, Google Scholar
361
+ - 🐦 **Social Media**: Twitter, Reddit, LinkedIn APIs
362
+ - 💾 **Cloud Storage**: Direct export to S3, GCS, Azure
363
+ - 🔗 **ML Platforms**: Native integration with major ML services
364
+
365
  ---
366
+
367
+ ## 🤝 Contributing
368
+
369
+ We welcome contributions! Areas where you can help:
370
+
371
+ - 🐛 **Bug Reports**: Test edge cases and report issues
372
+ - 💡 **Feature Ideas**: Suggest new templates and capabilities
373
+ - 📖 **Documentation**: Improve guides and examples
374
+ - 🧪 **Testing**: Try with different domains and use cases
375
+
376
+ ---
377
+
378
+ ## 📄 License
379
+
380
+ MIT License - Feel free to use, modify, and distribute!
381
+
382
+ ---
383
+
384
+ ## 🙏 Acknowledgments
385
+
386
+ - **Perplexity AI** for intelligent source discovery
387
+ - **Hugging Face** for transformers and hosting platform
388
+ - **Gradio** for the beautiful interface framework
389
+ - **Community** for feedback and feature requests
390
+
391
+ ---
392
+
393
+ **Ready to create amazing datasets? Deploy your AI Dataset Studio today!** 🚀
394
+
395
+ *Transform your ideas into ML-ready datasets in minutes, not weeks.*