MagicMeWizard's picture
Update README.md
631f688 verified
|
raw
history blame
6.57 kB
---
title: AI Powered Web Scraper
emoji: πŸƒ
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: true
license: mit
short_description: 'ai powered web scrapping tool '
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/6508b189ac5108b93a5f111b/MV3haSrhEtdlc5prx9rVO.png
---
title: AI-Powered Web Scraper
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: 3.10
suggested_hardware: t4-small
suggested_storage: small
short_description: Professional web content extraction and AI summarization tool for journalists, analysts, and researchers
tags:
web-scraping
content-extraction
ai-summarization
journalism
research
analysis
nlp
bart
content-analysis
models:
facebook/bart-large-cnn
sshleifer/distilbart-cnn-12-6
πŸ€– AI-Powered Web Scraper
Professional-grade web content extraction and AI summarization tool designed for journalists, analysts, and researchers.
πŸš€ Features
πŸ›‘οΈ Security & Compliance
Built-in URL validation and security checks
Robots.txt compliance checking
Protection against internal network access
Input sanitization and validation
πŸ€– AI-Powered Analysis
Advanced content summarization using BART models
Intelligent keyword extraction
Content quality assessment
Reading time estimation
πŸ“Š Rich Metadata Extraction
Article titles and authors
Publication dates
Meta descriptions
Word count and reading metrics
Social media metadata (Open Graph)
πŸ’Ύ Export & Data Management
CSV and JSON export formats
Batch processing capabilities
Session data management
Professional report generation
πŸ”§ Technical Excellence
Modular, maintainable code architecture
Comprehensive error handling
Async processing capabilities
Fallback mechanisms for reliability
🎯 Target Users
Journalists: Quick article summarization and fact-checking
Research Analysts: Content analysis and data extraction
Academic Researchers: Literature review and content analysis
Content Strategists: Competitive analysis and trend research
πŸ“– How to Use
Enter URL: Paste the URL of the content you want to analyze
Configure Settings: Adjust summary length and other parameters
Extract & Analyze: Click the extract button to process content
Review Results: Examine the AI summary, metadata, and keywords
Export Data: Save results in your preferred format
βš™οΈ Technical Specifications
AI Models
Primary: Facebook BART-Large-CNN for summarization
Fallback: DistilBART-CNN for faster processing
Keyword Extraction: Custom frequency-based algorithm
Content Processing
Parser: BeautifulSoup4 with multiple extraction strategies
Security: Multi-layer validation and sanitization
Compliance: Automatic robots.txt checking
Formats: HTML, XHTML, XML content support
Performance
Processing Time: ~5-15 seconds per article
Content Length: Supports articles up to 50,000 words
Concurrent Requests: Optimized for batch processing
Memory Usage: Efficient model loading and caching
πŸ› οΈ Development
Architecture
β”œβ”€β”€ ContentExtractor # Web scraping and content extraction
β”œβ”€β”€ AISummarizer # AI-powered summarization
β”œβ”€β”€ SecurityValidator # URL and content validation
β”œβ”€β”€ RobotsTxtChecker # Compliance verification
└── WebScraperApp # Main application orchestrator
Security Features
URL scheme validation (HTTP/HTTPS only)
Internal network protection
Robots.txt compliance
Rate limiting and throttling
Input sanitization
Error Handling
Graceful degradation for failed requests
Fallback summarization methods
Comprehensive logging
User-friendly error messages
πŸ“‹ Supported Content Types
βœ… Fully Supported
News articles and blog posts
Academic papers and research
Documentation and tutorials
Magazine articles and features
Press releases and announcements
⚠️ Limited Support
Dynamic JavaScript-heavy sites
Single-page applications (SPAs)
Password-protected content
Sites with aggressive anti-bot measures
❌ Not Supported
PDF documents (direct upload)
Video/audio content
Images and multimedia
Social media posts (API required)
πŸ” Privacy & Ethics
No Data Storage: Content is processed in memory only
Respect for robots.txt: Automatic compliance checking
Rate Limiting: Respectful crawling practices
User Privacy: No tracking or analytics
Content Rights: Users responsible for usage rights
🚨 Troubleshooting
Common Issues & Solutions
Issue: ModuleNotFoundError: No module named 'bs4'
bash# Solution 1: Use minimal requirements
pip install gradio requests beautifulsoup4 pandas
# Solution 2: Run the fix script
python quick_fix.py
# Solution 3: Manual installation
pip install beautifulsoup4
Issue: AI models not loading
βœ… App still works: Uses extractive summarization as fallback
πŸ”§ To enable AI: Ensure GPU is available or wait for model download
⚠️ First run: Models download automatically (2-3 minutes)
Issue: Slow performance
πŸ’‘ Upgrade hardware: Use T4 Small GPU for 5-10x speedup
πŸ”§ Optimize settings: Reduce summary length for faster processing
⚑ Batch processing: More efficient for multiple URLs
Deployment Troubleshooting
Check Space logs: Look for specific error messages
Verify requirements.txt: Ensure all packages are listed
Hardware requirements: Upgrade if memory issues occur
Restart Space: Factory reboot clears all caches
Fallback Features
The app includes robust fallback mechanisms:
No AI models: Uses extractive summarization
No NLTK: Uses basic text processing
Network issues: Graceful error handling
Invalid URLs: Security validation with clear messages
πŸ“ˆ Performance Tips
Batch Processing: Process multiple URLs for efficiency
Summary Length: Shorter summaries process faster
Content Quality: Clean, well-structured content works best
Network: Stable internet connection recommended
🀝 Contributing
Contributions welcome! Areas for improvement:
Additional content extractors
Enhanced keyword algorithms
Support for more file formats
Advanced AI models
Performance optimizations
πŸ“„ License
Apache 2.0 License - See LICENSE file for details
⚑ Quick Start Examples
Basic Usage
URL: https://example.com/article
Summary Length: 200 words
β†’ Extract & Summarize
Batch Analysis
1. Process first URL
2. Review and export
3. Process next URL
4. Combine results
5. Final export
Built with ❀️ for the research and journalism community
This tool respects content creators' rights and website policies. Please use responsibly and in accordance with applicable laws and terms of service.