MagicMeWizard's picture
Update README.md
631f688 verified
|
raw
history blame
6.57 kB
metadata
title: AI Powered Web Scraper
emoji: πŸƒ
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: true
license: mit
short_description: 'ai powered web scrapping tool '
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/6508b189ac5108b93a5f111b/MV3haSrhEtdlc5prx9rVO.png

title: AI-Powered Web Scraper emoji: πŸ€– colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0 python_version: 3.10 suggested_hardware: t4-small suggested_storage: small short_description: Professional web content extraction and AI summarization tool for journalists, analysts, and researchers tags:

web-scraping content-extraction ai-summarization journalism research analysis nlp bart content-analysis models: facebook/bart-large-cnn sshleifer/distilbart-cnn-12-6

πŸ€– AI-Powered Web Scraper Professional-grade web content extraction and AI summarization tool designed for journalists, analysts, and researchers. πŸš€ Features πŸ›‘οΈ Security & Compliance

Built-in URL validation and security checks Robots.txt compliance checking Protection against internal network access Input sanitization and validation

πŸ€– AI-Powered Analysis

Advanced content summarization using BART models Intelligent keyword extraction Content quality assessment Reading time estimation

πŸ“Š Rich Metadata Extraction

Article titles and authors Publication dates Meta descriptions Word count and reading metrics Social media metadata (Open Graph)

πŸ’Ύ Export & Data Management

CSV and JSON export formats Batch processing capabilities Session data management Professional report generation

πŸ”§ Technical Excellence

Modular, maintainable code architecture Comprehensive error handling Async processing capabilities Fallback mechanisms for reliability

🎯 Target Users

Journalists: Quick article summarization and fact-checking Research Analysts: Content analysis and data extraction Academic Researchers: Literature review and content analysis Content Strategists: Competitive analysis and trend research

πŸ“– How to Use

Enter URL: Paste the URL of the content you want to analyze Configure Settings: Adjust summary length and other parameters Extract & Analyze: Click the extract button to process content Review Results: Examine the AI summary, metadata, and keywords Export Data: Save results in your preferred format

βš™οΈ Technical Specifications AI Models

Primary: Facebook BART-Large-CNN for summarization Fallback: DistilBART-CNN for faster processing Keyword Extraction: Custom frequency-based algorithm

Content Processing

Parser: BeautifulSoup4 with multiple extraction strategies Security: Multi-layer validation and sanitization Compliance: Automatic robots.txt checking Formats: HTML, XHTML, XML content support

Performance

Processing Time: ~5-15 seconds per article Content Length: Supports articles up to 50,000 words Concurrent Requests: Optimized for batch processing Memory Usage: Efficient model loading and caching

πŸ› οΈ Development Architecture β”œβ”€β”€ ContentExtractor # Web scraping and content extraction β”œβ”€β”€ AISummarizer # AI-powered summarization β”œβ”€β”€ SecurityValidator # URL and content validation β”œβ”€β”€ RobotsTxtChecker # Compliance verification └── WebScraperApp # Main application orchestrator Security Features

URL scheme validation (HTTP/HTTPS only) Internal network protection Robots.txt compliance Rate limiting and throttling Input sanitization

Error Handling

Graceful degradation for failed requests Fallback summarization methods Comprehensive logging User-friendly error messages

πŸ“‹ Supported Content Types βœ… Fully Supported

News articles and blog posts Academic papers and research Documentation and tutorials Magazine articles and features Press releases and announcements

⚠️ Limited Support

Dynamic JavaScript-heavy sites Single-page applications (SPAs) Password-protected content Sites with aggressive anti-bot measures

❌ Not Supported

PDF documents (direct upload) Video/audio content Images and multimedia Social media posts (API required)

πŸ” Privacy & Ethics

No Data Storage: Content is processed in memory only Respect for robots.txt: Automatic compliance checking Rate Limiting: Respectful crawling practices User Privacy: No tracking or analytics Content Rights: Users responsible for usage rights

🚨 Troubleshooting Common Issues & Solutions Issue: ModuleNotFoundError: No module named 'bs4' bash# Solution 1: Use minimal requirements pip install gradio requests beautifulsoup4 pandas

Solution 2: Run the fix script

python quick_fix.py

Solution 3: Manual installation

pip install beautifulsoup4 Issue: AI models not loading

βœ… App still works: Uses extractive summarization as fallback πŸ”§ To enable AI: Ensure GPU is available or wait for model download ⚠️ First run: Models download automatically (2-3 minutes)

Issue: Slow performance

πŸ’‘ Upgrade hardware: Use T4 Small GPU for 5-10x speedup πŸ”§ Optimize settings: Reduce summary length for faster processing ⚑ Batch processing: More efficient for multiple URLs

Deployment Troubleshooting

Check Space logs: Look for specific error messages Verify requirements.txt: Ensure all packages are listed Hardware requirements: Upgrade if memory issues occur Restart Space: Factory reboot clears all caches

Fallback Features The app includes robust fallback mechanisms:

No AI models: Uses extractive summarization No NLTK: Uses basic text processing Network issues: Graceful error handling Invalid URLs: Security validation with clear messages

πŸ“ˆ Performance Tips

Batch Processing: Process multiple URLs for efficiency Summary Length: Shorter summaries process faster Content Quality: Clean, well-structured content works best Network: Stable internet connection recommended

🀝 Contributing Contributions welcome! Areas for improvement:

Additional content extractors Enhanced keyword algorithms Support for more file formats Advanced AI models Performance optimizations

πŸ“„ License Apache 2.0 License - See LICENSE file for details ⚑ Quick Start Examples Basic Usage URL: https://example.com/article Summary Length: 200 words β†’ Extract & Summarize Batch Analysis

  1. Process first URL
  2. Review and export
  3. Process next URL
  4. Combine results
  5. Final export

Built with ❀️ for the research and journalism community This tool respects content creators' rights and website policies. Please use responsibly and in accordance with applicable laws and terms of service.