Update README.md
Browse files
README.md
CHANGED
@@ -6,9 +6,237 @@ colorTo: pink
|
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.35.0
|
8 |
app_file: app.py
|
9 |
-
pinned:
|
10 |
license: mit
|
11 |
short_description: 'ai powered web scrapping tool '
|
|
|
|
|
12 |
---
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.35.0
|
8 |
app_file: app.py
|
9 |
+
pinned: true
|
10 |
license: mit
|
11 |
short_description: 'ai powered web scrapping tool '
|
12 |
+
thumbnail: >-
|
13 |
+
https://cdn-uploads.huggingface.co/production/uploads/6508b189ac5108b93a5f111b/MV3haSrhEtdlc5prx9rVO.png
|
14 |
---
|
15 |
|
16 |
+
title: AI-Powered Web Scraper
|
17 |
+
emoji: π€
|
18 |
+
colorFrom: blue
|
19 |
+
colorTo: purple
|
20 |
+
sdk: gradio
|
21 |
+
sdk_version: 4.44.0
|
22 |
+
app_file: app.py
|
23 |
+
pinned: false
|
24 |
+
license: apache-2.0
|
25 |
+
python_version: 3.10
|
26 |
+
suggested_hardware: t4-small
|
27 |
+
suggested_storage: small
|
28 |
+
short_description: Professional web content extraction and AI summarization tool for journalists, analysts, and researchers
|
29 |
+
tags:
|
30 |
+
|
31 |
+
web-scraping
|
32 |
+
content-extraction
|
33 |
+
ai-summarization
|
34 |
+
journalism
|
35 |
+
research
|
36 |
+
analysis
|
37 |
+
nlp
|
38 |
+
bart
|
39 |
+
content-analysis
|
40 |
+
models:
|
41 |
+
facebook/bart-large-cnn
|
42 |
+
sshleifer/distilbart-cnn-12-6
|
43 |
+
|
44 |
+
|
45 |
+
π€ AI-Powered Web Scraper
|
46 |
+
Professional-grade web content extraction and AI summarization tool designed for journalists, analysts, and researchers.
|
47 |
+
π Features
|
48 |
+
π‘οΈ Security & Compliance
|
49 |
+
|
50 |
+
Built-in URL validation and security checks
|
51 |
+
Robots.txt compliance checking
|
52 |
+
Protection against internal network access
|
53 |
+
Input sanitization and validation
|
54 |
+
|
55 |
+
π€ AI-Powered Analysis
|
56 |
+
|
57 |
+
Advanced content summarization using BART models
|
58 |
+
Intelligent keyword extraction
|
59 |
+
Content quality assessment
|
60 |
+
Reading time estimation
|
61 |
+
|
62 |
+
π Rich Metadata Extraction
|
63 |
+
|
64 |
+
Article titles and authors
|
65 |
+
Publication dates
|
66 |
+
Meta descriptions
|
67 |
+
Word count and reading metrics
|
68 |
+
Social media metadata (Open Graph)
|
69 |
+
|
70 |
+
πΎ Export & Data Management
|
71 |
+
|
72 |
+
CSV and JSON export formats
|
73 |
+
Batch processing capabilities
|
74 |
+
Session data management
|
75 |
+
Professional report generation
|
76 |
+
|
77 |
+
π§ Technical Excellence
|
78 |
+
|
79 |
+
Modular, maintainable code architecture
|
80 |
+
Comprehensive error handling
|
81 |
+
Async processing capabilities
|
82 |
+
Fallback mechanisms for reliability
|
83 |
+
|
84 |
+
π― Target Users
|
85 |
+
|
86 |
+
Journalists: Quick article summarization and fact-checking
|
87 |
+
Research Analysts: Content analysis and data extraction
|
88 |
+
Academic Researchers: Literature review and content analysis
|
89 |
+
Content Strategists: Competitive analysis and trend research
|
90 |
+
|
91 |
+
π How to Use
|
92 |
+
|
93 |
+
Enter URL: Paste the URL of the content you want to analyze
|
94 |
+
Configure Settings: Adjust summary length and other parameters
|
95 |
+
Extract & Analyze: Click the extract button to process content
|
96 |
+
Review Results: Examine the AI summary, metadata, and keywords
|
97 |
+
Export Data: Save results in your preferred format
|
98 |
+
|
99 |
+
βοΈ Technical Specifications
|
100 |
+
AI Models
|
101 |
+
|
102 |
+
Primary: Facebook BART-Large-CNN for summarization
|
103 |
+
Fallback: DistilBART-CNN for faster processing
|
104 |
+
Keyword Extraction: Custom frequency-based algorithm
|
105 |
+
|
106 |
+
Content Processing
|
107 |
+
|
108 |
+
Parser: BeautifulSoup4 with multiple extraction strategies
|
109 |
+
Security: Multi-layer validation and sanitization
|
110 |
+
Compliance: Automatic robots.txt checking
|
111 |
+
Formats: HTML, XHTML, XML content support
|
112 |
+
|
113 |
+
Performance
|
114 |
+
|
115 |
+
Processing Time: ~5-15 seconds per article
|
116 |
+
Content Length: Supports articles up to 50,000 words
|
117 |
+
Concurrent Requests: Optimized for batch processing
|
118 |
+
Memory Usage: Efficient model loading and caching
|
119 |
+
|
120 |
+
π οΈ Development
|
121 |
+
Architecture
|
122 |
+
βββ ContentExtractor # Web scraping and content extraction
|
123 |
+
βββ AISummarizer # AI-powered summarization
|
124 |
+
βββ SecurityValidator # URL and content validation
|
125 |
+
βββ RobotsTxtChecker # Compliance verification
|
126 |
+
βββ WebScraperApp # Main application orchestrator
|
127 |
+
Security Features
|
128 |
+
|
129 |
+
URL scheme validation (HTTP/HTTPS only)
|
130 |
+
Internal network protection
|
131 |
+
Robots.txt compliance
|
132 |
+
Rate limiting and throttling
|
133 |
+
Input sanitization
|
134 |
+
|
135 |
+
Error Handling
|
136 |
+
|
137 |
+
Graceful degradation for failed requests
|
138 |
+
Fallback summarization methods
|
139 |
+
Comprehensive logging
|
140 |
+
User-friendly error messages
|
141 |
+
|
142 |
+
π Supported Content Types
|
143 |
+
β
Fully Supported
|
144 |
+
|
145 |
+
News articles and blog posts
|
146 |
+
Academic papers and research
|
147 |
+
Documentation and tutorials
|
148 |
+
Magazine articles and features
|
149 |
+
Press releases and announcements
|
150 |
+
|
151 |
+
β οΈ Limited Support
|
152 |
+
|
153 |
+
Dynamic JavaScript-heavy sites
|
154 |
+
Single-page applications (SPAs)
|
155 |
+
Password-protected content
|
156 |
+
Sites with aggressive anti-bot measures
|
157 |
+
|
158 |
+
β Not Supported
|
159 |
+
|
160 |
+
PDF documents (direct upload)
|
161 |
+
Video/audio content
|
162 |
+
Images and multimedia
|
163 |
+
Social media posts (API required)
|
164 |
+
|
165 |
+
π Privacy & Ethics
|
166 |
+
|
167 |
+
No Data Storage: Content is processed in memory only
|
168 |
+
Respect for robots.txt: Automatic compliance checking
|
169 |
+
Rate Limiting: Respectful crawling practices
|
170 |
+
User Privacy: No tracking or analytics
|
171 |
+
Content Rights: Users responsible for usage rights
|
172 |
+
|
173 |
+
π¨ Troubleshooting
|
174 |
+
Common Issues & Solutions
|
175 |
+
Issue: ModuleNotFoundError: No module named 'bs4'
|
176 |
+
bash# Solution 1: Use minimal requirements
|
177 |
+
pip install gradio requests beautifulsoup4 pandas
|
178 |
+
|
179 |
+
# Solution 2: Run the fix script
|
180 |
+
python quick_fix.py
|
181 |
+
|
182 |
+
# Solution 3: Manual installation
|
183 |
+
pip install beautifulsoup4
|
184 |
+
Issue: AI models not loading
|
185 |
+
|
186 |
+
β
App still works: Uses extractive summarization as fallback
|
187 |
+
π§ To enable AI: Ensure GPU is available or wait for model download
|
188 |
+
β οΈ First run: Models download automatically (2-3 minutes)
|
189 |
+
|
190 |
+
Issue: Slow performance
|
191 |
+
|
192 |
+
π‘ Upgrade hardware: Use T4 Small GPU for 5-10x speedup
|
193 |
+
π§ Optimize settings: Reduce summary length for faster processing
|
194 |
+
β‘ Batch processing: More efficient for multiple URLs
|
195 |
+
|
196 |
+
Deployment Troubleshooting
|
197 |
+
|
198 |
+
Check Space logs: Look for specific error messages
|
199 |
+
Verify requirements.txt: Ensure all packages are listed
|
200 |
+
Hardware requirements: Upgrade if memory issues occur
|
201 |
+
Restart Space: Factory reboot clears all caches
|
202 |
+
|
203 |
+
Fallback Features
|
204 |
+
The app includes robust fallback mechanisms:
|
205 |
+
|
206 |
+
No AI models: Uses extractive summarization
|
207 |
+
No NLTK: Uses basic text processing
|
208 |
+
Network issues: Graceful error handling
|
209 |
+
Invalid URLs: Security validation with clear messages
|
210 |
+
|
211 |
+
π Performance Tips
|
212 |
+
|
213 |
+
Batch Processing: Process multiple URLs for efficiency
|
214 |
+
Summary Length: Shorter summaries process faster
|
215 |
+
Content Quality: Clean, well-structured content works best
|
216 |
+
Network: Stable internet connection recommended
|
217 |
+
|
218 |
+
π€ Contributing
|
219 |
+
Contributions welcome! Areas for improvement:
|
220 |
+
|
221 |
+
Additional content extractors
|
222 |
+
Enhanced keyword algorithms
|
223 |
+
Support for more file formats
|
224 |
+
Advanced AI models
|
225 |
+
Performance optimizations
|
226 |
+
|
227 |
+
π License
|
228 |
+
Apache 2.0 License - See LICENSE file for details
|
229 |
+
β‘ Quick Start Examples
|
230 |
+
Basic Usage
|
231 |
+
URL: https://example.com/article
|
232 |
+
Summary Length: 200 words
|
233 |
+
β Extract & Summarize
|
234 |
+
Batch Analysis
|
235 |
+
1. Process first URL
|
236 |
+
2. Review and export
|
237 |
+
3. Process next URL
|
238 |
+
4. Combine results
|
239 |
+
5. Final export
|
240 |
+
|
241 |
+
Built with β€οΈ for the research and journalism community
|
242 |
+
This tool respects content creators' rights and website policies. Please use responsibly and in accordance with applicable laws and terms of service.
|