Spaces:
Runtime error
Runtime error
# CrawlResult | |
The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage. | |
## Class Definition | |
```python | |
class CrawlResult(BaseModel): | |
"""Result of a web crawling operation.""" | |
# Basic Information | |
url: str # Crawled URL | |
success: bool # Whether crawl succeeded | |
status_code: Optional[int] = None # HTTP status code | |
error_message: Optional[str] = None # Error message if failed | |
# Content | |
html: str # Raw HTML content | |
cleaned_html: Optional[str] = None # Cleaned HTML | |
fit_html: Optional[str] = None # Most relevant HTML content | |
markdown: Optional[str] = None # HTML converted to markdown | |
fit_markdown: Optional[str] = None # Most relevant markdown content | |
downloaded_files: Optional[List[str]] = None # Downloaded files | |
# Extracted Data | |
extracted_content: Optional[str] = None # Content from extraction strategy | |
media: Dict[str, List[Dict]] = {} # Extracted media information | |
links: Dict[str, List[Dict]] = {} # Extracted links | |
metadata: Optional[dict] = None # Page metadata | |
# Additional Data | |
screenshot: Optional[str] = None # Base64 encoded screenshot | |
session_id: Optional[str] = None # Session identifier | |
response_headers: Optional[dict] = None # HTTP response headers | |
``` | |
## Properties and Their Data Structures | |
### Basic Information | |
```python | |
# Access basic information | |
result = await crawler.arun(url="https://example.com") | |
print(result.url) # "https://example.com" | |
print(result.success) # True/False | |
print(result.status_code) # 200, 404, etc. | |
print(result.error_message) # Error details if failed | |
``` | |
### Content Properties | |
#### HTML Content | |
```python | |
# Raw HTML | |
html_content = result.html | |
# Cleaned HTML (removed ads, popups, etc.) | |
clean_content = result.cleaned_html | |
# Most relevant HTML content | |
main_content = result.fit_html | |
``` | |
#### Markdown Content | |
```python | |
# Full markdown version | |
markdown_content = result.markdown | |
# Most relevant markdown content | |
main_content = result.fit_markdown | |
``` | |
### Media Content | |
The media dictionary contains organized media elements: | |
```python | |
# Structure | |
media = { | |
"images": [ | |
{ | |
"src": str, # Image URL | |
"alt": str, # Alt text | |
"desc": str, # Contextual description | |
"score": float, # Relevance score (0-10) | |
"type": str, # "image" | |
"width": int, # Image width (if available) | |
"height": int, # Image height (if available) | |
"context": str, # Surrounding text | |
"lazy": bool # Whether image was lazy-loaded | |
} | |
], | |
"videos": [ | |
{ | |
"src": str, # Video URL | |
"type": str, # "video" | |
"title": str, # Video title | |
"poster": str, # Thumbnail URL | |
"duration": str, # Video duration | |
"description": str # Video description | |
} | |
], | |
"audios": [ | |
{ | |
"src": str, # Audio URL | |
"type": str, # "audio" | |
"title": str, # Audio title | |
"duration": str, # Audio duration | |
"description": str # Audio description | |
} | |
] | |
} | |
# Example usage | |
for image in result.media["images"]: | |
if image["score"] > 5: # High-relevance images | |
print(f"High-quality image: {image['src']}") | |
print(f"Context: {image['context']}") | |
``` | |
### Link Analysis | |
The links dictionary organizes discovered links: | |
```python | |
# Structure | |
links = { | |
"internal": [ | |
{ | |
"href": str, # URL | |
"text": str, # Link text | |
"title": str, # Title attribute | |
"type": str, # Link type (nav, content, etc.) | |
"context": str, # Surrounding text | |
"score": float # Relevance score | |
} | |
], | |
"external": [ | |
{ | |
"href": str, # External URL | |
"text": str, # Link text | |
"title": str, # Title attribute | |
"domain": str, # Domain name | |
"type": str, # Link type | |
"context": str # Surrounding text | |
} | |
] | |
} | |
# Example usage | |
for link in result.links["internal"]: | |
print(f"Internal link: {link['href']}") | |
print(f"Context: {link['context']}") | |
``` | |
### Metadata | |
The metadata dictionary contains page information: | |
```python | |
# Structure | |
metadata = { | |
"title": str, # Page title | |
"description": str, # Meta description | |
"keywords": List[str], # Meta keywords | |
"author": str, # Author information | |
"published_date": str, # Publication date | |
"modified_date": str, # Last modified date | |
"language": str, # Page language | |
"canonical_url": str, # Canonical URL | |
"og_data": Dict, # Open Graph data | |
"twitter_data": Dict # Twitter card data | |
} | |
# Example usage | |
if result.metadata: | |
print(f"Title: {result.metadata['title']}") | |
print(f"Author: {result.metadata.get('author', 'Unknown')}") | |
``` | |
### Extracted Content | |
Content from extraction strategies: | |
```python | |
# For LLM or CSS extraction strategies | |
if result.extracted_content: | |
structured_data = json.loads(result.extracted_content) | |
print(structured_data) | |
``` | |
### Screenshot | |
Base64 encoded screenshot: | |
```python | |
# Save screenshot if available | |
if result.screenshot: | |
import base64 | |
# Decode and save | |
with open("screenshot.png", "wb") as f: | |
f.write(base64.b64decode(result.screenshot)) | |
``` | |
## Usage Examples | |
### Basic Content Access | |
```python | |
async with AsyncWebCrawler() as crawler: | |
result = await crawler.arun(url="https://example.com") | |
if result.success: | |
# Get clean content | |
print(result.fit_markdown) | |
# Process images | |
for image in result.media["images"]: | |
if image["score"] > 7: | |
print(f"High-quality image: {image['src']}") | |
``` | |
### Complete Data Processing | |
```python | |
async def process_webpage(url: str) -> Dict: | |
async with AsyncWebCrawler() as crawler: | |
result = await crawler.arun(url=url) | |
if not result.success: | |
raise Exception(f"Crawl failed: {result.error_message}") | |
return { | |
"content": result.fit_markdown, | |
"images": [ | |
img for img in result.media["images"] | |
if img["score"] > 5 | |
], | |
"internal_links": [ | |
link["href"] for link in result.links["internal"] | |
], | |
"metadata": result.metadata, | |
"status": result.status_code | |
} | |
``` | |
### Error Handling | |
```python | |
async def safe_crawl(url: str) -> Dict: | |
async with AsyncWebCrawler() as crawler: | |
try: | |
result = await crawler.arun(url=url) | |
if not result.success: | |
return { | |
"success": False, | |
"error": result.error_message, | |
"status": result.status_code | |
} | |
return { | |
"success": True, | |
"content": result.fit_markdown, | |
"status": result.status_code | |
} | |
except Exception as e: | |
return { | |
"success": False, | |
"error": str(e), | |
"status": None | |
} | |
``` | |
## Best Practices | |
1. **Always Check Success** | |
```python | |
if not result.success: | |
print(f"Error: {result.error_message}") | |
return | |
``` | |
2. **Use fit_markdown for Articles** | |
```python | |
# Better for article content | |
content = result.fit_markdown if result.fit_markdown else result.markdown | |
``` | |
3. **Filter Media by Score** | |
```python | |
relevant_images = [ | |
img for img in result.media["images"] | |
if img["score"] > 5 | |
] | |
``` | |
4. **Handle Missing Data** | |
```python | |
metadata = result.metadata or {} | |
title = metadata.get('title', 'Unknown Title') | |
``` |