Crawl4AI

Runtime error

File size: 8,401 Bytes

03c0888

# CrawlResult

The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage.

## Class Definition

```python
class CrawlResult(BaseModel):
    """Result of a web crawling operation."""
    
    # Basic Information
    url: str                                # Crawled URL
    success: bool                           # Whether crawl succeeded
    status_code: Optional[int] = None       # HTTP status code
    error_message: Optional[str] = None     # Error message if failed
    
    # Content
    html: str                              # Raw HTML content
    cleaned_html: Optional[str] = None      # Cleaned HTML
    fit_html: Optional[str] = None          # Most relevant HTML content
    markdown: Optional[str] = None          # HTML converted to markdown
    fit_markdown: Optional[str] = None      # Most relevant markdown content
    downloaded_files: Optional[List[str]] = None  # Downloaded files
    
    # Extracted Data
    extracted_content: Optional[str] = None  # Content from extraction strategy
    media: Dict[str, List[Dict]] = {}       # Extracted media information
    links: Dict[str, List[Dict]] = {}       # Extracted links
    metadata: Optional[dict] = None         # Page metadata
    
    # Additional Data
    screenshot: Optional[str] = None         # Base64 encoded screenshot
    session_id: Optional[str] = None         # Session identifier
    response_headers: Optional[dict] = None  # HTTP response headers
```

## Properties and Their Data Structures

### Basic Information

```python
# Access basic information
result = await crawler.arun(url="https://example.com")

print(result.url)          # "https://example.com"
print(result.success)      # True/False
print(result.status_code)  # 200, 404, etc.
print(result.error_message)  # Error details if failed
```

### Content Properties

#### HTML Content
```python
# Raw HTML
html_content = result.html

# Cleaned HTML (removed ads, popups, etc.)
clean_content = result.cleaned_html

# Most relevant HTML content
main_content = result.fit_html
```

#### Markdown Content
```python
# Full markdown version
markdown_content = result.markdown

# Most relevant markdown content
main_content = result.fit_markdown
```

### Media Content

The media dictionary contains organized media elements:

```python
# Structure
media = {
    "images": [
        {
            "src": str,           # Image URL
            "alt": str,           # Alt text
            "desc": str,          # Contextual description
            "score": float,       # Relevance score (0-10)
            "type": str,          # "image"
            "width": int,         # Image width (if available)
            "height": int,        # Image height (if available)
            "context": str,       # Surrounding text
            "lazy": bool          # Whether image was lazy-loaded
        }
    ],
    "videos": [
        {
            "src": str,           # Video URL
            "type": str,          # "video"
            "title": str,         # Video title
            "poster": str,        # Thumbnail URL
            "duration": str,      # Video duration
            "description": str    # Video description
        }
    ],
    "audios": [
        {
            "src": str,           # Audio URL
            "type": str,          # "audio"
            "title": str,         # Audio title
            "duration": str,      # Audio duration
            "description": str    # Audio description
        }
    ]
}

# Example usage
for image in result.media["images"]:
    if image["score"] > 5:  # High-relevance images
        print(f"High-quality image: {image['src']}")
        print(f"Context: {image['context']}")
```

### Link Analysis

The links dictionary organizes discovered links:

```python
# Structure
links = {
    "internal": [
        {
            "href": str,          # URL
            "text": str,          # Link text
            "title": str,         # Title attribute
            "type": str,          # Link type (nav, content, etc.)
            "context": str,       # Surrounding text
            "score": float        # Relevance score
        }
    ],
    "external": [
        {
            "href": str,          # External URL
            "text": str,          # Link text
            "title": str,         # Title attribute
            "domain": str,        # Domain name
            "type": str,          # Link type
            "context": str        # Surrounding text
        }
    ]
}

# Example usage
for link in result.links["internal"]:
    print(f"Internal link: {link['href']}")
    print(f"Context: {link['context']}")
```

### Metadata

The metadata dictionary contains page information:

```python
# Structure
metadata = {
    "title": str,                # Page title
    "description": str,          # Meta description
    "keywords": List[str],       # Meta keywords
    "author": str,              # Author information
    "published_date": str,      # Publication date
    "modified_date": str,       # Last modified date
    "language": str,            # Page language
    "canonical_url": str,       # Canonical URL
    "og_data": Dict,           # Open Graph data
    "twitter_data": Dict       # Twitter card data
}

# Example usage
if result.metadata:
    print(f"Title: {result.metadata['title']}")
    print(f"Author: {result.metadata.get('author', 'Unknown')}")
```

### Extracted Content

Content from extraction strategies:

```python
# For LLM or CSS extraction strategies
if result.extracted_content:
    structured_data = json.loads(result.extracted_content)
    print(structured_data)
```

### Screenshot

Base64 encoded screenshot:

```python
# Save screenshot if available
if result.screenshot:
    import base64
    
    # Decode and save
    with open("screenshot.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))
```

## Usage Examples

### Basic Content Access
```python
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
    
    if result.success:
        # Get clean content
        print(result.fit_markdown)
        
        # Process images
        for image in result.media["images"]:
            if image["score"] > 7:
                print(f"High-quality image: {image['src']}")
```

### Complete Data Processing
```python
async def process_webpage(url: str) -> Dict:
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url)
        
        if not result.success:
            raise Exception(f"Crawl failed: {result.error_message}")
        
        return {
            "content": result.fit_markdown,
            "images": [
                img for img in result.media["images"]
                if img["score"] > 5
            ],
            "internal_links": [
                link["href"] for link in result.links["internal"]
            ],
            "metadata": result.metadata,
            "status": result.status_code
        }
```

### Error Handling
```python
async def safe_crawl(url: str) -> Dict:
    async with AsyncWebCrawler() as crawler:
        try:
            result = await crawler.arun(url=url)
            
            if not result.success:
                return {
                    "success": False,
                    "error": result.error_message,
                    "status": result.status_code
                }
            
            return {
                "success": True,
                "content": result.fit_markdown,
                "status": result.status_code
            }
            
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "status": None
            }
```

## Best Practices

1. **Always Check Success**
```python
if not result.success:
    print(f"Error: {result.error_message}")
    return
```

2. **Use fit_markdown for Articles**
```python
# Better for article content
content = result.fit_markdown if result.fit_markdown else result.markdown
```

3. **Filter Media by Score**
```python
relevant_images = [
    img for img in result.media["images"]
    if img["score"] > 5
]
```

4. **Handle Missing Data**
```python
metadata = result.metadata or {}
title = metadata.get('title', 'Unknown Title')
```