Crawl4AI

Runtime error

File size: 3,145 Bytes

03c0888

# Output Formats

Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.

## Basic Formats

```python
result = await crawler.arun(url="https://example.com")

# Access different formats
raw_html = result.html                # Original HTML
clean_html = result.cleaned_html      # Sanitized HTML
markdown_v2 = result.markdown_v2      # Detailed markdown generation results
fit_md = result.markdown_v2.fit_markdown  # Most relevant content in markdown
```

> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.

## Raw HTML

Original, unmodified HTML from the webpage. Useful when you need to:
- Preserve the exact page structure.
- Process HTML with your own tools.
- Debug page issues.

```python
result = await crawler.arun(url="https://example.com")
print(result.html)  # Complete HTML including headers, scripts, etc.
```

## Cleaned HTML

Sanitized HTML with unnecessary elements removed. Automatically:
- Removes scripts and styles.
- Cleans up formatting.
- Preserves semantic structure.

```python
config = CrawlerRunConfig(
    excluded_tags=['form', 'header', 'footer'],  # Additional tags to remove
    keep_data_attributes=False  # Remove data-* attributes
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.cleaned_html)
```

## Standard Markdown

HTML converted to clean markdown format. This output is useful for:
- Content analysis.
- Documentation.
- Readability.

```python
config = CrawlerRunConfig(
    markdown_generator=DefaultMarkdownGenerator(
        options={"include_links": True}  # Include links in markdown
    )
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.raw_markdown)  # Standard markdown with links
```

## Fit Markdown

Extract and convert only the most relevant content into markdown format. Best suited for:
- Article extraction.
- Focusing on the main content.
- Removing boilerplate.

To generate `fit_markdown`, use a content filter like `PruningContentFilter`:

```python
from crawl4ai.content_filter_strategy import PruningContentFilter

config = CrawlerRunConfig(
    content_filter=PruningContentFilter(
        threshold=0.7,
        threshold_type="dynamic",
        min_word_threshold=100
    )
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.fit_markdown)  # Extracted main content in markdown
```

## Markdown with Citations

Generate markdown that includes citations for links. This format is ideal for:
- Creating structured documentation.
- Including references for extracted content.

```python
config = CrawlerRunConfig(
    markdown_generator=DefaultMarkdownGenerator(
        options={"citations": True}  # Enable citations
    )
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.markdown_with_citations)
print(result.markdown_v2.references_markdown)  # Citations section
```