Spaces:
Runtime error
Runtime error
# Output Formats | |
Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs. | |
## Basic Formats | |
```python | |
result = await crawler.arun(url="https://example.com") | |
# Access different formats | |
raw_html = result.html # Original HTML | |
clean_html = result.cleaned_html # Sanitized HTML | |
markdown_v2 = result.markdown_v2 # Detailed markdown generation results | |
fit_md = result.markdown_v2.fit_markdown # Most relevant content in markdown | |
``` | |
> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations. | |
## Raw HTML | |
Original, unmodified HTML from the webpage. Useful when you need to: | |
- Preserve the exact page structure. | |
- Process HTML with your own tools. | |
- Debug page issues. | |
```python | |
result = await crawler.arun(url="https://example.com") | |
print(result.html) # Complete HTML including headers, scripts, etc. | |
``` | |
## Cleaned HTML | |
Sanitized HTML with unnecessary elements removed. Automatically: | |
- Removes scripts and styles. | |
- Cleans up formatting. | |
- Preserves semantic structure. | |
```python | |
config = CrawlerRunConfig( | |
excluded_tags=['form', 'header', 'footer'], # Additional tags to remove | |
keep_data_attributes=False # Remove data-* attributes | |
) | |
result = await crawler.arun(url="https://example.com", config=config) | |
print(result.cleaned_html) | |
``` | |
## Standard Markdown | |
HTML converted to clean markdown format. This output is useful for: | |
- Content analysis. | |
- Documentation. | |
- Readability. | |
```python | |
config = CrawlerRunConfig( | |
markdown_generator=DefaultMarkdownGenerator( | |
options={"include_links": True} # Include links in markdown | |
) | |
) | |
result = await crawler.arun(url="https://example.com", config=config) | |
print(result.markdown_v2.raw_markdown) # Standard markdown with links | |
``` | |
## Fit Markdown | |
Extract and convert only the most relevant content into markdown format. Best suited for: | |
- Article extraction. | |
- Focusing on the main content. | |
- Removing boilerplate. | |
To generate `fit_markdown`, use a content filter like `PruningContentFilter`: | |
```python | |
from crawl4ai.content_filter_strategy import PruningContentFilter | |
config = CrawlerRunConfig( | |
content_filter=PruningContentFilter( | |
threshold=0.7, | |
threshold_type="dynamic", | |
min_word_threshold=100 | |
) | |
) | |
result = await crawler.arun(url="https://example.com", config=config) | |
print(result.markdown_v2.fit_markdown) # Extracted main content in markdown | |
``` | |
## Markdown with Citations | |
Generate markdown that includes citations for links. This format is ideal for: | |
- Creating structured documentation. | |
- Including references for extracted content. | |
```python | |
config = CrawlerRunConfig( | |
markdown_generator=DefaultMarkdownGenerator( | |
options={"citations": True} # Enable citations | |
) | |
) | |
result = await crawler.arun(url="https://example.com", config=config) | |
print(result.markdown_v2.markdown_with_citations) | |
print(result.markdown_v2.references_markdown) # Citations section | |
``` | |