Spaces:
Runtime error
Runtime error
File size: 3,145 Bytes
03c0888 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
# Output Formats
Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.
## Basic Formats
```python
result = await crawler.arun(url="https://example.com")
# Access different formats
raw_html = result.html # Original HTML
clean_html = result.cleaned_html # Sanitized HTML
markdown_v2 = result.markdown_v2 # Detailed markdown generation results
fit_md = result.markdown_v2.fit_markdown # Most relevant content in markdown
```
> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.
## Raw HTML
Original, unmodified HTML from the webpage. Useful when you need to:
- Preserve the exact page structure.
- Process HTML with your own tools.
- Debug page issues.
```python
result = await crawler.arun(url="https://example.com")
print(result.html) # Complete HTML including headers, scripts, etc.
```
## Cleaned HTML
Sanitized HTML with unnecessary elements removed. Automatically:
- Removes scripts and styles.
- Cleans up formatting.
- Preserves semantic structure.
```python
config = CrawlerRunConfig(
excluded_tags=['form', 'header', 'footer'], # Additional tags to remove
keep_data_attributes=False # Remove data-* attributes
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.cleaned_html)
```
## Standard Markdown
HTML converted to clean markdown format. This output is useful for:
- Content analysis.
- Documentation.
- Readability.
```python
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
options={"include_links": True} # Include links in markdown
)
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.raw_markdown) # Standard markdown with links
```
## Fit Markdown
Extract and convert only the most relevant content into markdown format. Best suited for:
- Article extraction.
- Focusing on the main content.
- Removing boilerplate.
To generate `fit_markdown`, use a content filter like `PruningContentFilter`:
```python
from crawl4ai.content_filter_strategy import PruningContentFilter
config = CrawlerRunConfig(
content_filter=PruningContentFilter(
threshold=0.7,
threshold_type="dynamic",
min_word_threshold=100
)
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.fit_markdown) # Extracted main content in markdown
```
## Markdown with Citations
Generate markdown that includes citations for links. This format is ideal for:
- Creating structured documentation.
- Including references for extracted content.
```python
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
options={"citations": True} # Enable citations
)
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.markdown_with_citations)
print(result.markdown_v2.references_markdown) # Citations section
```
|