Spaces:
Runtime error
Runtime error
# Extraction Strategies Overview | |
Crawl4AI provides powerful extraction strategies to help you get structured data from web pages. Each strategy is designed for specific use cases and offers different approaches to data extraction. | |
## Available Strategies | |
### [LLM-Based Extraction](llm.md) | |
`LLMExtractionStrategy` uses Language Models to extract structured data from web content. This approach is highly flexible and can understand content semantically. | |
```python | |
from pydantic import BaseModel | |
from crawl4ai.extraction_strategy import LLMExtractionStrategy | |
class Product(BaseModel): | |
name: str | |
price: float | |
description: str | |
strategy = LLMExtractionStrategy( | |
provider="ollama/llama2", | |
schema=Product.schema(), | |
instruction="Extract product details from the page" | |
) | |
result = await crawler.arun( | |
url="https://example.com/product", | |
extraction_strategy=strategy | |
) | |
``` | |
**Best for:** | |
- Complex data structures | |
- Content requiring interpretation | |
- Flexible content formats | |
- Natural language processing | |
### [CSS-Based Extraction](css.md) | |
`JsonCssExtractionStrategy` extracts data using CSS selectors. This is fast, reliable, and perfect for consistently structured pages. | |
```python | |
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy | |
schema = { | |
"name": "Product Listing", | |
"baseSelector": ".product-card", | |
"fields": [ | |
{"name": "title", "selector": "h2", "type": "text"}, | |
{"name": "price", "selector": ".price", "type": "text"}, | |
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"} | |
] | |
} | |
strategy = JsonCssExtractionStrategy(schema) | |
result = await crawler.arun( | |
url="https://example.com/products", | |
extraction_strategy=strategy | |
) | |
``` | |
**Best for:** | |
- E-commerce product listings | |
- News article collections | |
- Structured content pages | |
- High-performance needs | |
### [Cosine Strategy](cosine.md) | |
`CosineStrategy` uses similarity-based clustering to identify and extract relevant content sections. | |
```python | |
from crawl4ai.extraction_strategy import CosineStrategy | |
strategy = CosineStrategy( | |
semantic_filter="product reviews", # Content focus | |
word_count_threshold=10, # Minimum words per cluster | |
sim_threshold=0.3, # Similarity threshold | |
max_dist=0.2, # Maximum cluster distance | |
top_k=3 # Number of top clusters to extract | |
) | |
result = await crawler.arun( | |
url="https://example.com/reviews", | |
extraction_strategy=strategy | |
) | |
``` | |
**Best for:** | |
- Content similarity analysis | |
- Topic clustering | |
- Relevant content extraction | |
- Pattern recognition in text | |
## Strategy Selection Guide | |
Choose your strategy based on these factors: | |
1. **Content Structure** | |
- Well-structured HTML → Use CSS Strategy | |
- Natural language text → Use LLM Strategy | |
- Mixed/Complex content → Use Cosine Strategy | |
2. **Performance Requirements** | |
- Fastest: CSS Strategy | |
- Moderate: Cosine Strategy | |
- Variable: LLM Strategy (depends on provider) | |
3. **Accuracy Needs** | |
- Highest structure accuracy: CSS Strategy | |
- Best semantic understanding: LLM Strategy | |
- Best content relevance: Cosine Strategy | |
## Combining Strategies | |
You can combine strategies for more powerful extraction: | |
```python | |
# First use CSS strategy for initial structure | |
css_result = await crawler.arun( | |
url="https://example.com", | |
extraction_strategy=css_strategy | |
) | |
# Then use LLM for semantic analysis | |
llm_result = await crawler.arun( | |
url="https://example.com", | |
extraction_strategy=llm_strategy | |
) | |
``` | |
## Common Use Cases | |
1. **E-commerce Scraping** | |
```python | |
# CSS Strategy for product listings | |
schema = { | |
"name": "Products", | |
"baseSelector": ".product", | |
"fields": [ | |
{"name": "name", "selector": ".title", "type": "text"}, | |
{"name": "price", "selector": ".price", "type": "text"} | |
] | |
} | |
``` | |
2. **News Article Extraction** | |
```python | |
# LLM Strategy for article content | |
class Article(BaseModel): | |
title: str | |
content: str | |
author: str | |
date: str | |
strategy = LLMExtractionStrategy( | |
provider="ollama/llama2", | |
schema=Article.schema() | |
) | |
``` | |
3. **Content Analysis** | |
```python | |
# Cosine Strategy for topic analysis | |
strategy = CosineStrategy( | |
semantic_filter="technology trends", | |
top_k=5 | |
) | |
``` | |
## Input Formats | |
All extraction strategies support different input formats to give you more control over how content is processed: | |
- **markdown** (default): Uses the raw markdown conversion of the HTML content. Best for general text extraction where HTML structure isn't critical. | |
- **html**: Uses the raw HTML content. Useful when you need to preserve HTML structure or extract data from specific HTML elements. | |
- **fit_markdown**: Uses the cleaned and filtered markdown content. Best for extracting relevant content while removing noise. Requires a markdown generator with content filter to be configured. | |
To specify an input format: | |
```python | |
strategy = LLMExtractionStrategy( | |
input_format="html", # or "markdown" or "fit_markdown" | |
provider="openai/gpt-4", | |
instruction="Extract product information" | |
) | |
``` | |
Note: When using "fit_markdown", ensure your CrawlerRunConfig includes a markdown generator with content filter: | |
```python | |
config = CrawlerRunConfig( | |
extraction_strategy=strategy, | |
markdown_generator=DefaultMarkdownGenerator( | |
content_filter=PruningContentFilter() # Content filter goes here for fit_markdown | |
) | |
) | |
``` | |
If fit_markdown is requested but not available (no markdown generator or content filter), the system will automatically fall back to raw markdown with a warning. | |
## Best Practices | |
1. **Choose the Right Strategy** | |
- Start with CSS for structured data | |
- Use LLM for complex interpretation | |
- Try Cosine for content relevance | |
2. **Optimize Performance** | |
- Cache LLM results | |
- Keep CSS selectors specific | |
- Tune similarity thresholds | |
3. **Handle Errors** | |
```python | |
result = await crawler.arun( | |
url="https://example.com", | |
extraction_strategy=strategy | |
) | |
if not result.success: | |
print(f"Extraction failed: {result.error_message}") | |
else: | |
data = json.loads(result.extracted_content) | |
``` | |
Each strategy has its strengths and optimal use cases. Explore the detailed documentation for each strategy to learn more about their specific features and configurations. |