|
# Extraction & Chunking Strategies API |
|
|
|
This documentation covers the API reference for extraction and chunking strategies in Crawl4AI. |
|
|
|
## Extraction Strategies |
|
|
|
All extraction strategies inherit from the base `ExtractionStrategy` class and implement two key methods: |
|
- `extract(url: str, html: str) -> List[Dict[str, Any]]` |
|
- `run(url: str, sections: List[str]) -> List[Dict[str, Any]]` |
|
|
|
### LLMExtractionStrategy |
|
|
|
Used for extracting structured data using Language Models. |
|
|
|
```python |
|
LLMExtractionStrategy( |
|
# Required Parameters |
|
provider: str = DEFAULT_PROVIDER, # LLM provider (e.g., "ollama/llama2") |
|
api_token: Optional[str] = None, # API token |
|
|
|
# Extraction Configuration |
|
instruction: str = None, # Custom extraction instruction |
|
schema: Dict = None, # Pydantic model schema for structured data |
|
extraction_type: str = "block", # "block" or "schema" |
|
|
|
# Chunking Parameters |
|
chunk_token_threshold: int = 4000, # Maximum tokens per chunk |
|
overlap_rate: float = 0.1, # Overlap between chunks |
|
word_token_rate: float = 0.75, # Word to token conversion rate |
|
apply_chunking: bool = True, # Enable/disable chunking |
|
|
|
# API Configuration |
|
base_url: str = None, # Base URL for API |
|
extra_args: Dict = {}, # Additional provider arguments |
|
verbose: bool = False # Enable verbose logging |
|
) |
|
``` |
|
|
|
### CosineStrategy |
|
|
|
Used for content similarity-based extraction and clustering. |
|
|
|
```python |
|
CosineStrategy( |
|
# Content Filtering |
|
semantic_filter: str = None, # Topic/keyword filter |
|
word_count_threshold: int = 10, # Minimum words per cluster |
|
sim_threshold: float = 0.3, # Similarity threshold |
|
|
|
# Clustering Parameters |
|
max_dist: float = 0.2, # Maximum cluster distance |
|
linkage_method: str = 'ward', # Clustering method |
|
top_k: int = 3, # Top clusters to return |
|
|
|
# Model Configuration |
|
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model |
|
|
|
verbose: bool = False # Enable verbose logging |
|
) |
|
``` |
|
|
|
### JsonCssExtractionStrategy |
|
|
|
Used for CSS selector-based structured data extraction. |
|
|
|
```python |
|
JsonCssExtractionStrategy( |
|
schema: Dict[str, Any], # Extraction schema |
|
verbose: bool = False # Enable verbose logging |
|
) |
|
|
|
# Schema Structure |
|
schema = { |
|
"name": str, # Schema name |
|
"baseSelector": str, # Base CSS selector |
|
"fields": [ # List of fields to extract |
|
{ |
|
"name": str, # Field name |
|
"selector": str, # CSS selector |
|
"type": str, # Field type: "text", "attribute", "html", "regex" |
|
"attribute": str, # For type="attribute" |
|
"pattern": str, # For type="regex" |
|
"transform": str, # Optional: "lowercase", "uppercase", "strip" |
|
"default": Any # Default value if extraction fails |
|
} |
|
] |
|
} |
|
``` |
|
|
|
## Chunking Strategies |
|
|
|
All chunking strategies inherit from `ChunkingStrategy` and implement the `chunk(text: str) -> list` method. |
|
|
|
### RegexChunking |
|
|
|
Splits text based on regex patterns. |
|
|
|
```python |
|
RegexChunking( |
|
patterns: List[str] = None # Regex patterns for splitting |
|
# Default: [r'\n\n'] |
|
) |
|
``` |
|
|
|
### SlidingWindowChunking |
|
|
|
Creates overlapping chunks with a sliding window approach. |
|
|
|
```python |
|
SlidingWindowChunking( |
|
window_size: int = 100, # Window size in words |
|
step: int = 50 # Step size between windows |
|
) |
|
``` |
|
|
|
### OverlappingWindowChunking |
|
|
|
Creates chunks with specified overlap. |
|
|
|
```python |
|
OverlappingWindowChunking( |
|
window_size: int = 1000, # Chunk size in words |
|
overlap: int = 100 # Overlap size in words |
|
) |
|
``` |
|
|
|
## Usage Examples |
|
|
|
### LLM Extraction |
|
|
|
```python |
|
from pydantic import BaseModel |
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy |
|
|
|
# Define schema |
|
class Article(BaseModel): |
|
title: str |
|
content: str |
|
author: str |
|
|
|
# Create strategy |
|
strategy = LLMExtractionStrategy( |
|
provider="ollama/llama2", |
|
schema=Article.schema(), |
|
instruction="Extract article details" |
|
) |
|
|
|
# Use with crawler |
|
result = await crawler.arun( |
|
url="https://example.com/article", |
|
extraction_strategy=strategy |
|
) |
|
|
|
# Access extracted data |
|
data = json.loads(result.extracted_content) |
|
``` |
|
|
|
### CSS Extraction |
|
|
|
```python |
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy |
|
|
|
# Define schema |
|
schema = { |
|
"name": "Product List", |
|
"baseSelector": ".product-card", |
|
"fields": [ |
|
{ |
|
"name": "title", |
|
"selector": "h2.title", |
|
"type": "text" |
|
}, |
|
{ |
|
"name": "price", |
|
"selector": ".price", |
|
"type": "text", |
|
"transform": "strip" |
|
}, |
|
{ |
|
"name": "image", |
|
"selector": "img", |
|
"type": "attribute", |
|
"attribute": "src" |
|
} |
|
] |
|
} |
|
|
|
# Create and use strategy |
|
strategy = JsonCssExtractionStrategy(schema) |
|
result = await crawler.arun( |
|
url="https://example.com/products", |
|
extraction_strategy=strategy |
|
) |
|
``` |
|
|
|
### Content Chunking |
|
|
|
```python |
|
from crawl4ai.chunking_strategy import OverlappingWindowChunking |
|
|
|
# Create chunking strategy |
|
chunker = OverlappingWindowChunking( |
|
window_size=500, # 500 words per chunk |
|
overlap=50 # 50 words overlap |
|
) |
|
|
|
# Use with extraction strategy |
|
strategy = LLMExtractionStrategy( |
|
provider="ollama/llama2", |
|
chunking_strategy=chunker |
|
) |
|
|
|
result = await crawler.arun( |
|
url="https://example.com/long-article", |
|
extraction_strategy=strategy |
|
) |
|
``` |
|
|
|
## Best Practices |
|
|
|
1. **Choose the Right Strategy** |
|
- Use `LLMExtractionStrategy` for complex, unstructured content |
|
- Use `JsonCssExtractionStrategy` for well-structured HTML |
|
- Use `CosineStrategy` for content similarity and clustering |
|
|
|
2. **Optimize Chunking** |
|
```python |
|
# For long documents |
|
strategy = LLMExtractionStrategy( |
|
chunk_token_threshold=2000, # Smaller chunks |
|
overlap_rate=0.1 # 10% overlap |
|
) |
|
``` |
|
|
|
3. **Handle Errors** |
|
```python |
|
try: |
|
result = await crawler.arun( |
|
url="https://example.com", |
|
extraction_strategy=strategy |
|
) |
|
if result.success: |
|
content = json.loads(result.extracted_content) |
|
except Exception as e: |
|
print(f"Extraction failed: {e}") |
|
``` |
|
|
|
4. **Monitor Performance** |
|
```python |
|
strategy = CosineStrategy( |
|
verbose=True, # Enable logging |
|
word_count_threshold=20, # Filter short content |
|
top_k=5 # Limit results |
|
) |
|
``` |