Spaces:
Runtime error
Runtime error
# Complete Parameter Guide for arun() | |
The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality. | |
## Core Parameters | |
```python | |
await crawler.arun( | |
url="https://example.com", # Required: URL to crawl | |
verbose=True, # Enable detailed logging | |
cache_mode=CacheMode.ENABLED, # Control cache behavior | |
warmup=True # Whether to run warmup check | |
) | |
``` | |
## Cache Control | |
```python | |
from crawl4ai import CacheMode | |
await crawler.arun( | |
cache_mode=CacheMode.ENABLED, # Normal caching (read/write) | |
# Other cache modes: | |
# cache_mode=CacheMode.DISABLED # No caching at all | |
# cache_mode=CacheMode.READ_ONLY # Only read from cache | |
# cache_mode=CacheMode.WRITE_ONLY # Only write to cache | |
# cache_mode=CacheMode.BYPASS # Skip cache for this operation | |
) | |
``` | |
## Content Processing Parameters | |
### Text Processing | |
```python | |
await crawler.arun( | |
word_count_threshold=10, # Minimum words per content block | |
image_description_min_word_threshold=5, # Minimum words for image descriptions | |
only_text=False, # Extract only text content | |
excluded_tags=['form', 'nav'], # HTML tags to exclude | |
keep_data_attributes=False, # Preserve data-* attributes | |
) | |
``` | |
### Content Selection | |
```python | |
await crawler.arun( | |
css_selector=".main-content", # CSS selector for content extraction | |
remove_forms=True, # Remove all form elements | |
remove_overlay_elements=True, # Remove popups/modals/overlays | |
) | |
``` | |
### Link Handling | |
```python | |
await crawler.arun( | |
exclude_external_links=True, # Remove external links | |
exclude_social_media_links=True, # Remove social media links | |
exclude_external_images=True, # Remove external images | |
exclude_domains=["ads.example.com"], # Specific domains to exclude | |
social_media_domains=[ # Additional social media domains | |
"facebook.com", | |
"twitter.com", | |
"instagram.com" | |
] | |
) | |
``` | |
## Browser Control Parameters | |
### Basic Browser Settings | |
```python | |
await crawler.arun( | |
headless=True, # Run browser in headless mode | |
browser_type="chromium", # Browser engine: "chromium", "firefox", "webkit" | |
page_timeout=60000, # Page load timeout in milliseconds | |
user_agent="custom-agent", # Custom user agent | |
) | |
``` | |
### Navigation and Waiting | |
```python | |
await crawler.arun( | |
wait_for="css:.dynamic-content", # Wait for element/condition | |
delay_before_return_html=2.0, # Wait before returning HTML (seconds) | |
) | |
``` | |
### JavaScript Execution | |
```python | |
await crawler.arun( | |
js_code=[ # JavaScript to execute (string or list) | |
"window.scrollTo(0, document.body.scrollHeight);", | |
"document.querySelector('.load-more').click();" | |
], | |
js_only=False, # Only execute JavaScript without reloading page | |
) | |
``` | |
### Anti-Bot Features | |
```python | |
await crawler.arun( | |
magic=True, # Enable all anti-detection features | |
simulate_user=True, # Simulate human behavior | |
override_navigator=True # Override navigator properties | |
) | |
``` | |
### Session Management | |
```python | |
await crawler.arun( | |
session_id="my_session", # Session identifier for persistent browsing | |
) | |
``` | |
### Screenshot Options | |
```python | |
await crawler.arun( | |
screenshot=True, # Take page screenshot | |
screenshot_wait_for=2.0, # Wait before screenshot (seconds) | |
) | |
``` | |
### Proxy Configuration | |
```python | |
await crawler.arun( | |
proxy="http://proxy.example.com:8080", # Simple proxy URL | |
proxy_config={ # Advanced proxy settings | |
"server": "http://proxy.example.com:8080", | |
"username": "user", | |
"password": "pass" | |
} | |
) | |
``` | |
## Content Extraction Parameters | |
### Extraction Strategy | |
```python | |
await crawler.arun( | |
extraction_strategy=LLMExtractionStrategy( | |
provider="ollama/llama2", | |
schema=MySchema.schema(), | |
instruction="Extract specific data" | |
) | |
) | |
``` | |
### Chunking Strategy | |
```python | |
await crawler.arun( | |
chunking_strategy=RegexChunking( | |
patterns=[r'\n\n', r'\.\s+'] | |
) | |
) | |
``` | |
### HTML to Text Options | |
```python | |
await crawler.arun( | |
html2text={ | |
"ignore_links": False, | |
"ignore_images": False, | |
"escape_dot": False, | |
"body_width": 0, | |
"protect_links": True, | |
"unicode_snob": True | |
} | |
) | |
``` | |
## Debug Options | |
```python | |
await crawler.arun( | |
log_console=True, # Log browser console messages | |
) | |
``` | |
## Parameter Interactions and Notes | |
1. **Cache and Performance Setup** | |
```python | |
# Optimal caching for repeated crawls | |
await crawler.arun( | |
cache_mode=CacheMode.ENABLED, | |
word_count_threshold=10, | |
process_iframes=False | |
) | |
``` | |
2. **Dynamic Content Handling** | |
```python | |
# Handle lazy-loaded content | |
await crawler.arun( | |
js_code="window.scrollTo(0, document.body.scrollHeight);", | |
wait_for="css:.lazy-content", | |
delay_before_return_html=2.0, | |
cache_mode=CacheMode.WRITE_ONLY # Cache results after dynamic load | |
) | |
``` | |
3. **Content Extraction Pipeline** | |
```python | |
# Complete extraction setup | |
await crawler.arun( | |
css_selector=".main-content", | |
word_count_threshold=20, | |
extraction_strategy=my_strategy, | |
chunking_strategy=my_chunking, | |
process_iframes=True, | |
remove_overlay_elements=True, | |
cache_mode=CacheMode.ENABLED | |
) | |
``` | |
## Best Practices | |
1. **Performance Optimization** | |
```python | |
await crawler.arun( | |
cache_mode=CacheMode.ENABLED, # Use full caching | |
word_count_threshold=10, # Filter out noise | |
process_iframes=False # Skip iframes if not needed | |
) | |
``` | |
2. **Reliable Scraping** | |
```python | |
await crawler.arun( | |
magic=True, # Enable anti-detection | |
delay_before_return_html=1.0, # Wait for dynamic content | |
page_timeout=60000, # Longer timeout for slow pages | |
cache_mode=CacheMode.WRITE_ONLY # Cache results after successful crawl | |
) | |
``` | |
3. **Clean Content** | |
```python | |
await crawler.arun( | |
remove_overlay_elements=True, # Remove popups | |
excluded_tags=['nav', 'aside'],# Remove unnecessary elements | |
keep_data_attributes=False, # Remove data attributes | |
cache_mode=CacheMode.ENABLED # Use cache for faster processing | |
) | |
``` |