Spaces:
Runtime error
Runtime error
# Browser Configuration | |
Crawl4AI supports multiple browser engines and offers extensive configuration options for browser behavior. | |
## Browser Types | |
Choose from three browser engines: | |
```python | |
# Chromium (default) | |
async with AsyncWebCrawler(browser_type="chromium") as crawler: | |
result = await crawler.arun(url="https://example.com") | |
# Firefox | |
async with AsyncWebCrawler(browser_type="firefox") as crawler: | |
result = await crawler.arun(url="https://example.com") | |
# WebKit | |
async with AsyncWebCrawler(browser_type="webkit") as crawler: | |
result = await crawler.arun(url="https://example.com") | |
``` | |
## Basic Configuration | |
Common browser settings: | |
```python | |
async with AsyncWebCrawler( | |
headless=True, # Run in headless mode (no GUI) | |
verbose=True, # Enable detailed logging | |
sleep_on_close=False # No delay when closing browser | |
) as crawler: | |
result = await crawler.arun(url="https://example.com") | |
``` | |
## Identity Management | |
Control how your crawler appears to websites: | |
```python | |
# Custom user agent | |
async with AsyncWebCrawler( | |
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" | |
) as crawler: | |
result = await crawler.arun(url="https://example.com") | |
# Custom headers | |
headers = { | |
"Accept-Language": "en-US,en;q=0.9", | |
"Cache-Control": "no-cache" | |
} | |
async with AsyncWebCrawler(headers=headers) as crawler: | |
result = await crawler.arun(url="https://example.com") | |
``` | |
## Screenshot Capabilities | |
Capture page screenshots with enhanced error handling: | |
```python | |
result = await crawler.arun( | |
url="https://example.com", | |
screenshot=True, # Enable screenshot | |
screenshot_wait_for=2.0 # Wait 2 seconds before capture | |
) | |
if result.screenshot: # Base64 encoded image | |
import base64 | |
with open("screenshot.png", "wb") as f: | |
f.write(base64.b64decode(result.screenshot)) | |
``` | |
## Timeouts and Waiting | |
Control page loading behavior: | |
```python | |
result = await crawler.arun( | |
url="https://example.com", | |
page_timeout=60000, # Page load timeout (ms) | |
delay_before_return_html=2.0, # Wait before content capture | |
wait_for="css:.dynamic-content" # Wait for specific element | |
) | |
``` | |
## JavaScript Execution | |
Execute custom JavaScript before crawling: | |
```python | |
# Single JavaScript command | |
result = await crawler.arun( | |
url="https://example.com", | |
js_code="window.scrollTo(0, document.body.scrollHeight);" | |
) | |
# Multiple commands | |
js_commands = [ | |
"window.scrollTo(0, document.body.scrollHeight);", | |
"document.querySelector('.load-more').click();" | |
] | |
result = await crawler.arun( | |
url="https://example.com", | |
js_code=js_commands | |
) | |
``` | |
## Proxy Configuration | |
Use proxies for enhanced access: | |
```python | |
# Simple proxy | |
async with AsyncWebCrawler( | |
proxy="http://proxy.example.com:8080" | |
) as crawler: | |
result = await crawler.arun(url="https://example.com") | |
# Proxy with authentication | |
proxy_config = { | |
"server": "http://proxy.example.com:8080", | |
"username": "user", | |
"password": "pass" | |
} | |
async with AsyncWebCrawler(proxy_config=proxy_config) as crawler: | |
result = await crawler.arun(url="https://example.com") | |
``` | |
## Anti-Detection Features | |
Enable stealth features to avoid bot detection: | |
```python | |
result = await crawler.arun( | |
url="https://example.com", | |
simulate_user=True, # Simulate human behavior | |
override_navigator=True, # Mask automation signals | |
magic=True # Enable all anti-detection features | |
) | |
``` | |
## Handling Dynamic Content | |
Configure browser to handle dynamic content: | |
```python | |
# Wait for dynamic content | |
result = await crawler.arun( | |
url="https://example.com", | |
wait_for="js:() => document.querySelector('.content').children.length > 10", | |
process_iframes=True # Process iframe content | |
) | |
# Handle lazy-loaded images | |
result = await crawler.arun( | |
url="https://example.com", | |
js_code="window.scrollTo(0, document.body.scrollHeight);", | |
delay_before_return_html=2.0 # Wait for images to load | |
) | |
``` | |
## Comprehensive Example | |
Here's how to combine various browser configurations: | |
```python | |
async def crawl_with_advanced_config(url: str): | |
async with AsyncWebCrawler( | |
# Browser setup | |
browser_type="chromium", | |
headless=True, | |
verbose=True, | |
# Identity | |
user_agent="Custom User Agent", | |
headers={"Accept-Language": "en-US"}, | |
# Proxy setup | |
proxy="http://proxy.example.com:8080" | |
) as crawler: | |
result = await crawler.arun( | |
url=url, | |
# Content handling | |
process_iframes=True, | |
screenshot=True, | |
# Timing | |
page_timeout=60000, | |
delay_before_return_html=2.0, | |
# Anti-detection | |
magic=True, | |
simulate_user=True, | |
# Dynamic content | |
js_code=[ | |
"window.scrollTo(0, document.body.scrollHeight);", | |
"document.querySelector('.load-more')?.click();" | |
], | |
wait_for="css:.dynamic-content" | |
) | |
return { | |
"content": result.markdown, | |
"screenshot": result.screenshot, | |
"success": result.success | |
} | |
``` |