Spaces:
Runtime error
Runtime error
File size: 5,327 Bytes
03c0888 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
# Browser Configuration
Crawl4AI supports multiple browser engines and offers extensive configuration options for browser behavior.
## Browser Types
Choose from three browser engines:
```python
# Chromium (default)
async with AsyncWebCrawler(browser_type="chromium") as crawler:
result = await crawler.arun(url="https://example.com")
# Firefox
async with AsyncWebCrawler(browser_type="firefox") as crawler:
result = await crawler.arun(url="https://example.com")
# WebKit
async with AsyncWebCrawler(browser_type="webkit") as crawler:
result = await crawler.arun(url="https://example.com")
```
## Basic Configuration
Common browser settings:
```python
async with AsyncWebCrawler(
headless=True, # Run in headless mode (no GUI)
verbose=True, # Enable detailed logging
sleep_on_close=False # No delay when closing browser
) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Identity Management
Control how your crawler appears to websites:
```python
# Custom user agent
async with AsyncWebCrawler(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
) as crawler:
result = await crawler.arun(url="https://example.com")
# Custom headers
headers = {
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache"
}
async with AsyncWebCrawler(headers=headers) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Screenshot Capabilities
Capture page screenshots with enhanced error handling:
```python
result = await crawler.arun(
url="https://example.com",
screenshot=True, # Enable screenshot
screenshot_wait_for=2.0 # Wait 2 seconds before capture
)
if result.screenshot: # Base64 encoded image
import base64
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
```
## Timeouts and Waiting
Control page loading behavior:
```python
result = await crawler.arun(
url="https://example.com",
page_timeout=60000, # Page load timeout (ms)
delay_before_return_html=2.0, # Wait before content capture
wait_for="css:.dynamic-content" # Wait for specific element
)
```
## JavaScript Execution
Execute custom JavaScript before crawling:
```python
# Single JavaScript command
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight);"
)
# Multiple commands
js_commands = [
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();"
]
result = await crawler.arun(
url="https://example.com",
js_code=js_commands
)
```
## Proxy Configuration
Use proxies for enhanced access:
```python
# Simple proxy
async with AsyncWebCrawler(
proxy="http://proxy.example.com:8080"
) as crawler:
result = await crawler.arun(url="https://example.com")
# Proxy with authentication
proxy_config = {
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
}
async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Anti-Detection Features
Enable stealth features to avoid bot detection:
```python
result = await crawler.arun(
url="https://example.com",
simulate_user=True, # Simulate human behavior
override_navigator=True, # Mask automation signals
magic=True # Enable all anti-detection features
)
```
## Handling Dynamic Content
Configure browser to handle dynamic content:
```python
# Wait for dynamic content
result = await crawler.arun(
url="https://example.com",
wait_for="js:() => document.querySelector('.content').children.length > 10",
process_iframes=True # Process iframe content
)
# Handle lazy-loaded images
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight);",
delay_before_return_html=2.0 # Wait for images to load
)
```
## Comprehensive Example
Here's how to combine various browser configurations:
```python
async def crawl_with_advanced_config(url: str):
async with AsyncWebCrawler(
# Browser setup
browser_type="chromium",
headless=True,
verbose=True,
# Identity
user_agent="Custom User Agent",
headers={"Accept-Language": "en-US"},
# Proxy setup
proxy="http://proxy.example.com:8080"
) as crawler:
result = await crawler.arun(
url=url,
# Content handling
process_iframes=True,
screenshot=True,
# Timing
page_timeout=60000,
delay_before_return_html=2.0,
# Anti-detection
magic=True,
simulate_user=True,
# Dynamic content
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
wait_for="css:.dynamic-content"
)
return {
"content": result.markdown,
"screenshot": result.screenshot,
"success": result.success
}
``` |