Crawl4AI

Runtime error

App Files Files Community

Crawl4AI / docs /md_v2 /basic /browser-config.md

amaye15

test

03c0888 6 months ago

preview code

raw

history blame contribute delete

5.33 kB

	# Browser Configuration

	Crawl4AI supports multiple browser engines and offers extensive configuration options for browser behavior.

	## Browser Types

	Choose from three browser engines:

	```python
	# Chromium (default)
	async with AsyncWebCrawler(browser_type="chromium") as crawler:
	result = await crawler.arun(url="https://example.com")

	# Firefox
	async with AsyncWebCrawler(browser_type="firefox") as crawler:
	result = await crawler.arun(url="https://example.com")

	# WebKit
	async with AsyncWebCrawler(browser_type="webkit") as crawler:
	result = await crawler.arun(url="https://example.com")
	```

	## Basic Configuration

	Common browser settings:

	```python
	async with AsyncWebCrawler(
	headless=True, # Run in headless mode (no GUI)
	verbose=True, # Enable detailed logging
	sleep_on_close=False # No delay when closing browser
	) as crawler:
	result = await crawler.arun(url="https://example.com")
	```

	## Identity Management

	Control how your crawler appears to websites:

	```python
	# Custom user agent
	async with AsyncWebCrawler(
	user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
	) as crawler:
	result = await crawler.arun(url="https://example.com")

	# Custom headers
	headers = {
	"Accept-Language": "en-US,en;q=0.9",
	"Cache-Control": "no-cache"
	}
	async with AsyncWebCrawler(headers=headers) as crawler:
	result = await crawler.arun(url="https://example.com")
	```

	## Screenshot Capabilities

	Capture page screenshots with enhanced error handling:

	```python
	result = await crawler.arun(
	url="https://example.com",
	screenshot=True, # Enable screenshot
	screenshot_wait_for=2.0 # Wait 2 seconds before capture
	)

	if result.screenshot: # Base64 encoded image
	import base64
	with open("screenshot.png", "wb") as f:
	f.write(base64.b64decode(result.screenshot))
	```

	## Timeouts and Waiting

	Control page loading behavior:

	```python
	result = await crawler.arun(
	url="https://example.com",
	page_timeout=60000, # Page load timeout (ms)
	delay_before_return_html=2.0, # Wait before content capture
	wait_for="css:.dynamic-content" # Wait for specific element
	)
	```

	## JavaScript Execution

	Execute custom JavaScript before crawling:

	```python
	# Single JavaScript command
	result = await crawler.arun(
	url="https://example.com",
	js_code="window.scrollTo(0, document.body.scrollHeight);"
	)

	# Multiple commands
	js_commands = [
	"window.scrollTo(0, document.body.scrollHeight);",
	"document.querySelector('.load-more').click();"
	]
	result = await crawler.arun(
	url="https://example.com",
	js_code=js_commands
	)
	```

	## Proxy Configuration

	Use proxies for enhanced access:

	```python
	# Simple proxy
	async with AsyncWebCrawler(
	proxy="http://proxy.example.com:8080"
	) as crawler:
	result = await crawler.arun(url="https://example.com")

	# Proxy with authentication
	proxy_config = {
	"server": "http://proxy.example.com:8080",
	"username": "user",
	"password": "pass"
	}
	async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
	result = await crawler.arun(url="https://example.com")
	```

	## Anti-Detection Features

	Enable stealth features to avoid bot detection:

	```python
	result = await crawler.arun(
	url="https://example.com",
	simulate_user=True, # Simulate human behavior
	override_navigator=True, # Mask automation signals
	magic=True # Enable all anti-detection features
	)
	```

	## Handling Dynamic Content

	Configure browser to handle dynamic content:

	```python
	# Wait for dynamic content
	result = await crawler.arun(
	url="https://example.com",
	wait_for="js:() => document.querySelector('.content').children.length > 10",
	process_iframes=True # Process iframe content
	)

	# Handle lazy-loaded images
	result = await crawler.arun(
	url="https://example.com",
	js_code="window.scrollTo(0, document.body.scrollHeight);",
	delay_before_return_html=2.0 # Wait for images to load
	)
	```

	## Comprehensive Example

	Here's how to combine various browser configurations:

	```python
	async def crawl_with_advanced_config(url: str):
	async with AsyncWebCrawler(
	# Browser setup
	browser_type="chromium",
	headless=True,
	verbose=True,

	# Identity
	user_agent="Custom User Agent",
	headers={"Accept-Language": "en-US"},

	# Proxy setup
	proxy="http://proxy.example.com:8080"
	) as crawler:
	result = await crawler.arun(
	url=url,
	# Content handling
	process_iframes=True,
	screenshot=True,

	# Timing
	page_timeout=60000,
	delay_before_return_html=2.0,

	# Anti-detection
	magic=True,
	simulate_user=True,

	# Dynamic content
	js_code=[
	"window.scrollTo(0, document.body.scrollHeight);",
	"document.querySelector('.load-more')?.click();"
	],
	wait_for="css:.dynamic-content"
	)

	return {
	"content": result.markdown,
	"screenshot": result.screenshot,
	"success": result.success
	}
	```