Crawl4AI

Runtime error

File size: 8,242 Bytes

03c0888

# AsyncWebCrawler

The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options.

## Constructor

```python
AsyncWebCrawler(
    # Browser Settings
    browser_type: str = "chromium",         # Options: "chromium", "firefox", "webkit"
    headless: bool = True,                  # Run browser in headless mode
    verbose: bool = False,                  # Enable verbose logging
    
    # Cache Settings
    always_by_pass_cache: bool = False,     # Always bypass cache
    base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache
    
    # Network Settings
    proxy: str = None,                      # Simple proxy URL
    proxy_config: Dict = None,              # Advanced proxy configuration
    
    # Browser Behavior
    sleep_on_close: bool = False,           # Wait before closing browser
    
    # Custom Settings
    user_agent: str = None,                 # Custom user agent
    headers: Dict[str, str] = {},           # Custom HTTP headers
    js_code: Union[str, List[str]] = None,  # Default JavaScript to execute
)
```

### Parameters in Detail

#### Browser Settings

- **browser_type** (str, optional)
  - Default: `"chromium"`
  - Options: `"chromium"`, `"firefox"`, `"webkit"`
  - Controls which browser engine to use
  ```python
  # Example: Using Firefox
  crawler = AsyncWebCrawler(browser_type="firefox")
  ```

- **headless** (bool, optional)
  - Default: `True`
  - When `True`, browser runs without GUI
  - Set to `False` for debugging
  ```python
  # Visible browser for debugging
  crawler = AsyncWebCrawler(headless=False)
  ```

- **verbose** (bool, optional)
  - Default: `False`
  - Enables detailed logging
  ```python
  # Enable detailed logging
  crawler = AsyncWebCrawler(verbose=True)
  ```

#### Cache Settings

- **always_by_pass_cache** (bool, optional)
  - Default: `False`
  - When `True`, always fetches fresh content
  ```python
  # Always fetch fresh content
  crawler = AsyncWebCrawler(always_by_pass_cache=True)
  ```

- **base_directory** (str, optional)
  - Default: User's home directory
  - Base path for cache storage
  ```python
  # Custom cache directory
  crawler = AsyncWebCrawler(base_directory="/path/to/cache")
  ```

#### Network Settings

- **proxy** (str, optional)
  - Simple proxy URL
  ```python
  # Using simple proxy
  crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
  ```

- **proxy_config** (Dict, optional)
  - Advanced proxy configuration with authentication
  ```python
  # Advanced proxy with auth
  crawler = AsyncWebCrawler(proxy_config={
      "server": "http://proxy.example.com:8080",
      "username": "user",
      "password": "pass"
  })
  ```

#### Browser Behavior

- **sleep_on_close** (bool, optional)
  - Default: `False`
  - Adds delay before closing browser
  ```python
  # Wait before closing
  crawler = AsyncWebCrawler(sleep_on_close=True)
  ```

#### Custom Settings

- **user_agent** (str, optional)
  - Custom user agent string
  ```python
  # Custom user agent
  crawler = AsyncWebCrawler(
      user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
  )
  ```

- **headers** (Dict[str, str], optional)
  - Custom HTTP headers
  ```python
  # Custom headers
  crawler = AsyncWebCrawler(
      headers={
          "Accept-Language": "en-US",
          "Custom-Header": "Value"
      }
  )
  ```

- **js_code** (Union[str, List[str]], optional)
  - Default JavaScript to execute on each page
  ```python
  # Default JavaScript
  crawler = AsyncWebCrawler(
      js_code=[
          "window.scrollTo(0, document.body.scrollHeight);",
          "document.querySelector('.load-more').click();"
      ]
  )
  ```

## Methods

### arun()

The primary method for crawling web pages.

```python
async def arun(
    # Required
    url: str,                              # URL to crawl
    
    # Content Selection
    css_selector: str = None,              # CSS selector for content
    word_count_threshold: int = 10,        # Minimum words per block
    
    # Cache Control
    bypass_cache: bool = False,            # Bypass cache for this request
    
    # Session Management
    session_id: str = None,                # Session identifier
    
    # Screenshot Options
    screenshot: bool = False,              # Take screenshot
    screenshot_wait_for: float = None,     # Wait before screenshot
    
    # Content Processing
    process_iframes: bool = False,         # Process iframe content
    remove_overlay_elements: bool = False, # Remove popups/modals
    
    # Anti-Bot Settings
    simulate_user: bool = False,           # Simulate human behavior
    override_navigator: bool = False,      # Override navigator properties
    magic: bool = False,                   # Enable all anti-detection
    
    # Content Filtering
    excluded_tags: List[str] = None,       # HTML tags to exclude
    exclude_external_links: bool = False,  # Remove external links
    exclude_social_media_links: bool = False, # Remove social media links
    
    # JavaScript Handling
    js_code: Union[str, List[str]] = None, # JavaScript to execute
    wait_for: str = None,                  # Wait condition
    
    # Page Loading
    page_timeout: int = 60000,            # Page load timeout (ms)
    delay_before_return_html: float = None, # Wait before return
    
    # Extraction
    extraction_strategy: ExtractionStrategy = None  # Extraction strategy
) -> CrawlResult:
```

### Usage Examples

#### Basic Crawling
```python
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
```

#### Advanced Crawling
```python
async with AsyncWebCrawler(
    browser_type="firefox",
    verbose=True,
    headers={"Custom-Header": "Value"}
) as crawler:
    result = await crawler.arun(
        url="https://example.com",
        css_selector=".main-content",
        word_count_threshold=20,
        process_iframes=True,
        magic=True,
        wait_for="css:.dynamic-content",
        screenshot=True
    )
```

#### Session Management
```python
async with AsyncWebCrawler() as crawler:
    # First request
    result1 = await crawler.arun(
        url="https://example.com/login",
        session_id="my_session"
    )
    
    # Subsequent request using same session
    result2 = await crawler.arun(
        url="https://example.com/protected",
        session_id="my_session"
    )
```

## Context Manager

AsyncWebCrawler implements the async context manager protocol:

```python
async def __aenter__(self) -> 'AsyncWebCrawler':
    # Initialize browser and resources
    return self

async def __aexit__(self, *args):
    # Cleanup resources
    pass
```

Always use AsyncWebCrawler with async context manager:
```python
async with AsyncWebCrawler() as crawler:
    # Your crawling code here
    pass
```

## Best Practices

1. **Resource Management**
```python
# Always use context manager
async with AsyncWebCrawler() as crawler:
    # Crawler will be properly cleaned up
    pass
```

2. **Error Handling**
```python
try:
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        if not result.success:
            print(f"Crawl failed: {result.error_message}")
except Exception as e:
    print(f"Error: {str(e)}")
```

3. **Performance Optimization**
```python
# Enable caching for better performance
crawler = AsyncWebCrawler(
    always_by_pass_cache=False,
    verbose=True
)
```

4. **Anti-Detection**
```python
# Maximum stealth
crawler = AsyncWebCrawler(
    headless=True,
    user_agent="Mozilla/5.0...",
    headers={"Accept-Language": "en-US"}
)
result = await crawler.arun(
    url="https://example.com",
    magic=True,
    simulate_user=True
)
```

## Note on Browser Types

Each browser type has its characteristics:

- **chromium**: Best overall compatibility
- **firefox**: Good for specific use cases
- **webkit**: Lighter weight, good for basic crawling

Choose based on your specific needs:
```python
# High compatibility
crawler = AsyncWebCrawler(browser_type="chromium")

# Memory efficient
crawler = AsyncWebCrawler(browser_type="webkit")
```