File size: 8,242 Bytes
03c0888
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# AsyncWebCrawler

The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options.

## Constructor

```python
AsyncWebCrawler(
    # Browser Settings
    browser_type: str = "chromium",         # Options: "chromium", "firefox", "webkit"
    headless: bool = True,                  # Run browser in headless mode
    verbose: bool = False,                  # Enable verbose logging
    
    # Cache Settings
    always_by_pass_cache: bool = False,     # Always bypass cache
    base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache
    
    # Network Settings
    proxy: str = None,                      # Simple proxy URL
    proxy_config: Dict = None,              # Advanced proxy configuration
    
    # Browser Behavior
    sleep_on_close: bool = False,           # Wait before closing browser
    
    # Custom Settings
    user_agent: str = None,                 # Custom user agent
    headers: Dict[str, str] = {},           # Custom HTTP headers
    js_code: Union[str, List[str]] = None,  # Default JavaScript to execute
)
```

### Parameters in Detail

#### Browser Settings

- **browser_type** (str, optional)
  - Default: `"chromium"`
  - Options: `"chromium"`, `"firefox"`, `"webkit"`
  - Controls which browser engine to use
  ```python
  # Example: Using Firefox
  crawler = AsyncWebCrawler(browser_type="firefox")
  ```

- **headless** (bool, optional)
  - Default: `True`
  - When `True`, browser runs without GUI
  - Set to `False` for debugging
  ```python
  # Visible browser for debugging
  crawler = AsyncWebCrawler(headless=False)
  ```

- **verbose** (bool, optional)
  - Default: `False`
  - Enables detailed logging
  ```python
  # Enable detailed logging
  crawler = AsyncWebCrawler(verbose=True)
  ```

#### Cache Settings

- **always_by_pass_cache** (bool, optional)
  - Default: `False`
  - When `True`, always fetches fresh content
  ```python
  # Always fetch fresh content
  crawler = AsyncWebCrawler(always_by_pass_cache=True)
  ```

- **base_directory** (str, optional)
  - Default: User's home directory
  - Base path for cache storage
  ```python
  # Custom cache directory
  crawler = AsyncWebCrawler(base_directory="/path/to/cache")
  ```

#### Network Settings

- **proxy** (str, optional)
  - Simple proxy URL
  ```python
  # Using simple proxy
  crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
  ```

- **proxy_config** (Dict, optional)
  - Advanced proxy configuration with authentication
  ```python
  # Advanced proxy with auth
  crawler = AsyncWebCrawler(proxy_config={
      "server": "http://proxy.example.com:8080",
      "username": "user",
      "password": "pass"
  })
  ```

#### Browser Behavior

- **sleep_on_close** (bool, optional)
  - Default: `False`
  - Adds delay before closing browser
  ```python
  # Wait before closing
  crawler = AsyncWebCrawler(sleep_on_close=True)
  ```

#### Custom Settings

- **user_agent** (str, optional)
  - Custom user agent string
  ```python
  # Custom user agent
  crawler = AsyncWebCrawler(
      user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
  )
  ```

- **headers** (Dict[str, str], optional)
  - Custom HTTP headers
  ```python
  # Custom headers
  crawler = AsyncWebCrawler(
      headers={
          "Accept-Language": "en-US",
          "Custom-Header": "Value"
      }
  )
  ```

- **js_code** (Union[str, List[str]], optional)
  - Default JavaScript to execute on each page
  ```python
  # Default JavaScript
  crawler = AsyncWebCrawler(
      js_code=[
          "window.scrollTo(0, document.body.scrollHeight);",
          "document.querySelector('.load-more').click();"
      ]
  )
  ```

## Methods

### arun()

The primary method for crawling web pages.

```python
async def arun(
    # Required
    url: str,                              # URL to crawl
    
    # Content Selection
    css_selector: str = None,              # CSS selector for content
    word_count_threshold: int = 10,        # Minimum words per block
    
    # Cache Control
    bypass_cache: bool = False,            # Bypass cache for this request
    
    # Session Management
    session_id: str = None,                # Session identifier
    
    # Screenshot Options
    screenshot: bool = False,              # Take screenshot
    screenshot_wait_for: float = None,     # Wait before screenshot
    
    # Content Processing
    process_iframes: bool = False,         # Process iframe content
    remove_overlay_elements: bool = False, # Remove popups/modals
    
    # Anti-Bot Settings
    simulate_user: bool = False,           # Simulate human behavior
    override_navigator: bool = False,      # Override navigator properties
    magic: bool = False,                   # Enable all anti-detection
    
    # Content Filtering
    excluded_tags: List[str] = None,       # HTML tags to exclude
    exclude_external_links: bool = False,  # Remove external links
    exclude_social_media_links: bool = False, # Remove social media links
    
    # JavaScript Handling
    js_code: Union[str, List[str]] = None, # JavaScript to execute
    wait_for: str = None,                  # Wait condition
    
    # Page Loading
    page_timeout: int = 60000,            # Page load timeout (ms)
    delay_before_return_html: float = None, # Wait before return
    
    # Extraction
    extraction_strategy: ExtractionStrategy = None  # Extraction strategy
) -> CrawlResult:
```

### Usage Examples

#### Basic Crawling
```python
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
```

#### Advanced Crawling
```python
async with AsyncWebCrawler(
    browser_type="firefox",
    verbose=True,
    headers={"Custom-Header": "Value"}
) as crawler:
    result = await crawler.arun(
        url="https://example.com",
        css_selector=".main-content",
        word_count_threshold=20,
        process_iframes=True,
        magic=True,
        wait_for="css:.dynamic-content",
        screenshot=True
    )
```

#### Session Management
```python
async with AsyncWebCrawler() as crawler:
    # First request
    result1 = await crawler.arun(
        url="https://example.com/login",
        session_id="my_session"
    )
    
    # Subsequent request using same session
    result2 = await crawler.arun(
        url="https://example.com/protected",
        session_id="my_session"
    )
```

## Context Manager

AsyncWebCrawler implements the async context manager protocol:

```python
async def __aenter__(self) -> 'AsyncWebCrawler':
    # Initialize browser and resources
    return self

async def __aexit__(self, *args):
    # Cleanup resources
    pass
```

Always use AsyncWebCrawler with async context manager:
```python
async with AsyncWebCrawler() as crawler:
    # Your crawling code here
    pass
```

## Best Practices

1. **Resource Management**
```python
# Always use context manager
async with AsyncWebCrawler() as crawler:
    # Crawler will be properly cleaned up
    pass
```

2. **Error Handling**
```python
try:
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        if not result.success:
            print(f"Crawl failed: {result.error_message}")
except Exception as e:
    print(f"Error: {str(e)}")
```

3. **Performance Optimization**
```python
# Enable caching for better performance
crawler = AsyncWebCrawler(
    always_by_pass_cache=False,
    verbose=True
)
```

4. **Anti-Detection**
```python
# Maximum stealth
crawler = AsyncWebCrawler(
    headless=True,
    user_agent="Mozilla/5.0...",
    headers={"Accept-Language": "en-US"}
)
result = await crawler.arun(
    url="https://example.com",
    magic=True,
    simulate_user=True
)
```

## Note on Browser Types

Each browser type has its characteristics:

- **chromium**: Best overall compatibility
- **firefox**: Good for specific use cases
- **webkit**: Lighter weight, good for basic crawling

Choose based on your specific needs:
```python
# High compatibility
crawler = AsyncWebCrawler(browser_type="chromium")

# Memory efficient
crawler = AsyncWebCrawler(browser_type="webkit")
```