File size: 6,563 Bytes
03c0888
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# Complete Parameter Guide for arun()

The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality.

## Core Parameters

```python
await crawler.arun(
    url="https://example.com",   # Required: URL to crawl
    verbose=True,               # Enable detailed logging
    cache_mode=CacheMode.ENABLED,  # Control cache behavior
    warmup=True                # Whether to run warmup check
)
```

## Cache Control

```python
from crawl4ai import CacheMode

await crawler.arun(
    cache_mode=CacheMode.ENABLED,    # Normal caching (read/write)
    # Other cache modes:
    # cache_mode=CacheMode.DISABLED   # No caching at all
    # cache_mode=CacheMode.READ_ONLY  # Only read from cache
    # cache_mode=CacheMode.WRITE_ONLY # Only write to cache
    # cache_mode=CacheMode.BYPASS     # Skip cache for this operation
)
```

## Content Processing Parameters

### Text Processing
```python
await crawler.arun(
    word_count_threshold=10,                # Minimum words per content block
    image_description_min_word_threshold=5,  # Minimum words for image descriptions
    only_text=False,                        # Extract only text content
    excluded_tags=['form', 'nav'],          # HTML tags to exclude
    keep_data_attributes=False,             # Preserve data-* attributes
)
```

### Content Selection
```python
await crawler.arun(
    css_selector=".main-content",  # CSS selector for content extraction
    remove_forms=True,             # Remove all form elements
    remove_overlay_elements=True,  # Remove popups/modals/overlays
)
```

### Link Handling
```python
await crawler.arun(
    exclude_external_links=True,          # Remove external links
    exclude_social_media_links=True,      # Remove social media links
    exclude_external_images=True,         # Remove external images
    exclude_domains=["ads.example.com"],  # Specific domains to exclude
    social_media_domains=[               # Additional social media domains
        "facebook.com",
        "twitter.com",
        "instagram.com"
    ]
)
```

## Browser Control Parameters

### Basic Browser Settings
```python
await crawler.arun(
    headless=True,                # Run browser in headless mode
    browser_type="chromium",      # Browser engine: "chromium", "firefox", "webkit"
    page_timeout=60000,          # Page load timeout in milliseconds
    user_agent="custom-agent",    # Custom user agent
)
```

### Navigation and Waiting
```python
await crawler.arun(
    wait_for="css:.dynamic-content",  # Wait for element/condition
    delay_before_return_html=2.0,     # Wait before returning HTML (seconds)
)
```

### JavaScript Execution
```python
await crawler.arun(
    js_code=[                     # JavaScript to execute (string or list)
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more').click();"
    ],
    js_only=False,               # Only execute JavaScript without reloading page
)
```

### Anti-Bot Features
```python
await crawler.arun(
    magic=True,              # Enable all anti-detection features
    simulate_user=True,      # Simulate human behavior
    override_navigator=True  # Override navigator properties
)
```

### Session Management
```python
await crawler.arun(
    session_id="my_session",  # Session identifier for persistent browsing
)
```

### Screenshot Options
```python
await crawler.arun(
    screenshot=True,              # Take page screenshot
    screenshot_wait_for=2.0,      # Wait before screenshot (seconds)
)
```

### Proxy Configuration
```python
await crawler.arun(
    proxy="http://proxy.example.com:8080",     # Simple proxy URL
    proxy_config={                             # Advanced proxy settings
        "server": "http://proxy.example.com:8080",
        "username": "user",
        "password": "pass"
    }
)
```

## Content Extraction Parameters

### Extraction Strategy
```python
await crawler.arun(
    extraction_strategy=LLMExtractionStrategy(
        provider="ollama/llama2",
        schema=MySchema.schema(),
        instruction="Extract specific data"
    )
)
```

### Chunking Strategy
```python
await crawler.arun(
    chunking_strategy=RegexChunking(
        patterns=[r'\n\n', r'\.\s+']
    )
)
```

### HTML to Text Options
```python
await crawler.arun(
    html2text={
        "ignore_links": False,
        "ignore_images": False,
        "escape_dot": False,
        "body_width": 0,
        "protect_links": True,
        "unicode_snob": True
    }
)
```

## Debug Options
```python
await crawler.arun(
    log_console=True,   # Log browser console messages
)
```

## Parameter Interactions and Notes

1. **Cache and Performance Setup**
   ```python
   # Optimal caching for repeated crawls
   await crawler.arun(
       cache_mode=CacheMode.ENABLED,
       word_count_threshold=10,
       process_iframes=False
   )
   ```

2. **Dynamic Content Handling**
   ```python
   # Handle lazy-loaded content
   await crawler.arun(
       js_code="window.scrollTo(0, document.body.scrollHeight);",
       wait_for="css:.lazy-content",
       delay_before_return_html=2.0,
       cache_mode=CacheMode.WRITE_ONLY  # Cache results after dynamic load
   )
   ```

3. **Content Extraction Pipeline**
   ```python
   # Complete extraction setup
   await crawler.arun(
       css_selector=".main-content",
       word_count_threshold=20,
       extraction_strategy=my_strategy,
       chunking_strategy=my_chunking,
       process_iframes=True,
       remove_overlay_elements=True,
       cache_mode=CacheMode.ENABLED
   )
   ```

## Best Practices

1. **Performance Optimization**
   ```python
   await crawler.arun(
       cache_mode=CacheMode.ENABLED,  # Use full caching
       word_count_threshold=10,      # Filter out noise
       process_iframes=False         # Skip iframes if not needed
   )
   ```

2. **Reliable Scraping**
   ```python
   await crawler.arun(
       magic=True,                   # Enable anti-detection
       delay_before_return_html=1.0, # Wait for dynamic content
       page_timeout=60000,          # Longer timeout for slow pages
       cache_mode=CacheMode.WRITE_ONLY  # Cache results after successful crawl
   )
   ```

3. **Clean Content**
   ```python
   await crawler.arun(
       remove_overlay_elements=True,  # Remove popups
       excluded_tags=['nav', 'aside'],# Remove unnecessary elements
       keep_data_attributes=False,    # Remove data attributes
       cache_mode=CacheMode.ENABLED   # Use cache for faster processing
   )
   ```