Spaces:
Runtime error
Runtime error
File size: 13,893 Bytes
03c0888 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 |
# Hooks & Custom Code
Crawl4AI supports a **hook** system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like:
- **Authentication** (log in before navigating)
- **Content manipulation** (modify HTML, inject scripts, etc.)
- **Session or browser configuration** (e.g., adjusting user agents, local storage)
- **Custom data collection** (scrape extra details or track state at each stage)
In this tutorial, you’ll learn about:
1. What hooks are available
2. How to attach code to each hook
3. Practical examples (auth flows, user agent changes, content manipulation, etc.)
> **Prerequisites**
> - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md).
> - Comfortable with Python async/await.
---
## 1. Overview of Available Hooks
| Hook Name | Called When / Purpose | Context / Objects Provided |
|--------------------------|-----------------------------------------------------------------|-----------------------------------------------------|
| **`on_browser_created`** | Immediately after the browser is launched, but **before** any page or context is created. | **Browser** object only (no `page` yet). Use it for broad browser-level config. |
| **`on_page_context_created`** | Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. | Typically provides `page` and `context`. |
| **`on_user_agent_updated`** | Whenever the user agent changes. For advanced user agent logic or additional header updates. | Typically provides `page` and updated user agent string. |
| **`on_execution_started`** | Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. | Typically provides `page`, possibly `context`. |
| **`before_goto`** | Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. | Typically provides `page`, `context`, and `goto_params`. |
| **`after_goto`** | Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. | Typically provides `page`, `context`, `response`. |
| **`before_retrieve_html`** | Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). | Typically provides `page` or final HTML reference. |
| **`before_return_html`** | Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. | Typically provides final HTML or a `page`. |
### A Note on `on_browser_created` (the “unbrowser” hook)
- **No `page`** object is available because no page context exists yet. You can, however, set up browser-wide properties.
- For example, you might control [CDP sessions][cdp] or advanced browser flags here.
---
## 2. Registering Hooks
You can attach hooks by calling:
```python
crawler.crawler_strategy.set_hook("hook_name", your_hook_function)
```
or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor:
```python
hooks = {
"before_goto": my_before_goto_hook,
"after_goto": my_after_goto_hook,
# ... etc.
}
async with AsyncWebCrawler(hooks=hooks) as crawler:
...
```
### Hook Signature
Each hook is a function (async or sync, depending on your usage) that receives **certain parameters**—most often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing.
---
## 3. Real-Life Examples
Below are concrete scenarios where hooks come in handy.
---
### 3.1 Authentication Before Navigation
One of the most frequent tasks is logging in or applying authentication **before** the crawler navigates to a URL (so that the user is recognized immediately).
#### Using `before_goto`
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def before_goto_auth_hook(page, context, goto_params, **kwargs):
"""
Example: Set cookies or localStorage to simulate login.
This hook runs right before page.goto() is called.
"""
# Example: Insert cookie-based auth or local storage data
# (You could also do more complex actions, like fill forms if you already have a 'page' open.)
print("[HOOK] Setting auth data before goto.")
await context.add_cookies([
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/"
}
])
# Optionally manipulate goto_params if needed:
# goto_params["url"] = goto_params["url"] + "?debug=1"
async def main():
hooks = {
"before_goto": before_goto_auth_hook
}
browser_cfg = BrowserConfig(headless=True)
crawler_cfg = CrawlerRunConfig()
async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg)
if result.success:
print("[OK] Logged in and fetched protected page.")
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**Key Points**
- `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself.
- If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start.
---
### 3.2 Setting Up the Browser in `on_browser_created`
If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use `on_browser_created`. No `page` is available yet, but you can set up the **browser** instance itself.
```python
async def on_browser_created_hook(browser, **kwargs):
"""
Runs immediately after the browser is created, before any pages.
'browser' here is a Playwright Browser object.
"""
print("[HOOK] Browser created. Setting up custom stuff.")
# Possibly connect to DevTools or create an incognito context
# Example (pseudo-code):
# devtools_url = await browser.new_context(devtools=True)
# Usage:
async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler:
...
```
---
### 3.3 Adjusting Page or Context in `on_page_context_created`
If you’d like to set default timeouts or inject scripts right after a page context is spun up:
```python
async def on_page_context_created_hook(page, context, **kwargs):
print("[HOOK] Page context created. Setting default timeouts or scripts.")
await page.set_default_timeout(20000) # 20 seconds
# Possibly inject a script or set user locale
# Usage:
hooks = {
"on_page_context_created": on_page_context_created_hook
}
```
---
### 3.4 Dynamically Updating User Agents
`on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging:
```python
async def on_user_agent_updated_hook(page, context, new_ua, **kwargs):
print(f"[HOOK] User agent updated to {new_ua}")
# Maybe add a custom header based on new UA
await context.set_extra_http_headers({"X-UA-Source": new_ua})
hooks = {
"on_user_agent_updated": on_user_agent_updated_hook
}
```
---
### 3.5 Initializing Stuff with `on_execution_started`
`on_execution_started` runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp).
```python
async def on_execution_started_hook(page, context, **kwargs):
print("[HOOK] Execution started. Setting a start timestamp or logging.")
context.set_default_navigation_timeout(45000) # 45s if your site is slow
hooks = {
"on_execution_started": on_execution_started_hook
}
```
---
### 3.6 Post-Processing with `after_goto`
After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials:
```python
async def after_goto_hook(page, context, response, **kwargs):
"""
Called right after page.goto() finishes, but before the crawler extracts HTML.
"""
if response and response.ok:
print("[HOOK] After goto. Status:", response.status)
# Maybe remove popups or check if we landed on a login failure page.
await page.evaluate("""() => {
const popup = document.querySelector(".annoying-popup");
if (popup) popup.remove();
}""")
else:
print("[HOOK] Navigation might have failed, status not ok or no response.")
hooks = {
"after_goto": after_goto_hook
}
```
---
### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html`
Sometimes you need to tweak the page or raw HTML right before it’s captured.
```python
async def before_retrieve_html_hook(page, context, **kwargs):
"""
Modify the DOM just before the crawler finalizes the HTML.
"""
print("[HOOK] Removing adverts before capturing HTML.")
await page.evaluate("""() => {
const ads = document.querySelectorAll(".ad-banner");
ads.forEach(ad => ad.remove());
}""")
async def before_return_html_hook(page, context, html, **kwargs):
"""
'html' is the near-finished HTML string. Return an updated string if you like.
"""
# For example, remove personal data or certain tags from the final text
print("[HOOK] Sanitizing final HTML.")
sanitized_html = html.replace("PersonalInfo:", "[REDACTED]")
return sanitized_html
hooks = {
"before_retrieve_html": before_retrieve_html_hook,
"before_return_html": before_return_html_hook
}
```
**Note**: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override.
---
## 4. Putting It All Together
You can combine multiple hooks in a single run. For instance:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def on_browser_created_hook(browser, **kwargs):
print("[HOOK] Browser is up, no page yet. Good for broad config.")
async def before_goto_auth_hook(page, context, goto_params, **kwargs):
print("[HOOK] Adding cookies for auth.")
await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}])
async def after_goto_log_hook(page, context, response, **kwargs):
if response:
print("[HOOK] after_goto: Status code:", response.status)
async def main():
hooks = {
"on_browser_created": on_browser_created_hook,
"before_goto": before_goto_auth_hook,
"after_goto": after_goto_log_hook
}
browser_cfg = BrowserConfig(headless=True)
crawler_cfg = CrawlerRunConfig(verbose=True)
async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
result = await crawler.arun("https://example.com/protected", config=crawler_cfg)
if result.success:
print("[OK] Protected page length:", len(result.html))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
This example:
1. **`on_browser_created`** sets up the brand-new browser instance.
2. **`before_goto`** ensures you inject an auth cookie before accessing the page.
3. **`after_goto`** logs the resulting HTTP status code.
---
## 5. Common Pitfalls & Best Practices
1. **Hook Order**: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic.
2. **Async vs Sync**: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`.
3. **Mutating goto_params**: `goto_params` is a dict that eventually goes to Playwright’s `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully.
4. **Browser vs Page vs Context**: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to **`browser`**.
5. **Avoid Overdoing It**: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice.
---
## Conclusion & Next Steps
**Hooks** let you bend Crawl4AI to your will:
- **Authentication** (cookies, localStorage) with `before_goto`
- **Browser-level config** with `on_browser_created`
- **Page or context config** with `on_page_context_created`
- **Content modifications** before capturing HTML (`before_retrieve_html` or `before_return_html`)
**Where to go next**:
- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: Combine hooks with advanced user simulation to avoid bot detection.
- **[Reference → AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md)**: Learn more about how hooks are implemented under the hood.
- **[How-To Guides](../../how-to/)**: Check short, specific recipes for tasks like scraping multiple pages with repeated “Load More” clicks.
With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines!
**Last Updated**: 2024-XX-XX
|