|
# Hooks & Custom Code |
|
|
|
Crawl4AI supports a **hook** system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like: |
|
|
|
- **Authentication** (log in before navigating) |
|
- **Content manipulation** (modify HTML, inject scripts, etc.) |
|
- **Session or browser configuration** (e.g., adjusting user agents, local storage) |
|
- **Custom data collection** (scrape extra details or track state at each stage) |
|
|
|
In this tutorial, you’ll learn about: |
|
|
|
1. What hooks are available |
|
2. How to attach code to each hook |
|
3. Practical examples (auth flows, user agent changes, content manipulation, etc.) |
|
|
|
> **Prerequisites** |
|
> - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md). |
|
> - Comfortable with Python async/await. |
|
|
|
--- |
|
|
|
## 1. Overview of Available Hooks |
|
|
|
| Hook Name | Called When / Purpose | Context / Objects Provided | |
|
|--------------------------|-----------------------------------------------------------------|-----------------------------------------------------| |
|
| **`on_browser_created`** | Immediately after the browser is launched, but **before** any page or context is created. | **Browser** object only (no `page` yet). Use it for broad browser-level config. | |
|
| **`on_page_context_created`** | Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. | Typically provides `page` and `context`. | |
|
| **`on_user_agent_updated`** | Whenever the user agent changes. For advanced user agent logic or additional header updates. | Typically provides `page` and updated user agent string. | |
|
| **`on_execution_started`** | Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. | Typically provides `page`, possibly `context`. | |
|
| **`before_goto`** | Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. | Typically provides `page`, `context`, and `goto_params`. | |
|
| **`after_goto`** | Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. | Typically provides `page`, `context`, `response`. | |
|
| **`before_retrieve_html`** | Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). | Typically provides `page` or final HTML reference. | |
|
| **`before_return_html`** | Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. | Typically provides final HTML or a `page`. | |
|
|
|
### A Note on `on_browser_created` (the “unbrowser” hook) |
|
- **No `page`** object is available because no page context exists yet. You can, however, set up browser-wide properties. |
|
- For example, you might control [CDP sessions][cdp] or advanced browser flags here. |
|
|
|
--- |
|
|
|
## 2. Registering Hooks |
|
|
|
You can attach hooks by calling: |
|
|
|
```python |
|
crawler.crawler_strategy.set_hook("hook_name", your_hook_function) |
|
``` |
|
|
|
or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor: |
|
|
|
```python |
|
hooks = { |
|
"before_goto": my_before_goto_hook, |
|
"after_goto": my_after_goto_hook, |
|
# ... etc. |
|
} |
|
async with AsyncWebCrawler(hooks=hooks) as crawler: |
|
... |
|
``` |
|
|
|
### Hook Signature |
|
|
|
Each hook is a function (async or sync, depending on your usage) that receives **certain parameters**—most often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing. |
|
|
|
--- |
|
|
|
## 3. Real-Life Examples |
|
|
|
Below are concrete scenarios where hooks come in handy. |
|
|
|
--- |
|
|
|
### 3.1 Authentication Before Navigation |
|
|
|
One of the most frequent tasks is logging in or applying authentication **before** the crawler navigates to a URL (so that the user is recognized immediately). |
|
|
|
#### Using `before_goto` |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig |
|
|
|
async def before_goto_auth_hook(page, context, goto_params, **kwargs): |
|
""" |
|
Example: Set cookies or localStorage to simulate login. |
|
This hook runs right before page.goto() is called. |
|
""" |
|
# Example: Insert cookie-based auth or local storage data |
|
# (You could also do more complex actions, like fill forms if you already have a 'page' open.) |
|
print("[HOOK] Setting auth data before goto.") |
|
await context.add_cookies([ |
|
{ |
|
"name": "session", |
|
"value": "abcd1234", |
|
"domain": "example.com", |
|
"path": "/" |
|
} |
|
]) |
|
# Optionally manipulate goto_params if needed: |
|
# goto_params["url"] = goto_params["url"] + "?debug=1" |
|
|
|
async def main(): |
|
hooks = { |
|
"before_goto": before_goto_auth_hook |
|
} |
|
|
|
browser_cfg = BrowserConfig(headless=True) |
|
crawler_cfg = CrawlerRunConfig() |
|
|
|
async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler: |
|
result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg) |
|
if result.success: |
|
print("[OK] Logged in and fetched protected page.") |
|
else: |
|
print("[ERROR]", result.error_message) |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
``` |
|
|
|
**Key Points** |
|
- `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself. |
|
- If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start. |
|
|
|
--- |
|
|
|
### 3.2 Setting Up the Browser in `on_browser_created` |
|
|
|
If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use `on_browser_created`. No `page` is available yet, but you can set up the **browser** instance itself. |
|
|
|
```python |
|
async def on_browser_created_hook(browser, **kwargs): |
|
""" |
|
Runs immediately after the browser is created, before any pages. |
|
'browser' here is a Playwright Browser object. |
|
""" |
|
print("[HOOK] Browser created. Setting up custom stuff.") |
|
# Possibly connect to DevTools or create an incognito context |
|
# Example (pseudo-code): |
|
# devtools_url = await browser.new_context(devtools=True) |
|
|
|
# Usage: |
|
async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler: |
|
... |
|
``` |
|
|
|
--- |
|
|
|
### 3.3 Adjusting Page or Context in `on_page_context_created` |
|
|
|
If you’d like to set default timeouts or inject scripts right after a page context is spun up: |
|
|
|
```python |
|
async def on_page_context_created_hook(page, context, **kwargs): |
|
print("[HOOK] Page context created. Setting default timeouts or scripts.") |
|
await page.set_default_timeout(20000) # 20 seconds |
|
# Possibly inject a script or set user locale |
|
|
|
# Usage: |
|
hooks = { |
|
"on_page_context_created": on_page_context_created_hook |
|
} |
|
``` |
|
|
|
--- |
|
|
|
### 3.4 Dynamically Updating User Agents |
|
|
|
`on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging: |
|
|
|
```python |
|
async def on_user_agent_updated_hook(page, context, new_ua, **kwargs): |
|
print(f"[HOOK] User agent updated to {new_ua}") |
|
# Maybe add a custom header based on new UA |
|
await context.set_extra_http_headers({"X-UA-Source": new_ua}) |
|
|
|
hooks = { |
|
"on_user_agent_updated": on_user_agent_updated_hook |
|
} |
|
``` |
|
|
|
--- |
|
|
|
### 3.5 Initializing Stuff with `on_execution_started` |
|
|
|
`on_execution_started` runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp). |
|
|
|
```python |
|
async def on_execution_started_hook(page, context, **kwargs): |
|
print("[HOOK] Execution started. Setting a start timestamp or logging.") |
|
context.set_default_navigation_timeout(45000) # 45s if your site is slow |
|
|
|
hooks = { |
|
"on_execution_started": on_execution_started_hook |
|
} |
|
``` |
|
|
|
--- |
|
|
|
### 3.6 Post-Processing with `after_goto` |
|
|
|
After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials: |
|
|
|
```python |
|
async def after_goto_hook(page, context, response, **kwargs): |
|
""" |
|
Called right after page.goto() finishes, but before the crawler extracts HTML. |
|
""" |
|
if response and response.ok: |
|
print("[HOOK] After goto. Status:", response.status) |
|
# Maybe remove popups or check if we landed on a login failure page. |
|
await page.evaluate("""() => { |
|
const popup = document.querySelector(".annoying-popup"); |
|
if (popup) popup.remove(); |
|
}""") |
|
else: |
|
print("[HOOK] Navigation might have failed, status not ok or no response.") |
|
|
|
hooks = { |
|
"after_goto": after_goto_hook |
|
} |
|
``` |
|
|
|
--- |
|
|
|
### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html` |
|
|
|
Sometimes you need to tweak the page or raw HTML right before it’s captured. |
|
|
|
```python |
|
async def before_retrieve_html_hook(page, context, **kwargs): |
|
""" |
|
Modify the DOM just before the crawler finalizes the HTML. |
|
""" |
|
print("[HOOK] Removing adverts before capturing HTML.") |
|
await page.evaluate("""() => { |
|
const ads = document.querySelectorAll(".ad-banner"); |
|
ads.forEach(ad => ad.remove()); |
|
}""") |
|
|
|
async def before_return_html_hook(page, context, html, **kwargs): |
|
""" |
|
'html' is the near-finished HTML string. Return an updated string if you like. |
|
""" |
|
# For example, remove personal data or certain tags from the final text |
|
print("[HOOK] Sanitizing final HTML.") |
|
sanitized_html = html.replace("PersonalInfo:", "[REDACTED]") |
|
return sanitized_html |
|
|
|
hooks = { |
|
"before_retrieve_html": before_retrieve_html_hook, |
|
"before_return_html": before_return_html_hook |
|
} |
|
``` |
|
|
|
**Note**: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override. |
|
|
|
--- |
|
|
|
## 4. Putting It All Together |
|
|
|
You can combine multiple hooks in a single run. For instance: |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig |
|
|
|
async def on_browser_created_hook(browser, **kwargs): |
|
print("[HOOK] Browser is up, no page yet. Good for broad config.") |
|
|
|
async def before_goto_auth_hook(page, context, goto_params, **kwargs): |
|
print("[HOOK] Adding cookies for auth.") |
|
await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}]) |
|
|
|
async def after_goto_log_hook(page, context, response, **kwargs): |
|
if response: |
|
print("[HOOK] after_goto: Status code:", response.status) |
|
|
|
async def main(): |
|
hooks = { |
|
"on_browser_created": on_browser_created_hook, |
|
"before_goto": before_goto_auth_hook, |
|
"after_goto": after_goto_log_hook |
|
} |
|
|
|
browser_cfg = BrowserConfig(headless=True) |
|
crawler_cfg = CrawlerRunConfig(verbose=True) |
|
|
|
async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler: |
|
result = await crawler.arun("https://example.com/protected", config=crawler_cfg) |
|
if result.success: |
|
print("[OK] Protected page length:", len(result.html)) |
|
else: |
|
print("[ERROR]", result.error_message) |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
``` |
|
|
|
This example: |
|
|
|
1. **`on_browser_created`** sets up the brand-new browser instance. |
|
2. **`before_goto`** ensures you inject an auth cookie before accessing the page. |
|
3. **`after_goto`** logs the resulting HTTP status code. |
|
|
|
--- |
|
|
|
## 5. Common Pitfalls & Best Practices |
|
|
|
1. **Hook Order**: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic. |
|
2. **Async vs Sync**: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`. |
|
3. **Mutating goto_params**: `goto_params` is a dict that eventually goes to Playwright’s `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully. |
|
4. **Browser vs Page vs Context**: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to **`browser`**. |
|
5. **Avoid Overdoing It**: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice. |
|
|
|
--- |
|
|
|
## Conclusion & Next Steps |
|
|
|
**Hooks** let you bend Crawl4AI to your will: |
|
|
|
- **Authentication** (cookies, localStorage) with `before_goto` |
|
- **Browser-level config** with `on_browser_created` |
|
- **Page or context config** with `on_page_context_created` |
|
- **Content modifications** before capturing HTML (`before_retrieve_html` or `before_return_html`) |
|
|
|
**Where to go next**: |
|
|
|
- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: Combine hooks with advanced user simulation to avoid bot detection. |
|
- **[Reference → AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md)**: Learn more about how hooks are implemented under the hood. |
|
- **[How-To Guides](../../how-to/)**: Check short, specific recipes for tasks like scraping multiple pages with repeated “Load More” clicks. |
|
|
|
With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines! |
|
|
|
**Last Updated**: 2024-XX-XX |
|
|