Crawl4AI

Runtime error

File size: 13,893 Bytes

03c0888

# Hooks & Custom Code

Crawl4AI supports a **hook** system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like:

- **Authentication** (log in before navigating)  
- **Content manipulation** (modify HTML, inject scripts, etc.)  
- **Session or browser configuration** (e.g., adjusting user agents, local storage)  
- **Custom data collection** (scrape extra details or track state at each stage)

In this tutorial, you’ll learn about:

1. What hooks are available  
2. How to attach code to each hook  
3. Practical examples (auth flows, user agent changes, content manipulation, etc.)

> **Prerequisites**  
> - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md).  
> - Comfortable with Python async/await.

---

## 1. Overview of Available Hooks

| Hook Name                | Called When / Purpose                                           | Context / Objects Provided                         |
|--------------------------|-----------------------------------------------------------------|-----------------------------------------------------|
| **`on_browser_created`** | Immediately after the browser is launched, but **before** any page or context is created. | **Browser** object only (no `page` yet). Use it for broad browser-level config. |
| **`on_page_context_created`** | Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. | Typically provides `page` and `context`.           |
| **`on_user_agent_updated`** | Whenever the user agent changes. For advanced user agent logic or additional header updates. | Typically provides `page` and updated user agent string. |
| **`on_execution_started`** | Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. | Typically provides `page`, possibly `context`.      |
| **`before_goto`**        | Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. | Typically provides `page`, `context`, and `goto_params`. |
| **`after_goto`**         | Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. | Typically provides `page`, `context`, `response`.   |
| **`before_retrieve_html`** | Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). | Typically provides `page` or final HTML reference.  |
| **`before_return_html`** | Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. | Typically provides final HTML or a `page`.          |

### A Note on `on_browser_created` (the “unbrowser” hook)
- **No `page`** object is available because no page context exists yet. You can, however, set up browser-wide properties.  
- For example, you might control [CDP sessions][cdp] or advanced browser flags here.

---

## 2. Registering Hooks

You can attach hooks by calling:

```python
crawler.crawler_strategy.set_hook("hook_name", your_hook_function)
```

or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor:

```python
hooks = {
    "before_goto": my_before_goto_hook,
    "after_goto": my_after_goto_hook,
    # ... etc.
}
async with AsyncWebCrawler(hooks=hooks) as crawler:
    ...
```

### Hook Signature

Each hook is a function (async or sync, depending on your usage) that receives **certain parameters**—most often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing.

---

## 3. Real-Life Examples

Below are concrete scenarios where hooks come in handy.

---

### 3.1 Authentication Before Navigation

One of the most frequent tasks is logging in or applying authentication **before** the crawler navigates to a URL (so that the user is recognized immediately).

#### Using `before_goto`

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def before_goto_auth_hook(page, context, goto_params, **kwargs):
    """
    Example: Set cookies or localStorage to simulate login.
    This hook runs right before page.goto() is called.
    """
    # Example: Insert cookie-based auth or local storage data
    # (You could also do more complex actions, like fill forms if you already have a 'page' open.)
    print("[HOOK] Setting auth data before goto.")
    await context.add_cookies([
        {
            "name": "session",
            "value": "abcd1234",
            "domain": "example.com",
            "path": "/"
        }
    ])
    # Optionally manipulate goto_params if needed:
    # goto_params["url"] = goto_params["url"] + "?debug=1"

async def main():
    hooks = {
        "before_goto": before_goto_auth_hook
    }

    browser_cfg = BrowserConfig(headless=True)
    crawler_cfg = CrawlerRunConfig()

    async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
        result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg)
        if result.success:
            print("[OK] Logged in and fetched protected page.")
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

**Key Points**  
- `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself.  
- If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start.

---

### 3.2 Setting Up the Browser in `on_browser_created`

If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use `on_browser_created`. No `page` is available yet, but you can set up the **browser** instance itself.

```python
async def on_browser_created_hook(browser, **kwargs):
    """
    Runs immediately after the browser is created, before any pages.
    'browser' here is a Playwright Browser object.
    """
    print("[HOOK] Browser created. Setting up custom stuff.")
    # Possibly connect to DevTools or create an incognito context
    # Example (pseudo-code):
    # devtools_url = await browser.new_context(devtools=True)

# Usage:
async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler:
    ...
```

---

### 3.3 Adjusting Page or Context in `on_page_context_created`

If you’d like to set default timeouts or inject scripts right after a page context is spun up:

```python
async def on_page_context_created_hook(page, context, **kwargs):
    print("[HOOK] Page context created. Setting default timeouts or scripts.")
    await page.set_default_timeout(20000)  # 20 seconds
    # Possibly inject a script or set user locale

# Usage:
hooks = {
    "on_page_context_created": on_page_context_created_hook
}
```

---

### 3.4 Dynamically Updating User Agents

`on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging:

```python
async def on_user_agent_updated_hook(page, context, new_ua, **kwargs):
    print(f"[HOOK] User agent updated to {new_ua}")
    # Maybe add a custom header based on new UA
    await context.set_extra_http_headers({"X-UA-Source": new_ua})

hooks = {
    "on_user_agent_updated": on_user_agent_updated_hook
}
```

---

### 3.5 Initializing Stuff with `on_execution_started`

`on_execution_started` runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp).

```python
async def on_execution_started_hook(page, context, **kwargs):
    print("[HOOK] Execution started. Setting a start timestamp or logging.")
    context.set_default_navigation_timeout(45000)  # 45s if your site is slow

hooks = {
    "on_execution_started": on_execution_started_hook
}
```

---

### 3.6 Post-Processing with `after_goto`

After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials:

```python
async def after_goto_hook(page, context, response, **kwargs):
    """
    Called right after page.goto() finishes, but before the crawler extracts HTML.
    """
    if response and response.ok:
        print("[HOOK] After goto. Status:", response.status)
        # Maybe remove popups or check if we landed on a login failure page.
        await page.evaluate("""() => {
            const popup = document.querySelector(".annoying-popup");
            if (popup) popup.remove();
        }""")
    else:
        print("[HOOK] Navigation might have failed, status not ok or no response.")

hooks = {
    "after_goto": after_goto_hook
}
```

---

### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html`

Sometimes you need to tweak the page or raw HTML right before it’s captured.

```python
async def before_retrieve_html_hook(page, context, **kwargs):
    """
    Modify the DOM just before the crawler finalizes the HTML.
    """
    print("[HOOK] Removing adverts before capturing HTML.")
    await page.evaluate("""() => {
        const ads = document.querySelectorAll(".ad-banner");
        ads.forEach(ad => ad.remove());
    }""")

async def before_return_html_hook(page, context, html, **kwargs):
    """
    'html' is the near-finished HTML string. Return an updated string if you like.
    """
    # For example, remove personal data or certain tags from the final text
    print("[HOOK] Sanitizing final HTML.")
    sanitized_html = html.replace("PersonalInfo:", "[REDACTED]")
    return sanitized_html

hooks = {
    "before_retrieve_html": before_retrieve_html_hook,
    "before_return_html": before_return_html_hook
}
```

**Note**: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override.

---

## 4. Putting It All Together

You can combine multiple hooks in a single run. For instance:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def on_browser_created_hook(browser, **kwargs):
    print("[HOOK] Browser is up, no page yet. Good for broad config.")

async def before_goto_auth_hook(page, context, goto_params, **kwargs):
    print("[HOOK] Adding cookies for auth.")
    await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}])

async def after_goto_log_hook(page, context, response, **kwargs):
    if response:
        print("[HOOK] after_goto: Status code:", response.status)

async def main():
    hooks = {
        "on_browser_created": on_browser_created_hook,
        "before_goto": before_goto_auth_hook,
        "after_goto": after_goto_log_hook
    }

    browser_cfg = BrowserConfig(headless=True)
    crawler_cfg = CrawlerRunConfig(verbose=True)

    async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
        result = await crawler.arun("https://example.com/protected", config=crawler_cfg)
        if result.success:
            print("[OK] Protected page length:", len(result.html))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

This example:

1. **`on_browser_created`** sets up the brand-new browser instance.  
2. **`before_goto`** ensures you inject an auth cookie before accessing the page.  
3. **`after_goto`** logs the resulting HTTP status code.

---

## 5. Common Pitfalls & Best Practices

1. **Hook Order**: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic.  
2. **Async vs Sync**: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`.  
3. **Mutating goto_params**: `goto_params` is a dict that eventually goes to Playwright’s `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully.  
4. **Browser vs Page vs Context**: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to **`browser`**.  
5. **Avoid Overdoing It**: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice.

---

## Conclusion & Next Steps

**Hooks** let you bend Crawl4AI to your will:

- **Authentication** (cookies, localStorage) with `before_goto`  
- **Browser-level config** with `on_browser_created`  
- **Page or context config** with `on_page_context_created`  
- **Content modifications** before capturing HTML (`before_retrieve_html` or `before_return_html`)  

**Where to go next**:

- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: Combine hooks with advanced user simulation to avoid bot detection.  
- **[Reference → AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md)**: Learn more about how hooks are implemented under the hood.  
- **[How-To Guides](../../how-to/)**: Check short, specific recipes for tasks like scraping multiple pages with repeated “Load More” clicks.

With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines!

**Last Updated**: 2024-XX-XX