# Hooks & Custom Code Crawl4AI supports a **hook** system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like: - **Authentication** (log in before navigating) - **Content manipulation** (modify HTML, inject scripts, etc.) - **Session or browser configuration** (e.g., adjusting user agents, local storage) - **Custom data collection** (scrape extra details or track state at each stage) In this tutorial, you’ll learn about: 1. What hooks are available 2. How to attach code to each hook 3. Practical examples (auth flows, user agent changes, content manipulation, etc.) > **Prerequisites** > - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md). > - Comfortable with Python async/await. --- ## 1. Overview of Available Hooks | Hook Name | Called When / Purpose | Context / Objects Provided | |--------------------------|-----------------------------------------------------------------|-----------------------------------------------------| | **`on_browser_created`** | Immediately after the browser is launched, but **before** any page or context is created. | **Browser** object only (no `page` yet). Use it for broad browser-level config. | | **`on_page_context_created`** | Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. | Typically provides `page` and `context`. | | **`on_user_agent_updated`** | Whenever the user agent changes. For advanced user agent logic or additional header updates. | Typically provides `page` and updated user agent string. | | **`on_execution_started`** | Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. | Typically provides `page`, possibly `context`. | | **`before_goto`** | Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. | Typically provides `page`, `context`, and `goto_params`. | | **`after_goto`** | Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. | Typically provides `page`, `context`, `response`. | | **`before_retrieve_html`** | Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). | Typically provides `page` or final HTML reference. | | **`before_return_html`** | Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. | Typically provides final HTML or a `page`. | ### A Note on `on_browser_created` (the “unbrowser” hook) - **No `page`** object is available because no page context exists yet. You can, however, set up browser-wide properties. - For example, you might control [CDP sessions][cdp] or advanced browser flags here. --- ## 2. Registering Hooks You can attach hooks by calling: ```python crawler.crawler_strategy.set_hook("hook_name", your_hook_function) ``` or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor: ```python hooks = { "before_goto": my_before_goto_hook, "after_goto": my_after_goto_hook, # ... etc. } async with AsyncWebCrawler(hooks=hooks) as crawler: ... ``` ### Hook Signature Each hook is a function (async or sync, depending on your usage) that receives **certain parameters**—most often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing. --- ## 3. Real-Life Examples Below are concrete scenarios where hooks come in handy. --- ### 3.1 Authentication Before Navigation One of the most frequent tasks is logging in or applying authentication **before** the crawler navigates to a URL (so that the user is recognized immediately). #### Using `before_goto` ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def before_goto_auth_hook(page, context, goto_params, **kwargs): """ Example: Set cookies or localStorage to simulate login. This hook runs right before page.goto() is called. """ # Example: Insert cookie-based auth or local storage data # (You could also do more complex actions, like fill forms if you already have a 'page' open.) print("[HOOK] Setting auth data before goto.") await context.add_cookies([ { "name": "session", "value": "abcd1234", "domain": "example.com", "path": "/" } ]) # Optionally manipulate goto_params if needed: # goto_params["url"] = goto_params["url"] + "?debug=1" async def main(): hooks = { "before_goto": before_goto_auth_hook } browser_cfg = BrowserConfig(headless=True) crawler_cfg = CrawlerRunConfig() async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler: result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg) if result.success: print("[OK] Logged in and fetched protected page.") else: print("[ERROR]", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` **Key Points** - `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself. - If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start. --- ### 3.2 Setting Up the Browser in `on_browser_created` If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use `on_browser_created`. No `page` is available yet, but you can set up the **browser** instance itself. ```python async def on_browser_created_hook(browser, **kwargs): """ Runs immediately after the browser is created, before any pages. 'browser' here is a Playwright Browser object. """ print("[HOOK] Browser created. Setting up custom stuff.") # Possibly connect to DevTools or create an incognito context # Example (pseudo-code): # devtools_url = await browser.new_context(devtools=True) # Usage: async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler: ... ``` --- ### 3.3 Adjusting Page or Context in `on_page_context_created` If you’d like to set default timeouts or inject scripts right after a page context is spun up: ```python async def on_page_context_created_hook(page, context, **kwargs): print("[HOOK] Page context created. Setting default timeouts or scripts.") await page.set_default_timeout(20000) # 20 seconds # Possibly inject a script or set user locale # Usage: hooks = { "on_page_context_created": on_page_context_created_hook } ``` --- ### 3.4 Dynamically Updating User Agents `on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging: ```python async def on_user_agent_updated_hook(page, context, new_ua, **kwargs): print(f"[HOOK] User agent updated to {new_ua}") # Maybe add a custom header based on new UA await context.set_extra_http_headers({"X-UA-Source": new_ua}) hooks = { "on_user_agent_updated": on_user_agent_updated_hook } ``` --- ### 3.5 Initializing Stuff with `on_execution_started` `on_execution_started` runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp). ```python async def on_execution_started_hook(page, context, **kwargs): print("[HOOK] Execution started. Setting a start timestamp or logging.") context.set_default_navigation_timeout(45000) # 45s if your site is slow hooks = { "on_execution_started": on_execution_started_hook } ``` --- ### 3.6 Post-Processing with `after_goto` After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials: ```python async def after_goto_hook(page, context, response, **kwargs): """ Called right after page.goto() finishes, but before the crawler extracts HTML. """ if response and response.ok: print("[HOOK] After goto. Status:", response.status) # Maybe remove popups or check if we landed on a login failure page. await page.evaluate("""() => { const popup = document.querySelector(".annoying-popup"); if (popup) popup.remove(); }""") else: print("[HOOK] Navigation might have failed, status not ok or no response.") hooks = { "after_goto": after_goto_hook } ``` --- ### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html` Sometimes you need to tweak the page or raw HTML right before it’s captured. ```python async def before_retrieve_html_hook(page, context, **kwargs): """ Modify the DOM just before the crawler finalizes the HTML. """ print("[HOOK] Removing adverts before capturing HTML.") await page.evaluate("""() => { const ads = document.querySelectorAll(".ad-banner"); ads.forEach(ad => ad.remove()); }""") async def before_return_html_hook(page, context, html, **kwargs): """ 'html' is the near-finished HTML string. Return an updated string if you like. """ # For example, remove personal data or certain tags from the final text print("[HOOK] Sanitizing final HTML.") sanitized_html = html.replace("PersonalInfo:", "[REDACTED]") return sanitized_html hooks = { "before_retrieve_html": before_retrieve_html_hook, "before_return_html": before_return_html_hook } ``` **Note**: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override. --- ## 4. Putting It All Together You can combine multiple hooks in a single run. For instance: ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def on_browser_created_hook(browser, **kwargs): print("[HOOK] Browser is up, no page yet. Good for broad config.") async def before_goto_auth_hook(page, context, goto_params, **kwargs): print("[HOOK] Adding cookies for auth.") await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}]) async def after_goto_log_hook(page, context, response, **kwargs): if response: print("[HOOK] after_goto: Status code:", response.status) async def main(): hooks = { "on_browser_created": on_browser_created_hook, "before_goto": before_goto_auth_hook, "after_goto": after_goto_log_hook } browser_cfg = BrowserConfig(headless=True) crawler_cfg = CrawlerRunConfig(verbose=True) async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler: result = await crawler.arun("https://example.com/protected", config=crawler_cfg) if result.success: print("[OK] Protected page length:", len(result.html)) else: print("[ERROR]", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` This example: 1. **`on_browser_created`** sets up the brand-new browser instance. 2. **`before_goto`** ensures you inject an auth cookie before accessing the page. 3. **`after_goto`** logs the resulting HTTP status code. --- ## 5. Common Pitfalls & Best Practices 1. **Hook Order**: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic. 2. **Async vs Sync**: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`. 3. **Mutating goto_params**: `goto_params` is a dict that eventually goes to Playwright’s `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully. 4. **Browser vs Page vs Context**: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to **`browser`**. 5. **Avoid Overdoing It**: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice. --- ## Conclusion & Next Steps **Hooks** let you bend Crawl4AI to your will: - **Authentication** (cookies, localStorage) with `before_goto` - **Browser-level config** with `on_browser_created` - **Page or context config** with `on_page_context_created` - **Content modifications** before capturing HTML (`before_retrieve_html` or `before_return_html`) **Where to go next**: - **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: Combine hooks with advanced user simulation to avoid bot detection. - **[Reference → AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md)**: Learn more about how hooks are implemented under the hood. - **[How-To Guides](../../how-to/)**: Check short, specific recipes for tasks like scraping multiple pages with repeated “Load More” clicks. With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines! **Last Updated**: 2024-XX-XX