Below is a draft of a follow-up tutorial, “Smart Crawling Techniques,” building on the “AsyncWebCrawler Basics” tutorial. This tutorial focuses on three main points:

Advanced usage of CSS selectors (e.g., partial extraction, exclusions)
Handling iframes (if relevant for your workflow)
Waiting for dynamic content using wait_for, including the new css: and js: prefixes

Feel free to adjust code snippets, wording, or emphasis to match your library updates or user feedback.

Smart Crawling Techniques

In the previous tutorial (AsyncWebCrawler Basics), you learned how to create an AsyncWebCrawler instance, run a basic crawl, and inspect the CrawlResult. Now it’s time to explore some of the targeted crawling features that let you:

Select specific parts of a webpage using CSS selectors
Exclude or ignore certain page elements
Wait for dynamic content to load using wait_for (with css: or js: rules)
(Optionally) Handle iframes if your target site embeds additional content

Prerequisites

You’ve read or completed AsyncWebCrawler Basics.

You have a working environment for Crawl4AI (Playwright installed, etc.).

1. Targeting Specific Elements with CSS Selectors

1.1 Simple CSS Selector Usage

Let’s say you only need to crawl the main article content of a news page. By setting css_selector in CrawlerRunConfig, your final HTML or Markdown output focuses on that region. For example:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_cfg = BrowserConfig(headless=True)
    crawler_cfg = CrawlerRunConfig(
        css_selector=".article-body",  # Only capture .article-body content
        excluded_tags=["nav", "footer"]  # Optional: skip big nav & footer sections
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://news.example.com/story/12345",
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Extracted content length:", len(result.html))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

Key Parameters:

css_selector: Tells the crawler to focus on .article-body.
excluded_tags: Tells the crawler to skip specific HTML tags altogether (e.g., nav or footer).

Tip: For extremely noisy pages, you can further refine how you exclude certain elements by using excluded_selector, which takes a CSS selector you want removed from the final output.

1.2 Excluding Content with `excluded_selector`

If you want to remove certain sections within .article-body (like “related stories” sidebars), set:

CrawlerRunConfig(
    css_selector=".article-body",
    excluded_selector=".related-stories, .ads-banner"
)

This combination grabs the main article content while filtering out sidebars or ads.

2. Handling Iframes

Some sites embed extra content via <iframe> elements—for example, embedded videos or external forms. If you want the crawler to traverse these iframes and merge their content into the final HTML or Markdown, set:

crawler_cfg = CrawlerRunConfig(
    process_iframes=True
)

process_iframes=True: Tells the crawler (specifically the underlying Playwright strategy) to recursively fetch iframe content and integrate it into result.html and result.markdown.

Warning: Not all sites allow iframes to be crawled (some cross-origin policies might block it). If you see partial or missing data, check the domain policy or logs for warnings.

3. Waiting for Dynamic Content

Many modern sites load content dynamically (e.g., after user interaction or asynchronously). Crawl4AI helps you wait for specific conditions before capturing the final HTML. Let’s look at wait_for.

3.1 `wait_for` Basics

In CrawlerRunConfig, wait_for can be a simple CSS selector or a JavaScript condition. Under the hood, Crawl4AI uses smart_wait to interpret what you provide.

crawler_cfg = CrawlerRunConfig(
    wait_for="css:.main-article-loaded",
    page_timeout=30000
)

Example: css:.main-article-loaded means “Wait for an element with the class .main-article-loaded to appear in the DOM.” If it doesn’t appear within 30 seconds, you’ll get a timeout.

3.2 Using Explicit Prefixes

js: and css: can explicitly tell the crawler which approach to use:

wait_for="css:.comments-section" → Wait for .comments-section to appear
wait_for="js:() => document.querySelectorAll('.comments').length > 5" → Wait until there are at least 6 comment elements

Code Example:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    config = CrawlerRunConfig(
        wait_for="js:() => document.querySelectorAll('.dynamic-items li').length >= 10",
        page_timeout=20000  # 20s
    )
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/async-list",
            config=config
        )
        if result.success:
            print("[OK] Dynamic items loaded. HTML length:", len(result.html))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

3.3 Fallback Logic

If you don’t prefix js: or css:, Crawl4AI tries to detect whether your string looks like a CSS selector or a JavaScript snippet. It’ll first attempt a CSS selector. If that fails, it tries to evaluate it as a JavaScript function. This can be convenient but can also lead to confusion if the library guesses incorrectly. It’s often best to be explicit:

"css:.my-selector" → Force CSS
"js:() => myAppState.isReady()" → Force JavaScript

What Should My JavaScript Return?

A function that returns true once the condition is met (or false if it fails).
The function can be sync or async, but note that the crawler wraps it in an async loop to poll until true or timeout.

4. Example: Targeted Crawl with Iframes & Wait-For

Below is a more advanced snippet combining these features:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_cfg = BrowserConfig(headless=True)
    crawler_cfg = CrawlerRunConfig(
        css_selector=".main-content",
        process_iframes=True,
        wait_for="css:.loaded-indicator",   # Wait for .loaded-indicator to appear
        excluded_tags=["script", "style"],  # Remove script/style tags
        page_timeout=30000,
        verbose=True
    )
    
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/iframe-heavy",
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Crawled with iframes. Length of final HTML:", len(result.html))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

What’s Happening:

css_selector=".main-content" → Focus only on .main-content for final extraction.
process_iframes=True → Recursively handle <iframe> content.
wait_for="css:.loaded-indicator" → Don’t extract until the page shows .loaded-indicator.
excluded_tags=["script", "style"] → Remove script and style tags for a cleaner result.

5. Common Pitfalls & Tips

Be Explicit: Using "js:" or "css:" can spare you headaches if the library guesses incorrectly.
Timeouts: If the site never triggers your wait condition, a TimeoutError can occur. Check your logs or use verbose=True for more clues.
Infinite Scroll: If you have repeated “load more” loops, you might use Hooks & Custom Code or add your own JavaScript for repeated scrolling.
Iframes: Some iframes are cross-origin or protected. In those cases, you might not be able to read their content. Check your logs for permission errors.

6. Summary & Next Steps

With these Targeted Crawling Techniques you can:

Precisely target or exclude content using CSS selectors.
Automatically wait for dynamic elements to load using wait_for.
Merge iframe content into your main page result.

Where to Go Next?

Link & Media Analysis: Dive deeper into analyzing extracted links and media items.
Hooks & Custom Code: Learn how to implement repeated actions like infinite scroll or login sequences using hooks.
Reference: For an exhaustive list of parameters and advanced usage, see CrawlerRunConfig Reference.

If you run into issues or want to see real examples from other users, check the How-To Guides or raise a question on GitHub.

Last updated: 2024-XX-XX

That’s it for Targeted Crawling Techniques! You’re now equipped to handle complex pages that rely on dynamic loading, custom CSS selectors, and iframe embedding.