Crawl4AI

Runtime error

File size: 9,437 Bytes

03c0888

Below is a **draft** of a follow-up tutorial, **“Smart Crawling Techniques,”** building on the **“AsyncWebCrawler Basics”** tutorial. This tutorial focuses on three main points:

1. **Advanced usage of CSS selectors** (e.g., partial extraction, exclusions)
2. **Handling iframes** (if relevant for your workflow)
3. **Waiting for dynamic content** using `wait_for`, including the new `css:` and `js:` prefixes

Feel free to adjust code snippets, wording, or emphasis to match your library updates or user feedback.

---

# Smart Crawling Techniques

In the previous tutorial ([AsyncWebCrawler Basics](./async-webcrawler-basics.md)), you learned how to create an `AsyncWebCrawler` instance, run a basic crawl, and inspect the `CrawlResult`. Now it’s time to explore some of the **targeted crawling** features that let you:

1. Select specific parts of a webpage using CSS selectors  
2. Exclude or ignore certain page elements  
3. Wait for dynamic content to load using `wait_for` (with `css:` or `js:` rules)  
4. (Optionally) Handle iframes if your target site embeds additional content

> **Prerequisites**  
> - You’ve read or completed [AsyncWebCrawler Basics](./async-webcrawler-basics.md).  
> - You have a working environment for Crawl4AI (Playwright installed, etc.).

---

## 1. Targeting Specific Elements with CSS Selectors

### 1.1 Simple CSS Selector Usage

Let’s say you only need to crawl the main article content of a news page. By setting `css_selector` in `CrawlerRunConfig`, your final HTML or Markdown output focuses on that region. For example:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_cfg = BrowserConfig(headless=True)
    crawler_cfg = CrawlerRunConfig(
        css_selector=".article-body",  # Only capture .article-body content
        excluded_tags=["nav", "footer"]  # Optional: skip big nav & footer sections
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://news.example.com/story/12345",
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Extracted content length:", len(result.html))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

**Key Parameters**:
- **`css_selector`**: Tells the crawler to focus on `.article-body`.  
- **`excluded_tags`**: Tells the crawler to skip specific HTML tags altogether (e.g., `nav` or `footer`).  

**Tip**: For extremely noisy pages, you can further refine how you exclude certain elements by using `excluded_selector`, which takes a CSS selector you want removed from the final output.

### 1.2 Excluding Content with `excluded_selector`

If you want to remove certain sections within `.article-body` (like “related stories” sidebars), set:

```python
CrawlerRunConfig(
    css_selector=".article-body",
    excluded_selector=".related-stories, .ads-banner"
)
```

This combination grabs the main article content while filtering out sidebars or ads.

---

## 2. Handling Iframes

Some sites embed extra content via `<iframe>` elements—for example, embedded videos or external forms. If you want the crawler to traverse these iframes and merge their content into the final HTML or Markdown, set:

```python
crawler_cfg = CrawlerRunConfig(
    process_iframes=True
)
```

- **`process_iframes=True`**: Tells the crawler (specifically the underlying Playwright strategy) to recursively fetch iframe content and integrate it into `result.html` and `result.markdown`.

**Warning**: Not all sites allow iframes to be crawled (some cross-origin policies might block it). If you see partial or missing data, check the domain policy or logs for warnings.

---

## 3. Waiting for Dynamic Content

Many modern sites load content dynamically (e.g., after user interaction or asynchronously). Crawl4AI helps you wait for specific conditions before capturing the final HTML. Let’s look at `wait_for`.

### 3.1 `wait_for` Basics

In `CrawlerRunConfig`, `wait_for` can be a simple CSS selector or a JavaScript condition. Under the hood, Crawl4AI uses `smart_wait` to interpret what you provide.

```python
crawler_cfg = CrawlerRunConfig(
    wait_for="css:.main-article-loaded",
    page_timeout=30000
)
```

**Example**: `css:.main-article-loaded` means “Wait for an element with the class `.main-article-loaded` to appear in the DOM.” If it doesn’t appear within `30` seconds, you’ll get a timeout.

### 3.2 Using Explicit Prefixes

**`js:`** and **`css:`** can explicitly tell the crawler which approach to use:

- **`wait_for="css:.comments-section"`** → Wait for `.comments-section` to appear  
- **`wait_for="js:() => document.querySelectorAll('.comments').length > 5"`** → Wait until there are at least 6 comment elements  

**Code Example**:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    config = CrawlerRunConfig(
        wait_for="js:() => document.querySelectorAll('.dynamic-items li').length >= 10",
        page_timeout=20000  # 20s
    )
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/async-list",
            config=config
        )
        if result.success:
            print("[OK] Dynamic items loaded. HTML length:", len(result.html))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

### 3.3 Fallback Logic

If you **don’t** prefix `js:` or `css:`, Crawl4AI tries to detect whether your string looks like a CSS selector or a JavaScript snippet. It’ll first attempt a CSS selector. If that fails, it tries to evaluate it as a JavaScript function. This can be convenient but can also lead to confusion if the library guesses incorrectly. It’s often best to be explicit:

- **`"css:.my-selector"`** → Force CSS  
- **`"js:() => myAppState.isReady()"`** → Force JavaScript

**What Should My JavaScript Return?**  
- A function that returns `true` once the condition is met (or `false` if it fails).  
- The function can be sync or async, but note that the crawler wraps it in an async loop to poll until `true` or timeout.

---

## 4. Example: Targeted Crawl with Iframes & Wait-For

Below is a more advanced snippet combining these features:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_cfg = BrowserConfig(headless=True)
    crawler_cfg = CrawlerRunConfig(
        css_selector=".main-content",
        process_iframes=True,
        wait_for="css:.loaded-indicator",   # Wait for .loaded-indicator to appear
        excluded_tags=["script", "style"],  # Remove script/style tags
        page_timeout=30000,
        verbose=True
    )
    
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/iframe-heavy",
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Crawled with iframes. Length of final HTML:", len(result.html))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

**What’s Happening**:
1. **`css_selector=".main-content"`** → Focus only on `.main-content` for final extraction.  
2. **`process_iframes=True`** → Recursively handle `<iframe>` content.  
3. **`wait_for="css:.loaded-indicator"`** → Don’t extract until the page shows `.loaded-indicator`.  
4. **`excluded_tags=["script", "style"]`** → Remove script and style tags for a cleaner result.

---

## 5. Common Pitfalls & Tips

1. **Be Explicit**: Using `"js:"` or `"css:"` can spare you headaches if the library guesses incorrectly.  
2. **Timeouts**: If the site never triggers your wait condition, a `TimeoutError` can occur. Check your logs or use `verbose=True` for more clues.  
3. **Infinite Scroll**: If you have repeated “load more” loops, you might use [Hooks & Custom Code](./hooks-custom.md) or add your own JavaScript for repeated scrolling.  
4. **Iframes**: Some iframes are cross-origin or protected. In those cases, you might not be able to read their content. Check your logs for permission errors.  

---

## 6. Summary & Next Steps

With these **Targeted Crawling Techniques** you can:

- Precisely target or exclude content using CSS selectors.  
- Automatically wait for dynamic elements to load using `wait_for`.  
- Merge iframe content into your main page result.  

### Where to Go Next?

- **[Link & Media Analysis](./link-media-analysis.md)**: Dive deeper into analyzing extracted links and media items.  
- **[Hooks & Custom Code](./hooks-custom.md)**: Learn how to implement repeated actions like infinite scroll or login sequences using hooks.  
- **Reference**: For an exhaustive list of parameters and advanced usage, see [CrawlerRunConfig Reference](../../reference/configuration.md).  

If you run into issues or want to see real examples from other users, check the [How-To Guides](../../how-to/) or raise a question on GitHub.

**Last updated**: 2024-XX-XX

---

That’s it for **Targeted Crawling Techniques**! You’re now equipped to handle complex pages that rely on dynamic loading, custom CSS selectors, and iframe embedding.