Spaces:
Runtime error
Runtime error
Below is a **draft** of a follow-up tutorial, **“Smart Crawling Techniques,”** building on the **“AsyncWebCrawler Basics”** tutorial. This tutorial focuses on three main points: | |
1. **Advanced usage of CSS selectors** (e.g., partial extraction, exclusions) | |
2. **Handling iframes** (if relevant for your workflow) | |
3. **Waiting for dynamic content** using `wait_for`, including the new `css:` and `js:` prefixes | |
Feel free to adjust code snippets, wording, or emphasis to match your library updates or user feedback. | |
--- | |
# Smart Crawling Techniques | |
In the previous tutorial ([AsyncWebCrawler Basics](./async-webcrawler-basics.md)), you learned how to create an `AsyncWebCrawler` instance, run a basic crawl, and inspect the `CrawlResult`. Now it’s time to explore some of the **targeted crawling** features that let you: | |
1. Select specific parts of a webpage using CSS selectors | |
2. Exclude or ignore certain page elements | |
3. Wait for dynamic content to load using `wait_for` (with `css:` or `js:` rules) | |
4. (Optionally) Handle iframes if your target site embeds additional content | |
> **Prerequisites** | |
> - You’ve read or completed [AsyncWebCrawler Basics](./async-webcrawler-basics.md). | |
> - You have a working environment for Crawl4AI (Playwright installed, etc.). | |
--- | |
## 1. Targeting Specific Elements with CSS Selectors | |
### 1.1 Simple CSS Selector Usage | |
Let’s say you only need to crawl the main article content of a news page. By setting `css_selector` in `CrawlerRunConfig`, your final HTML or Markdown output focuses on that region. For example: | |
```python | |
import asyncio | |
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig | |
async def main(): | |
browser_cfg = BrowserConfig(headless=True) | |
crawler_cfg = CrawlerRunConfig( | |
css_selector=".article-body", # Only capture .article-body content | |
excluded_tags=["nav", "footer"] # Optional: skip big nav & footer sections | |
) | |
async with AsyncWebCrawler(config=browser_cfg) as crawler: | |
result = await crawler.arun( | |
url="https://news.example.com/story/12345", | |
config=crawler_cfg | |
) | |
if result.success: | |
print("[OK] Extracted content length:", len(result.html)) | |
else: | |
print("[ERROR]", result.error_message) | |
if __name__ == "__main__": | |
asyncio.run(main()) | |
``` | |
**Key Parameters**: | |
- **`css_selector`**: Tells the crawler to focus on `.article-body`. | |
- **`excluded_tags`**: Tells the crawler to skip specific HTML tags altogether (e.g., `nav` or `footer`). | |
**Tip**: For extremely noisy pages, you can further refine how you exclude certain elements by using `excluded_selector`, which takes a CSS selector you want removed from the final output. | |
### 1.2 Excluding Content with `excluded_selector` | |
If you want to remove certain sections within `.article-body` (like “related stories” sidebars), set: | |
```python | |
CrawlerRunConfig( | |
css_selector=".article-body", | |
excluded_selector=".related-stories, .ads-banner" | |
) | |
``` | |
This combination grabs the main article content while filtering out sidebars or ads. | |
--- | |
## 2. Handling Iframes | |
Some sites embed extra content via `<iframe>` elements—for example, embedded videos or external forms. If you want the crawler to traverse these iframes and merge their content into the final HTML or Markdown, set: | |
```python | |
crawler_cfg = CrawlerRunConfig( | |
process_iframes=True | |
) | |
``` | |
- **`process_iframes=True`**: Tells the crawler (specifically the underlying Playwright strategy) to recursively fetch iframe content and integrate it into `result.html` and `result.markdown`. | |
**Warning**: Not all sites allow iframes to be crawled (some cross-origin policies might block it). If you see partial or missing data, check the domain policy or logs for warnings. | |
--- | |
## 3. Waiting for Dynamic Content | |
Many modern sites load content dynamically (e.g., after user interaction or asynchronously). Crawl4AI helps you wait for specific conditions before capturing the final HTML. Let’s look at `wait_for`. | |
### 3.1 `wait_for` Basics | |
In `CrawlerRunConfig`, `wait_for` can be a simple CSS selector or a JavaScript condition. Under the hood, Crawl4AI uses `smart_wait` to interpret what you provide. | |
```python | |
crawler_cfg = CrawlerRunConfig( | |
wait_for="css:.main-article-loaded", | |
page_timeout=30000 | |
) | |
``` | |
**Example**: `css:.main-article-loaded` means “Wait for an element with the class `.main-article-loaded` to appear in the DOM.” If it doesn’t appear within `30` seconds, you’ll get a timeout. | |
### 3.2 Using Explicit Prefixes | |
**`js:`** and **`css:`** can explicitly tell the crawler which approach to use: | |
- **`wait_for="css:.comments-section"`** → Wait for `.comments-section` to appear | |
- **`wait_for="js:() => document.querySelectorAll('.comments').length > 5"`** → Wait until there are at least 6 comment elements | |
**Code Example**: | |
```python | |
import asyncio | |
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig | |
async def main(): | |
config = CrawlerRunConfig( | |
wait_for="js:() => document.querySelectorAll('.dynamic-items li').length >= 10", | |
page_timeout=20000 # 20s | |
) | |
async with AsyncWebCrawler() as crawler: | |
result = await crawler.arun( | |
url="https://example.com/async-list", | |
config=config | |
) | |
if result.success: | |
print("[OK] Dynamic items loaded. HTML length:", len(result.html)) | |
else: | |
print("[ERROR]", result.error_message) | |
if __name__ == "__main__": | |
asyncio.run(main()) | |
``` | |
### 3.3 Fallback Logic | |
If you **don’t** prefix `js:` or `css:`, Crawl4AI tries to detect whether your string looks like a CSS selector or a JavaScript snippet. It’ll first attempt a CSS selector. If that fails, it tries to evaluate it as a JavaScript function. This can be convenient but can also lead to confusion if the library guesses incorrectly. It’s often best to be explicit: | |
- **`"css:.my-selector"`** → Force CSS | |
- **`"js:() => myAppState.isReady()"`** → Force JavaScript | |
**What Should My JavaScript Return?** | |
- A function that returns `true` once the condition is met (or `false` if it fails). | |
- The function can be sync or async, but note that the crawler wraps it in an async loop to poll until `true` or timeout. | |
--- | |
## 4. Example: Targeted Crawl with Iframes & Wait-For | |
Below is a more advanced snippet combining these features: | |
```python | |
import asyncio | |
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig | |
async def main(): | |
browser_cfg = BrowserConfig(headless=True) | |
crawler_cfg = CrawlerRunConfig( | |
css_selector=".main-content", | |
process_iframes=True, | |
wait_for="css:.loaded-indicator", # Wait for .loaded-indicator to appear | |
excluded_tags=["script", "style"], # Remove script/style tags | |
page_timeout=30000, | |
verbose=True | |
) | |
async with AsyncWebCrawler(config=browser_cfg) as crawler: | |
result = await crawler.arun( | |
url="https://example.com/iframe-heavy", | |
config=crawler_cfg | |
) | |
if result.success: | |
print("[OK] Crawled with iframes. Length of final HTML:", len(result.html)) | |
else: | |
print("[ERROR]", result.error_message) | |
if __name__ == "__main__": | |
asyncio.run(main()) | |
``` | |
**What’s Happening**: | |
1. **`css_selector=".main-content"`** → Focus only on `.main-content` for final extraction. | |
2. **`process_iframes=True`** → Recursively handle `<iframe>` content. | |
3. **`wait_for="css:.loaded-indicator"`** → Don’t extract until the page shows `.loaded-indicator`. | |
4. **`excluded_tags=["script", "style"]`** → Remove script and style tags for a cleaner result. | |
--- | |
## 5. Common Pitfalls & Tips | |
1. **Be Explicit**: Using `"js:"` or `"css:"` can spare you headaches if the library guesses incorrectly. | |
2. **Timeouts**: If the site never triggers your wait condition, a `TimeoutError` can occur. Check your logs or use `verbose=True` for more clues. | |
3. **Infinite Scroll**: If you have repeated “load more” loops, you might use [Hooks & Custom Code](./hooks-custom.md) or add your own JavaScript for repeated scrolling. | |
4. **Iframes**: Some iframes are cross-origin or protected. In those cases, you might not be able to read their content. Check your logs for permission errors. | |
--- | |
## 6. Summary & Next Steps | |
With these **Targeted Crawling Techniques** you can: | |
- Precisely target or exclude content using CSS selectors. | |
- Automatically wait for dynamic elements to load using `wait_for`. | |
- Merge iframe content into your main page result. | |
### Where to Go Next? | |
- **[Link & Media Analysis](./link-media-analysis.md)**: Dive deeper into analyzing extracted links and media items. | |
- **[Hooks & Custom Code](./hooks-custom.md)**: Learn how to implement repeated actions like infinite scroll or login sequences using hooks. | |
- **Reference**: For an exhaustive list of parameters and advanced usage, see [CrawlerRunConfig Reference](../../reference/configuration.md). | |
If you run into issues or want to see real examples from other users, check the [How-To Guides](../../how-to/) or raise a question on GitHub. | |
**Last updated**: 2024-XX-XX | |
--- | |
That’s it for **Targeted Crawling Techniques**! You’re now equipped to handle complex pages that rely on dynamic loading, custom CSS selectors, and iframe embedding. |