Spaces:
Runtime error
Runtime error
File size: 9,437 Bytes
03c0888 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
Below is a **draft** of a follow-up tutorial, **“Smart Crawling Techniques,”** building on the **“AsyncWebCrawler Basics”** tutorial. This tutorial focuses on three main points:
1. **Advanced usage of CSS selectors** (e.g., partial extraction, exclusions)
2. **Handling iframes** (if relevant for your workflow)
3. **Waiting for dynamic content** using `wait_for`, including the new `css:` and `js:` prefixes
Feel free to adjust code snippets, wording, or emphasis to match your library updates or user feedback.
---
# Smart Crawling Techniques
In the previous tutorial ([AsyncWebCrawler Basics](./async-webcrawler-basics.md)), you learned how to create an `AsyncWebCrawler` instance, run a basic crawl, and inspect the `CrawlResult`. Now it’s time to explore some of the **targeted crawling** features that let you:
1. Select specific parts of a webpage using CSS selectors
2. Exclude or ignore certain page elements
3. Wait for dynamic content to load using `wait_for` (with `css:` or `js:` rules)
4. (Optionally) Handle iframes if your target site embeds additional content
> **Prerequisites**
> - You’ve read or completed [AsyncWebCrawler Basics](./async-webcrawler-basics.md).
> - You have a working environment for Crawl4AI (Playwright installed, etc.).
---
## 1. Targeting Specific Elements with CSS Selectors
### 1.1 Simple CSS Selector Usage
Let’s say you only need to crawl the main article content of a news page. By setting `css_selector` in `CrawlerRunConfig`, your final HTML or Markdown output focuses on that region. For example:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_cfg = BrowserConfig(headless=True)
crawler_cfg = CrawlerRunConfig(
css_selector=".article-body", # Only capture .article-body content
excluded_tags=["nav", "footer"] # Optional: skip big nav & footer sections
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://news.example.com/story/12345",
config=crawler_cfg
)
if result.success:
print("[OK] Extracted content length:", len(result.html))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**Key Parameters**:
- **`css_selector`**: Tells the crawler to focus on `.article-body`.
- **`excluded_tags`**: Tells the crawler to skip specific HTML tags altogether (e.g., `nav` or `footer`).
**Tip**: For extremely noisy pages, you can further refine how you exclude certain elements by using `excluded_selector`, which takes a CSS selector you want removed from the final output.
### 1.2 Excluding Content with `excluded_selector`
If you want to remove certain sections within `.article-body` (like “related stories” sidebars), set:
```python
CrawlerRunConfig(
css_selector=".article-body",
excluded_selector=".related-stories, .ads-banner"
)
```
This combination grabs the main article content while filtering out sidebars or ads.
---
## 2. Handling Iframes
Some sites embed extra content via `<iframe>` elements—for example, embedded videos or external forms. If you want the crawler to traverse these iframes and merge their content into the final HTML or Markdown, set:
```python
crawler_cfg = CrawlerRunConfig(
process_iframes=True
)
```
- **`process_iframes=True`**: Tells the crawler (specifically the underlying Playwright strategy) to recursively fetch iframe content and integrate it into `result.html` and `result.markdown`.
**Warning**: Not all sites allow iframes to be crawled (some cross-origin policies might block it). If you see partial or missing data, check the domain policy or logs for warnings.
---
## 3. Waiting for Dynamic Content
Many modern sites load content dynamically (e.g., after user interaction or asynchronously). Crawl4AI helps you wait for specific conditions before capturing the final HTML. Let’s look at `wait_for`.
### 3.1 `wait_for` Basics
In `CrawlerRunConfig`, `wait_for` can be a simple CSS selector or a JavaScript condition. Under the hood, Crawl4AI uses `smart_wait` to interpret what you provide.
```python
crawler_cfg = CrawlerRunConfig(
wait_for="css:.main-article-loaded",
page_timeout=30000
)
```
**Example**: `css:.main-article-loaded` means “Wait for an element with the class `.main-article-loaded` to appear in the DOM.” If it doesn’t appear within `30` seconds, you’ll get a timeout.
### 3.2 Using Explicit Prefixes
**`js:`** and **`css:`** can explicitly tell the crawler which approach to use:
- **`wait_for="css:.comments-section"`** → Wait for `.comments-section` to appear
- **`wait_for="js:() => document.querySelectorAll('.comments').length > 5"`** → Wait until there are at least 6 comment elements
**Code Example**:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
wait_for="js:() => document.querySelectorAll('.dynamic-items li').length >= 10",
page_timeout=20000 # 20s
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/async-list",
config=config
)
if result.success:
print("[OK] Dynamic items loaded. HTML length:", len(result.html))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
### 3.3 Fallback Logic
If you **don’t** prefix `js:` or `css:`, Crawl4AI tries to detect whether your string looks like a CSS selector or a JavaScript snippet. It’ll first attempt a CSS selector. If that fails, it tries to evaluate it as a JavaScript function. This can be convenient but can also lead to confusion if the library guesses incorrectly. It’s often best to be explicit:
- **`"css:.my-selector"`** → Force CSS
- **`"js:() => myAppState.isReady()"`** → Force JavaScript
**What Should My JavaScript Return?**
- A function that returns `true` once the condition is met (or `false` if it fails).
- The function can be sync or async, but note that the crawler wraps it in an async loop to poll until `true` or timeout.
---
## 4. Example: Targeted Crawl with Iframes & Wait-For
Below is a more advanced snippet combining these features:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_cfg = BrowserConfig(headless=True)
crawler_cfg = CrawlerRunConfig(
css_selector=".main-content",
process_iframes=True,
wait_for="css:.loaded-indicator", # Wait for .loaded-indicator to appear
excluded_tags=["script", "style"], # Remove script/style tags
page_timeout=30000,
verbose=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/iframe-heavy",
config=crawler_cfg
)
if result.success:
print("[OK] Crawled with iframes. Length of final HTML:", len(result.html))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**What’s Happening**:
1. **`css_selector=".main-content"`** → Focus only on `.main-content` for final extraction.
2. **`process_iframes=True`** → Recursively handle `<iframe>` content.
3. **`wait_for="css:.loaded-indicator"`** → Don’t extract until the page shows `.loaded-indicator`.
4. **`excluded_tags=["script", "style"]`** → Remove script and style tags for a cleaner result.
---
## 5. Common Pitfalls & Tips
1. **Be Explicit**: Using `"js:"` or `"css:"` can spare you headaches if the library guesses incorrectly.
2. **Timeouts**: If the site never triggers your wait condition, a `TimeoutError` can occur. Check your logs or use `verbose=True` for more clues.
3. **Infinite Scroll**: If you have repeated “load more” loops, you might use [Hooks & Custom Code](./hooks-custom.md) or add your own JavaScript for repeated scrolling.
4. **Iframes**: Some iframes are cross-origin or protected. In those cases, you might not be able to read their content. Check your logs for permission errors.
---
## 6. Summary & Next Steps
With these **Targeted Crawling Techniques** you can:
- Precisely target or exclude content using CSS selectors.
- Automatically wait for dynamic elements to load using `wait_for`.
- Merge iframe content into your main page result.
### Where to Go Next?
- **[Link & Media Analysis](./link-media-analysis.md)**: Dive deeper into analyzing extracted links and media items.
- **[Hooks & Custom Code](./hooks-custom.md)**: Learn how to implement repeated actions like infinite scroll or login sequences using hooks.
- **Reference**: For an exhaustive list of parameters and advanced usage, see [CrawlerRunConfig Reference](../../reference/configuration.md).
If you run into issues or want to see real examples from other users, check the [How-To Guides](../../how-to/) or raise a question on GitHub.
**Last updated**: 2024-XX-XX
---
That’s it for **Targeted Crawling Techniques**! You’re now equipped to handle complex pages that rely on dynamic loading, custom CSS selectors, and iframe embedding. |