Spaces:
Runtime error
Below is a draft of a follow-up tutorial, “Smart Crawling Techniques,” building on the “AsyncWebCrawler Basics” tutorial. This tutorial focuses on three main points:
- Advanced usage of CSS selectors (e.g., partial extraction, exclusions)
- Handling iframes (if relevant for your workflow)
- Waiting for dynamic content using
wait_for
, including the newcss:
andjs:
prefixes
Feel free to adjust code snippets, wording, or emphasis to match your library updates or user feedback.
Smart Crawling Techniques
In the previous tutorial (AsyncWebCrawler Basics), you learned how to create an AsyncWebCrawler
instance, run a basic crawl, and inspect the CrawlResult
. Now it’s time to explore some of the targeted crawling features that let you:
- Select specific parts of a webpage using CSS selectors
- Exclude or ignore certain page elements
- Wait for dynamic content to load using
wait_for
(withcss:
orjs:
rules) - (Optionally) Handle iframes if your target site embeds additional content
Prerequisites
- You’ve read or completed AsyncWebCrawler Basics.
- You have a working environment for Crawl4AI (Playwright installed, etc.).
1. Targeting Specific Elements with CSS Selectors
1.1 Simple CSS Selector Usage
Let’s say you only need to crawl the main article content of a news page. By setting css_selector
in CrawlerRunConfig
, your final HTML or Markdown output focuses on that region. For example:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_cfg = BrowserConfig(headless=True)
crawler_cfg = CrawlerRunConfig(
css_selector=".article-body", # Only capture .article-body content
excluded_tags=["nav", "footer"] # Optional: skip big nav & footer sections
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://news.example.com/story/12345",
config=crawler_cfg
)
if result.success:
print("[OK] Extracted content length:", len(result.html))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
Key Parameters:
css_selector
: Tells the crawler to focus on.article-body
.excluded_tags
: Tells the crawler to skip specific HTML tags altogether (e.g.,nav
orfooter
).
Tip: For extremely noisy pages, you can further refine how you exclude certain elements by using excluded_selector
, which takes a CSS selector you want removed from the final output.
1.2 Excluding Content with excluded_selector
If you want to remove certain sections within .article-body
(like “related stories” sidebars), set:
CrawlerRunConfig(
css_selector=".article-body",
excluded_selector=".related-stories, .ads-banner"
)
This combination grabs the main article content while filtering out sidebars or ads.
2. Handling Iframes
Some sites embed extra content via <iframe>
elements—for example, embedded videos or external forms. If you want the crawler to traverse these iframes and merge their content into the final HTML or Markdown, set:
crawler_cfg = CrawlerRunConfig(
process_iframes=True
)
process_iframes=True
: Tells the crawler (specifically the underlying Playwright strategy) to recursively fetch iframe content and integrate it intoresult.html
andresult.markdown
.
Warning: Not all sites allow iframes to be crawled (some cross-origin policies might block it). If you see partial or missing data, check the domain policy or logs for warnings.
3. Waiting for Dynamic Content
Many modern sites load content dynamically (e.g., after user interaction or asynchronously). Crawl4AI helps you wait for specific conditions before capturing the final HTML. Let’s look at wait_for
.
3.1 wait_for
Basics
In CrawlerRunConfig
, wait_for
can be a simple CSS selector or a JavaScript condition. Under the hood, Crawl4AI uses smart_wait
to interpret what you provide.
crawler_cfg = CrawlerRunConfig(
wait_for="css:.main-article-loaded",
page_timeout=30000
)
Example: css:.main-article-loaded
means “Wait for an element with the class .main-article-loaded
to appear in the DOM.” If it doesn’t appear within 30
seconds, you’ll get a timeout.
3.2 Using Explicit Prefixes
js:
and css:
can explicitly tell the crawler which approach to use:
wait_for="css:.comments-section"
→ Wait for.comments-section
to appearwait_for="js:() => document.querySelectorAll('.comments').length > 5"
→ Wait until there are at least 6 comment elements
Code Example:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
wait_for="js:() => document.querySelectorAll('.dynamic-items li').length >= 10",
page_timeout=20000 # 20s
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/async-list",
config=config
)
if result.success:
print("[OK] Dynamic items loaded. HTML length:", len(result.html))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
3.3 Fallback Logic
If you don’t prefix js:
or css:
, Crawl4AI tries to detect whether your string looks like a CSS selector or a JavaScript snippet. It’ll first attempt a CSS selector. If that fails, it tries to evaluate it as a JavaScript function. This can be convenient but can also lead to confusion if the library guesses incorrectly. It’s often best to be explicit:
"css:.my-selector"
→ Force CSS"js:() => myAppState.isReady()"
→ Force JavaScript
What Should My JavaScript Return?
- A function that returns
true
once the condition is met (orfalse
if it fails). - The function can be sync or async, but note that the crawler wraps it in an async loop to poll until
true
or timeout.
4. Example: Targeted Crawl with Iframes & Wait-For
Below is a more advanced snippet combining these features:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_cfg = BrowserConfig(headless=True)
crawler_cfg = CrawlerRunConfig(
css_selector=".main-content",
process_iframes=True,
wait_for="css:.loaded-indicator", # Wait for .loaded-indicator to appear
excluded_tags=["script", "style"], # Remove script/style tags
page_timeout=30000,
verbose=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/iframe-heavy",
config=crawler_cfg
)
if result.success:
print("[OK] Crawled with iframes. Length of final HTML:", len(result.html))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
What’s Happening:
css_selector=".main-content"
→ Focus only on.main-content
for final extraction.process_iframes=True
→ Recursively handle<iframe>
content.wait_for="css:.loaded-indicator"
→ Don’t extract until the page shows.loaded-indicator
.excluded_tags=["script", "style"]
→ Remove script and style tags for a cleaner result.
5. Common Pitfalls & Tips
- Be Explicit: Using
"js:"
or"css:"
can spare you headaches if the library guesses incorrectly. - Timeouts: If the site never triggers your wait condition, a
TimeoutError
can occur. Check your logs or useverbose=True
for more clues. - Infinite Scroll: If you have repeated “load more” loops, you might use Hooks & Custom Code or add your own JavaScript for repeated scrolling.
- Iframes: Some iframes are cross-origin or protected. In those cases, you might not be able to read their content. Check your logs for permission errors.
6. Summary & Next Steps
With these Targeted Crawling Techniques you can:
- Precisely target or exclude content using CSS selectors.
- Automatically wait for dynamic elements to load using
wait_for
. - Merge iframe content into your main page result.
Where to Go Next?
- Link & Media Analysis: Dive deeper into analyzing extracted links and media items.
- Hooks & Custom Code: Learn how to implement repeated actions like infinite scroll or login sequences using hooks.
- Reference: For an exhaustive list of parameters and advanced usage, see CrawlerRunConfig Reference.
If you run into issues or want to see real examples from other users, check the How-To Guides or raise a question on GitHub.
Last updated: 2024-XX-XX
That’s it for Targeted Crawling Techniques! You’re now equipped to handle complex pages that rely on dynamic loading, custom CSS selectors, and iframe embedding.