Spaces:
Runtime error
Runtime error
Content Selection
Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need.
CSS Selectors
Extract specific content using a CrawlerRunConfig
with CSS selectors:
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig(css_selector=".main-article") # Target main article content
result = await crawler.arun(url="https://crawl4ai.com", config=config)
config = CrawlerRunConfig(css_selector="article h1, article .content") # Target heading and content
result = await crawler.arun(url="https://crawl4ai.com", config=config)
Content Filtering
Control content inclusion or exclusion with CrawlerRunConfig
:
config = CrawlerRunConfig(
word_count_threshold=10, # Minimum words per block
excluded_tags=['form', 'header', 'footer', 'nav'], # Excluded tags
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
exclude_external_images=True # Remove external images
)
result = await crawler.arun(url="https://crawl4ai.com", config=config)
Iframe Content
Process iframe content by enabling specific options in CrawlerRunConfig
:
config = CrawlerRunConfig(
process_iframes=True, # Extract iframe content
remove_overlay_elements=True # Remove popups/modals that might block iframes
)
result = await crawler.arun(url="https://crawl4ai.com", config=config)
Structured Content Selection Using LLMs
Leverage LLMs for intelligent content extraction:
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from typing import List
class ArticleContent(BaseModel):
title: str
main_points: List[str]
conclusion: str
strategy = LLMExtractionStrategy(
provider="ollama/nemotron",
schema=ArticleContent.schema(),
instruction="Extract the main article title, key points, and conclusion"
)
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url="https://crawl4ai.com", config=config)
article = json.loads(result.extracted_content)
Pattern-Based Selection
Extract content matching repetitive patterns:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "News Articles",
"baseSelector": "article.news-item",
"fields": [
{"name": "headline", "selector": "h2", "type": "text"},
{"name": "summary", "selector": ".summary", "type": "text"},
{"name": "category", "selector": ".category", "type": "text"},
{
"name": "metadata",
"type": "nested",
"fields": [
{"name": "author", "selector": ".author", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"}
]
}
]
}
strategy = JsonCssExtractionStrategy(schema)
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url="https://crawl4ai.com", config=config)
articles = json.loads(result.extracted_content)
Comprehensive Example
Combine different selection methods using CrawlerRunConfig
:
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def extract_article_content(url: str):
# Define structured extraction
article_schema = {
"name": "Article",
"baseSelector": "article.main",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"}
]
}
# Define configuration
config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(article_schema),
word_count_threshold=10,
excluded_tags=['nav', 'footer'],
exclude_external_links=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=config)
return json.loads(result.extracted_content)