Crawl4AI

Runtime error

File size: 4,093 Bytes

03c0888

### Content Selection

Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need.

#### CSS Selectors

Extract specific content using a `CrawlerRunConfig` with CSS selectors:

```python
from crawl4ai.async_configs import CrawlerRunConfig

config = CrawlerRunConfig(css_selector=".main-article")  # Target main article content
result = await crawler.arun(url="https://crawl4ai.com", config=config)

config = CrawlerRunConfig(css_selector="article h1, article .content")  # Target heading and content
result = await crawler.arun(url="https://crawl4ai.com", config=config)
```

#### Content Filtering

Control content inclusion or exclusion with `CrawlerRunConfig`:

```python
config = CrawlerRunConfig(
    word_count_threshold=10,        # Minimum words per block
    excluded_tags=['form', 'header', 'footer', 'nav'],  # Excluded tags
    exclude_external_links=True,    # Remove external links
    exclude_social_media_links=True,  # Remove social media links
    exclude_external_images=True   # Remove external images
)

result = await crawler.arun(url="https://crawl4ai.com", config=config)
```

#### Iframe Content

Process iframe content by enabling specific options in `CrawlerRunConfig`:

```python
config = CrawlerRunConfig(
    process_iframes=True,          # Extract iframe content
    remove_overlay_elements=True  # Remove popups/modals that might block iframes
)

result = await crawler.arun(url="https://crawl4ai.com", config=config)
```

#### Structured Content Selection Using LLMs

Leverage LLMs for intelligent content extraction:

```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from typing import List

class ArticleContent(BaseModel):
    title: str
    main_points: List[str]
    conclusion: str

strategy = LLMExtractionStrategy(
    provider="ollama/nemotron",
    schema=ArticleContent.schema(),
    instruction="Extract the main article title, key points, and conclusion"
)

config = CrawlerRunConfig(extraction_strategy=strategy)

result = await crawler.arun(url="https://crawl4ai.com", config=config)
article = json.loads(result.extracted_content)
```

#### Pattern-Based Selection

Extract content matching repetitive patterns:

```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "News Articles",
    "baseSelector": "article.news-item",
    "fields": [
        {"name": "headline", "selector": "h2", "type": "text"},
        {"name": "summary", "selector": ".summary", "type": "text"},
        {"name": "category", "selector": ".category", "type": "text"},
        {
            "name": "metadata",
            "type": "nested",
            "fields": [
                {"name": "author", "selector": ".author", "type": "text"},
                {"name": "date", "selector": ".date", "type": "text"}
            ]
        }
    ]
}

strategy = JsonCssExtractionStrategy(schema)
config = CrawlerRunConfig(extraction_strategy=strategy)

result = await crawler.arun(url="https://crawl4ai.com", config=config)
articles = json.loads(result.extracted_content)
```

#### Comprehensive Example

Combine different selection methods using `CrawlerRunConfig`:

```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

async def extract_article_content(url: str):
    # Define structured extraction
    article_schema = {
        "name": "Article",
        "baseSelector": "article.main",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "content", "selector": ".content", "type": "text"}
        ]
    }

    # Define configuration
    config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(article_schema),
        word_count_threshold=10,
        excluded_tags=['nav', 'footer'],
        exclude_external_links=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        return json.loads(result.extracted_content)
```