Crawl4AI

Runtime error

File size: 5,975 Bytes

03c0888

# Page Interaction

Crawl4AI provides powerful features for interacting with dynamic webpages, handling JavaScript execution, and managing page events.

## JavaScript Execution

### Basic Execution

```python
from crawl4ai.async_configs import CrawlerRunConfig

# Single JavaScript command
config = CrawlerRunConfig(
    js_code="window.scrollTo(0, document.body.scrollHeight);"
)
result = await crawler.arun(url="https://example.com", config=config)

# Multiple commands
js_commands = [
    "window.scrollTo(0, document.body.scrollHeight);",
    "document.querySelector('.load-more').click();",
    "document.querySelector('#consent-button').click();"
]
config = CrawlerRunConfig(js_code=js_commands)
result = await crawler.arun(url="https://example.com", config=config)
```

## Wait Conditions

### CSS-Based Waiting

Wait for elements to appear:

```python
config = CrawlerRunConfig(wait_for="css:.dynamic-content")  # Wait for element with class 'dynamic-content'
result = await crawler.arun(url="https://example.com", config=config)
```

### JavaScript-Based Waiting

Wait for custom conditions:

```python
# Wait for number of elements
wait_condition = """() => {
    return document.querySelectorAll('.item').length > 10;
}"""

config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
result = await crawler.arun(url="https://example.com", config=config)

# Wait for dynamic content to load
wait_for_content = """() => {
    const content = document.querySelector('.content');
    return content && content.innerText.length > 100;
}"""

config = CrawlerRunConfig(wait_for=f"js:{wait_for_content}")
result = await crawler.arun(url="https://example.com", config=config)
```

## Handling Dynamic Content

### Load More Content

Handle infinite scroll or load more buttons:

```python
config = CrawlerRunConfig(
    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",  # Scroll to bottom
        "const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();"  # Click load more
    ],
    wait_for="js:() => document.querySelectorAll('.item').length > previousCount"  # Wait for new content
)
result = await crawler.arun(url="https://example.com", config=config)
```

### Form Interaction

Handle forms and inputs:

```python
js_form_interaction = """
    document.querySelector('#search').value = 'search term';  // Fill form fields
    document.querySelector('form').submit();                 // Submit form
"""

config = CrawlerRunConfig(
    js_code=js_form_interaction,
    wait_for="css:.results"  # Wait for results to load
)
result = await crawler.arun(url="https://example.com", config=config)
```

## Timing Control

### Delays and Timeouts

Control timing of interactions:

```python
config = CrawlerRunConfig(
    page_timeout=60000,              # Page load timeout (ms)
    delay_before_return_html=2.0     # Wait before capturing content
)
result = await crawler.arun(url="https://example.com", config=config)
```

## Complex Interactions Example

Here's an example of handling a dynamic page with multiple interactions:

```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

async def crawl_dynamic_content():
    async with AsyncWebCrawler() as crawler:
        # Initial page load
        config = CrawlerRunConfig(
            js_code="document.querySelector('.cookie-accept')?.click();",  # Handle cookie consent
            wait_for="css:.main-content"
        )
        result = await crawler.arun(url="https://example.com", config=config)

        # Load more content
        session_id = "dynamic_session"  # Keep session for multiple interactions
        
        for page in range(3):  # Load 3 pages of content
            config = CrawlerRunConfig(
                session_id=session_id,
                js_code=[
                    "window.scrollTo(0, document.body.scrollHeight);",  # Scroll to bottom
                    "window.previousCount = document.querySelectorAll('.item').length;",  # Store item count
                    "document.querySelector('.load-more')?.click();"   # Click load more
                ],
                wait_for="""() => {
                    const currentCount = document.querySelectorAll('.item').length;
                    return currentCount > window.previousCount;
                }""",
                js_only=(page > 0)  # Execute JS without reloading page for subsequent interactions
            )
            result = await crawler.arun(url="https://example.com", config=config)
            print(f"Page {page + 1} items:", len(result.cleaned_html))

        # Clean up session
        await crawler.crawler_strategy.kill_session(session_id)
```

## Using with Extraction Strategies

Combine page interaction with structured extraction:

```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
from crawl4ai.async_configs import CrawlerRunConfig

# Pattern-based extraction after interaction
schema = {
    "name": "Dynamic Items",
    "baseSelector": ".item",
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "description", "selector": ".desc", "type": "text"}
    ]
}

config = CrawlerRunConfig(
    js_code="window.scrollTo(0, document.body.scrollHeight);",
    wait_for="css:.item:nth-child(10)",  # Wait for 10 items
    extraction_strategy=JsonCssExtractionStrategy(schema)
)
result = await crawler.arun(url="https://example.com", config=config)

# Or use LLM to analyze dynamic content
class ContentAnalysis(BaseModel):
    topics: List[str]
    summary: str

config = CrawlerRunConfig(
    js_code="document.querySelector('.show-more').click();",
    wait_for="css:.full-content",
    extraction_strategy=LLMExtractionStrategy(
        provider="ollama/nemotron",
        schema=ContentAnalysis.schema(),
        instruction="Analyze the full content"
    )
)
result = await crawler.arun(url="https://example.com", config=config)
```