|
# CrawlerRunConfig Parameters Documentation |
|
|
|
## Content Processing Parameters |
|
|
|
| Parameter | Type | Default | Description | |
|
|-----------|------|---------|-------------| |
|
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content | |
|
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy | |
|
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction | |
|
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content | |
|
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content | |
|
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable | |
|
| `css_selector` | str | None | CSS selector to extract a specific portion of the page | |
|
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing | |
|
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes | |
|
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML | |
|
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output | |
|
|
|
## Caching Parameters |
|
|
|
| Parameter | Type | Default | Description | |
|
|-----------|------|---------|-------------| |
|
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally | |
|
| `session_id` | str | None | Optional session ID to persist browser context and page instance | |
|
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS | |
|
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED | |
|
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY | |
|
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY | |
|
|
|
## Page Navigation and Timing Parameters |
|
|
|
| Parameter | Type | Default | Description | |
|
|-----------|------|---------|-------------| |
|
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating | |
|
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation | |
|
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content | |
|
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content | |
|
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML | |
|
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many | |
|
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many | |
|
| `semaphore_count` | int | 5 | Number of concurrent operations allowed | |
|
|
|
## Page Interaction Parameters |
|
|
|
| Parameter | Type | Default | Description | |
|
|-----------|------|---------|-------------| |
|
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page | |
|
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates | |
|
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding | |
|
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content | |
|
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True | |
|
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content | |
|
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML | |
|
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures | |
|
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior | |
|
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups | |
|
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions | |
|
|
|
## Media Handling Parameters |
|
|
|
| Parameter | Type | Default | Description | |
|
|-----------|------|---------|-------------| |
|
| `screenshot` | bool | False | Whether to take a screenshot after crawling | |
|
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot | |
|
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy | |
|
| `pdf` | bool | False | Whether to generate a PDF of the page | |
|
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction | |
|
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image | |
|
| `exclude_external_images` | bool | False | If True, exclude all external images from processing | |
|
|
|
## Link and Domain Handling Parameters |
|
|
|
| Parameter | Type | Default | Description | |
|
|-----------|------|---------|-------------| |
|
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links | |
|
| `exclude_external_links` | bool | False | If True, exclude all external links from the results | |
|
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains | |
|
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results | |
|
|
|
## Debugging and Logging Parameters |
|
|
|
| Parameter | Type | Default | Description | |
|
|-----------|------|---------|-------------| |
|
| `verbose` | bool | True | Enable verbose logging | |
|
| `log_console` | bool | False | If True, log console messages from the page | |