Spaces:
Runtime error
Runtime error
# CrawlerRunConfig Parameters Documentation | |
## Content Processing Parameters | |
| Parameter | Type | Default | Description | | |
|-----------|------|---------|-------------| | |
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content | | |
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy | | |
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction | | |
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content | | |
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content | | |
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable | | |
| `css_selector` | str | None | CSS selector to extract a specific portion of the page | | |
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing | | |
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes | | |
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML | | |
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output | | |
## Caching Parameters | |
| Parameter | Type | Default | Description | | |
|-----------|------|---------|-------------| | |
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally | | |
| `session_id` | str | None | Optional session ID to persist browser context and page instance | | |
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS | | |
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED | | |
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY | | |
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY | | |
## Page Navigation and Timing Parameters | |
| Parameter | Type | Default | Description | | |
|-----------|------|---------|-------------| | |
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating | | |
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation | | |
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content | | |
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content | | |
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML | | |
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many | | |
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many | | |
| `semaphore_count` | int | 5 | Number of concurrent operations allowed | | |
## Page Interaction Parameters | |
| Parameter | Type | Default | Description | | |
|-----------|------|---------|-------------| | |
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page | | |
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates | | |
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding | | |
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content | | |
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True | | |
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content | | |
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML | | |
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures | | |
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior | | |
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups | | |
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions | | |
## Media Handling Parameters | |
| Parameter | Type | Default | Description | | |
|-----------|------|---------|-------------| | |
| `screenshot` | bool | False | Whether to take a screenshot after crawling | | |
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot | | |
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy | | |
| `pdf` | bool | False | Whether to generate a PDF of the page | | |
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction | | |
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image | | |
| `exclude_external_images` | bool | False | If True, exclude all external images from processing | | |
## Link and Domain Handling Parameters | |
| Parameter | Type | Default | Description | | |
|-----------|------|---------|-------------| | |
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links | | |
| `exclude_external_links` | bool | False | If True, exclude all external links from the results | | |
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains | | |
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results | | |
## Debugging and Logging Parameters | |
| Parameter | Type | Default | Description | | |
|-----------|------|---------|-------------| | |
| `verbose` | bool | True | Enable verbose logging | | |
| `log_console` | bool | False | If True, log console messages from the page | |