Crawl4AI

Sleeping

App Files Files Community

Crawl4AI / docs /md_v2 /api /crawl-config.md

amaye15

test

03c0888 6 months ago

preview code

raw

history blame contribute delete

5.62 kB

	# CrawlerRunConfig Parameters Documentation

	## Content Processing Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `word_count_threshold` \| int \| 200 \| Minimum word count threshold before processing content \|
	\| `extraction_strategy` \| ExtractionStrategy \| None \| Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy \|
	\| `chunking_strategy` \| ChunkingStrategy \| RegexChunking() \| Strategy to chunk content before extraction \|
	\| `markdown_generator` \| MarkdownGenerationStrategy \| None \| Strategy for generating markdown from extracted content \|
	\| `content_filter` \| RelevantContentFilter \| None \| Optional filter to prune irrelevant content \|
	\| `only_text` \| bool \| False \| If True, attempt to extract text-only content where applicable \|
	\| `css_selector` \| str \| None \| CSS selector to extract a specific portion of the page \|
	\| `excluded_tags` \| list[str] \| [] \| List of HTML tags to exclude from processing \|
	\| `keep_data_attributes` \| bool \| False \| If True, retain `data-*` attributes while removing unwanted attributes \|
	\| `remove_forms` \| bool \| False \| If True, remove all `<form>` elements from the HTML \|
	\| `prettiify` \| bool \| False \| If True, apply `fast_format_html` to produce prettified HTML output \|

	## Caching Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `cache_mode` \| CacheMode \| None \| Defines how caching is handled. Defaults to CacheMode.ENABLED internally \|
	\| `session_id` \| str \| None \| Optional session ID to persist browser context and page instance \|
	\| `bypass_cache` \| bool \| False \| Legacy parameter, if True acts like CacheMode.BYPASS \|
	\| `disable_cache` \| bool \| False \| Legacy parameter, if True acts like CacheMode.DISABLED \|
	\| `no_cache_read` \| bool \| False \| Legacy parameter, if True acts like CacheMode.WRITE_ONLY \|
	\| `no_cache_write` \| bool \| False \| Legacy parameter, if True acts like CacheMode.READ_ONLY \|

	## Page Navigation and Timing Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `wait_until` \| str \| "domcontentloaded" \| The condition to wait for when navigating \|
	\| `page_timeout` \| int \| 60000 \| Timeout in milliseconds for page operations like navigation \|
	\| `wait_for` \| str \| None \| CSS selector or JS condition to wait for before extracting content \|
	\| `wait_for_images` \| bool \| True \| If True, wait for images to load before extracting content \|
	\| `delay_before_return_html` \| float \| 0.1 \| Delay in seconds before retrieving final HTML \|
	\| `mean_delay` \| float \| 0.1 \| Mean base delay between requests when calling arun_many \|
	\| `max_range` \| float \| 0.3 \| Max random additional delay range for requests in arun_many \|
	\| `semaphore_count` \| int \| 5 \| Number of concurrent operations allowed \|

	## Page Interaction Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `js_code` \| str or list[str] \| None \| JavaScript code/snippets to run on the page \|
	\| `js_only` \| bool \| False \| If True, indicates subsequent calls are JS-driven updates \|
	\| `ignore_body_visibility` \| bool \| True \| If True, ignore whether the body is visible before proceeding \|
	\| `scan_full_page` \| bool \| False \| If True, scroll through the entire page to load all content \|
	\| `scroll_delay` \| float \| 0.2 \| Delay in seconds between scroll steps if scan_full_page is True \|
	\| `process_iframes` \| bool \| False \| If True, attempts to process and inline iframe content \|
	\| `remove_overlay_elements` \| bool \| False \| If True, remove overlays/popups before extracting HTML \|
	\| `simulate_user` \| bool \| False \| If True, simulate user interactions for anti-bot measures \|
	\| `override_navigator` \| bool \| False \| If True, overrides navigator properties for more human-like behavior \|
	\| `magic` \| bool \| False \| If True, attempts automatic handling of overlays/popups \|
	\| `adjust_viewport_to_content` \| bool \| False \| If True, adjust viewport according to page content dimensions \|

	## Media Handling Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `screenshot` \| bool \| False \| Whether to take a screenshot after crawling \|
	\| `screenshot_wait_for` \| float \| None \| Additional wait time before taking a screenshot \|
	\| `screenshot_height_threshold` \| int \| 20000 \| Threshold for page height to decide screenshot strategy \|
	\| `pdf` \| bool \| False \| Whether to generate a PDF of the page \|
	\| `image_description_min_word_threshold` \| int \| 50 \| Minimum words for image description extraction \|
	\| `image_score_threshold` \| int \| 3 \| Minimum score threshold for processing an image \|
	\| `exclude_external_images` \| bool \| False \| If True, exclude all external images from processing \|

	## Link and Domain Handling Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `exclude_social_media_domains` \| list[str] \| SOCIAL_MEDIA_DOMAINS \| List of domains to exclude for social media links \|
	\| `exclude_external_links` \| bool \| False \| If True, exclude all external links from the results \|
	\| `exclude_social_media_links` \| bool \| False \| If True, exclude links pointing to social media domains \|
	\| `exclude_domains` \| list[str] \| [] \| List of specific domains to exclude from results \|

	## Debugging and Logging Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `verbose` \| bool \| True \| Enable verbose logging \|
	\| `log_console` \| bool \| False \| If True, log console messages from the page \|

	# CrawlerRunConfig Parameters Documentation

	## Content Processing Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `word_count_threshold` \| int \| 200 \| Minimum word count threshold before processing content \|
	\| `extraction_strategy` \| ExtractionStrategy \| None \| Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy \|
	\| `chunking_strategy` \| ChunkingStrategy \| RegexChunking() \| Strategy to chunk content before extraction \|
	\| `markdown_generator` \| MarkdownGenerationStrategy \| None \| Strategy for generating markdown from extracted content \|
	\| `content_filter` \| RelevantContentFilter \| None \| Optional filter to prune irrelevant content \|
	\| `only_text` \| bool \| False \| If True, attempt to extract text-only content where applicable \|
	\| `css_selector` \| str \| None \| CSS selector to extract a specific portion of the page \|
	\| `excluded_tags` \| list[str] \| [] \| List of HTML tags to exclude from processing \|
	\| `keep_data_attributes` \| bool \| False \| If True, retain `data-*` attributes while removing unwanted attributes \|
	\| `remove_forms` \| bool \| False \| If True, remove all `<form>` elements from the HTML \|
	\| `prettiify` \| bool \| False \| If True, apply `fast_format_html` to produce prettified HTML output \|

	## Caching Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `cache_mode` \| CacheMode \| None \| Defines how caching is handled. Defaults to CacheMode.ENABLED internally \|
	\| `session_id` \| str \| None \| Optional session ID to persist browser context and page instance \|
	\| `bypass_cache` \| bool \| False \| Legacy parameter, if True acts like CacheMode.BYPASS \|
	\| `disable_cache` \| bool \| False \| Legacy parameter, if True acts like CacheMode.DISABLED \|
	\| `no_cache_read` \| bool \| False \| Legacy parameter, if True acts like CacheMode.WRITE_ONLY \|
	\| `no_cache_write` \| bool \| False \| Legacy parameter, if True acts like CacheMode.READ_ONLY \|

	## Page Navigation and Timing Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `wait_until` \| str \| "domcontentloaded" \| The condition to wait for when navigating \|
	\| `page_timeout` \| int \| 60000 \| Timeout in milliseconds for page operations like navigation \|
	\| `wait_for` \| str \| None \| CSS selector or JS condition to wait for before extracting content \|
	\| `wait_for_images` \| bool \| True \| If True, wait for images to load before extracting content \|
	\| `delay_before_return_html` \| float \| 0.1 \| Delay in seconds before retrieving final HTML \|
	\| `mean_delay` \| float \| 0.1 \| Mean base delay between requests when calling arun_many \|
	\| `max_range` \| float \| 0.3 \| Max random additional delay range for requests in arun_many \|
	\| `semaphore_count` \| int \| 5 \| Number of concurrent operations allowed \|

	## Page Interaction Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `js_code` \| str or list[str] \| None \| JavaScript code/snippets to run on the page \|
	\| `js_only` \| bool \| False \| If True, indicates subsequent calls are JS-driven updates \|
	\| `ignore_body_visibility` \| bool \| True \| If True, ignore whether the body is visible before proceeding \|
	\| `scan_full_page` \| bool \| False \| If True, scroll through the entire page to load all content \|
	\| `scroll_delay` \| float \| 0.2 \| Delay in seconds between scroll steps if scan_full_page is True \|
	\| `process_iframes` \| bool \| False \| If True, attempts to process and inline iframe content \|
	\| `remove_overlay_elements` \| bool \| False \| If True, remove overlays/popups before extracting HTML \|
	\| `simulate_user` \| bool \| False \| If True, simulate user interactions for anti-bot measures \|
	\| `override_navigator` \| bool \| False \| If True, overrides navigator properties for more human-like behavior \|
	\| `magic` \| bool \| False \| If True, attempts automatic handling of overlays/popups \|
	\| `adjust_viewport_to_content` \| bool \| False \| If True, adjust viewport according to page content dimensions \|

	## Media Handling Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `screenshot` \| bool \| False \| Whether to take a screenshot after crawling \|
	\| `screenshot_wait_for` \| float \| None \| Additional wait time before taking a screenshot \|
	\| `screenshot_height_threshold` \| int \| 20000 \| Threshold for page height to decide screenshot strategy \|
	\| `pdf` \| bool \| False \| Whether to generate a PDF of the page \|
	\| `image_description_min_word_threshold` \| int \| 50 \| Minimum words for image description extraction \|
	\| `image_score_threshold` \| int \| 3 \| Minimum score threshold for processing an image \|
	\| `exclude_external_images` \| bool \| False \| If True, exclude all external images from processing \|

	## Link and Domain Handling Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `exclude_social_media_domains` \| list[str] \| SOCIAL_MEDIA_DOMAINS \| List of domains to exclude for social media links \|
	\| `exclude_external_links` \| bool \| False \| If True, exclude all external links from the results \|
	\| `exclude_social_media_links` \| bool \| False \| If True, exclude links pointing to social media domains \|
	\| `exclude_domains` \| list[str] \| [] \| List of specific domains to exclude from results \|

	## Debugging and Logging Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `verbose` \| bool \| True \| Enable verbose logging \|
	\| `log_console` \| bool \| False \| If True, log console messages from the page \|