Crawl4AI

Sleeping

App Files Files Community

Crawl4AI / docs /md_v2 /basic /file-download.md

amaye15

test

03c0888 6 months ago

preview code

raw

history blame

4.68 kB

	# Download Handling in Crawl4AI

	This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.

	## Enabling Downloads

	To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.

	```python
	from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler

	async def main():
	config = BrowserConfig(accept_downloads=True) # Enable downloads globally
	async with AsyncWebCrawler(config=config) as crawler:
	# ... your crawling logic ...

	asyncio.run(main())
	```

	Or, enable it for a specific crawl by using `CrawlerRunConfig`:

	```python
	from crawl4ai.async_configs import CrawlerRunConfig

	async def main():
	async with AsyncWebCrawler() as crawler:
	config = CrawlerRunConfig(accept_downloads=True)
	result = await crawler.arun(url="https://example.com", config=config)
	# ...
	```

	## Specifying Download Location

	Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.

	```python
	from crawl4ai.async_configs import BrowserConfig
	import os

	downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
	os.makedirs(downloads_path, exist_ok=True)

	config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)

	async def main():
	async with AsyncWebCrawler(config=config) as crawler:
	result = await crawler.arun(url="https://example.com")
	# ...
	```

	## Triggering Downloads

	Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.

	```python
	from crawl4ai.async_configs import CrawlerRunConfig

	config = CrawlerRunConfig(
	js_code="""
	const downloadLink = document.querySelector('a[href$=".exe"]');
	if (downloadLink) {
	downloadLink.click();
	}
	""",
	wait_for=5 # Wait 5 seconds for the download to start
	)

	result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
	```

	## Accessing Downloaded Files

	The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.

	```python
	if result.downloaded_files:
	print("Downloaded files:")
	for file_path in result.downloaded_files:
	print(f"- {file_path}")
	file_size = os.path.getsize(file_path)
	print(f"- File size: {file_size} bytes")
	else:
	print("No files downloaded.")
	```

	## Example: Downloading Multiple Files

	```python
	from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
	import os
	from pathlib import Path

	async def download_multiple_files(url: str, download_path: str):
	config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
	async with AsyncWebCrawler(config=config) as crawler:
	run_config = CrawlerRunConfig(
	js_code="""
	const downloadLinks = document.querySelectorAll('a[download]');
	for (const link of downloadLinks) {
	link.click();
	await new Promise(r => setTimeout(r, 2000)); // Delay between clicks
	}
	""",
	wait_for=10 # Wait for all downloads to start
	)
	result = await crawler.arun(url=url, config=run_config)

	if result.downloaded_files:
	print("Downloaded files:")
	for file in result.downloaded_files:
	print(f"- {file}")
	else:
	print("No files downloaded.")

	# Usage
	download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
	os.makedirs(download_path, exist_ok=True)

	asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
	```

	## Important Considerations

	- Browser Context: Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
	- Timing: Use `wait_for` in `CrawlerRunConfig` to manage download timing.
	- Error Handling: Handle errors to manage failed downloads or incorrect paths gracefully.
	- Security: Scan downloaded files for potential security threats before use.

	This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!