Crawl4AI

Runtime error

App Files Files Community

Crawl4AI / docs /md_v2 /advanced /session-management.md

amaye15

test

03c0888 6 months ago

preview code

raw

history blame contribute delete

4.75 kB

	### Session Management

	Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:

	- Performing JavaScript actions before and after crawling.
	- Executing multiple sequential crawls faster without needing to reopen tabs or allocate memory repeatedly.

	Note: This feature is designed for sequential workflows and is not suitable for parallel operations.

	---

	#### Basic Session Usage

	Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:

	```python
	from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

	async with AsyncWebCrawler() as crawler:
	session_id = "my_session"

	# Define configurations
	config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
	config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)

	# First request
	result1 = await crawler.arun(config=config1)

	# Subsequent request using the same session
	result2 = await crawler.arun(config=config2)

	# Clean up when done
	await crawler.crawler_strategy.kill_session(session_id)
	```

	---

	#### Dynamic Content with Sessions

	Here's an example of crawling GitHub commits across multiple pages while preserving session state:

	```python
	from crawl4ai.async_configs import CrawlerRunConfig
	from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
	from crawl4ai.cache_context import CacheMode

	async def crawl_dynamic_content():
	async with AsyncWebCrawler() as crawler:
	session_id = "github_commits_session"
	url = "https://github.com/microsoft/TypeScript/commits/main"
	all_commits = []

	# Define extraction schema
	schema = {
	"name": "Commit Extractor",
	"baseSelector": "li.Box-sc-g0xbh4-0",
	"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
	}
	extraction_strategy = JsonCssExtractionStrategy(schema)

	# JavaScript and wait configurations
	js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
	wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""

	# Crawl multiple pages
	for page in range(3):
	config = CrawlerRunConfig(
	url=url,
	session_id=session_id,
	extraction_strategy=extraction_strategy,
	js_code=js_next_page if page > 0 else None,
	wait_for=wait_for if page > 0 else None,
	js_only=page > 0,
	cache_mode=CacheMode.BYPASS
	)

	result = await crawler.arun(config=config)
	if result.success:
	commits = json.loads(result.extracted_content)
	all_commits.extend(commits)
	print(f"Page {page + 1}: Found {len(commits)} commits")

	# Clean up session
	await crawler.crawler_strategy.kill_session(session_id)
	return all_commits
	```

	---

	#### Session Best Practices

	1. Descriptive Session IDs:
	Use meaningful names for session IDs to organize workflows:
	```python
	session_id = "login_flow_session"
	session_id = "product_catalog_session"
	```

	2. Resource Management:
	Always ensure sessions are cleaned up to free resources:
	```python
	try:
	# Your crawling code here
	pass
	finally:
	await crawler.crawler_strategy.kill_session(session_id)
	```

	3. State Maintenance:
	Reuse the session for subsequent actions within the same workflow:
	```python
	# Step 1: Login
	login_config = CrawlerRunConfig(
	url="https://example.com/login",
	session_id=session_id,
	js_code="document.querySelector('form').submit();"
	)
	await crawler.arun(config=login_config)

	# Step 2: Verify login success
	dashboard_config = CrawlerRunConfig(
	url="https://example.com/dashboard",
	session_id=session_id,
	wait_for="css:.user-profile" # Wait for authenticated content
	)
	result = await crawler.arun(config=dashboard_config)
	```

	---

	#### Common Use Cases for Sessions

	1. Authentication Flows: Login and interact with secured pages.
	2. Pagination Handling: Navigate through multiple pages.
	3. Form Submissions: Fill forms, submit, and process results.
	4. Multi-step Processes: Complete workflows that span multiple actions.
	5. Dynamic Content Navigation: Handle JavaScript-rendered or event-triggered content.