Spaces:
Runtime error
Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
Crawl4AI, the #1 trending GitHub repository, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.
What Crawl4AI is not:
Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal:
- To generate perfect, AI-friendly data (particularly for LLMs) from web content
- To maximize speed and efficiency in data extraction and processing
- To operate at scale, from Raspberry Pi to cloud infrastructures
Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to:
- Transform raw web content into structured, LLM-ready formats (Markdown/JSON)
- Implement intelligent extraction strategies to reduce reliance on costly API calls
- Provide a streamlined pipeline for AI data preparation and ingestion
In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities.
Key Links:
- Website: https://crawl4ai.com
- GitHub: https://github.com/unclecode/crawl4ai
- Colab Notebook: Try on Google Colab
- Quickstart Code Example: quickstart_async.config.py
- Examples Folder: Crawl4AI Examples
Table of Contents
- Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
- Table of Contents
- 1. Introduction & Key Concepts
- 2. Installation & Environment Setup
- 3. Core Concepts & Configuration
- 4. Basic Crawling & Simple Extraction
- 5. Markdown Generation & AI-Optimized Output
- 6. Structured Data Extraction (CSS, XPath, LLM)
- 7. Advanced Extraction: LLM & Open-Source Models
- 8. Page Interactions, JS Execution, & Dynamic Content
- 9. Media, Links, & Metadata Handling
- 10. Authentication & Identity Preservation
- 11. Proxy & Security Enhancements
- 12. Screenshots, PDFs & File Downloads
- 13. Caching & Performance Optimization
- 14. Hooks for Custom Logic
- 15. Dockerization & Scaling
- 16. Troubleshooting & Common Pitfalls
- 17. Comprehensive End-to-End Example
- 18. Further Resources & Community
1. Introduction & Key Concepts
Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.
Quick Test:
import asyncio
from crawl4ai import AsyncWebCrawler
async def test_run():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown)
asyncio.run(test_run())
If you see Markdown output, everything is working!
More info: See /docs/introduction or 1_introduction.ex.md
2. Installation & Environment Setup
# Install the package
pip install crawl4ai
crawl4ai-setup
# Install Playwright with system dependencies (recommended)
playwright install --with-deps # Installs all browsers
# Or install specific browsers:
playwright install --with-deps chrome # Recommended for Colab/Linux
playwright install --with-deps firefox
playwright install --with-deps webkit
playwright install --with-deps chromium
# Keep Playwright updated periodically
playwright install
Note: For Google Colab and some Linux environments, use
chrome
instead ofchromium
- it tends to work more reliably.
Test Your Installation
Try these one-liners:
# Visible browser test
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
# Headless test (for servers/CI)
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()"
You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using playwright install --with-deps firefox
.
Try in Colab:
Open Colab Notebook
More info: See /docs/configuration or 2_configuration.md
3. Core Concepts & Configuration
Use AsyncWebCrawler
, CrawlerRunConfig
, and BrowserConfig
to control crawling.
Example config:
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
headless=True,
verbose=True,
viewport_width=1080,
viewport_height=600,
text_mode=False,
ignore_https_errors=True,
java_script_enabled=True
)
run_config = CrawlerRunConfig(
css_selector="article.main",
word_count_threshold=50,
excluded_tags=['nav','footer'],
exclude_external_links=True,
wait_for="css:.article-loaded",
page_timeout=60000,
delay_before_return_html=1.0,
mean_delay=0.1,
max_range=0.3,
process_iframes=True,
remove_overlay_elements=True,
js_code="""
(async () => {
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 2000));
document.querySelector('.load-more')?.click();
})();
"""
)
# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
# run_config.cache_mode = CacheMode.ENABLED
Prefixes:
http://
orhttps://
for live pagesfile://local.html
for localraw:<html>
for raw HTML strings
More info: See /docs/async_webcrawler or 3_async_webcrawler.ex.md
4. Basic Crawling & Simple Extraction
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://news.example.com/article", config=run_config)
print(result.markdown) # Basic markdown content
More info: See /docs/browser_context_page or 4_browser_context_page.ex.md
5. Markdown Generation & AI-Optimized Output
After crawling, result.markdown_v2
provides:
raw_markdown
: Unfiltered markdownmarkdown_with_citations
: Links as references at the bottomreferences_markdown
: A separate list of reference linksfit_markdown
: Filtered, relevant markdown (e.g., after BM25)fit_html
: The HTML used to producefit_markdown
Example:
print("RAW:", result.markdown_v2.raw_markdown[:200])
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
print("REFERENCES:", result.markdown_v2.references_markdown)
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)
For AI training, fit_markdown
focuses on the most relevant content.
More info: See /docs/markdown_generation or 5_markdown_generation.ex.md
6. Structured Data Extraction (CSS, XPath, LLM)
Extract JSON data without LLMs:
CSS:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Products",
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"}
]
}
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)
XPath:
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
xpath_schema = {
"name": "Articles",
"baseSelector": "//div[@class='article']",
"fields": [
{"name":"headline","selector":".//h1","type":"text"},
{"name":"summary","selector":".//p[@class='summary']","type":"text"}
]
}
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)
More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md
7. Advanced Extraction: LLM & Open-Source Models
Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class TravelData(BaseModel):
destination: str
attractions: list
run_config.extraction_strategy = LLMExtractionStrategy(
provider="ollama/nemotron",
schema=TravelData.schema(),
instruction="Extract destination and top attractions."
)
More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md
8. Page Interactions, JS Execution, & Dynamic Content
Insert js_code
and use wait_for
to ensure content loads. Example:
run_config.js_code = """
(async () => {
document.querySelector('.load-more')?.click();
await new Promise(r => setTimeout(r, 2000));
})();
"""
run_config.wait_for = "css:.item-loaded"
More info: See /docs/page_interaction or 11_page_interaction.md
9. Media, Links, & Metadata Handling
result.media["images"]
: List of images with src
, score
, alt
. Score indicates relevance.
result.media["videos"]
, result.media["audios"]
similarly hold media info.
result.links["internal"]
, result.links["external"]
, result.links["social"]
: Categorized links. Each link has href
, text
, context
, type
.
result.metadata
: Title, description, keywords, author.
Example:
# Images
for img in result.media["images"]:
print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))
# Links
for link in result.links["external"]:
print("External Link:", link["href"], "Text:", link["text"])
# Metadata
print("Page Title:", result.metadata["title"])
print("Description:", result.metadata["description"])
More info: See /docs/content_selection or 8_content_selection.ex.md
10. Authentication & Identity Preservation
Manual Setup via User Data Directory
Open Chrome with a custom user data dir:
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"
On macOS:
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"
Log in to sites, solve CAPTCHAs, adjust settings manually.
The browser saves cookies/localStorage in that directory.Use
user_data_dir
inBrowserConfig
:browser_config = BrowserConfig( headless=True, user_data_dir="/Users/username/ChromeProfiles/MyProfile" )
Now the crawler starts with those cookies, sessions, etc.
Using storage_state
Alternatively, export and reuse storage states:
browser_config = BrowserConfig(
headless=True,
storage_state="mystate.json" # Pre-saved state
)
No repeated logins needed.
More info: See /docs/storage_state or 16_storage_state.md
11. Proxy & Security Enhancements
Use proxy_config
for authenticated proxies:
browser_config.proxy_config = {
"server": "http://proxy.example.com:8080",
"username": "proxyuser",
"password": "proxypass"
}
Combine with headers
or ignore_https_errors
as needed.
More info: See /docs/proxy_security or 14_proxy_security.md
12. Screenshots, PDFs & File Downloads
Enable screenshot=True
or pdf=True
in CrawlerRunConfig
:
run_config.screenshot = True
run_config.pdf = True
After crawling:
if result.screenshot:
with open("page.png", "wb") as f:
f.write(result.screenshot)
if result.pdf:
with open("page.pdf", "wb") as f:
f.write(result.pdf)
File Downloads:
browser_config.accept_downloads = True
browser_config.downloads_path = "./downloads"
run_config.js_code = """document.querySelector('a.download')?.click();"""
# After crawl:
print("Downloaded files:", result.downloaded_files)
More info: See /docs/screenshot_and_pdf_export or 15_screenshot_and_pdf_export.md
Also 10_file_download.md
13. Caching & Performance Optimization
Set cache_mode
to reuse fetch results:
from crawl4ai import CacheMode
run_config.cache_mode = CacheMode.ENABLED
Adjust delays, increase concurrency, or use text_mode=True
for faster extraction.
More info: See /docs/cache_modes or 9_cache_modes.md
14. Hooks for Custom Logic
Hooks let you run code at specific lifecycle events without creating pages manually in on_browser_created
.
Use on_page_context_created
to apply routing or modify page contexts before crawling the URL:
Example Hook:
async def on_page_context_created_hook(context, page, **kwargs):
# Block all images to speed up load
await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
print("[HOOK] Image requests blocked")
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
result = await crawler.arun("https://imageheavy.example.com", config=run_config)
print("Crawl finished with images blocked.")
This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.
More info: See /docs/hooks_auth or 13_hooks_auth.md
15. Dockerization & Scaling
Use Docker images:
- AMD64 basic:
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
- ARM64 for M1/M2:
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
- GPU support:
docker pull unclecode/crawl4ai:gpu-amd64
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64
Scale with load balancers or Kubernetes.
More info: See /docs/proxy_security (for proxy) or relevant Docker instructions in README
16. Troubleshooting & Common Pitfalls
- Empty results? Relax filters, check selectors.
- Timeouts? Increase
page_timeout
or refinewait_for
. - CAPTCHAs? Use
user_data_dir
orstorage_state
after manual solving. - JS errors? Try headful mode for debugging.
Check examples & quickstart_async.config.py for more code.
17. Comprehensive End-to-End Example
Combine hooks, JS execution, PDF saving, LLM extraction—see quickstart_async.config.py for a full example.
18. Further Resources & Community
- Docs: https://crawl4ai.com
- Issues & PRs: https://github.com/unclecode/crawl4ai/issues
Follow @unclecode for news & community updates.
Happy Crawling!
Leverage Crawl4AI to feed your AI models with clean, structured web data today.