Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution

Crawl4AI, the #1 trending GitHub repository, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.

What Crawl4AI is not:

Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal:

To generate perfect, AI-friendly data (particularly for LLMs) from web content
To maximize speed and efficiency in data extraction and processing
To operate at scale, from Raspberry Pi to cloud infrastructures

Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to:

Transform raw web content into structured, LLM-ready formats (Markdown/JSON)
Implement intelligent extraction strategies to reduce reliance on costly API calls
Provide a streamlined pipeline for AI data preparation and ingestion

In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities.

Key Links:

Website: https://crawl4ai.com
GitHub: https://github.com/unclecode/crawl4ai
Colab Notebook: Try on Google Colab
Quickstart Code Example: quickstart_async.config.py
Examples Folder: Crawl4AI Examples

Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution

1. Introduction & Key Concepts

Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.

Quick Test:

import asyncio
from crawl4ai import AsyncWebCrawler

async def test_run():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown)

asyncio.run(test_run())

If you see Markdown output, everything is working!

More info: See /docs/introduction or 1_introduction.ex.md

2. Installation & Environment Setup

# Install the package
pip install crawl4ai
crawl4ai-setup

# Install Playwright with system dependencies (recommended)
playwright install --with-deps  # Installs all browsers

# Or install specific browsers:
playwright install --with-deps chrome  # Recommended for Colab/Linux
playwright install --with-deps firefox
playwright install --with-deps webkit
playwright install --with-deps chromium

# Keep Playwright updated periodically
playwright install

Note: For Google Colab and some Linux environments, use chrome instead of chromium - it tends to work more reliably.

Test Your Installation

Try these one-liners:

# Visible browser test
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"

# Headless test (for servers/CI)
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()"

You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using playwright install --with-deps firefox.

Try in Colab:
Open Colab Notebook

More info: See /docs/configuration or 2_configuration.md

3. Core Concepts & Configuration

Use AsyncWebCrawler, CrawlerRunConfig, and BrowserConfig to control crawling.

Example config:

from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(
    headless=True,
    verbose=True,
    viewport_width=1080,
    viewport_height=600,
    text_mode=False,
    ignore_https_errors=True,
    java_script_enabled=True
)

run_config = CrawlerRunConfig(
    css_selector="article.main",
    word_count_threshold=50,
    excluded_tags=['nav','footer'],
    exclude_external_links=True,
    wait_for="css:.article-loaded",
    page_timeout=60000,
    delay_before_return_html=1.0,
    mean_delay=0.1,
    max_range=0.3,
    process_iframes=True,
    remove_overlay_elements=True,
    js_code="""
        (async () => {
            window.scrollTo(0, document.body.scrollHeight);
            await new Promise(r => setTimeout(r, 2000));
            document.querySelector('.load-more')?.click();
        })();
    """
)

# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
# run_config.cache_mode = CacheMode.ENABLED

Prefixes:

http:// or https:// for live pages
file://local.html for local
raw:<html> for raw HTML strings

More info: See /docs/async_webcrawler or 3_async_webcrawler.ex.md

4. Basic Crawling & Simple Extraction

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun("https://news.example.com/article", config=run_config)
    print(result.markdown) # Basic markdown content

More info: See /docs/browser_context_page or 4_browser_context_page.ex.md

5. Markdown Generation & AI-Optimized Output

After crawling, result.markdown_v2 provides:

raw_markdown: Unfiltered markdown
markdown_with_citations: Links as references at the bottom
references_markdown: A separate list of reference links
fit_markdown: Filtered, relevant markdown (e.g., after BM25)
fit_html: The HTML used to produce fit_markdown

Example:

print("RAW:", result.markdown_v2.raw_markdown[:200])
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
print("REFERENCES:", result.markdown_v2.references_markdown)
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)

For AI training, fit_markdown focuses on the most relevant content.

More info: See /docs/markdown_generation or 5_markdown_generation.ex.md

6. Structured Data Extraction (CSS, XPath, LLM)

Extract JSON data without LLMs:

CSS:

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
  "name": "Products",
  "baseSelector": ".product",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"}
  ]
}
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)

XPath:

from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy

xpath_schema = {
  "name": "Articles",
  "baseSelector": "//div[@class='article']",
  "fields": [
    {"name":"headline","selector":".//h1","type":"text"},
    {"name":"summary","selector":".//p[@class='summary']","type":"text"}
  ]
}
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)

More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md

7. Advanced Extraction: LLM & Open-Source Models

Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).

from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class TravelData(BaseModel):
    destination: str
    attractions: list

run_config.extraction_strategy = LLMExtractionStrategy(
    provider="ollama/nemotron",
    schema=TravelData.schema(),
    instruction="Extract destination and top attractions."
)

More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md

8. Page Interactions, JS Execution, & Dynamic Content

Insert js_code and use wait_for to ensure content loads. Example:

run_config.js_code = """
(async () => {
   document.querySelector('.load-more')?.click();
   await new Promise(r => setTimeout(r, 2000));
})();
"""
run_config.wait_for = "css:.item-loaded"

More info: See /docs/page_interaction or 11_page_interaction.md

9. Media, Links, & Metadata Handling

result.media["images"]: List of images with src, score, alt. Score indicates relevance.

result.media["videos"], result.media["audios"] similarly hold media info.

result.links["internal"], result.links["external"], result.links["social"]: Categorized links. Each link has href, text, context, type.

result.metadata: Title, description, keywords, author.

Example:

# Images
for img in result.media["images"]:
    print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))

# Links
for link in result.links["external"]:
    print("External Link:", link["href"], "Text:", link["text"])

# Metadata
print("Page Title:", result.metadata["title"])
print("Description:", result.metadata["description"])

More info: See /docs/content_selection or 8_content_selection.ex.md

10. Authentication & Identity Preservation

Manual Setup via User Data Directory

Open Chrome with a custom user data dir:

"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"

On macOS:

"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"

Log in to sites, solve CAPTCHAs, adjust settings manually.
The browser saves cookies/localStorage in that directory.

Use user_data_dir in BrowserConfig:

browser_config = BrowserConfig(
    headless=True,
    user_data_dir="/Users/username/ChromeProfiles/MyProfile"
)

Now the crawler starts with those cookies, sessions, etc.

Using `storage_state`

Alternatively, export and reuse storage states:

browser_config = BrowserConfig(
    headless=True,
    storage_state="mystate.json"  # Pre-saved state
)

No repeated logins needed.

More info: See /docs/storage_state or 16_storage_state.md

11. Proxy & Security Enhancements

Use proxy_config for authenticated proxies:

browser_config.proxy_config = {
    "server": "http://proxy.example.com:8080",
    "username": "proxyuser",
    "password": "proxypass"
}

Combine with headers or ignore_https_errors as needed.

More info: See /docs/proxy_security or 14_proxy_security.md

12. Screenshots, PDFs & File Downloads

Enable screenshot=True or pdf=True in CrawlerRunConfig:

run_config.screenshot = True
run_config.pdf = True

After crawling:

if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(result.screenshot)

if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)

File Downloads:

browser_config.accept_downloads = True
browser_config.downloads_path = "./downloads"
run_config.js_code = """document.querySelector('a.download')?.click();"""

# After crawl:
print("Downloaded files:", result.downloaded_files)

More info: See /docs/screenshot_and_pdf_export or 15_screenshot_and_pdf_export.md
Also 10_file_download.md

13. Caching & Performance Optimization

Set cache_mode to reuse fetch results:

from crawl4ai import CacheMode
run_config.cache_mode = CacheMode.ENABLED

Adjust delays, increase concurrency, or use text_mode=True for faster extraction.

More info: See /docs/cache_modes or 9_cache_modes.md

14. Hooks for Custom Logic

Hooks let you run code at specific lifecycle events without creating pages manually in on_browser_created.

Use on_page_context_created to apply routing or modify page contexts before crawling the URL:

Example Hook:

async def on_page_context_created_hook(context, page, **kwargs):
    # Block all images to speed up load
    await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
    print("[HOOK] Image requests blocked")

async with AsyncWebCrawler(config=browser_config) as crawler:
    crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
    result = await crawler.arun("https://imageheavy.example.com", config=run_config)
    print("Crawl finished with images blocked.")

This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.

More info: See /docs/hooks_auth or 13_hooks_auth.md

15. Dockerization & Scaling

Use Docker images:

AMD64 basic:

docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64

ARM64 for M1/M2:

docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64

GPU support:

docker pull unclecode/crawl4ai:gpu-amd64
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64

Scale with load balancers or Kubernetes.

More info: See /docs/proxy_security (for proxy) or relevant Docker instructions in README

16. Troubleshooting & Common Pitfalls

Empty results? Relax filters, check selectors.
Timeouts? Increase page_timeout or refine wait_for.
CAPTCHAs? Use user_data_dir or storage_state after manual solving.
JS errors? Try headful mode for debugging.

Check examples & quickstart_async.config.py for more code.

17. Comprehensive End-to-End Example

Combine hooks, JS execution, PDF saving, LLM extraction—see quickstart_async.config.py for a full example.

18. Further Resources & Community

Docs: https://crawl4ai.com
Issues & PRs: https://github.com/unclecode/crawl4ai/issues

Follow @unclecode for news & community updates.

Happy Crawling!
Leverage Crawl4AI to feed your AI models with clean, structured web data today.

Spaces:

Duplicated from re-mind/Crawl4AI

Echo-AI-official
/

Crawl4AI

Runtime error

Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution

Table of Contents

1. Introduction & Key Concepts

2. Installation & Environment Setup

Test Your Installation

3. Core Concepts & Configuration

4. Basic Crawling & Simple Extraction

5. Markdown Generation & AI-Optimized Output

6. Structured Data Extraction (CSS, XPath, LLM)

7. Advanced Extraction: LLM & Open-Source Models

8. Page Interactions, JS Execution, & Dynamic Content

9. Media, Links, & Metadata Handling

10. Authentication & Identity Preservation

Manual Setup via User Data Directory

Using `storage_state`

11. Proxy & Security Enhancements

12. Screenshots, PDFs & File Downloads

13. Caching & Performance Optimization

14. Hooks for Custom Logic

15. Dockerization & Scaling

16. Troubleshooting & Common Pitfalls

17. Comprehensive End-to-End Example

18. Further Resources & Community

Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution

Table of Contents

1. Introduction & Key Concepts

2. Installation & Environment Setup

Test Your Installation

3. Core Concepts & Configuration

4. Basic Crawling & Simple Extraction

5. Markdown Generation & AI-Optimized Output

6. Structured Data Extraction (CSS, XPath, LLM)

7. Advanced Extraction: LLM & Open-Source Models

8. Page Interactions, JS Execution, & Dynamic Content

9. Media, Links, & Metadata Handling

10. Authentication & Identity Preservation

Manual Setup via User Data Directory

Using storage_state

11. Proxy & Security Enhancements

12. Screenshots, PDFs & File Downloads

13. Caching & Performance Optimization

14. Hooks for Custom Logic

15. Dockerization & Scaling

16. Troubleshooting & Common Pitfalls

17. Comprehensive End-to-End Example

18. Further Resources & Community

Using `storage_state`