Crawl4AI / docs /md_v3 /tutorials /getting-warmer.md
amaye15
test
03c0888

Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution

Crawl4AI, the #1 trending GitHub repository, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.

What Crawl4AI is not:

Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal:

  • To generate perfect, AI-friendly data (particularly for LLMs) from web content
  • To maximize speed and efficiency in data extraction and processing
  • To operate at scale, from Raspberry Pi to cloud infrastructures

Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to:

  1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON)
  2. Implement intelligent extraction strategies to reduce reliance on costly API calls
  3. Provide a streamlined pipeline for AI data preparation and ingestion

In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities.

Key Links:


Table of Contents


1. Introduction & Key Concepts

Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.

Quick Test:

import asyncio
from crawl4ai import AsyncWebCrawler

async def test_run():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown)

asyncio.run(test_run())

If you see Markdown output, everything is working!

More info: See /docs/introduction or 1_introduction.ex.md


2. Installation & Environment Setup

# Install the package
pip install crawl4ai
crawl4ai-setup

# Install Playwright with system dependencies (recommended)
playwright install --with-deps  # Installs all browsers

# Or install specific browsers:
playwright install --with-deps chrome  # Recommended for Colab/Linux
playwright install --with-deps firefox
playwright install --with-deps webkit
playwright install --with-deps chromium

# Keep Playwright updated periodically
playwright install

Note: For Google Colab and some Linux environments, use chrome instead of chromium - it tends to work more reliably.

Test Your Installation

Try these one-liners:

# Visible browser test
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"

# Headless test (for servers/CI)
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()"

You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using playwright install --with-deps firefox.

Try in Colab:
Open Colab Notebook

More info: See /docs/configuration or 2_configuration.md


3. Core Concepts & Configuration

Use AsyncWebCrawler, CrawlerRunConfig, and BrowserConfig to control crawling.

Example config:

from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(
    headless=True,
    verbose=True,
    viewport_width=1080,
    viewport_height=600,
    text_mode=False,
    ignore_https_errors=True,
    java_script_enabled=True
)

run_config = CrawlerRunConfig(
    css_selector="article.main",
    word_count_threshold=50,
    excluded_tags=['nav','footer'],
    exclude_external_links=True,
    wait_for="css:.article-loaded",
    page_timeout=60000,
    delay_before_return_html=1.0,
    mean_delay=0.1,
    max_range=0.3,
    process_iframes=True,
    remove_overlay_elements=True,
    js_code="""
        (async () => {
            window.scrollTo(0, document.body.scrollHeight);
            await new Promise(r => setTimeout(r, 2000));
            document.querySelector('.load-more')?.click();
        })();
    """
)

# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
# run_config.cache_mode = CacheMode.ENABLED

Prefixes:

  • http:// or https:// for live pages
  • file://local.html for local
  • raw:<html> for raw HTML strings

More info: See /docs/async_webcrawler or 3_async_webcrawler.ex.md


4. Basic Crawling & Simple Extraction

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun("https://news.example.com/article", config=run_config)
    print(result.markdown) # Basic markdown content

More info: See /docs/browser_context_page or 4_browser_context_page.ex.md


5. Markdown Generation & AI-Optimized Output

After crawling, result.markdown_v2 provides:

  • raw_markdown: Unfiltered markdown
  • markdown_with_citations: Links as references at the bottom
  • references_markdown: A separate list of reference links
  • fit_markdown: Filtered, relevant markdown (e.g., after BM25)
  • fit_html: The HTML used to produce fit_markdown

Example:

print("RAW:", result.markdown_v2.raw_markdown[:200])
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
print("REFERENCES:", result.markdown_v2.references_markdown)
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)

For AI training, fit_markdown focuses on the most relevant content.

More info: See /docs/markdown_generation or 5_markdown_generation.ex.md


6. Structured Data Extraction (CSS, XPath, LLM)

Extract JSON data without LLMs:

CSS:

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
  "name": "Products",
  "baseSelector": ".product",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"}
  ]
}
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)

XPath:

from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy

xpath_schema = {
  "name": "Articles",
  "baseSelector": "//div[@class='article']",
  "fields": [
    {"name":"headline","selector":".//h1","type":"text"},
    {"name":"summary","selector":".//p[@class='summary']","type":"text"}
  ]
}
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)

More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md


7. Advanced Extraction: LLM & Open-Source Models

Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).

from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class TravelData(BaseModel):
    destination: str
    attractions: list

run_config.extraction_strategy = LLMExtractionStrategy(
    provider="ollama/nemotron",
    schema=TravelData.schema(),
    instruction="Extract destination and top attractions."
)

More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md


8. Page Interactions, JS Execution, & Dynamic Content

Insert js_code and use wait_for to ensure content loads. Example:

run_config.js_code = """
(async () => {
   document.querySelector('.load-more')?.click();
   await new Promise(r => setTimeout(r, 2000));
})();
"""
run_config.wait_for = "css:.item-loaded"

More info: See /docs/page_interaction or 11_page_interaction.md


9. Media, Links, & Metadata Handling

result.media["images"]: List of images with src, score, alt. Score indicates relevance.

result.media["videos"], result.media["audios"] similarly hold media info.

result.links["internal"], result.links["external"], result.links["social"]: Categorized links. Each link has href, text, context, type.

result.metadata: Title, description, keywords, author.

Example:

# Images
for img in result.media["images"]:
    print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))

# Links
for link in result.links["external"]:
    print("External Link:", link["href"], "Text:", link["text"])

# Metadata
print("Page Title:", result.metadata["title"])
print("Description:", result.metadata["description"])

More info: See /docs/content_selection or 8_content_selection.ex.md


10. Authentication & Identity Preservation

Manual Setup via User Data Directory

  1. Open Chrome with a custom user data dir:

    "C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"
    

    On macOS:

    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"
    
  2. Log in to sites, solve CAPTCHAs, adjust settings manually.
    The browser saves cookies/localStorage in that directory.

  3. Use user_data_dir in BrowserConfig:

    browser_config = BrowserConfig(
        headless=True,
        user_data_dir="/Users/username/ChromeProfiles/MyProfile"
    )
    

    Now the crawler starts with those cookies, sessions, etc.

Using storage_state

Alternatively, export and reuse storage states:

browser_config = BrowserConfig(
    headless=True,
    storage_state="mystate.json"  # Pre-saved state
)

No repeated logins needed.

More info: See /docs/storage_state or 16_storage_state.md


11. Proxy & Security Enhancements

Use proxy_config for authenticated proxies:

browser_config.proxy_config = {
    "server": "http://proxy.example.com:8080",
    "username": "proxyuser",
    "password": "proxypass"
}

Combine with headers or ignore_https_errors as needed.

More info: See /docs/proxy_security or 14_proxy_security.md


12. Screenshots, PDFs & File Downloads

Enable screenshot=True or pdf=True in CrawlerRunConfig:

run_config.screenshot = True
run_config.pdf = True

After crawling:

if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(result.screenshot)

if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)

File Downloads:

browser_config.accept_downloads = True
browser_config.downloads_path = "./downloads"
run_config.js_code = """document.querySelector('a.download')?.click();"""

# After crawl:
print("Downloaded files:", result.downloaded_files)

More info: See /docs/screenshot_and_pdf_export or 15_screenshot_and_pdf_export.md
Also 10_file_download.md


13. Caching & Performance Optimization

Set cache_mode to reuse fetch results:

from crawl4ai import CacheMode
run_config.cache_mode = CacheMode.ENABLED

Adjust delays, increase concurrency, or use text_mode=True for faster extraction.

More info: See /docs/cache_modes or 9_cache_modes.md


14. Hooks for Custom Logic

Hooks let you run code at specific lifecycle events without creating pages manually in on_browser_created.

Use on_page_context_created to apply routing or modify page contexts before crawling the URL:

Example Hook:

async def on_page_context_created_hook(context, page, **kwargs):
    # Block all images to speed up load
    await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
    print("[HOOK] Image requests blocked")

async with AsyncWebCrawler(config=browser_config) as crawler:
    crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
    result = await crawler.arun("https://imageheavy.example.com", config=run_config)
    print("Crawl finished with images blocked.")

This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.

More info: See /docs/hooks_auth or 13_hooks_auth.md


15. Dockerization & Scaling

Use Docker images:

  • AMD64 basic:
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
  • ARM64 for M1/M2:
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
  • GPU support:
docker pull unclecode/crawl4ai:gpu-amd64
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64

Scale with load balancers or Kubernetes.

More info: See /docs/proxy_security (for proxy) or relevant Docker instructions in README


16. Troubleshooting & Common Pitfalls

  • Empty results? Relax filters, check selectors.
  • Timeouts? Increase page_timeout or refine wait_for.
  • CAPTCHAs? Use user_data_dir or storage_state after manual solving.
  • JS errors? Try headful mode for debugging.

Check examples & quickstart_async.config.py for more code.


17. Comprehensive End-to-End Example

Combine hooks, JS execution, PDF saving, LLM extraction—see quickstart_async.config.py for a full example.


18. Further Resources & Community

Follow @unclecode for news & community updates.

Happy Crawling!
Leverage Crawl4AI to feed your AI models with clean, structured web data today.