# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows. **What Crawl4AI is not:** Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal: - To generate perfect, AI-friendly data (particularly for LLMs) from web content - To maximize speed and efficiency in data extraction and processing - To operate at scale, from Raspberry Pi to cloud infrastructures Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to: 1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON) 2. Implement intelligent extraction strategies to reduce reliance on costly API calls 3. Provide a streamlined pipeline for AI data preparation and ingestion In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities. **Key Links:** - **Website:** [https://crawl4ai.com](https://crawl4ai.com) - **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) - **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) - **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) - **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) --- ## Table of Contents - [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution) - [Table of Contents](#table-of-contents) - [1. Introduction \& Key Concepts](#1-introduction--key-concepts) - [2. Installation \& Environment Setup](#2-installation--environment-setup) - [Test Your Installation](#test-your-installation) - [3. Core Concepts \& Configuration](#3-core-concepts--configuration) - [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction) - [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output) - [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm) - [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models) - [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content) - [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling) - [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation) - [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory) - [Using `storage_state`](#using-storage_state) - [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements) - [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads) - [13. Caching \& Performance Optimization](#13-caching--performance-optimization) - [14. Hooks for Custom Logic](#14-hooks-for-custom-logic) - [15. Dockerization \& Scaling](#15-dockerization--scaling) - [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls) - [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example) - [18. Further Resources \& Community](#18-further-resources--community) --- ## 1. Introduction & Key Concepts Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines. **Quick Test:** ```python import asyncio from crawl4ai import AsyncWebCrawler async def test_run(): async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com") print(result.markdown) asyncio.run(test_run()) ``` If you see Markdown output, everything is working! **More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md) --- ## 2. Installation & Environment Setup ```bash # Install the package pip install crawl4ai crawl4ai-setup # Install Playwright with system dependencies (recommended) playwright install --with-deps # Installs all browsers # Or install specific browsers: playwright install --with-deps chrome # Recommended for Colab/Linux playwright install --with-deps firefox playwright install --with-deps webkit playwright install --with-deps chromium # Keep Playwright updated periodically playwright install ``` > **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably. ### Test Your Installation Try these one-liners: ```python # Visible browser test python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')" # Headless test (for servers/CI) python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()" ``` You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`. **Try in Colab:** [Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) **More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md) --- ## 3. Core Concepts & Configuration Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling. **Example config:** ```python from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig browser_config = BrowserConfig( headless=True, verbose=True, viewport_width=1080, viewport_height=600, text_mode=False, ignore_https_errors=True, java_script_enabled=True ) run_config = CrawlerRunConfig( css_selector="article.main", word_count_threshold=50, excluded_tags=['nav','footer'], exclude_external_links=True, wait_for="css:.article-loaded", page_timeout=60000, delay_before_return_html=1.0, mean_delay=0.1, max_range=0.3, process_iframes=True, remove_overlay_elements=True, js_code=""" (async () => { window.scrollTo(0, document.body.scrollHeight); await new Promise(r => setTimeout(r, 2000)); document.querySelector('.load-more')?.click(); })(); """ ) # Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY # run_config.cache_mode = CacheMode.ENABLED ``` **Prefixes:** - `http://` or `https://` for live pages - `file://local.html` for local - `raw:` for raw HTML strings **More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md) --- ## 4. Basic Crawling & Simple Extraction ```python async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun("https://news.example.com/article", config=run_config) print(result.markdown) # Basic markdown content ``` **More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md) --- ## 5. Markdown Generation & AI-Optimized Output After crawling, `result.markdown_v2` provides: - `raw_markdown`: Unfiltered markdown - `markdown_with_citations`: Links as references at the bottom - `references_markdown`: A separate list of reference links - `fit_markdown`: Filtered, relevant markdown (e.g., after BM25) - `fit_html`: The HTML used to produce `fit_markdown` **Example:** ```python print("RAW:", result.markdown_v2.raw_markdown[:200]) print("CITED:", result.markdown_v2.markdown_with_citations[:200]) print("REFERENCES:", result.markdown_v2.references_markdown) print("FIT MARKDOWN:", result.markdown_v2.fit_markdown) ``` For AI training, `fit_markdown` focuses on the most relevant content. **More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md) --- ## 6. Structured Data Extraction (CSS, XPath, LLM) Extract JSON data without LLMs: **CSS:** ```python from crawl4ai.extraction_strategy import JsonCssExtractionStrategy schema = { "name": "Products", "baseSelector": ".product", "fields": [ {"name": "title", "selector": "h2", "type": "text"}, {"name": "price", "selector": ".price", "type": "text"} ] } run_config.extraction_strategy = JsonCssExtractionStrategy(schema) ``` **XPath:** ```python from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy xpath_schema = { "name": "Articles", "baseSelector": "//div[@class='article']", "fields": [ {"name":"headline","selector":".//h1","type":"text"}, {"name":"summary","selector":".//p[@class='summary']","type":"text"} ] } run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema) ``` **More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md) --- ## 7. Advanced Extraction: LLM & Open-Source Models Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama). ```python from pydantic import BaseModel from crawl4ai.extraction_strategy import LLMExtractionStrategy class TravelData(BaseModel): destination: str attractions: list run_config.extraction_strategy = LLMExtractionStrategy( provider="ollama/nemotron", schema=TravelData.schema(), instruction="Extract destination and top attractions." ) ``` **More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md) --- ## 8. Page Interactions, JS Execution, & Dynamic Content Insert `js_code` and use `wait_for` to ensure content loads. Example: ```python run_config.js_code = """ (async () => { document.querySelector('.load-more')?.click(); await new Promise(r => setTimeout(r, 2000)); })(); """ run_config.wait_for = "css:.item-loaded" ``` **More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md) --- ## 9. Media, Links, & Metadata Handling `result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance. `result.media["videos"]`, `result.media["audios"]` similarly hold media info. `result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`. `result.metadata`: Title, description, keywords, author. **Example:** ```python # Images for img in result.media["images"]: print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A")) # Links for link in result.links["external"]: print("External Link:", link["href"], "Text:", link["text"]) # Metadata print("Page Title:", result.metadata["title"]) print("Description:", result.metadata["description"]) ``` **More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md) --- ## 10. Authentication & Identity Preservation ### Manual Setup via User Data Directory 1. **Open Chrome with a custom user data dir:** ```bash "C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile" ``` On macOS: ```bash "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile" ``` 2. **Log in to sites, solve CAPTCHAs, adjust settings manually.** The browser saves cookies/localStorage in that directory. 3. **Use `user_data_dir` in `BrowserConfig`:** ```python browser_config = BrowserConfig( headless=True, user_data_dir="/Users/username/ChromeProfiles/MyProfile" ) ``` Now the crawler starts with those cookies, sessions, etc. ### Using `storage_state` Alternatively, export and reuse storage states: ```python browser_config = BrowserConfig( headless=True, storage_state="mystate.json" # Pre-saved state ) ``` No repeated logins needed. **More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md) --- ## 11. Proxy & Security Enhancements Use `proxy_config` for authenticated proxies: ```python browser_config.proxy_config = { "server": "http://proxy.example.com:8080", "username": "proxyuser", "password": "proxypass" } ``` Combine with `headers` or `ignore_https_errors` as needed. **More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md) --- ## 12. Screenshots, PDFs & File Downloads Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`: ```python run_config.screenshot = True run_config.pdf = True ``` After crawling: ```python if result.screenshot: with open("page.png", "wb") as f: f.write(result.screenshot) if result.pdf: with open("page.pdf", "wb") as f: f.write(result.pdf) ``` **File Downloads:** ```python browser_config.accept_downloads = True browser_config.downloads_path = "./downloads" run_config.js_code = """document.querySelector('a.download')?.click();""" # After crawl: print("Downloaded files:", result.downloaded_files) ``` **More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md) Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md) --- ## 13. Caching & Performance Optimization Set `cache_mode` to reuse fetch results: ```python from crawl4ai import CacheMode run_config.cache_mode = CacheMode.ENABLED ``` Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction. **More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md) --- ## 14. Hooks for Custom Logic Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`. Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL: **Example Hook:** ```python async def on_page_context_created_hook(context, page, **kwargs): # Block all images to speed up load await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort()) print("[HOOK] Image requests blocked") async with AsyncWebCrawler(config=browser_config) as crawler: crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook) result = await crawler.arun("https://imageheavy.example.com", config=run_config) print("Crawl finished with images blocked.") ``` This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup. **More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md) --- ## 15. Dockerization & Scaling Use Docker images: - AMD64 basic: ```bash docker pull unclecode/crawl4ai:basic-amd64 docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64 ``` - ARM64 for M1/M2: ```bash docker pull unclecode/crawl4ai:basic-arm64 docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64 ``` - GPU support: ```bash docker pull unclecode/crawl4ai:gpu-amd64 docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64 ``` Scale with load balancers or Kubernetes. **More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#) --- ## 16. Troubleshooting & Common Pitfalls - Empty results? Relax filters, check selectors. - Timeouts? Increase `page_timeout` or refine `wait_for`. - CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving. - JS errors? Try headful mode for debugging. Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code. --- ## 17. Comprehensive End-to-End Example Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example. --- ## 18. Further Resources & Community - **Docs:** [https://crawl4ai.com](https://crawl4ai.com) - **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues) Follow [@unclecode](https://x.com/unclecode) for news & community updates. **Happy Crawling!** Leverage Crawl4AI to feed your AI models with clean, structured web data today.