|
# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution |
|
|
|
Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows. |
|
|
|
**What Crawl4AI is not:** |
|
|
|
Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal: |
|
|
|
- To generate perfect, AI-friendly data (particularly for LLMs) from web content |
|
- To maximize speed and efficiency in data extraction and processing |
|
- To operate at scale, from Raspberry Pi to cloud infrastructures |
|
|
|
Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to: |
|
|
|
1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON) |
|
2. Implement intelligent extraction strategies to reduce reliance on costly API calls |
|
3. Provide a streamlined pipeline for AI data preparation and ingestion |
|
|
|
In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities. |
|
|
|
**Key Links:** |
|
|
|
- **Website:** [https://crawl4ai.com](https://crawl4ai.com) |
|
- **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) |
|
- **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) |
|
- **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) |
|
- **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) |
|
|
|
--- |
|
|
|
## Table of Contents |
|
|
|
- [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution) |
|
- [Table of Contents](#table-of-contents) |
|
- [1. Introduction \& Key Concepts](#1-introduction--key-concepts) |
|
- [2. Installation \& Environment Setup](#2-installation--environment-setup) |
|
- [Test Your Installation](#test-your-installation) |
|
- [3. Core Concepts \& Configuration](#3-core-concepts--configuration) |
|
- [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction) |
|
- [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output) |
|
- [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm) |
|
- [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models) |
|
- [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content) |
|
- [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling) |
|
- [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation) |
|
- [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory) |
|
- [Using `storage_state`](#using-storage_state) |
|
- [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements) |
|
- [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads) |
|
- [13. Caching \& Performance Optimization](#13-caching--performance-optimization) |
|
- [14. Hooks for Custom Logic](#14-hooks-for-custom-logic) |
|
- [15. Dockerization \& Scaling](#15-dockerization--scaling) |
|
- [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls) |
|
- [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example) |
|
- [18. Further Resources \& Community](#18-further-resources--community) |
|
|
|
--- |
|
|
|
## 1. Introduction & Key Concepts |
|
|
|
Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines. |
|
|
|
**Quick Test:** |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler |
|
|
|
async def test_run(): |
|
async with AsyncWebCrawler() as crawler: |
|
result = await crawler.arun("https://example.com") |
|
print(result.markdown) |
|
|
|
asyncio.run(test_run()) |
|
``` |
|
|
|
If you see Markdown output, everything is working! |
|
|
|
**More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md) |
|
|
|
--- |
|
|
|
## 2. Installation & Environment Setup |
|
|
|
```bash |
|
# Install the package |
|
pip install crawl4ai |
|
crawl4ai-setup |
|
|
|
# Install Playwright with system dependencies (recommended) |
|
playwright install --with-deps # Installs all browsers |
|
|
|
# Or install specific browsers: |
|
playwright install --with-deps chrome # Recommended for Colab/Linux |
|
playwright install --with-deps firefox |
|
playwright install --with-deps webkit |
|
playwright install --with-deps chromium |
|
|
|
# Keep Playwright updated periodically |
|
playwright install |
|
``` |
|
|
|
> **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably. |
|
|
|
### Test Your Installation |
|
Try these one-liners: |
|
|
|
```python |
|
# Visible browser test |
|
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')" |
|
|
|
# Headless test (for servers/CI) |
|
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()" |
|
``` |
|
|
|
You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`. |
|
|
|
|
|
**Try in Colab:** |
|
[Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) |
|
|
|
**More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md) |
|
|
|
--- |
|
|
|
## 3. Core Concepts & Configuration |
|
|
|
Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling. |
|
|
|
**Example config:** |
|
|
|
```python |
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig |
|
|
|
browser_config = BrowserConfig( |
|
headless=True, |
|
verbose=True, |
|
viewport_width=1080, |
|
viewport_height=600, |
|
text_mode=False, |
|
ignore_https_errors=True, |
|
java_script_enabled=True |
|
) |
|
|
|
run_config = CrawlerRunConfig( |
|
css_selector="article.main", |
|
word_count_threshold=50, |
|
excluded_tags=['nav','footer'], |
|
exclude_external_links=True, |
|
wait_for="css:.article-loaded", |
|
page_timeout=60000, |
|
delay_before_return_html=1.0, |
|
mean_delay=0.1, |
|
max_range=0.3, |
|
process_iframes=True, |
|
remove_overlay_elements=True, |
|
js_code=""" |
|
(async () => { |
|
window.scrollTo(0, document.body.scrollHeight); |
|
await new Promise(r => setTimeout(r, 2000)); |
|
document.querySelector('.load-more')?.click(); |
|
})(); |
|
""" |
|
) |
|
|
|
# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY |
|
# run_config.cache_mode = CacheMode.ENABLED |
|
``` |
|
|
|
**Prefixes:** |
|
|
|
- `http://` or `https://` for live pages |
|
- `file://local.html` for local |
|
- `raw:<html>` for raw HTML strings |
|
|
|
**More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md) |
|
|
|
--- |
|
|
|
## 4. Basic Crawling & Simple Extraction |
|
|
|
```python |
|
async with AsyncWebCrawler(config=browser_config) as crawler: |
|
result = await crawler.arun("https://news.example.com/article", config=run_config) |
|
print(result.markdown) # Basic markdown content |
|
``` |
|
|
|
**More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md) |
|
|
|
--- |
|
|
|
## 5. Markdown Generation & AI-Optimized Output |
|
|
|
After crawling, `result.markdown_v2` provides: |
|
|
|
- `raw_markdown`: Unfiltered markdown |
|
- `markdown_with_citations`: Links as references at the bottom |
|
- `references_markdown`: A separate list of reference links |
|
- `fit_markdown`: Filtered, relevant markdown (e.g., after BM25) |
|
- `fit_html`: The HTML used to produce `fit_markdown` |
|
|
|
**Example:** |
|
|
|
```python |
|
print("RAW:", result.markdown_v2.raw_markdown[:200]) |
|
print("CITED:", result.markdown_v2.markdown_with_citations[:200]) |
|
print("REFERENCES:", result.markdown_v2.references_markdown) |
|
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown) |
|
``` |
|
|
|
For AI training, `fit_markdown` focuses on the most relevant content. |
|
|
|
**More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md) |
|
|
|
--- |
|
|
|
## 6. Structured Data Extraction (CSS, XPath, LLM) |
|
|
|
Extract JSON data without LLMs: |
|
|
|
**CSS:** |
|
|
|
```python |
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy |
|
|
|
schema = { |
|
"name": "Products", |
|
"baseSelector": ".product", |
|
"fields": [ |
|
{"name": "title", "selector": "h2", "type": "text"}, |
|
{"name": "price", "selector": ".price", "type": "text"} |
|
] |
|
} |
|
run_config.extraction_strategy = JsonCssExtractionStrategy(schema) |
|
``` |
|
|
|
**XPath:** |
|
|
|
```python |
|
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy |
|
|
|
xpath_schema = { |
|
"name": "Articles", |
|
"baseSelector": "//div[@class='article']", |
|
"fields": [ |
|
{"name":"headline","selector":".//h1","type":"text"}, |
|
{"name":"summary","selector":".//p[@class='summary']","type":"text"} |
|
] |
|
} |
|
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema) |
|
``` |
|
|
|
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md) |
|
|
|
--- |
|
|
|
## 7. Advanced Extraction: LLM & Open-Source Models |
|
|
|
Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama). |
|
|
|
```python |
|
from pydantic import BaseModel |
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy |
|
|
|
class TravelData(BaseModel): |
|
destination: str |
|
attractions: list |
|
|
|
run_config.extraction_strategy = LLMExtractionStrategy( |
|
provider="ollama/nemotron", |
|
schema=TravelData.schema(), |
|
instruction="Extract destination and top attractions." |
|
) |
|
``` |
|
|
|
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md) |
|
|
|
--- |
|
|
|
## 8. Page Interactions, JS Execution, & Dynamic Content |
|
|
|
Insert `js_code` and use `wait_for` to ensure content loads. Example: |
|
|
|
```python |
|
run_config.js_code = """ |
|
(async () => { |
|
document.querySelector('.load-more')?.click(); |
|
await new Promise(r => setTimeout(r, 2000)); |
|
})(); |
|
""" |
|
run_config.wait_for = "css:.item-loaded" |
|
``` |
|
|
|
**More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md) |
|
|
|
--- |
|
|
|
## 9. Media, Links, & Metadata Handling |
|
|
|
`result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance. |
|
|
|
`result.media["videos"]`, `result.media["audios"]` similarly hold media info. |
|
|
|
`result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`. |
|
|
|
`result.metadata`: Title, description, keywords, author. |
|
|
|
**Example:** |
|
|
|
```python |
|
# Images |
|
for img in result.media["images"]: |
|
print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A")) |
|
|
|
# Links |
|
for link in result.links["external"]: |
|
print("External Link:", link["href"], "Text:", link["text"]) |
|
|
|
# Metadata |
|
print("Page Title:", result.metadata["title"]) |
|
print("Description:", result.metadata["description"]) |
|
``` |
|
|
|
**More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md) |
|
|
|
--- |
|
|
|
## 10. Authentication & Identity Preservation |
|
|
|
### Manual Setup via User Data Directory |
|
|
|
1. **Open Chrome with a custom user data dir:** |
|
|
|
```bash |
|
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile" |
|
``` |
|
|
|
On macOS: |
|
|
|
```bash |
|
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile" |
|
``` |
|
|
|
2. **Log in to sites, solve CAPTCHAs, adjust settings manually.** |
|
The browser saves cookies/localStorage in that directory. |
|
|
|
3. **Use `user_data_dir` in `BrowserConfig`:** |
|
|
|
```python |
|
browser_config = BrowserConfig( |
|
headless=True, |
|
user_data_dir="/Users/username/ChromeProfiles/MyProfile" |
|
) |
|
``` |
|
|
|
Now the crawler starts with those cookies, sessions, etc. |
|
|
|
### Using `storage_state` |
|
|
|
Alternatively, export and reuse storage states: |
|
|
|
```python |
|
browser_config = BrowserConfig( |
|
headless=True, |
|
storage_state="mystate.json" # Pre-saved state |
|
) |
|
``` |
|
|
|
No repeated logins needed. |
|
|
|
**More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md) |
|
|
|
--- |
|
|
|
## 11. Proxy & Security Enhancements |
|
|
|
Use `proxy_config` for authenticated proxies: |
|
|
|
```python |
|
browser_config.proxy_config = { |
|
"server": "http://proxy.example.com:8080", |
|
"username": "proxyuser", |
|
"password": "proxypass" |
|
} |
|
``` |
|
|
|
Combine with `headers` or `ignore_https_errors` as needed. |
|
|
|
**More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md) |
|
|
|
--- |
|
|
|
## 12. Screenshots, PDFs & File Downloads |
|
|
|
Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`: |
|
|
|
```python |
|
run_config.screenshot = True |
|
run_config.pdf = True |
|
``` |
|
|
|
After crawling: |
|
|
|
```python |
|
if result.screenshot: |
|
with open("page.png", "wb") as f: |
|
f.write(result.screenshot) |
|
|
|
if result.pdf: |
|
with open("page.pdf", "wb") as f: |
|
f.write(result.pdf) |
|
``` |
|
|
|
**File Downloads:** |
|
|
|
```python |
|
browser_config.accept_downloads = True |
|
browser_config.downloads_path = "./downloads" |
|
run_config.js_code = """document.querySelector('a.download')?.click();""" |
|
|
|
# After crawl: |
|
print("Downloaded files:", result.downloaded_files) |
|
``` |
|
|
|
**More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md) |
|
Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md) |
|
|
|
--- |
|
|
|
## 13. Caching & Performance Optimization |
|
|
|
Set `cache_mode` to reuse fetch results: |
|
|
|
```python |
|
from crawl4ai import CacheMode |
|
run_config.cache_mode = CacheMode.ENABLED |
|
``` |
|
|
|
Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction. |
|
|
|
**More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md) |
|
|
|
--- |
|
|
|
## 14. Hooks for Custom Logic |
|
|
|
Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`. |
|
|
|
Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL: |
|
|
|
**Example Hook:** |
|
|
|
```python |
|
async def on_page_context_created_hook(context, page, **kwargs): |
|
# Block all images to speed up load |
|
await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort()) |
|
print("[HOOK] Image requests blocked") |
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler: |
|
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook) |
|
result = await crawler.arun("https://imageheavy.example.com", config=run_config) |
|
print("Crawl finished with images blocked.") |
|
``` |
|
|
|
This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup. |
|
|
|
**More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md) |
|
|
|
--- |
|
|
|
## 15. Dockerization & Scaling |
|
|
|
Use Docker images: |
|
|
|
- AMD64 basic: |
|
|
|
```bash |
|
docker pull unclecode/crawl4ai:basic-amd64 |
|
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64 |
|
``` |
|
|
|
- ARM64 for M1/M2: |
|
|
|
```bash |
|
docker pull unclecode/crawl4ai:basic-arm64 |
|
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64 |
|
``` |
|
|
|
- GPU support: |
|
|
|
```bash |
|
docker pull unclecode/crawl4ai:gpu-amd64 |
|
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64 |
|
``` |
|
|
|
Scale with load balancers or Kubernetes. |
|
|
|
**More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#) |
|
|
|
--- |
|
|
|
## 16. Troubleshooting & Common Pitfalls |
|
|
|
- Empty results? Relax filters, check selectors. |
|
- Timeouts? Increase `page_timeout` or refine `wait_for`. |
|
- CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving. |
|
- JS errors? Try headful mode for debugging. |
|
|
|
Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code. |
|
|
|
--- |
|
|
|
## 17. Comprehensive End-to-End Example |
|
|
|
Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example. |
|
|
|
--- |
|
|
|
## 18. Further Resources & Community |
|
|
|
- **Docs:** [https://crawl4ai.com](https://crawl4ai.com) |
|
- **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues) |
|
|
|
Follow [@unclecode](https://x.com/unclecode) for news & community updates. |
|
|
|
**Happy Crawling!** |
|
Leverage Crawl4AI to feed your AI models with clean, structured web data today. |
|
|