Spaces:
Runtime error
Runtime error
# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution | |
Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows. | |
**What Crawl4AI is not:** | |
Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal: | |
- To generate perfect, AI-friendly data (particularly for LLMs) from web content | |
- To maximize speed and efficiency in data extraction and processing | |
- To operate at scale, from Raspberry Pi to cloud infrastructures | |
Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to: | |
1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON) | |
2. Implement intelligent extraction strategies to reduce reliance on costly API calls | |
3. Provide a streamlined pipeline for AI data preparation and ingestion | |
In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities. | |
**Key Links:** | |
- **Website:** [https://crawl4ai.com](https://crawl4ai.com) | |
- **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) | |
- **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) | |
- **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) | |
- **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) | |
--- | |
## Table of Contents | |
- [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution) | |
- [Table of Contents](#table-of-contents) | |
- [1. Introduction \& Key Concepts](#1-introduction--key-concepts) | |
- [2. Installation \& Environment Setup](#2-installation--environment-setup) | |
- [Test Your Installation](#test-your-installation) | |
- [3. Core Concepts \& Configuration](#3-core-concepts--configuration) | |
- [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction) | |
- [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output) | |
- [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm) | |
- [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models) | |
- [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content) | |
- [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling) | |
- [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation) | |
- [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory) | |
- [Using `storage_state`](#using-storage_state) | |
- [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements) | |
- [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads) | |
- [13. Caching \& Performance Optimization](#13-caching--performance-optimization) | |
- [14. Hooks for Custom Logic](#14-hooks-for-custom-logic) | |
- [15. Dockerization \& Scaling](#15-dockerization--scaling) | |
- [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls) | |
- [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example) | |
- [18. Further Resources \& Community](#18-further-resources--community) | |
--- | |
## 1. Introduction & Key Concepts | |
Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines. | |
**Quick Test:** | |
```python | |
import asyncio | |
from crawl4ai import AsyncWebCrawler | |
async def test_run(): | |
async with AsyncWebCrawler() as crawler: | |
result = await crawler.arun("https://example.com") | |
print(result.markdown) | |
asyncio.run(test_run()) | |
``` | |
If you see Markdown output, everything is working! | |
**More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md) | |
--- | |
## 2. Installation & Environment Setup | |
```bash | |
# Install the package | |
pip install crawl4ai | |
crawl4ai-setup | |
# Install Playwright with system dependencies (recommended) | |
playwright install --with-deps # Installs all browsers | |
# Or install specific browsers: | |
playwright install --with-deps chrome # Recommended for Colab/Linux | |
playwright install --with-deps firefox | |
playwright install --with-deps webkit | |
playwright install --with-deps chromium | |
# Keep Playwright updated periodically | |
playwright install | |
``` | |
> **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably. | |
### Test Your Installation | |
Try these one-liners: | |
```python | |
# Visible browser test | |
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')" | |
# Headless test (for servers/CI) | |
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()" | |
``` | |
You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`. | |
**Try in Colab:** | |
[Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) | |
**More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md) | |
--- | |
## 3. Core Concepts & Configuration | |
Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling. | |
**Example config:** | |
```python | |
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig | |
browser_config = BrowserConfig( | |
headless=True, | |
verbose=True, | |
viewport_width=1080, | |
viewport_height=600, | |
text_mode=False, | |
ignore_https_errors=True, | |
java_script_enabled=True | |
) | |
run_config = CrawlerRunConfig( | |
css_selector="article.main", | |
word_count_threshold=50, | |
excluded_tags=['nav','footer'], | |
exclude_external_links=True, | |
wait_for="css:.article-loaded", | |
page_timeout=60000, | |
delay_before_return_html=1.0, | |
mean_delay=0.1, | |
max_range=0.3, | |
process_iframes=True, | |
remove_overlay_elements=True, | |
js_code=""" | |
(async () => { | |
window.scrollTo(0, document.body.scrollHeight); | |
await new Promise(r => setTimeout(r, 2000)); | |
document.querySelector('.load-more')?.click(); | |
})(); | |
""" | |
) | |
# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY | |
# run_config.cache_mode = CacheMode.ENABLED | |
``` | |
**Prefixes:** | |
- `http://` or `https://` for live pages | |
- `file://local.html` for local | |
- `raw:<html>` for raw HTML strings | |
**More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md) | |
--- | |
## 4. Basic Crawling & Simple Extraction | |
```python | |
async with AsyncWebCrawler(config=browser_config) as crawler: | |
result = await crawler.arun("https://news.example.com/article", config=run_config) | |
print(result.markdown) # Basic markdown content | |
``` | |
**More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md) | |
--- | |
## 5. Markdown Generation & AI-Optimized Output | |
After crawling, `result.markdown_v2` provides: | |
- `raw_markdown`: Unfiltered markdown | |
- `markdown_with_citations`: Links as references at the bottom | |
- `references_markdown`: A separate list of reference links | |
- `fit_markdown`: Filtered, relevant markdown (e.g., after BM25) | |
- `fit_html`: The HTML used to produce `fit_markdown` | |
**Example:** | |
```python | |
print("RAW:", result.markdown_v2.raw_markdown[:200]) | |
print("CITED:", result.markdown_v2.markdown_with_citations[:200]) | |
print("REFERENCES:", result.markdown_v2.references_markdown) | |
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown) | |
``` | |
For AI training, `fit_markdown` focuses on the most relevant content. | |
**More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md) | |
--- | |
## 6. Structured Data Extraction (CSS, XPath, LLM) | |
Extract JSON data without LLMs: | |
**CSS:** | |
```python | |
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy | |
schema = { | |
"name": "Products", | |
"baseSelector": ".product", | |
"fields": [ | |
{"name": "title", "selector": "h2", "type": "text"}, | |
{"name": "price", "selector": ".price", "type": "text"} | |
] | |
} | |
run_config.extraction_strategy = JsonCssExtractionStrategy(schema) | |
``` | |
**XPath:** | |
```python | |
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy | |
xpath_schema = { | |
"name": "Articles", | |
"baseSelector": "//div[@class='article']", | |
"fields": [ | |
{"name":"headline","selector":".//h1","type":"text"}, | |
{"name":"summary","selector":".//p[@class='summary']","type":"text"} | |
] | |
} | |
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema) | |
``` | |
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md) | |
--- | |
## 7. Advanced Extraction: LLM & Open-Source Models | |
Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama). | |
```python | |
from pydantic import BaseModel | |
from crawl4ai.extraction_strategy import LLMExtractionStrategy | |
class TravelData(BaseModel): | |
destination: str | |
attractions: list | |
run_config.extraction_strategy = LLMExtractionStrategy( | |
provider="ollama/nemotron", | |
schema=TravelData.schema(), | |
instruction="Extract destination and top attractions." | |
) | |
``` | |
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md) | |
--- | |
## 8. Page Interactions, JS Execution, & Dynamic Content | |
Insert `js_code` and use `wait_for` to ensure content loads. Example: | |
```python | |
run_config.js_code = """ | |
(async () => { | |
document.querySelector('.load-more')?.click(); | |
await new Promise(r => setTimeout(r, 2000)); | |
})(); | |
""" | |
run_config.wait_for = "css:.item-loaded" | |
``` | |
**More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md) | |
--- | |
## 9. Media, Links, & Metadata Handling | |
`result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance. | |
`result.media["videos"]`, `result.media["audios"]` similarly hold media info. | |
`result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`. | |
`result.metadata`: Title, description, keywords, author. | |
**Example:** | |
```python | |
# Images | |
for img in result.media["images"]: | |
print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A")) | |
# Links | |
for link in result.links["external"]: | |
print("External Link:", link["href"], "Text:", link["text"]) | |
# Metadata | |
print("Page Title:", result.metadata["title"]) | |
print("Description:", result.metadata["description"]) | |
``` | |
**More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md) | |
--- | |
## 10. Authentication & Identity Preservation | |
### Manual Setup via User Data Directory | |
1. **Open Chrome with a custom user data dir:** | |
```bash | |
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile" | |
``` | |
On macOS: | |
```bash | |
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile" | |
``` | |
2. **Log in to sites, solve CAPTCHAs, adjust settings manually.** | |
The browser saves cookies/localStorage in that directory. | |
3. **Use `user_data_dir` in `BrowserConfig`:** | |
```python | |
browser_config = BrowserConfig( | |
headless=True, | |
user_data_dir="/Users/username/ChromeProfiles/MyProfile" | |
) | |
``` | |
Now the crawler starts with those cookies, sessions, etc. | |
### Using `storage_state` | |
Alternatively, export and reuse storage states: | |
```python | |
browser_config = BrowserConfig( | |
headless=True, | |
storage_state="mystate.json" # Pre-saved state | |
) | |
``` | |
No repeated logins needed. | |
**More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md) | |
--- | |
## 11. Proxy & Security Enhancements | |
Use `proxy_config` for authenticated proxies: | |
```python | |
browser_config.proxy_config = { | |
"server": "http://proxy.example.com:8080", | |
"username": "proxyuser", | |
"password": "proxypass" | |
} | |
``` | |
Combine with `headers` or `ignore_https_errors` as needed. | |
**More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md) | |
--- | |
## 12. Screenshots, PDFs & File Downloads | |
Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`: | |
```python | |
run_config.screenshot = True | |
run_config.pdf = True | |
``` | |
After crawling: | |
```python | |
if result.screenshot: | |
with open("page.png", "wb") as f: | |
f.write(result.screenshot) | |
if result.pdf: | |
with open("page.pdf", "wb") as f: | |
f.write(result.pdf) | |
``` | |
**File Downloads:** | |
```python | |
browser_config.accept_downloads = True | |
browser_config.downloads_path = "./downloads" | |
run_config.js_code = """document.querySelector('a.download')?.click();""" | |
# After crawl: | |
print("Downloaded files:", result.downloaded_files) | |
``` | |
**More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md) | |
Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md) | |
--- | |
## 13. Caching & Performance Optimization | |
Set `cache_mode` to reuse fetch results: | |
```python | |
from crawl4ai import CacheMode | |
run_config.cache_mode = CacheMode.ENABLED | |
``` | |
Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction. | |
**More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md) | |
--- | |
## 14. Hooks for Custom Logic | |
Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`. | |
Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL: | |
**Example Hook:** | |
```python | |
async def on_page_context_created_hook(context, page, **kwargs): | |
# Block all images to speed up load | |
await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort()) | |
print("[HOOK] Image requests blocked") | |
async with AsyncWebCrawler(config=browser_config) as crawler: | |
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook) | |
result = await crawler.arun("https://imageheavy.example.com", config=run_config) | |
print("Crawl finished with images blocked.") | |
``` | |
This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup. | |
**More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md) | |
--- | |
## 15. Dockerization & Scaling | |
Use Docker images: | |
- AMD64 basic: | |
```bash | |
docker pull unclecode/crawl4ai:basic-amd64 | |
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64 | |
``` | |
- ARM64 for M1/M2: | |
```bash | |
docker pull unclecode/crawl4ai:basic-arm64 | |
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64 | |
``` | |
- GPU support: | |
```bash | |
docker pull unclecode/crawl4ai:gpu-amd64 | |
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64 | |
``` | |
Scale with load balancers or Kubernetes. | |
**More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#) | |
--- | |
## 16. Troubleshooting & Common Pitfalls | |
- Empty results? Relax filters, check selectors. | |
- Timeouts? Increase `page_timeout` or refine `wait_for`. | |
- CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving. | |
- JS errors? Try headful mode for debugging. | |
Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code. | |
--- | |
## 17. Comprehensive End-to-End Example | |
Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example. | |
--- | |
## 18. Further Resources & Community | |
- **Docs:** [https://crawl4ai.com](https://crawl4ai.com) | |
- **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues) | |
Follow [@unclecode](https://x.com/unclecode) for news & community updates. | |
**Happy Crawling!** | |
Leverage Crawl4AI to feed your AI models with clean, structured web data today. | |