File size: 17,986 Bytes
03c0888
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution

Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.

**What Crawl4AI is not:**

Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal:

-   To generate perfect, AI-friendly data (particularly for LLMs) from web content
-   To maximize speed and efficiency in data extraction and processing
-   To operate at scale, from Raspberry Pi to cloud infrastructures

Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to:

1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON)
2. Implement intelligent extraction strategies to reduce reliance on costly API calls
3. Provide a streamlined pipeline for AI data preparation and ingestion

In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities.

**Key Links:**

-   **Website:** [https://crawl4ai.com](https://crawl4ai.com)
-   **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
-   **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
-   **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py)
-   **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples)

---

## Table of Contents

- [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution)
  - [Table of Contents](#table-of-contents)
  - [1. Introduction \& Key Concepts](#1-introduction--key-concepts)
  - [2. Installation \& Environment Setup](#2-installation--environment-setup)
    - [Test Your Installation](#test-your-installation)
  - [3. Core Concepts \& Configuration](#3-core-concepts--configuration)
  - [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction)
  - [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output)
  - [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm)
  - [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models)
  - [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content)
  - [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling)
  - [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation)
    - [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory)
    - [Using `storage_state`](#using-storage_state)
  - [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements)
  - [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads)
  - [13. Caching \& Performance Optimization](#13-caching--performance-optimization)
  - [14. Hooks for Custom Logic](#14-hooks-for-custom-logic)
  - [15. Dockerization \& Scaling](#15-dockerization--scaling)
  - [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls)
  - [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example)
  - [18. Further Resources \& Community](#18-further-resources--community)

---

## 1. Introduction & Key Concepts

Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.

**Quick Test:**

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def test_run():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown)

asyncio.run(test_run())
```

If you see Markdown output, everything is working!

**More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md)

---

## 2. Installation & Environment Setup

```bash
# Install the package
pip install crawl4ai
crawl4ai-setup

# Install Playwright with system dependencies (recommended)
playwright install --with-deps  # Installs all browsers

# Or install specific browsers:
playwright install --with-deps chrome  # Recommended for Colab/Linux
playwright install --with-deps firefox
playwright install --with-deps webkit
playwright install --with-deps chromium

# Keep Playwright updated periodically
playwright install
```

> **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably.

### Test Your Installation
Try these one-liners:

```python
# Visible browser test
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"

# Headless test (for servers/CI)
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()"
```

You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`.


**Try in Colab:**  
[Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)

**More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md)

---

## 3. Core Concepts & Configuration

Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling.

**Example config:**

```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(
    headless=True,
    verbose=True,
    viewport_width=1080,
    viewport_height=600,
    text_mode=False,
    ignore_https_errors=True,
    java_script_enabled=True
)

run_config = CrawlerRunConfig(
    css_selector="article.main",
    word_count_threshold=50,
    excluded_tags=['nav','footer'],
    exclude_external_links=True,
    wait_for="css:.article-loaded",
    page_timeout=60000,
    delay_before_return_html=1.0,
    mean_delay=0.1,
    max_range=0.3,
    process_iframes=True,
    remove_overlay_elements=True,
    js_code="""
        (async () => {
            window.scrollTo(0, document.body.scrollHeight);
            await new Promise(r => setTimeout(r, 2000));
            document.querySelector('.load-more')?.click();
        })();
    """
)

# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
# run_config.cache_mode = CacheMode.ENABLED
```

**Prefixes:**

-   `http://` or `https://` for live pages
-   `file://local.html` for local
-   `raw:<html>` for raw HTML strings

**More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md)

---

## 4. Basic Crawling & Simple Extraction

```python
async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun("https://news.example.com/article", config=run_config)
    print(result.markdown) # Basic markdown content
```

**More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md)

---

## 5. Markdown Generation & AI-Optimized Output

After crawling, `result.markdown_v2` provides:

-   `raw_markdown`: Unfiltered markdown
-   `markdown_with_citations`: Links as references at the bottom
-   `references_markdown`: A separate list of reference links
-   `fit_markdown`: Filtered, relevant markdown (e.g., after BM25)
-   `fit_html`: The HTML used to produce `fit_markdown`

**Example:**

```python
print("RAW:", result.markdown_v2.raw_markdown[:200])
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
print("REFERENCES:", result.markdown_v2.references_markdown)
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)
```

For AI training, `fit_markdown` focuses on the most relevant content.

**More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md)

---

## 6. Structured Data Extraction (CSS, XPath, LLM)

Extract JSON data without LLMs:

**CSS:**

```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
  "name": "Products",
  "baseSelector": ".product",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"}
  ]
}
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)
```

**XPath:**

```python
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy

xpath_schema = {
  "name": "Articles",
  "baseSelector": "//div[@class='article']",
  "fields": [
    {"name":"headline","selector":".//h1","type":"text"},
    {"name":"summary","selector":".//p[@class='summary']","type":"text"}
  ]
}
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)
```

**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)

---

## 7. Advanced Extraction: LLM & Open-Source Models

Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).

```python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class TravelData(BaseModel):
    destination: str
    attractions: list

run_config.extraction_strategy = LLMExtractionStrategy(
    provider="ollama/nemotron",
    schema=TravelData.schema(),
    instruction="Extract destination and top attractions."
)
```

**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)

---

## 8. Page Interactions, JS Execution, & Dynamic Content

Insert `js_code` and use `wait_for` to ensure content loads. Example:

```python
run_config.js_code = """
(async () => {
   document.querySelector('.load-more')?.click();
   await new Promise(r => setTimeout(r, 2000));
})();
"""
run_config.wait_for = "css:.item-loaded"
```

**More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md)

---

## 9. Media, Links, & Metadata Handling

`result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance.

`result.media["videos"]`, `result.media["audios"]` similarly hold media info.

`result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`.

`result.metadata`: Title, description, keywords, author.

**Example:**

```python
# Images
for img in result.media["images"]:
    print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))

# Links
for link in result.links["external"]:
    print("External Link:", link["href"], "Text:", link["text"])

# Metadata
print("Page Title:", result.metadata["title"])
print("Description:", result.metadata["description"])
```

**More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md)

---

## 10. Authentication & Identity Preservation

### Manual Setup via User Data Directory

1. **Open Chrome with a custom user data dir:**

    ```bash
    "C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"
    ```

    On macOS:

    ```bash
    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"
    ```

2. **Log in to sites, solve CAPTCHAs, adjust settings manually.**  
   The browser saves cookies/localStorage in that directory.

3. **Use `user_data_dir` in `BrowserConfig`:**

    ```python
    browser_config = BrowserConfig(
        headless=True,
        user_data_dir="/Users/username/ChromeProfiles/MyProfile"
    )
    ```

    Now the crawler starts with those cookies, sessions, etc.

### Using `storage_state`

Alternatively, export and reuse storage states:

```python
browser_config = BrowserConfig(
    headless=True,
    storage_state="mystate.json"  # Pre-saved state
)
```

No repeated logins needed.

**More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md)

---

## 11. Proxy & Security Enhancements

Use `proxy_config` for authenticated proxies:

```python
browser_config.proxy_config = {
    "server": "http://proxy.example.com:8080",
    "username": "proxyuser",
    "password": "proxypass"
}
```

Combine with `headers` or `ignore_https_errors` as needed.

**More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md)

---

## 12. Screenshots, PDFs & File Downloads

Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`:

```python
run_config.screenshot = True
run_config.pdf = True
```

After crawling:

```python
if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(result.screenshot)

if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)
```

**File Downloads:**

```python
browser_config.accept_downloads = True
browser_config.downloads_path = "./downloads"
run_config.js_code = """document.querySelector('a.download')?.click();"""

# After crawl:
print("Downloaded files:", result.downloaded_files)
```

**More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md)  
Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md)

---

## 13. Caching & Performance Optimization

Set `cache_mode` to reuse fetch results:

```python
from crawl4ai import CacheMode
run_config.cache_mode = CacheMode.ENABLED
```

Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction.

**More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md)

---

## 14. Hooks for Custom Logic

Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`.

Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL:

**Example Hook:**

```python
async def on_page_context_created_hook(context, page, **kwargs):
    # Block all images to speed up load
    await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
    print("[HOOK] Image requests blocked")

async with AsyncWebCrawler(config=browser_config) as crawler:
    crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
    result = await crawler.arun("https://imageheavy.example.com", config=run_config)
    print("Crawl finished with images blocked.")
```

This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.

**More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md)

---

## 15. Dockerization & Scaling

Use Docker images:

-   AMD64 basic:

```bash
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
```

-   ARM64 for M1/M2:

```bash
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
```

-   GPU support:

```bash
docker pull unclecode/crawl4ai:gpu-amd64
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64
```

Scale with load balancers or Kubernetes.

**More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#)

---

## 16. Troubleshooting & Common Pitfalls

-   Empty results? Relax filters, check selectors.
-   Timeouts? Increase `page_timeout` or refine `wait_for`.
-   CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving.
-   JS errors? Try headful mode for debugging.

Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code.

---

## 17. Comprehensive End-to-End Example

Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example.

---

## 18. Further Resources & Community

-   **Docs:** [https://crawl4ai.com](https://crawl4ai.com)
-   **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)

Follow [@unclecode](https://x.com/unclecode) for news & community updates.

**Happy Crawling!**  
Leverage Crawl4AI to feed your AI models with clean, structured web data today.