File size: 9,938 Bytes
03c0888
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
Below is a sample Markdown file (`tutorials/async-webcrawler-basics.md`) illustrating how you might teach new users the fundamentals of `AsyncWebCrawler`. This tutorial builds on the **Getting Started** section by introducing key configuration parameters and the structure of the crawl result. Feel free to adjust the code snippets, wording, or format to match your style.

---

# AsyncWebCrawler Basics

In this tutorial, you’ll learn how to:

1. Create and configure an `AsyncWebCrawler` instance  
2. Understand the `CrawlResult` object returned by `arun()`  
3. Use basic `BrowserConfig` and `CrawlerRunConfig` options to tailor your crawl

> **Prerequisites**  
> - You’ve already completed the [Getting Started](./getting-started.md) tutorial (or have equivalent knowledge).  
> - You have **Crawl4AI** installed and configured with Playwright.

---

## 1. What is `AsyncWebCrawler`?

`AsyncWebCrawler` is the central class for running asynchronous crawling operations in Crawl4AI. It manages browser sessions, handles dynamic pages (if needed), and provides you with a structured result object for each crawl. Essentially, it’s your high-level interface for collecting page data.

```python
from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://example.com")
    print(result)
```

---

## 2. Creating a Basic `AsyncWebCrawler` Instance

Below is a simple code snippet showing how to create and use `AsyncWebCrawler`. This goes one step beyond the minimal example you saw in [Getting Started](./getting-started.md).

```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import BrowserConfig, CrawlerRunConfig

async def main():
    # 1. Set up configuration objects (optional if you want defaults)
    browser_config = BrowserConfig(
        browser_type="chromium",
        headless=True,
        verbose=True
    )
    crawler_config = CrawlerRunConfig(
        page_timeout=30000,     # 30 seconds
        wait_for_images=True,
        verbose=True
    )

    # 2. Initialize AsyncWebCrawler with your chosen browser config
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # 3. Run a single crawl
        url_to_crawl = "https://example.com"
        result = await crawler.arun(url=url_to_crawl, config=crawler_config)
        
        # 4. Inspect the result
        if result.success:
            print(f"Successfully crawled: {result.url}")
            print(f"HTML length: {len(result.html)}")
            print(f"Markdown snippet: {result.markdown[:200]}...")
        else:
            print(f"Failed to crawl {result.url}. Error: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())
```

### Key Points

1. **`BrowserConfig`** is optional, but it’s the place to specify browser-related settings (e.g., `headless`, `browser_type`).
2. **`CrawlerRunConfig`** deals with how you want the crawler to behave for this particular run (timeouts, waiting for images, etc.).
3. **`arun()`** is the main method to crawl a single URL. We’ll see how `arun_many()` works in later tutorials.

---

## 3. Understanding `CrawlResult`

When you call `arun()`, you get back a `CrawlResult` object containing all the relevant data from that crawl attempt. Some common fields include:

```python
class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
    cleaned_html: Optional[str] = None
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    screenshot: Optional[str] = None  # base64-encoded screenshot if requested
    pdf: Optional[bytes] = None       # binary PDF data if requested
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    markdown_v2: Optional[MarkdownGenerationResult] = None
    error_message: Optional[str] = None
    # ... plus other fields like status_code, ssl_certificate, extracted_content, etc.
```

### Commonly Used Fields

- **`success`**: `True` if the crawl succeeded, `False` otherwise.  
- **`html`**: The raw HTML (or final rendered state if JavaScript was executed).  
- **`markdown` / `markdown_v2`**: Contains the automatically generated Markdown representation of the page.  
- **`media`**: A dictionary with lists of extracted images, videos, or audio elements.  
- **`links`**: A dictionary with lists of “internal” and “external” link objects.  
- **`error_message`**: If `success` is `False`, this often contains a description of the error.

**Example**:

```python
if result.success:
    print("Page Title or snippet of HTML:", result.html[:200])
    if result.markdown:
        print("Markdown snippet:", result.markdown[:200])
    print("Links found:", len(result.links.get("internal", [])), "internal links")
else:
    print("Error crawling:", result.error_message)
```

---

## 4. Relevant Basic Parameters

Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might tweak early on. We’ll cover more advanced ones (like proxies, PDF, or screenshots) in later tutorials.

### 4.1 `BrowserConfig` Essentials

| Parameter          | Description                                               | Default        |
|--------------------|-----------------------------------------------------------|----------------|
| `browser_type`     | Which browser engine to use: `"chromium"`, `"firefox"`, `"webkit"` | `"chromium"`   |
| `headless`         | Run the browser with no UI window. If `False`, you see the browser. | `True`         |
| `verbose`          | Print extra logs for debugging.                          | `True`         |
| `java_script_enabled` | Toggle JavaScript. When `False`, you might speed up loads but lose dynamic content. | `True`         |

### 4.2 `CrawlerRunConfig` Essentials

| Parameter             | Description                                                  | Default            |
|-----------------------|--------------------------------------------------------------|--------------------|
| `page_timeout`        | Maximum time in ms to wait for the page to load or scripts. | `30000` (30s)      |
| `wait_for_images`     | Wait for images to fully load. Good for accurate rendering.  | `True`             |
| `css_selector`        | Target only certain elements for extraction.                | `None`             |
| `excluded_tags`       | Skip certain HTML tags (like `nav`, `footer`, etc.)          | `None`             |
| `verbose`             | Print logs for debugging.                                    | `True`             |

> **Tip**: Don’t worry if you see lots of parameters. You’ll learn them gradually in later tutorials.

---

## 5. Windows-Specific Configuration

When using AsyncWebCrawler on Windows, you might encounter a `NotImplementedError` related to `asyncio.create_subprocess_exec`. This is a known Windows-specific issue that occurs because Windows' default event loop doesn't support subprocess operations.

To resolve this, Crawl4AI provides a utility function to configure Windows to use the ProactorEventLoop. Call this function before running any async operations:

```python
from crawl4ai.utils import configure_windows_event_loop

# Call this before any async operations if you're on Windows
configure_windows_event_loop()

# Your AsyncWebCrawler code here
```

---

## 6. Putting It All Together

Here’s a slightly more in-depth example that shows off a few key config parameters at once:

```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import BrowserConfig, CrawlerRunConfig

async def main():
    browser_cfg = BrowserConfig(
        browser_type="chromium",
        headless=True,
        java_script_enabled=True,
        verbose=False
    )

    crawler_cfg = CrawlerRunConfig(
        page_timeout=30000,  # wait up to 30 seconds
        wait_for_images=True,
        css_selector=".article-body",  # only extract content under this CSS selector
        verbose=True
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun("https://news.example.com", config=crawler_cfg)

        if result.success:
            print("[OK] Crawled:", result.url)
            print("HTML length:", len(result.html))
            print("Extracted Markdown:", result.markdown_v2.raw_markdown[:300])
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

**Key Observations**:
- `css_selector=".article-body"` ensures we only focus on the main content region.  
- `page_timeout=30000` helps if the site is slow.  
- We turned off `verbose` logs for the browser but kept them on for the crawler config.  

---

## 7. Next Steps

- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).
- **Reference**: For a complete list of every parameter in `BrowserConfig` and `CrawlerRunConfig`, check out the [Reference section](../../reference/configuration.md).

---

## Summary

You now know the basics of **AsyncWebCrawler**:
- How to create it with optional browser/crawler configs
- How `arun()` works for single-page crawls
- Where to find your crawled data in `CrawlResult`
- A handful of frequently used configuration parameters

From here, you can refine your crawler to handle more advanced scenarios, like focusing on specific content or dealing with dynamic elements. Let’s move on to **[Smart Crawling Techniques](./smart-crawling.md)** to learn how to handle iframes, advanced caching, and more.

---

**Last updated**: 2024-XX-XX

Keep exploring! If you get stuck, remember to check out the [How-To Guides](../../how-to/) for targeted solutions or the [Explanations](../../explanations/) for deeper conceptual background.