File size: 8,251 Bytes
03c0888
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# Deploying with Docker (Quickstart)

> **⚠️ WARNING: Experimental & Legacy**  
> Our current Docker solution for Crawl4AI is **not stable** and **will be discontinued** soon. A more robust Docker/Orchestration strategy is in development, with a planned stable release in **2025**. If you choose to use this Docker approach, please proceed cautiously and avoid production deployment without thorough testing.

Crawl4AI is **open-source** and under **active development**. We appreciate your interest, but strongly recommend you make **informed decisions** if you need a production environment. Expect breaking changes in future versions.

---

## 1. Installation & Environment Setup (Outside Docker)

Before we jump into Docker usage, here’s a quick reminder of how to install Crawl4AI locally (legacy doc). For **non-Docker** deployments or local dev:

```bash
# 1. Install the package
pip install crawl4ai
crawl4ai-setup

# 2. Install playwright dependencies (all browsers or specific ones)
playwright install --with-deps
# or
playwright install --with-deps chromium
# or
playwright install --with-deps chrome
```

**Testing** your installation:

```bash
# Visible browser test
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
```

---

## 2. Docker Overview

This Docker approach allows you to run a **Crawl4AI** service via REST API. You can:

1. **POST** a request (e.g., URLs, extraction config)  
2. **Retrieve** your results from a task-based endpoint  

> **Note**: This Docker solution is **temporary**. We plan a more robust, stable Docker approach in the near future. For now, you can experiment, but do not rely on it for mission-critical production.

---

## 3. Pulling and Running the Image

### Basic Run

```bash
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```

This starts a container on port `11235`. You can `POST` requests to `http://localhost:11235/crawl`.

### Using an API Token

```bash
docker run -p 11235:11235 \
  -e CRAWL4AI_API_TOKEN=your_secret_token \
  unclecode/crawl4ai:basic
```

If **`CRAWL4AI_API_TOKEN`** is set, you must include `Authorization: Bearer <token>` in your requests. Otherwise, the service is open to anyone.

---

## 4. Docker Compose for Multi-Container Workflows

You can also use **Docker Compose** to manage multiple services. Below is an **experimental** snippet:

```yaml
version: '3.8'

services:
  crawl4ai:
    image: unclecode/crawl4ai:basic
    ports:
      - "11235:11235"
    environment:
      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
    # Additional env variables as needed
    volumes:
      - /dev/shm:/dev/shm
```

To run:

```bash
docker-compose up -d
```

And to stop:

```bash
docker-compose down
```

**Troubleshooting**:

- **Check logs**: `docker-compose logs -f crawl4ai`
- **Remove orphan containers**: `docker-compose down --remove-orphans`
- **Remove networks**: `docker network rm <network_name>`

---

## 5. Making Requests to the Container

**Base URL**: `http://localhost:11235`

### Example: Basic Crawl

```python
import requests

task_request = {
    "urls": "https://example.com",
    "priority": 10
}

response = requests.post("http://localhost:11235/crawl", json=task_request)
task_id = response.json()["task_id"]

# Poll for status
status_url = f"http://localhost:11235/task/{task_id}"
status = requests.get(status_url).json()
print(status)
```

If you used an API token, do:

```python
headers = {"Authorization": "Bearer your_secret_token"}
response = requests.post(
    "http://localhost:11235/crawl",
    headers=headers,
    json=task_request
)
```

---

## 6. Docker + New Crawler Config Approach

### Using `BrowserConfig` & `CrawlerRunConfig` in Requests

The Docker-based solution can accept **crawler configurations** in the request JSON (legacy doc might show direct parameters, but we want to embed them in `crawler_params` or `extra` to align with the new approach). For example:

```python
import requests

request_data = {
    "urls": "https://www.nbcnews.com/business",
    "crawler_params": {
        "headless": True,
        "browser_type": "chromium",
        "verbose": True,
        "page_timeout": 30000,
        # ... any other BrowserConfig-like fields
    },
    "extra": {
        "word_count_threshold": 50,
        "bypass_cache": True
    }
}

response = requests.post("http://localhost:11235/crawl", json=request_data)
task_id = response.json()["task_id"]
```

This is the recommended style if you want to replicate `BrowserConfig` and `CrawlerRunConfig` settings in Docker mode.

---

## 7. Example: JSON Extraction in Docker

```python
import requests
import json

# Define a schema for CSS extraction
schema = {
    "name": "Coinbase Crypto Prices",
    "baseSelector": ".cds-tableRow-t45thuk",
    "fields": [
        {
            "name": "crypto",
            "selector": "td:nth-child(1) h2",
            "type": "text"
        },
        {
            "name": "symbol",
            "selector": "td:nth-child(1) p",
            "type": "text"
        },
        {
            "name": "price",
            "selector": "td:nth-child(2)",
            "type": "text"
        }
    ]
}

request_data = {
    "urls": "https://www.coinbase.com/explore",
    "extraction_config": {
        "type": "json_css",
        "params": {"schema": schema}
    },
    "crawler_params": {
        "headless": True,
        "verbose": True
    }
}

resp = requests.post("http://localhost:11235/crawl", json=request_data)
task_id = resp.json()["task_id"]

# Poll for status
status = requests.get(f"http://localhost:11235/task/{task_id}").json()
if status["status"] == "completed":
    extracted_content = status["result"]["extracted_content"]
    data = json.loads(extracted_content)
    print("Extracted:", len(data), "entries")
else:
    print("Task still in progress or failed.")
```

---

## 8. Why This Docker Is Temporary

**We are building a new, stable approach**:

- The current Docker container is **experimental** and might break with future releases.  
- We plan a stable release in **2025** with a more robust API, versioning, and orchestration.  
- If you use this Docker in production, do so at your own risk and be prepared for **breaking changes**.

**Community**: Because Crawl4AI is open-source, you can track progress or contribute to the new Docker approach. Check the [GitHub repository](https://github.com/unclecode/crawl4ai) for roadmaps and updates.

---

## 9. Known Limitations & Next Steps

1. **Not Production-Ready**: This Docker approach lacks extensive security, logging, or advanced config for large-scale usage.  
2. **Ongoing Changes**: Expect API changes. The official stable version is targeted for **2025**.  
3. **LLM Integrations**: Docker images are big if you want GPU or multiple model providers. We might unify these in a future build.  
4. **Performance**: For concurrency or large crawls, you may need to tune resources (memory, CPU) and watch out for ephemeral storage.  
5. **Version Pinning**: If you must deploy, pin your Docker tag to a specific version (e.g., `:basic-0.3.7`) to avoid surprise updates.

### Next Steps

- **Watch the Repository**: For announcements on the new Docker architecture.  
- **Experiment**: Use this Docker for test or dev environments, but keep an eye out for breakage.  
- **Contribute**: If you have ideas or improvements, open a PR or discussion.  
- **Check Roadmaps**: See our [GitHub issues](https://github.com/unclecode/crawl4ai/issues) or [Roadmap doc](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md) to find upcoming releases.

---

## 10. Summary

**Deploying with Docker** can simplify running Crawl4AI as a service. However:

- **This Docker** approach is **legacy** and subject to removal/overhaul.  
- For production, please weigh the risks carefully.  
- Detailed “new Docker approach” is coming in **2025**.

We hope this guide helps you do a quick spin-up of Crawl4AI in Docker for **experimental** usage. Stay tuned for the fully-supported version!