Crawl4AI / docs /md_v3 /tutorials /docker-quickstart.md
amaye15
test
03c0888

Deploying with Docker (Quickstart)

⚠️ WARNING: Experimental & Legacy
Our current Docker solution for Crawl4AI is not stable and will be discontinued soon. A more robust Docker/Orchestration strategy is in development, with a planned stable release in 2025. If you choose to use this Docker approach, please proceed cautiously and avoid production deployment without thorough testing.

Crawl4AI is open-source and under active development. We appreciate your interest, but strongly recommend you make informed decisions if you need a production environment. Expect breaking changes in future versions.


1. Installation & Environment Setup (Outside Docker)

Before we jump into Docker usage, here’s a quick reminder of how to install Crawl4AI locally (legacy doc). For non-Docker deployments or local dev:

# 1. Install the package
pip install crawl4ai
crawl4ai-setup

# 2. Install playwright dependencies (all browsers or specific ones)
playwright install --with-deps
# or
playwright install --with-deps chromium
# or
playwright install --with-deps chrome

Testing your installation:

# Visible browser test
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"

2. Docker Overview

This Docker approach allows you to run a Crawl4AI service via REST API. You can:

  1. POST a request (e.g., URLs, extraction config)
  2. Retrieve your results from a task-based endpoint

Note: This Docker solution is temporary. We plan a more robust, stable Docker approach in the near future. For now, you can experiment, but do not rely on it for mission-critical production.


3. Pulling and Running the Image

Basic Run

docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic

This starts a container on port 11235. You can POST requests to http://localhost:11235/crawl.

Using an API Token

docker run -p 11235:11235 \
  -e CRAWL4AI_API_TOKEN=your_secret_token \
  unclecode/crawl4ai:basic

If CRAWL4AI_API_TOKEN is set, you must include Authorization: Bearer <token> in your requests. Otherwise, the service is open to anyone.


4. Docker Compose for Multi-Container Workflows

You can also use Docker Compose to manage multiple services. Below is an experimental snippet:

version: '3.8'

services:
  crawl4ai:
    image: unclecode/crawl4ai:basic
    ports:
      - "11235:11235"
    environment:
      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
    # Additional env variables as needed
    volumes:
      - /dev/shm:/dev/shm

To run:

docker-compose up -d

And to stop:

docker-compose down

Troubleshooting:

  • Check logs: docker-compose logs -f crawl4ai
  • Remove orphan containers: docker-compose down --remove-orphans
  • Remove networks: docker network rm <network_name>

5. Making Requests to the Container

Base URL: http://localhost:11235

Example: Basic Crawl

import requests

task_request = {
    "urls": "https://example.com",
    "priority": 10
}

response = requests.post("http://localhost:11235/crawl", json=task_request)
task_id = response.json()["task_id"]

# Poll for status
status_url = f"http://localhost:11235/task/{task_id}"
status = requests.get(status_url).json()
print(status)

If you used an API token, do:

headers = {"Authorization": "Bearer your_secret_token"}
response = requests.post(
    "http://localhost:11235/crawl",
    headers=headers,
    json=task_request
)

6. Docker + New Crawler Config Approach

Using BrowserConfig & CrawlerRunConfig in Requests

The Docker-based solution can accept crawler configurations in the request JSON (legacy doc might show direct parameters, but we want to embed them in crawler_params or extra to align with the new approach). For example:

import requests

request_data = {
    "urls": "https://www.nbcnews.com/business",
    "crawler_params": {
        "headless": True,
        "browser_type": "chromium",
        "verbose": True,
        "page_timeout": 30000,
        # ... any other BrowserConfig-like fields
    },
    "extra": {
        "word_count_threshold": 50,
        "bypass_cache": True
    }
}

response = requests.post("http://localhost:11235/crawl", json=request_data)
task_id = response.json()["task_id"]

This is the recommended style if you want to replicate BrowserConfig and CrawlerRunConfig settings in Docker mode.


7. Example: JSON Extraction in Docker

import requests
import json

# Define a schema for CSS extraction
schema = {
    "name": "Coinbase Crypto Prices",
    "baseSelector": ".cds-tableRow-t45thuk",
    "fields": [
        {
            "name": "crypto",
            "selector": "td:nth-child(1) h2",
            "type": "text"
        },
        {
            "name": "symbol",
            "selector": "td:nth-child(1) p",
            "type": "text"
        },
        {
            "name": "price",
            "selector": "td:nth-child(2)",
            "type": "text"
        }
    ]
}

request_data = {
    "urls": "https://www.coinbase.com/explore",
    "extraction_config": {
        "type": "json_css",
        "params": {"schema": schema}
    },
    "crawler_params": {
        "headless": True,
        "verbose": True
    }
}

resp = requests.post("http://localhost:11235/crawl", json=request_data)
task_id = resp.json()["task_id"]

# Poll for status
status = requests.get(f"http://localhost:11235/task/{task_id}").json()
if status["status"] == "completed":
    extracted_content = status["result"]["extracted_content"]
    data = json.loads(extracted_content)
    print("Extracted:", len(data), "entries")
else:
    print("Task still in progress or failed.")

8. Why This Docker Is Temporary

We are building a new, stable approach:

  • The current Docker container is experimental and might break with future releases.
  • We plan a stable release in 2025 with a more robust API, versioning, and orchestration.
  • If you use this Docker in production, do so at your own risk and be prepared for breaking changes.

Community: Because Crawl4AI is open-source, you can track progress or contribute to the new Docker approach. Check the GitHub repository for roadmaps and updates.


9. Known Limitations & Next Steps

  1. Not Production-Ready: This Docker approach lacks extensive security, logging, or advanced config for large-scale usage.
  2. Ongoing Changes: Expect API changes. The official stable version is targeted for 2025.
  3. LLM Integrations: Docker images are big if you want GPU or multiple model providers. We might unify these in a future build.
  4. Performance: For concurrency or large crawls, you may need to tune resources (memory, CPU) and watch out for ephemeral storage.
  5. Version Pinning: If you must deploy, pin your Docker tag to a specific version (e.g., :basic-0.3.7) to avoid surprise updates.

Next Steps

  • Watch the Repository: For announcements on the new Docker architecture.
  • Experiment: Use this Docker for test or dev environments, but keep an eye out for breakage.
  • Contribute: If you have ideas or improvements, open a PR or discussion.
  • Check Roadmaps: See our GitHub issues or Roadmap doc to find upcoming releases.

10. Summary

Deploying with Docker can simplify running Crawl4AI as a service. However:

  • This Docker approach is legacy and subject to removal/overhaul.
  • For production, please weigh the risks carefully.
  • Detailed “new Docker approach” is coming in 2025.

We hope this guide helps you do a quick spin-up of Crawl4AI in Docker for experimental usage. Stay tuned for the fully-supported version!