Crawl4AI

Running

App Files Files Community

Crawl4AI / docs /md_v3 /tutorials /getting-started.md

amaye15

test

03c0888 6 months ago

preview code

raw

history blame contribute delete

9.96 kB

	# Getting Started with Crawl4AI

	Welcome to Crawl4AI, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, you’ll:

	1. Install Crawl4AI (both via pip and Docker, with notes on platform challenges).
	2. Run your first crawl using minimal configuration.
	3. Generate Markdown output (and learn how it’s influenced by content filters).
	4. Experiment with a simple CSS-based extraction strategy.
	5. See a glimpse of LLM-based extraction (including open-source and closed-source model options).

	---

	## 1. Introduction

	Crawl4AI provides:
	- An asynchronous crawler, `AsyncWebCrawler`.
	- Configurable browser and run settings via `BrowserConfig` and `CrawlerRunConfig`.
	- Automatic HTML-to-Markdown conversion via `DefaultMarkdownGenerator` (supports additional filters).
	- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).

	By the end of this guide, you’ll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies.

	---

	## 2. Installation

	### 2.1 Python + Playwright

	#### Basic Pip Installation

	```bash
	pip install crawl4ai
	crawl4ai-setup

	# Verify your installation
	crawl4ai-doctor
	```

	If you encounter any browser-related issues, you can install them manually:
	```bash
	python -m playwright install --with-deps chrome chromium
	```

	- `crawl4ai-setup` installs and configures Playwright (Chromium by default).

	We cover advanced installation and Docker in the [Installation](#installation) section.

	---

	## 3. Your First Crawl

	Here’s a minimal Python script that creates an `AsyncWebCrawler`, fetches a webpage, and prints the first 300 characters of its Markdown output:

	```python
	import asyncio
	from crawl4ai import AsyncWebCrawler

	async def main():
	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun("https://example.com")
	print(result.markdown[:300]) # Print first 300 chars

	if __name__ == "__main__":
	asyncio.run(main())
	```

	What’s happening?
	- `AsyncWebCrawler` launches a headless browser (Chromium by default).
	- It fetches `https://example.com`.
	- Crawl4AI automatically converts the HTML into Markdown.

	You now have a simple, working crawl!

	---

	## 4. Basic Configuration (Light Introduction)

	Crawl4AI’s crawler can be heavily customized using two main classes:

	1. `BrowserConfig`: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
	2. `CrawlerRunConfig`: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).

	Below is an example with minimal usage:

	```python
	import asyncio
	from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

	async def main():
	browser_conf = BrowserConfig(headless=True) # or False to see the browser
	run_conf = CrawlerRunConfig(cache_mode="BYPASS")

	async with AsyncWebCrawler(config=browser_conf) as crawler:
	result = await crawler.arun(
	url="https://example.com",
	config=run_conf
	)
	print(result.markdown)

	if __name__ == "__main__":
	asyncio.run(main())
	```

	We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.

	---

	## 5. Generating Markdown Output

	By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a markdown generator or content filter.

	- `result.markdown`:
	The direct HTML-to-Markdown conversion.
	- `result.markdown.fit_markdown`:
	The same content after applying any configured content filter (e.g., `PruningContentFilter`).

	### Example: Using a Filter with `DefaultMarkdownGenerator`

	```python
	from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
	from crawl4ai.content_filter_strategy import PruningContentFilter
	from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

	md_generator = DefaultMarkdownGenerator(
	content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
	)

	config = CrawlerRunConfig(markdown_generator=md_generator)

	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun("https://news.ycombinator.com", config=config)
	print("Raw Markdown length:", len(result.markdown.raw_markdown))
	print("Fit Markdown length:", len(result.markdown.fit_markdown))
	```

	Note: If you do not specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated Markdown Generation tutorial.

	---

	## 6. Simple Data Extraction (CSS-based)

	Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:

	```python
	import asyncio
	import json
	from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
	from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

	async def main():
	schema = {
	"name": "Example Items",
	"baseSelector": "div.item",
	"fields": [
	{"name": "title", "selector": "h2", "type": "text"},
	{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
	]
	}

	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun(
	url="https://example.com/items",
	config=CrawlerRunConfig(
	extraction_strategy=JsonCssExtractionStrategy(schema)
	)
	)
	# The JSON output is stored in 'extracted_content'
	data = json.loads(result.extracted_content)
	print(data)

	if __name__ == "__main__":
	asyncio.run(main())
	```

	Why is this helpful?
	- Great for repetitive page structures (e.g., item listings, articles).
	- No AI usage or costs.
	- The crawler returns a JSON string you can parse or store.

	---

	## 7. Simple Data Extraction (LLM-based)

	For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports open-source or closed-source providers:

	- Open-Source Models (e.g., `ollama/llama3.3`, `no_token`)
	- OpenAI Models (e.g., `openai/gpt-4`, requires `api_token`)
	- Or any provider supported by the underlying library

	Below is an example using open-source style (no token) and closed-source:

	```python
	import os
	import json
	import asyncio
	from pydantic import BaseModel, Field
	from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
	from crawl4ai.extraction_strategy import LLMExtractionStrategy

	class PricingInfo(BaseModel):
	model_name: str = Field(..., description="Name of the AI model")
	input_fee: str = Field(..., description="Fee for input tokens")
	output_fee: str = Field(..., description="Fee for output tokens")

	async def main():
	# 1) Open-Source usage: no token required
	llm_strategy_open_source = LLMExtractionStrategy(
	provider="ollama/llama3.3", # or "any-other-local-model"
	api_token="no_token", # for local models, no API key is typically required
	schema=PricingInfo.schema(),
	extraction_type="schema",
	instruction="""
	From this page, extract all AI model pricing details in JSON format.
	Each entry should have 'model_name', 'input_fee', and 'output_fee'.
	""",
	temperature=0
	)

	# 2) Closed-Source usage: API key for OpenAI, for example
	openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY")
	llm_strategy_openai = LLMExtractionStrategy(
	provider="openai/gpt-4",
	api_token=openai_token,
	schema=PricingInfo.schema(),
	extraction_type="schema",
	instruction="""
	From this page, extract all AI model pricing details in JSON format.
	Each entry should have 'model_name', 'input_fee', and 'output_fee'.
	""",
	temperature=0
	)

	# We'll demo the open-source approach here
	config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source)

	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun(
	url="https://example.com/pricing",
	config=config
	)
	print("LLM-based extraction JSON:", result.extracted_content)

	if __name__ == "__main__":
	asyncio.run(main())
	```

	What’s happening?
	- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
	- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
	- Depending on the provider and api_token, you can use local models or a remote API.

	---

	## 8. Next Steps

	Congratulations! You have:
	1. Installed Crawl4AI (via pip, with Docker as an option).
	2. Performed a simple crawl and printed Markdown.
	3. Seen how adding a markdown generator + content filter can produce “fit” Markdown.
	4. Experimented with CSS-based extraction for repetitive data.
	5. Learned the basics of LLM-based extraction (open-source and closed-source).

	If you are ready for more, check out:

	- Installation: Learn more on how to install Crawl4AI and set up Playwright.
	- Focus on Configuration: Learn to customize browser settings, caching modes, advanced timeouts, etc.
	- Markdown Generation Basics: Dive deeper into content filtering and “fit markdown” usage.
	- Dynamic Pages & Hooks: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities.
	- Deployment: Run Crawl4AI in Docker containers and scale across multiple nodes.
	- Explanations & How-To Guides: Explore browser contexts, identity-based crawling, hooking, performance, and more.

	Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!