|
# Simple Crawling |
|
|
|
This guide covers the basics of web crawling with Crawl4AI. You'll learn how to set up a crawler, make your first request, and understand the response. |
|
|
|
## Basic Usage |
|
|
|
Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`: |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler |
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig |
|
|
|
async def main(): |
|
browser_config = BrowserConfig() # Default browser configuration |
|
run_config = CrawlerRunConfig() # Default crawl run configuration |
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler: |
|
result = await crawler.arun( |
|
url="https://example.com", |
|
config=run_config |
|
) |
|
print(result.markdown) # Print clean markdown content |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
``` |
|
|
|
## Understanding the Response |
|
|
|
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details): |
|
|
|
```python |
|
result = await crawler.arun( |
|
url="https://example.com", |
|
config=CrawlerRunConfig(fit_markdown=True) |
|
) |
|
|
|
# Different content formats |
|
print(result.html) # Raw HTML |
|
print(result.cleaned_html) # Cleaned HTML |
|
print(result.markdown) # Markdown version |
|
print(result.fit_markdown) # Most relevant content in markdown |
|
|
|
# Check success status |
|
print(result.success) # True if crawl succeeded |
|
print(result.status_code) # HTTP status code (e.g., 200, 404) |
|
|
|
# Access extracted media and links |
|
print(result.media) # Dictionary of found media (images, videos, audio) |
|
print(result.links) # Dictionary of internal and external links |
|
``` |
|
|
|
## Adding Basic Options |
|
|
|
Customize your crawl using `CrawlerRunConfig`: |
|
|
|
```python |
|
run_config = CrawlerRunConfig( |
|
word_count_threshold=10, # Minimum words per content block |
|
exclude_external_links=True, # Remove external links |
|
remove_overlay_elements=True, # Remove popups/modals |
|
process_iframes=True # Process iframe content |
|
) |
|
|
|
result = await crawler.arun( |
|
url="https://example.com", |
|
config=run_config |
|
) |
|
``` |
|
|
|
## Handling Errors |
|
|
|
Always check if the crawl was successful: |
|
|
|
```python |
|
run_config = CrawlerRunConfig() |
|
result = await crawler.arun(url="https://example.com", config=run_config) |
|
|
|
if not result.success: |
|
print(f"Crawl failed: {result.error_message}") |
|
print(f"Status code: {result.status_code}") |
|
``` |
|
|
|
## Logging and Debugging |
|
|
|
Enable verbose logging in `BrowserConfig`: |
|
|
|
```python |
|
browser_config = BrowserConfig(verbose=True) |
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler: |
|
run_config = CrawlerRunConfig() |
|
result = await crawler.arun(url="https://example.com", config=run_config) |
|
``` |
|
|
|
## Complete Example |
|
|
|
Here's a more comprehensive example demonstrating common usage patterns: |
|
|
|
```python |
|
import asyncio |
|
from crawl4ai import AsyncWebCrawler |
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode |
|
|
|
async def main(): |
|
browser_config = BrowserConfig(verbose=True) |
|
run_config = CrawlerRunConfig( |
|
# Content filtering |
|
word_count_threshold=10, |
|
excluded_tags=['form', 'header'], |
|
exclude_external_links=True, |
|
|
|
# Content processing |
|
process_iframes=True, |
|
remove_overlay_elements=True, |
|
|
|
# Cache control |
|
cache_mode=CacheMode.ENABLED # Use cache if available |
|
) |
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler: |
|
result = await crawler.arun( |
|
url="https://example.com", |
|
config=run_config |
|
) |
|
|
|
if result.success: |
|
# Print clean content |
|
print("Content:", result.markdown[:500]) # First 500 chars |
|
|
|
# Process images |
|
for image in result.media["images"]: |
|
print(f"Found image: {image['src']}") |
|
|
|
# Process links |
|
for link in result.links["internal"]: |
|
print(f"Internal link: {link['href']}") |
|
|
|
else: |
|
print(f"Crawl failed: {result.error_message}") |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
``` |
|
|