Spaces:
Runtime error
Runtime error
Below is a **draft** of the **Extracting JSON (LLM)** tutorial, illustrating how to use large language models for structured data extraction in Crawl4AI. It highlights key parameters (like chunking, overlap, instruction, schema) and explains how the system remains **provider-agnostic** via LightLLM. Adjust field names or code snippets to match your repository’s specifics. | |
--- | |
# Extracting JSON (LLM) | |
In some cases, you need to extract **complex or unstructured** information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want **AI**-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an **LLM-based extraction strategy** that: | |
1. Works with **any** large language model supported by [LightLLM](https://github.com/LightLLM) (Ollama, OpenAI, Claude, and more). | |
2. Automatically splits content into chunks (if desired) to handle token limits, then combines results. | |
3. Lets you define a **schema** (like a Pydantic model) or a simpler “block” extraction approach. | |
**Important**: LLM-based extraction can be slower and costlier than schema-based approaches. If your page data is highly structured, consider using [`JsonCssExtractionStrategy`](./json-extraction-basic.md) or [`JsonXPathExtractionStrategy`](./json-extraction-basic.md) first. But if you need AI to interpret or reorganize content, read on! | |
--- | |
## 1. Why Use an LLM? | |
- **Complex Reasoning**: If the site’s data is unstructured, scattered, or full of natural language context. | |
- **Semantic Extraction**: Summaries, knowledge graphs, or relational data that require comprehension. | |
- **Flexible**: You can pass instructions to the model to do more advanced transformations or classification. | |
--- | |
## 2. Provider-Agnostic via LightLLM | |
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LightLLM supports is fair game. You just provide: | |
- **`provider`**: The `<provider>/<model_name>` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.). | |
- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it. | |
- **`api_base`** (optional): If your provider has a custom endpoint. | |
This means you **aren’t locked** into a single LLM vendor. Switch or experiment easily. | |
--- | |
## 3. How LLM Extraction Works | |
### 3.1 Flow | |
1. **Chunking** (optional): The HTML or markdown is split into smaller segments if it’s very long (based on `chunk_token_threshold`, overlap, etc.). | |
2. **Prompt Construction**: For each chunk, the library forms a prompt that includes your **`instruction`** (and possibly schema or examples). | |
3. **LLM Inference**: Each chunk is sent to the model in parallel or sequentially (depending on your concurrency). | |
4. **Combining**: The results from each chunk are merged and parsed into JSON. | |
### 3.2 `extraction_type` | |
- **`"schema"`**: The model tries to return JSON conforming to your Pydantic-based schema. | |
- **`"block"`**: The model returns freeform text, or smaller JSON structures, which the library collects. | |
For structured data, `"schema"` is recommended. You provide `schema=YourPydanticModel.model_json_schema()`. | |
--- | |
## 4. Key Parameters | |
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`. | |
1. **`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`. | |
2. **`api_token`** (str): The API key or token for that model. May not be needed for local models. | |
3. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`. | |
4. **`extraction_type`** (str): `"schema"` or `"block"`. | |
5. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.” | |
6. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM. | |
7. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity. | |
8. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`. | |
9. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include: | |
- `"markdown"`: The raw markdown (default). | |
- `"fit_markdown"`: The filtered “fit” markdown if you used a content filter. | |
- `"html"`: The cleaned or raw HTML. | |
10. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc. | |
11. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known). | |
**Example**: | |
```python | |
extraction_strategy = LLMExtractionStrategy( | |
provider="openai/gpt-4", | |
api_token="YOUR_OPENAI_KEY", | |
schema=MyModel.model_json_schema(), | |
extraction_type="schema", | |
instruction="Extract a list of items from the text with 'name' and 'price' fields.", | |
chunk_token_threshold=1200, | |
overlap_rate=0.1, | |
apply_chunking=True, | |
input_format="html", | |
extra_args={"temperature": 0.1, "max_tokens": 1000}, | |
verbose=True | |
) | |
``` | |
--- | |
## 5. Putting It in `CrawlerRunConfig` | |
**Important**: In Crawl4AI, all strategy definitions should go inside the `CrawlerRunConfig`, not directly as a param in `arun()`. Here’s a full example: | |
```python | |
import os | |
import asyncio | |
import json | |
from pydantic import BaseModel, Field | |
from typing import List | |
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode | |
from crawl4ai.extraction_strategy import LLMExtractionStrategy | |
class Product(BaseModel): | |
name: str | |
price: str | |
async def main(): | |
# 1. Define the LLM extraction strategy | |
llm_strategy = LLMExtractionStrategy( | |
provider="openai/gpt-4o-mini", # e.g. "ollama/llama2" | |
api_token=os.getenv('OPENAI_API_KEY'), | |
schema=Product.schema_json(), # Or use model_json_schema() | |
extraction_type="schema", | |
instruction="Extract all product objects with 'name' and 'price' from the content.", | |
chunk_token_threshold=1000, | |
overlap_rate=0.0, | |
apply_chunking=True, | |
input_format="markdown", # or "html", "fit_markdown" | |
extra_args={"temperature": 0.0, "max_tokens": 800} | |
) | |
# 2. Build the crawler config | |
crawl_config = CrawlerRunConfig( | |
extraction_strategy=llm_strategy, | |
cache_mode=CacheMode.BYPASS | |
) | |
# 3. Create a browser config if needed | |
browser_cfg = BrowserConfig(headless=True) | |
async with AsyncWebCrawler(config=browser_cfg) as crawler: | |
# 4. Let's say we want to crawl a single page | |
result = await crawler.arun( | |
url="https://example.com/products", | |
config=crawl_config | |
) | |
if result.success: | |
# 5. The extracted content is presumably JSON | |
data = json.loads(result.extracted_content) | |
print("Extracted items:", data) | |
# 6. Show usage stats | |
llm_strategy.show_usage() # prints token usage | |
else: | |
print("Error:", result.error_message) | |
if __name__ == "__main__": | |
asyncio.run(main()) | |
``` | |
--- | |
## 6. Chunking Details | |
### 6.1 `chunk_token_threshold` | |
If your page is large, you might exceed your LLM’s context window. **`chunk_token_threshold`** sets the approximate max tokens per chunk. The library calculates word→token ratio using `word_token_rate` (often ~0.75 by default). If chunking is enabled (`apply_chunking=True`), the text is split into segments. | |
### 6.2 `overlap_rate` | |
To keep context continuous across chunks, we can overlap them. E.g., `overlap_rate=0.1` means each subsequent chunk includes 10% of the previous chunk’s text. This is helpful if your needed info might straddle chunk boundaries. | |
### 6.3 Performance & Parallelism | |
By chunking, you can potentially process multiple chunks in parallel (depending on your concurrency settings and the LLM provider). This reduces total time if the site is huge or has many sections. | |
--- | |
## 7. Input Format | |
By default, **LLMExtractionStrategy** uses `input_format="markdown"`, meaning the **crawler’s final markdown** is fed to the LLM. You can change to: | |
- **`html`**: The cleaned HTML or raw HTML (depending on your crawler config) goes into the LLM. | |
- **`fit_markdown`**: If you used, for instance, `PruningContentFilter`, the “fit” version of the markdown is used. This can drastically reduce tokens if you trust the filter. | |
- **`markdown`**: Standard markdown output from the crawler’s `markdown_generator`. | |
This setting is crucial: if the LLM instructions rely on HTML tags, pick `"html"`. If you prefer a text-based approach, pick `"markdown"`. | |
```python | |
LLMExtractionStrategy( | |
# ... | |
input_format="html", # Instead of "markdown" or "fit_markdown" | |
) | |
``` | |
--- | |
## 8. Token Usage & Show Usage | |
To keep track of tokens and cost, each chunk is processed with an LLM call. We record usage in: | |
- **`usages`** (list): token usage per chunk or call. | |
- **`total_usage`**: sum of all chunk calls. | |
- **`show_usage()`**: prints a usage report (if the provider returns usage data). | |
```python | |
llm_strategy = LLMExtractionStrategy(...) | |
# ... | |
llm_strategy.show_usage() | |
# e.g. “Total usage: 1241 tokens across 2 chunk calls” | |
``` | |
If your model provider doesn’t return usage info, these fields might be partial or empty. | |
--- | |
## 9. Example: Building a Knowledge Graph | |
Below is a snippet combining **`LLMExtractionStrategy`** with a Pydantic schema for a knowledge graph. Notice how we pass an **`instruction`** telling the model what to parse. | |
```python | |
import os | |
import json | |
import asyncio | |
from typing import List | |
from pydantic import BaseModel, Field | |
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode | |
from crawl4ai.extraction_strategy import LLMExtractionStrategy | |
class Entity(BaseModel): | |
name: str | |
description: str | |
class Relationship(BaseModel): | |
entity1: Entity | |
entity2: Entity | |
description: str | |
relation_type: str | |
class KnowledgeGraph(BaseModel): | |
entities: List[Entity] | |
relationships: List[Relationship] | |
async def main(): | |
# LLM extraction strategy | |
llm_strat = LLMExtractionStrategy( | |
provider="openai/gpt-4", | |
api_token=os.getenv('OPENAI_API_KEY'), | |
schema=KnowledgeGraph.schema_json(), | |
extraction_type="schema", | |
instruction="Extract entities and relationships from the content. Return valid JSON.", | |
chunk_token_threshold=1400, | |
apply_chunking=True, | |
input_format="html", | |
extra_args={"temperature": 0.1, "max_tokens": 1500} | |
) | |
crawl_config = CrawlerRunConfig( | |
extraction_strategy=llm_strat, | |
cache_mode=CacheMode.BYPASS | |
) | |
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler: | |
# Example page | |
url = "https://www.nbcnews.com/business" | |
result = await crawler.arun(url=url, config=crawl_config) | |
if result.success: | |
with open("kb_result.json", "w", encoding="utf-8") as f: | |
f.write(result.extracted_content) | |
llm_strat.show_usage() | |
else: | |
print("Crawl failed:", result.error_message) | |
if __name__ == "__main__": | |
asyncio.run(main()) | |
``` | |
**Key Observations**: | |
- **`extraction_type="schema"`** ensures we get JSON fitting our `KnowledgeGraph`. | |
- **`input_format="html"`** means we feed HTML to the model. | |
- **`instruction`** guides the model to output a structured knowledge graph. | |
--- | |
## 10. Best Practices & Caveats | |
1. **Cost & Latency**: LLM calls can be slow or expensive. Consider chunking or smaller coverage if you only need partial data. | |
2. **Model Token Limits**: If your page + instruction exceed the context window, chunking is essential. | |
3. **Instruction Engineering**: Well-crafted instructions can drastically improve output reliability. | |
4. **Schema Strictness**: `"schema"` extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error. | |
5. **Parallel vs. Serial**: The library can process multiple chunks in parallel, but you must watch out for rate limits on certain providers. | |
6. **Check Output**: Sometimes, an LLM might omit fields or produce extraneous text. You may want to post-validate with Pydantic or do additional cleanup. | |
--- | |
## 11. Conclusion | |
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LightLLM. It’s perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, it’s **slower** and potentially costlier than schema-based approaches. Keep these tips in mind: | |
- Put your LLM strategy **in `CrawlerRunConfig`**. | |
- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees. | |
- Tweak **`chunk_token_threshold`**, **`overlap_rate`**, and **`apply_chunking`** to handle large content efficiently. | |
- Monitor token usage with `show_usage()`. | |
If your site’s data is consistent or repetitive, consider [`JsonCssExtractionStrategy`](./json-extraction-basic.md) first for speed and simplicity. But if you need an **AI-driven** approach, `LLMExtractionStrategy` offers a flexible, multi-provider solution for extracting structured JSON from any website. | |
**Next Steps**: | |
1. **Experiment with Different Providers** | |
- Try switching the `provider` (e.g., `"ollama/llama2"`, `"openai/gpt-4o"`, etc.) to see differences in speed, accuracy, or cost. | |
- Pass different `extra_args` like `temperature`, `top_p`, and `max_tokens` to fine-tune your results. | |
2. **Combine With Other Strategies** | |
- Use [content filters](../../how-to/content-filters.md) like BM25 or Pruning prior to LLM extraction to remove noise and reduce token usage. | |
- Apply a [CSS or XPath extraction strategy](./json-extraction-basic.md) first for obvious, structured data, then send only the tricky parts to the LLM. | |
3. **Performance Tuning** | |
- If pages are large, tweak `chunk_token_threshold`, `overlap_rate`, or `apply_chunking` to optimize throughput. | |
- Check the usage logs with `show_usage()` to keep an eye on token consumption and identify potential bottlenecks. | |
4. **Validate Outputs** | |
- If using `extraction_type="schema"`, parse the LLM’s JSON with a Pydantic model for a final validation step. | |
- Log or handle any parse errors gracefully, especially if the model occasionally returns malformed JSON. | |
5. **Explore Hooks & Automation** | |
- Integrate LLM extraction with [hooks](./hooks-custom.md) for complex pre/post-processing. | |
- Use a multi-step pipeline: crawl, filter, LLM-extract, then store or index results for further analysis. | |
6. **Scale and Deploy** | |
- Combine your LLM extraction setup with [Docker or other deployment solutions](./docker-quickstart.md) to run at scale. | |
- Monitor memory usage and concurrency if you call LLMs frequently. | |
**Last Updated**: 2024-XX-XX | |
--- | |
That’s it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling! |