File size: 6,530 Bytes
03c0888
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# Extraction Strategies Overview

Crawl4AI provides powerful extraction strategies to help you get structured data from web pages. Each strategy is designed for specific use cases and offers different approaches to data extraction.

## Available Strategies

### [LLM-Based Extraction](llm.md)

`LLMExtractionStrategy` uses Language Models to extract structured data from web content. This approach is highly flexible and can understand content semantically.

```python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class Product(BaseModel):
    name: str
    price: float
    description: str

strategy = LLMExtractionStrategy(
    provider="ollama/llama2",
    schema=Product.schema(),
    instruction="Extract product details from the page"
)

result = await crawler.arun(
    url="https://example.com/product",
    extraction_strategy=strategy
)
```

**Best for:**
- Complex data structures
- Content requiring interpretation
- Flexible content formats
- Natural language processing

### [CSS-Based Extraction](css.md)

`JsonCssExtractionStrategy` extracts data using CSS selectors. This is fast, reliable, and perfect for consistently structured pages.

```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Product Listing",
    "baseSelector": ".product-card",
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}
    ]
}

strategy = JsonCssExtractionStrategy(schema)

result = await crawler.arun(
    url="https://example.com/products",
    extraction_strategy=strategy
)
```

**Best for:**
- E-commerce product listings
- News article collections
- Structured content pages
- High-performance needs

### [Cosine Strategy](cosine.md)

`CosineStrategy` uses similarity-based clustering to identify and extract relevant content sections.

```python
from crawl4ai.extraction_strategy import CosineStrategy

strategy = CosineStrategy(
    semantic_filter="product reviews",    # Content focus
    word_count_threshold=10,             # Minimum words per cluster
    sim_threshold=0.3,                   # Similarity threshold
    max_dist=0.2,                        # Maximum cluster distance
    top_k=3                             # Number of top clusters to extract
)

result = await crawler.arun(
    url="https://example.com/reviews",
    extraction_strategy=strategy
)
```

**Best for:**
- Content similarity analysis
- Topic clustering
- Relevant content extraction
- Pattern recognition in text

## Strategy Selection Guide

Choose your strategy based on these factors:

1. **Content Structure**
   - Well-structured HTML → Use CSS Strategy
   - Natural language text → Use LLM Strategy
   - Mixed/Complex content → Use Cosine Strategy

2. **Performance Requirements**
   - Fastest: CSS Strategy
   - Moderate: Cosine Strategy
   - Variable: LLM Strategy (depends on provider)

3. **Accuracy Needs**
   - Highest structure accuracy: CSS Strategy
   - Best semantic understanding: LLM Strategy
   - Best content relevance: Cosine Strategy

## Combining Strategies

You can combine strategies for more powerful extraction:

```python
# First use CSS strategy for initial structure
css_result = await crawler.arun(
    url="https://example.com",
    extraction_strategy=css_strategy
)

# Then use LLM for semantic analysis
llm_result = await crawler.arun(
    url="https://example.com",
    extraction_strategy=llm_strategy
)
```

## Common Use Cases

1. **E-commerce Scraping**
   ```python
   # CSS Strategy for product listings
   schema = {
       "name": "Products",
       "baseSelector": ".product",
       "fields": [
           {"name": "name", "selector": ".title", "type": "text"},
           {"name": "price", "selector": ".price", "type": "text"}
       ]
   }
   ```

2. **News Article Extraction**
   ```python
   # LLM Strategy for article content
   class Article(BaseModel):
       title: str
       content: str
       author: str
       date: str

   strategy = LLMExtractionStrategy(
       provider="ollama/llama2",
       schema=Article.schema()
   )
   ```

3. **Content Analysis**
   ```python
   # Cosine Strategy for topic analysis
   strategy = CosineStrategy(
       semantic_filter="technology trends",
       top_k=5
   )
   ```


## Input Formats
All extraction strategies support different input formats to give you more control over how content is processed:

- **markdown** (default): Uses the raw markdown conversion of the HTML content. Best for general text extraction where HTML structure isn't critical.
- **html**: Uses the raw HTML content. Useful when you need to preserve HTML structure or extract data from specific HTML elements.
- **fit_markdown**: Uses the cleaned and filtered markdown content. Best for extracting relevant content while removing noise. Requires a markdown generator with content filter to be configured.

To specify an input format:
```python
strategy = LLMExtractionStrategy(
    input_format="html",  # or "markdown" or "fit_markdown"
    provider="openai/gpt-4",
    instruction="Extract product information"
)
```

Note: When using "fit_markdown", ensure your CrawlerRunConfig includes a markdown generator with content filter:
```python
config = CrawlerRunConfig(
    extraction_strategy=strategy,
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter()  # Content filter goes here for fit_markdown
    )
)
```

If fit_markdown is requested but not available (no markdown generator or content filter), the system will automatically fall back to raw markdown with a warning.

## Best Practices

1. **Choose the Right Strategy**
   - Start with CSS for structured data
   - Use LLM for complex interpretation
   - Try Cosine for content relevance

2. **Optimize Performance**
   - Cache LLM results
   - Keep CSS selectors specific
   - Tune similarity thresholds

3. **Handle Errors**
   ```python
   result = await crawler.arun(
       url="https://example.com",
       extraction_strategy=strategy
   )
   
   if not result.success:
       print(f"Extraction failed: {result.error_message}")
   else:
       data = json.loads(result.extracted_content)
   ```

Each strategy has its strengths and optimal use cases. Explore the detailed documentation for each strategy to learn more about their specific features and configurations.