Crawl4AI

Runtime error

App Files Files Community

Crawl4AI / docs /md_v2 /extraction /cosine.md

amaye15

test

03c0888 3 months ago

preview code

raw

history blame contribute delete

6.43 kB

	# Cosine Strategy

	The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.

	## How It Works

	The Cosine Strategy:
	1. Breaks down page content into meaningful chunks
	2. Converts text into vector representations
	3. Calculates similarity between chunks
	4. Clusters similar content together
	5. Ranks and filters content based on relevance

	## Basic Usage

	```python
	from crawl4ai.extraction_strategy import CosineStrategy

	strategy = CosineStrategy(
	semantic_filter="product reviews", # Target content type
	word_count_threshold=10, # Minimum words per cluster
	sim_threshold=0.3 # Similarity threshold
	)

	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun(
	url="https://example.com/reviews",
	extraction_strategy=strategy
	)

	content = result.extracted_content
	```

	## Configuration Options

	### Core Parameters

	```python
	CosineStrategy(
	# Content Filtering
	semantic_filter: str = None, # Keywords/topic for content filtering
	word_count_threshold: int = 10, # Minimum words per cluster
	sim_threshold: float = 0.3, # Similarity threshold (0.0 to 1.0)

	# Clustering Parameters
	max_dist: float = 0.2, # Maximum distance for clustering
	linkage_method: str = 'ward', # Clustering linkage method
	top_k: int = 3, # Number of top categories to extract

	# Model Configuration
	model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model

	verbose: bool = False # Enable logging
	)
	```

	### Parameter Details

	1. semantic_filter
	- Sets the target topic or content type
	- Use keywords relevant to your desired content
	- Example: "technical specifications", "user reviews", "pricing information"

	2. sim_threshold
	- Controls how similar content must be to be grouped together
	- Higher values (e.g., 0.8) mean stricter matching
	- Lower values (e.g., 0.3) allow more variation
	```python
	# Strict matching
	strategy = CosineStrategy(sim_threshold=0.8)

	# Loose matching
	strategy = CosineStrategy(sim_threshold=0.3)
	```

	3. word_count_threshold
	- Filters out short content blocks
	- Helps eliminate noise and irrelevant content
	```python
	# Only consider substantial paragraphs
	strategy = CosineStrategy(word_count_threshold=50)
	```

	4. top_k
	- Number of top content clusters to return
	- Higher values return more diverse content
	```python
	# Get top 5 most relevant content clusters
	strategy = CosineStrategy(top_k=5)
	```

	## Use Cases

	### 1. Article Content Extraction
	```python
	strategy = CosineStrategy(
	semantic_filter="main article content",
	word_count_threshold=100, # Longer blocks for articles
	top_k=1 # Usually want single main content
	)

	result = await crawler.arun(
	url="https://example.com/blog/post",
	extraction_strategy=strategy
	)
	```

	### 2. Product Review Analysis
	```python
	strategy = CosineStrategy(
	semantic_filter="customer reviews and ratings",
	word_count_threshold=20, # Reviews can be shorter
	top_k=10, # Get multiple reviews
	sim_threshold=0.4 # Allow variety in review content
	)
	```

	### 3. Technical Documentation
	```python
	strategy = CosineStrategy(
	semantic_filter="technical specifications documentation",
	word_count_threshold=30,
	sim_threshold=0.6, # Stricter matching for technical content
	max_dist=0.3 # Allow related technical sections
	)
	```

	## Advanced Features

	### Custom Clustering
	```python
	strategy = CosineStrategy(
	linkage_method='complete', # Alternative clustering method
	max_dist=0.4, # Larger clusters
	model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' # Multilingual support
	)
	```

	### Content Filtering Pipeline
	```python
	strategy = CosineStrategy(
	semantic_filter="pricing plans features",
	word_count_threshold=15,
	sim_threshold=0.5,
	top_k=3
	)

	async def extract_pricing_features(url: str):
	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun(
	url=url,
	extraction_strategy=strategy
	)

	if result.success:
	content = json.loads(result.extracted_content)
	return {
	'pricing_features': content,
	'clusters': len(content),
	'similarity_scores': [item['score'] for item in content]
	}
	```

	## Best Practices

	1. Adjust Thresholds Iteratively
	- Start with default values
	- Adjust based on results
	- Monitor clustering quality

	2. Choose Appropriate Word Count Thresholds
	- Higher for articles (100+)
	- Lower for reviews/comments (20+)
	- Medium for product descriptions (50+)

	3. Optimize Performance
	```python
	strategy = CosineStrategy(
	word_count_threshold=10, # Filter early
	top_k=5, # Limit results
	verbose=True # Monitor performance
	)
	```

	4. Handle Different Content Types
	```python
	# For mixed content pages
	strategy = CosineStrategy(
	semantic_filter="product features",
	sim_threshold=0.4, # More flexible matching
	max_dist=0.3, # Larger clusters
	top_k=3 # Multiple relevant sections
	)
	```

	## Error Handling

	```python
	try:
	result = await crawler.arun(
	url="https://example.com",
	extraction_strategy=strategy
	)

	if result.success:
	content = json.loads(result.extracted_content)
	if not content:
	print("No relevant content found")
	else:
	print(f"Extraction failed: {result.error_message}")

	except Exception as e:
	print(f"Error during extraction: {str(e)}")
	```

	The Cosine Strategy is particularly effective when:
	- Content structure is inconsistent
	- You need semantic understanding
	- You want to find similar content blocks
	- Structure-based extraction (CSS/XPath) isn't reliable

	It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.