first add
Browse files- README.md +156 -12
- app.py +45 -0
- config.ini +4 -0
- config.py +130 -0
- file_handler.py +190 -0
- requirements.txt +7 -0
- save_report.py +27 -0
- search_utils.py +103 -0
- web_search.py +1109 -0
README.md
CHANGED
@@ -1,12 +1,156 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Web Research and Report Generation System
|
2 |
+
|
3 |
+
An advanced AI-powered system for automated web research and report generation. This system uses AI agents to search, analyze, and compile comprehensive reports on any given topic.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- **Intelligent Web Search**
|
8 |
+
- Multi-source search using DuckDuckGo and Google
|
9 |
+
- Smart retry mechanism with rate limit handling
|
10 |
+
- Configurable search depth and result limits
|
11 |
+
- Domain filtering for trusted sources
|
12 |
+
|
13 |
+
- **Advanced Report Generation**
|
14 |
+
- Beautiful HTML reports with modern styling
|
15 |
+
- Automatic keyword extraction
|
16 |
+
- Source validation and relevance scoring
|
17 |
+
- Comprehensive logging of research process
|
18 |
+
|
19 |
+
- **Smart Caching**
|
20 |
+
- Caches search results for faster repeat queries
|
21 |
+
- Configurable cache directory
|
22 |
+
- Cache invalidation management
|
23 |
+
|
24 |
+
- **Error Handling**
|
25 |
+
- Graceful fallback between search engines
|
26 |
+
- Rate limit detection and backoff
|
27 |
+
- Detailed error logging
|
28 |
+
- Automatic retry mechanisms
|
29 |
+
|
30 |
+
## Installation
|
31 |
+
|
32 |
+
1. Clone the repository:
|
33 |
+
```bash
|
34 |
+
git clone [repository-url]
|
35 |
+
cd phidata_analyst
|
36 |
+
```
|
37 |
+
|
38 |
+
2. Install dependencies:
|
39 |
+
```bash
|
40 |
+
pip install -r requirements.txt
|
41 |
+
```
|
42 |
+
|
43 |
+
3. Set up your API keys:
|
44 |
+
- Create a `.env` file in the project root
|
45 |
+
- Add your API keys:
|
46 |
+
```
|
47 |
+
NVIDIA_API_KEY=your-nvidia-api-key
|
48 |
+
GOOGLE_API_KEY=your-google-api-key
|
49 |
+
```
|
50 |
+
|
51 |
+
## Usage
|
52 |
+
|
53 |
+
### Basic Usage
|
54 |
+
|
55 |
+
```python
|
56 |
+
from web_search import create_blog_post_workflow
|
57 |
+
|
58 |
+
# Create a workflow instance
|
59 |
+
workflow = create_blog_post_workflow()
|
60 |
+
|
61 |
+
# Generate a report
|
62 |
+
for response in workflow.run("Your research topic"):
|
63 |
+
print(response.message)
|
64 |
+
```
|
65 |
+
|
66 |
+
### Advanced Usage
|
67 |
+
|
68 |
+
```python
|
69 |
+
from web_search import BlogPostGenerator, SqlWorkflowStorage
|
70 |
+
from phi.llm import Nvidia
|
71 |
+
from phi.tools import DuckDuckGo, GoogleSearch
|
72 |
+
|
73 |
+
# Configure custom agents
|
74 |
+
searcher = Agent(
|
75 |
+
model=Nvidia(
|
76 |
+
id="meta/llama-3.2-3b-instruct",
|
77 |
+
temperature=0.3,
|
78 |
+
top_p=0.1
|
79 |
+
),
|
80 |
+
tools=[DuckDuckGo(fixed_max_results=10)]
|
81 |
+
)
|
82 |
+
|
83 |
+
# Initialize with custom configuration
|
84 |
+
generator = BlogPostGenerator(
|
85 |
+
searcher=searcher,
|
86 |
+
storage=SqlWorkflowStorage(
|
87 |
+
table_name="custom_workflows",
|
88 |
+
db_file="path/to/db.sqlite"
|
89 |
+
)
|
90 |
+
)
|
91 |
+
|
92 |
+
# Run with caching enabled
|
93 |
+
for response in generator.run("topic", use_cache=True):
|
94 |
+
print(response.message)
|
95 |
+
```
|
96 |
+
|
97 |
+
## Output
|
98 |
+
|
99 |
+
The system generates:
|
100 |
+
1. Professional HTML reports with:
|
101 |
+
- Executive summary
|
102 |
+
- Detailed analysis
|
103 |
+
- Source citations
|
104 |
+
- Generation timestamp
|
105 |
+
2. Detailed logs of:
|
106 |
+
- Search process
|
107 |
+
- Keyword extraction
|
108 |
+
- Source relevance
|
109 |
+
- Download attempts
|
110 |
+
|
111 |
+
Reports are saved in:
|
112 |
+
- Default: `./reports/YYYY-MM-DD-HH-MM-SS/`
|
113 |
+
- Custom: Configurable via `file_handler`
|
114 |
+
|
115 |
+
## Configuration
|
116 |
+
|
117 |
+
Key configuration options:
|
118 |
+
|
119 |
+
```python
|
120 |
+
DUCK_DUCK_GO_FIXED_MAX_RESULTS = 10 # Max results from DuckDuckGo
|
121 |
+
DEFAULT_TEMPERATURE = 0.3 # Model temperature
|
122 |
+
TOP_P = 0.1 # Top-p sampling parameter
|
123 |
+
```
|
124 |
+
|
125 |
+
Trusted domains can be configured in `BlogPostGenerator.trusted_domains`.
|
126 |
+
|
127 |
+
## Logging
|
128 |
+
|
129 |
+
The system uses `phi.utils.log` for comprehensive logging:
|
130 |
+
- Search progress and results
|
131 |
+
- Keyword extraction details
|
132 |
+
- File downloads and failures
|
133 |
+
- Report generation status
|
134 |
+
|
135 |
+
Logs are color-coded for easy monitoring:
|
136 |
+
- INFO: Normal operations
|
137 |
+
- WARNING: Non-critical issues
|
138 |
+
- ERROR: Critical failures
|
139 |
+
|
140 |
+
## Contributing
|
141 |
+
|
142 |
+
1. Fork the repository
|
143 |
+
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
|
144 |
+
3. Commit your changes (`git commit -m 'Add amazing feature'`)
|
145 |
+
4. Push to the branch (`git push origin feature/amazing-feature`)
|
146 |
+
5. Open a Pull Request
|
147 |
+
|
148 |
+
## License
|
149 |
+
|
150 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
151 |
+
|
152 |
+
## Acknowledgments
|
153 |
+
|
154 |
+
- Built with [Phi](https://github.com/phidatahq/phidata)
|
155 |
+
- Uses NVIDIA AI models
|
156 |
+
- Search powered by DuckDuckGo and Google
|
app.py
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import subprocess
|
3 |
+
import configparser
|
4 |
+
|
5 |
+
|
6 |
+
config = configparser.ConfigParser()
|
7 |
+
|
8 |
+
# Streamlit page for user inputs
|
9 |
+
def user_input_page():
|
10 |
+
st.title("Research Topic and Websites Input")
|
11 |
+
|
12 |
+
# Input for research topic
|
13 |
+
topic = st.text_input("Enter the research topic:")
|
14 |
+
|
15 |
+
# Input for list of websites
|
16 |
+
websites = st.text_area("Enter the list of websites (one per line):")
|
17 |
+
websites = websites.splitlines()
|
18 |
+
|
19 |
+
config['DEFAULT'] = {'DEFAULT_TOPIC': "\"{0}\"".format(topic),
|
20 |
+
'INITIAL_WEBSITES': websites}
|
21 |
+
|
22 |
+
with open('config.ini', 'w') as configfile:
|
23 |
+
config.write(configfile)
|
24 |
+
|
25 |
+
# Button to load and run web_search.py
|
26 |
+
if st.button("Execute Web Research"):
|
27 |
+
# Execute web_search.py and stream output
|
28 |
+
process = subprocess.run(["python3", "web_search.py"], stderr=subprocess.PIPE, text=True)
|
29 |
+
error_message = process.stderr
|
30 |
+
|
31 |
+
# Stream the output in real-time
|
32 |
+
# for line in process.stdout:
|
33 |
+
# st.write(line) # Display each line of output as it is produced
|
34 |
+
|
35 |
+
# Wait for the process to complete
|
36 |
+
# process.wait()
|
37 |
+
|
38 |
+
# Check for any errors
|
39 |
+
if process.returncode != 0:
|
40 |
+
st.error(f"Error occurred: {error_message}")
|
41 |
+
|
42 |
+
st.success("Web search executed successfully!")
|
43 |
+
|
44 |
+
# Call the user input page function
|
45 |
+
user_input_page()
|
config.ini
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[DEFAULT]
|
2 |
+
default_topic = "Is there a process of establishment of Israeli Military or Offensive Cyber Industry in Australia?"
|
3 |
+
initial_websites = ['https://www.bellingcat.com', 'https://worldview.stratfor.com', 'https://thesoufancenter.org', 'https://www.globalsecurity.org', 'https://www.defenseone.com']
|
4 |
+
|
config.py
ADDED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""Configuration settings for the web search and report generation system."""
|
2 |
+
|
3 |
+
from phi.model.groq import Groq
|
4 |
+
from phi.model.together import Together
|
5 |
+
from phi.model.huggingface import HuggingFaceChat
|
6 |
+
|
7 |
+
# DEFAULT_TOPIC = "Is there a process of establishment of Israeli Military or Offensive Cyber Industry in Australia?"
|
8 |
+
|
9 |
+
# # Initial websites for crawling
|
10 |
+
# INITIAL_WEBSITES = [
|
11 |
+
# "https://www.bellingcat.com/",
|
12 |
+
# "https://worldview.stratfor.com/",
|
13 |
+
# "https://thesoufancenter.org/",
|
14 |
+
# "https://www.globalsecurity.org/",
|
15 |
+
# "https://www.defenseone.com/"
|
16 |
+
# ]
|
17 |
+
|
18 |
+
# Model configuration
|
19 |
+
SEARCHER_MODEL_CONFIG = {
|
20 |
+
"id": "Trelis/Meta-Llama-3-70B-Instruct-function-calling",
|
21 |
+
"temperature": 0.4,
|
22 |
+
"top_p": 0.3,
|
23 |
+
"repetition_penalty": 1
|
24 |
+
}
|
25 |
+
|
26 |
+
# Model configuration
|
27 |
+
WRITER_MODEL_CONFIG = {
|
28 |
+
"id": "Trelis/Meta-Llama-3-70B-Instruct-function-calling",
|
29 |
+
"temperature": 0.2,
|
30 |
+
"top_p": 0.2,
|
31 |
+
"repetition_penalty": 1
|
32 |
+
}
|
33 |
+
|
34 |
+
# Review criteria thresholds
|
35 |
+
REVIEW_THRESHOLDS = {
|
36 |
+
"min_word_count": 2000,
|
37 |
+
"min_score": 7,
|
38 |
+
"min_avg_score": 8,
|
39 |
+
"max_iterations": 5
|
40 |
+
}
|
41 |
+
|
42 |
+
# Crawler settings
|
43 |
+
CRAWLER_CONFIG = {
|
44 |
+
"max_pages_per_site": 10,
|
45 |
+
"min_relevance_score": 0.5
|
46 |
+
}
|
47 |
+
|
48 |
+
def get_hf_model(purpose: str) -> HuggingFaceChat:
|
49 |
+
"""
|
50 |
+
Factory function to create HuggingFaceChat models with specific configurations.
|
51 |
+
|
52 |
+
Args:
|
53 |
+
purpose: Either 'searcher' or 'writer' to determine which configuration to use
|
54 |
+
|
55 |
+
Returns:
|
56 |
+
Configured HuggingFaceChat model instance
|
57 |
+
"""
|
58 |
+
if purpose == 'searcher':
|
59 |
+
return HuggingFaceChat(
|
60 |
+
id=SEARCHER_MODEL_CONFIG["id"],
|
61 |
+
api_key=os.getenv("HF_API_KEY"),
|
62 |
+
temperature=SEARCHER_MODEL_CONFIG["temperature"],
|
63 |
+
top_p=SEARCHER_MODEL_CONFIG["top_p"],
|
64 |
+
)
|
65 |
+
elif purpose == 'writer':
|
66 |
+
return HuggingFaceChat(
|
67 |
+
id=WRITER_MODEL_CONFIG["id"],
|
68 |
+
api_key=os.getenv("HF_API_KEY"),
|
69 |
+
temperature=WRITER_MODEL_CONFIG["temperature"],
|
70 |
+
top_p=WRITER_MODEL_CONFIG["top_p"]
|
71 |
+
)
|
72 |
+
else:
|
73 |
+
raise ValueError(f"Unknown purpose: {purpose}. Must be 'searcher' or 'writer'")
|
74 |
+
|
75 |
+
def get_together_model(purpose: str) -> Together:
|
76 |
+
"""
|
77 |
+
Factory function to create Together models with specific configurations.
|
78 |
+
|
79 |
+
Args:
|
80 |
+
purpose: Either 'searcher' or 'writer' to determine which configuration to use
|
81 |
+
|
82 |
+
Returns:
|
83 |
+
Configured Together model instance
|
84 |
+
"""
|
85 |
+
if purpose == 'searcher':
|
86 |
+
return Together(
|
87 |
+
id=SEARCHER_MODEL_CONFIG["id"],
|
88 |
+
api_key=TOGETHER_API_KEY,
|
89 |
+
temperature=SEARCHER_MODEL_CONFIG["temperature"],
|
90 |
+
top_p=SEARCHER_MODEL_CONFIG["top_p"],
|
91 |
+
repetition_penalty=SEARCHER_MODEL_CONFIG["repetition_penalty"]
|
92 |
+
)
|
93 |
+
elif purpose == 'writer':
|
94 |
+
return Together(
|
95 |
+
id=WRITER_MODEL_CONFIG["id"],
|
96 |
+
api_key=TOGETHER_API_KEY,
|
97 |
+
temperature=WRITER_MODEL_CONFIG["temperature"],
|
98 |
+
top_p=WRITER_MODEL_CONFIG["top_p"],
|
99 |
+
repetition_penalty=WRITER_MODEL_CONFIG["repetition_penalty"]
|
100 |
+
)
|
101 |
+
else:
|
102 |
+
raise ValueError(f"Unknown purpose: {purpose}. Must be 'searcher' or 'writer'")
|
103 |
+
|
104 |
+
|
105 |
+
def get_groq_model(purpose: str) -> Groq:
|
106 |
+
"""
|
107 |
+
Factory function to create Groq models with specific configurations.
|
108 |
+
|
109 |
+
Args:
|
110 |
+
purpose: Either 'searcher' or 'writer' to determine which configuration to use
|
111 |
+
|
112 |
+
Returns:
|
113 |
+
Configured Groq model instance
|
114 |
+
"""
|
115 |
+
if purpose == 'searcher':
|
116 |
+
return Groq(
|
117 |
+
id=SEARCHER_MODEL_CONFIG["id"],
|
118 |
+
api_key=GROQ_API_KEY,
|
119 |
+
temperature=SEARCHER_MODEL_CONFIG["temperature"],
|
120 |
+
top_p=SEARCHER_MODEL_CONFIG["top_p"]
|
121 |
+
)
|
122 |
+
elif purpose == 'writer':
|
123 |
+
return Groq(
|
124 |
+
id=WRITER_MODEL_CONFIG["id"],
|
125 |
+
api_key=GROQ_API_KEY,
|
126 |
+
temperature=WRITER_MODEL_CONFIG["temperature"],
|
127 |
+
top_p=WRITER_MODEL_CONFIG["top_p"]
|
128 |
+
)
|
129 |
+
else:
|
130 |
+
raise ValueError(f"Unknown purpose: {purpose}. Must be 'searcher' or 'writer'")
|
file_handler.py
ADDED
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import requests
|
3 |
+
from typing import Optional, List, Set
|
4 |
+
from urllib.parse import urlparse, unquote
|
5 |
+
from pathlib import Path
|
6 |
+
from datetime import datetime
|
7 |
+
from save_report import save_markdown_report
|
8 |
+
from phi.utils.log import logger
|
9 |
+
|
10 |
+
|
11 |
+
class FileHandler:
|
12 |
+
"""Handler for downloading and saving files discovered during web crawling."""
|
13 |
+
|
14 |
+
SUPPORTED_EXTENSIONS = {
|
15 |
+
'pdf': 'application/pdf',
|
16 |
+
'xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
17 |
+
'csv': 'text/csv'
|
18 |
+
}
|
19 |
+
|
20 |
+
# Common browser headers
|
21 |
+
HEADERS = {
|
22 |
+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
23 |
+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
|
24 |
+
'Accept-Language': 'en-US,en;q=0.9',
|
25 |
+
'Accept-Encoding': 'gzip, deflate, br',
|
26 |
+
'DNT': '1',
|
27 |
+
'Connection': 'keep-alive',
|
28 |
+
'Upgrade-Insecure-Requests': '1'
|
29 |
+
}
|
30 |
+
|
31 |
+
def __init__(self):
|
32 |
+
# Get the report directory for the current date
|
33 |
+
self.report_dir, _ = save_markdown_report()
|
34 |
+
self.downloaded_files: Set[str] = set()
|
35 |
+
self.file_metadata: List[dict] = []
|
36 |
+
self.failed_downloads: List[dict] = [] # Track failed downloads
|
37 |
+
|
38 |
+
# Create a subdirectory for downloaded files
|
39 |
+
self.downloads_dir = os.path.join(self.report_dir, 'downloads')
|
40 |
+
os.makedirs(self.downloads_dir, exist_ok=True)
|
41 |
+
|
42 |
+
# Create a metadata file to track downloaded files
|
43 |
+
self.metadata_file = os.path.join(self.downloads_dir, 'files_metadata.md')
|
44 |
+
|
45 |
+
def is_supported_file(self, url: str) -> bool:
|
46 |
+
"""Check if the URL points to a supported file type."""
|
47 |
+
parsed_url = urlparse(url)
|
48 |
+
extension = os.path.splitext(parsed_url.path)[1].lower().lstrip('.')
|
49 |
+
return extension in self.SUPPORTED_EXTENSIONS
|
50 |
+
|
51 |
+
def get_filename_from_url(self, url: str, content_type: Optional[str] = None) -> str:
|
52 |
+
"""Generate a safe filename from the URL."""
|
53 |
+
# Get the filename from the URL
|
54 |
+
parsed_url = urlparse(url)
|
55 |
+
filename = os.path.basename(unquote(parsed_url.path))
|
56 |
+
|
57 |
+
# If no filename in URL, create one based on content type
|
58 |
+
if not filename:
|
59 |
+
extension = next(
|
60 |
+
(ext for ext, mime in self.SUPPORTED_EXTENSIONS.items()
|
61 |
+
if mime == content_type),
|
62 |
+
'unknown'
|
63 |
+
)
|
64 |
+
filename = f"downloaded_file.{extension}"
|
65 |
+
|
66 |
+
# Ensure filename is safe and unique
|
67 |
+
safe_filename = "".join(c for c in filename if c.isalnum() or c in '._-')
|
68 |
+
base, ext = os.path.splitext(safe_filename)
|
69 |
+
|
70 |
+
# Add number suffix if file exists
|
71 |
+
counter = 1
|
72 |
+
while os.path.exists(os.path.join(self.downloads_dir, safe_filename)):
|
73 |
+
safe_filename = f"{base}_{counter}{ext}"
|
74 |
+
counter += 1
|
75 |
+
|
76 |
+
return safe_filename
|
77 |
+
|
78 |
+
def download_file(self, url: str, source_page: str = None) -> Optional[str]:
|
79 |
+
"""
|
80 |
+
Download a file from the URL and save it to the downloads directory.
|
81 |
+
Returns the path to the saved file if successful, None otherwise.
|
82 |
+
"""
|
83 |
+
if url in self.downloaded_files:
|
84 |
+
logger.info(f"File already downloaded: {url}")
|
85 |
+
return None
|
86 |
+
|
87 |
+
try:
|
88 |
+
# Create a session to maintain headers across redirects
|
89 |
+
session = requests.Session()
|
90 |
+
session.headers.update(self.HEADERS)
|
91 |
+
|
92 |
+
# First make a HEAD request to check content type and size
|
93 |
+
head_response = session.head(url, timeout=10, allow_redirects=True)
|
94 |
+
head_response.raise_for_status()
|
95 |
+
|
96 |
+
content_type = head_response.headers.get('content-type', '').lower().split(';')[0]
|
97 |
+
content_length = int(head_response.headers.get('content-length', 0))
|
98 |
+
|
99 |
+
# Check if content type is supported and size is reasonable (less than 100MB)
|
100 |
+
if not any(mime in content_type for mime in self.SUPPORTED_EXTENSIONS.values()):
|
101 |
+
logger.warning(f"Unsupported content type: {content_type} for URL: {url}")
|
102 |
+
return None
|
103 |
+
|
104 |
+
if content_length > 100 * 1024 * 1024: # 100MB limit
|
105 |
+
logger.warning(f"File too large ({content_length} bytes) for URL: {url}")
|
106 |
+
return None
|
107 |
+
|
108 |
+
# Make the actual download request
|
109 |
+
response = session.get(url, timeout=30, stream=True)
|
110 |
+
response.raise_for_status()
|
111 |
+
|
112 |
+
# Generate safe filename
|
113 |
+
filename = self.get_filename_from_url(url, content_type)
|
114 |
+
file_path = os.path.join(self.downloads_dir, filename)
|
115 |
+
|
116 |
+
# Save the file
|
117 |
+
with open(file_path, 'wb') as f:
|
118 |
+
for chunk in response.iter_content(chunk_size=8192):
|
119 |
+
if chunk:
|
120 |
+
f.write(chunk)
|
121 |
+
|
122 |
+
# Record metadata
|
123 |
+
metadata = {
|
124 |
+
'filename': filename,
|
125 |
+
'source_url': url,
|
126 |
+
'source_page': source_page,
|
127 |
+
'content_type': content_type,
|
128 |
+
'download_time': datetime.now().isoformat(),
|
129 |
+
'file_size': os.path.getsize(file_path)
|
130 |
+
}
|
131 |
+
self.file_metadata.append(metadata)
|
132 |
+
|
133 |
+
# Update metadata file
|
134 |
+
self._update_metadata_file()
|
135 |
+
|
136 |
+
self.downloaded_files.add(url)
|
137 |
+
logger.info(f"Successfully downloaded: {url} to {file_path}")
|
138 |
+
return file_path
|
139 |
+
|
140 |
+
except requests.RequestException as e:
|
141 |
+
error_info = {
|
142 |
+
'url': url,
|
143 |
+
'source_page': source_page,
|
144 |
+
'error': str(e),
|
145 |
+
'time': datetime.now().isoformat()
|
146 |
+
}
|
147 |
+
self.failed_downloads.append(error_info)
|
148 |
+
self._update_metadata_file() # Update metadata including failed downloads
|
149 |
+
logger.error(f"Error downloading file from {url}: {str(e)}")
|
150 |
+
return None
|
151 |
+
except Exception as e:
|
152 |
+
logger.error(f"Unexpected error while downloading {url}: {str(e)}")
|
153 |
+
return None
|
154 |
+
|
155 |
+
def _update_metadata_file(self):
|
156 |
+
"""Update the metadata markdown file with information about downloaded files."""
|
157 |
+
try:
|
158 |
+
with open(self.metadata_file, 'w', encoding='utf-8') as f:
|
159 |
+
f.write("# Downloaded Files Metadata\n\n")
|
160 |
+
|
161 |
+
# Successful downloads
|
162 |
+
if self.file_metadata:
|
163 |
+
f.write("## Successfully Downloaded Files\n\n")
|
164 |
+
for metadata in self.file_metadata:
|
165 |
+
f.write(f"### {metadata['filename']}\n")
|
166 |
+
f.write(f"- Source URL: {metadata['source_url']}\n")
|
167 |
+
if metadata['source_page']:
|
168 |
+
f.write(f"- Found on page: {metadata['source_page']}\n")
|
169 |
+
f.write(f"- Content Type: {metadata['content_type']}\n")
|
170 |
+
f.write(f"- Download Time: {metadata['download_time']}\n")
|
171 |
+
f.write(f"- File Size: {metadata['file_size']} bytes\n\n")
|
172 |
+
|
173 |
+
# Failed downloads
|
174 |
+
if self.failed_downloads:
|
175 |
+
f.write("## Failed Downloads\n\n")
|
176 |
+
for failed in self.failed_downloads:
|
177 |
+
f.write(f"### {failed['url']}\n")
|
178 |
+
if failed['source_page']:
|
179 |
+
f.write(f"- Found on page: {failed['source_page']}\n")
|
180 |
+
f.write(f"- Error: {failed['error']}\n")
|
181 |
+
f.write(f"- Time: {failed['time']}\n\n")
|
182 |
+
|
183 |
+
except Exception as e:
|
184 |
+
logger.error(f"Error updating metadata file: {str(e)}")
|
185 |
+
|
186 |
+
def get_downloaded_files(self) -> List[str]:
|
187 |
+
"""Return a list of all downloaded file paths."""
|
188 |
+
return [os.path.join(self.downloads_dir, f)
|
189 |
+
for f in os.listdir(self.downloads_dir)
|
190 |
+
if os.path.isfile(os.path.join(self.downloads_dir, f))]
|
requirements.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
phidata
|
2 |
+
beautifulsoup4
|
3 |
+
requests
|
4 |
+
pydantic
|
5 |
+
duckduckgo-search
|
6 |
+
tenacity
|
7 |
+
streamlit
|
save_report.py
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
from datetime import datetime
|
3 |
+
import shutil
|
4 |
+
|
5 |
+
def save_markdown_report():
|
6 |
+
# Get current date in YYYY-MM-DD format
|
7 |
+
current_date = datetime.now().strftime('%Y-%m-%d')
|
8 |
+
|
9 |
+
# Create directory name
|
10 |
+
report_dir = f"report_{current_date}"
|
11 |
+
|
12 |
+
# Create full path
|
13 |
+
base_path = os.path.dirname(os.path.abspath(__file__))
|
14 |
+
report_path = os.path.join(base_path, report_dir)
|
15 |
+
|
16 |
+
# Create directory if it doesn't exist
|
17 |
+
os.makedirs(report_path, exist_ok=True)
|
18 |
+
|
19 |
+
# Create markdown file path
|
20 |
+
report_file = os.path.join(report_path, f"report_{current_date}.md")
|
21 |
+
|
22 |
+
return report_path, report_file
|
23 |
+
|
24 |
+
if __name__ == "__main__":
|
25 |
+
report_path, report_file = save_markdown_report()
|
26 |
+
print(f"Report directory created at: {report_path}")
|
27 |
+
print(f"Report file path: {report_file}")
|
search_utils.py
ADDED
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import time
|
2 |
+
import logging
|
3 |
+
import random
|
4 |
+
import threading
|
5 |
+
from typing import Optional, Dict, Any
|
6 |
+
from duckduckgo_search.exceptions import RatelimitException
|
7 |
+
|
8 |
+
logger = logging.getLogger(__name__)
|
9 |
+
|
10 |
+
class RateLimitedSearch:
|
11 |
+
"""Rate limited search implementation with exponential backoff."""
|
12 |
+
|
13 |
+
def __init__(self):
|
14 |
+
self.last_request_time = 0
|
15 |
+
self.min_delay = 30 # Increased minimum delay between requests to 30 seconds
|
16 |
+
self.max_delay = 300 # Maximum delay of 5 minutes
|
17 |
+
self.jitter = 5 # Added more jitter range
|
18 |
+
self.consecutive_failures = 0
|
19 |
+
self.max_consecutive_failures = 5 # Increased max failures before giving up
|
20 |
+
self._delay_lock = threading.Lock() # Add thread safety
|
21 |
+
|
22 |
+
def _add_jitter(self, delay: float) -> float:
|
23 |
+
"""Add randomized jitter to delay."""
|
24 |
+
return delay + random.uniform(-self.jitter, self.jitter)
|
25 |
+
|
26 |
+
def _wait_for_rate_limit(self):
|
27 |
+
"""Wait for rate limit with exponential backoff."""
|
28 |
+
with self._delay_lock:
|
29 |
+
current_time = time.time()
|
30 |
+
elapsed = current_time - self.last_request_time
|
31 |
+
|
32 |
+
# Calculate delay based on consecutive failures
|
33 |
+
if self.consecutive_failures > 0:
|
34 |
+
delay = min(
|
35 |
+
self.max_delay,
|
36 |
+
self.min_delay * (2 ** (self.consecutive_failures - 1))
|
37 |
+
)
|
38 |
+
else:
|
39 |
+
delay = self.min_delay
|
40 |
+
|
41 |
+
# Add jitter to prevent synchronized requests
|
42 |
+
jitter = random.uniform(-self.jitter, self.jitter)
|
43 |
+
delay = max(0, delay + jitter)
|
44 |
+
|
45 |
+
# If not enough time has elapsed, wait
|
46 |
+
if elapsed < delay:
|
47 |
+
time.sleep(delay - elapsed)
|
48 |
+
|
49 |
+
self.last_request_time = time.time()
|
50 |
+
|
51 |
+
def execute_with_retry(self,
|
52 |
+
search_func: callable,
|
53 |
+
max_retries: int = 3,
|
54 |
+
**kwargs) -> Optional[Dict[str, Any]]:
|
55 |
+
"""Execute search with retries and exponential backoff."""
|
56 |
+
|
57 |
+
for attempt in range(max_retries):
|
58 |
+
try:
|
59 |
+
# Enforce rate limiting
|
60 |
+
self._wait_for_rate_limit()
|
61 |
+
|
62 |
+
# Execute search
|
63 |
+
result = search_func(**kwargs)
|
64 |
+
|
65 |
+
# Reset consecutive failures on success
|
66 |
+
self.consecutive_failures = 0
|
67 |
+
return result
|
68 |
+
|
69 |
+
except RatelimitException as e:
|
70 |
+
self.consecutive_failures += 1
|
71 |
+
|
72 |
+
# Calculate backoff time
|
73 |
+
backoff = min(
|
74 |
+
self.max_delay,
|
75 |
+
self.min_delay * (2 ** attempt)
|
76 |
+
)
|
77 |
+
backoff = self._add_jitter(backoff)
|
78 |
+
|
79 |
+
if attempt == max_retries - 1:
|
80 |
+
logger.error(f"Rate limit exceeded after {max_retries} retries")
|
81 |
+
raise
|
82 |
+
|
83 |
+
logger.warning(f"Rate limit hit, attempt {attempt + 1}/{max_retries}. "
|
84 |
+
f"Waiting {backoff:.2f} seconds...")
|
85 |
+
time.sleep(backoff)
|
86 |
+
|
87 |
+
# If we've hit too many consecutive failures, raise an exception
|
88 |
+
if self.consecutive_failures >= self.max_consecutive_failures:
|
89 |
+
logger.error("Too many consecutive rate limit failures")
|
90 |
+
raise RatelimitException("Persistent rate limiting detected")
|
91 |
+
continue
|
92 |
+
|
93 |
+
except Exception as e:
|
94 |
+
logger.error(f"Search error on attempt {attempt + 1}: {str(e)}")
|
95 |
+
if attempt == max_retries - 1:
|
96 |
+
raise
|
97 |
+
|
98 |
+
backoff = self.min_delay * (2 ** attempt)
|
99 |
+
backoff = self._add_jitter(backoff)
|
100 |
+
logger.info(f"Retrying in {backoff:.2f} seconds...")
|
101 |
+
time.sleep(backoff)
|
102 |
+
|
103 |
+
return None
|
web_search.py
ADDED
@@ -0,0 +1,1109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
import re
|
3 |
+
import time
|
4 |
+
import os
|
5 |
+
import concurrent.futures
|
6 |
+
from typing import Optional, Iterator, List, Set, Dict, Any
|
7 |
+
from urllib.parse import urlparse, urljoin
|
8 |
+
import requests
|
9 |
+
from bs4 import BeautifulSoup
|
10 |
+
from pydantic import BaseModel, Field
|
11 |
+
from datetime import datetime
|
12 |
+
|
13 |
+
# Phi imports
|
14 |
+
from phi.workflow import Workflow, RunResponse, RunEvent
|
15 |
+
from phi.storage.workflow.sqlite import SqlWorkflowStorage
|
16 |
+
from phi.agent import Agent
|
17 |
+
from phi.model.groq import Groq
|
18 |
+
from phi.tools.duckduckgo import DuckDuckGo
|
19 |
+
from phi.tools.googlesearch import GoogleSearch
|
20 |
+
from phi.utils.pprint import pprint_run_response
|
21 |
+
from phi.utils.log import logger
|
22 |
+
|
23 |
+
# Error handling imports
|
24 |
+
from duckduckgo_search.exceptions import RatelimitException
|
25 |
+
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
|
26 |
+
from requests.exceptions import HTTPError
|
27 |
+
|
28 |
+
from config import GROQ_API_KEY, NVIDIA_API_KEY, SEARCHER_MODEL_CONFIG, WRITER_MODEL_CONFIG, get_hf_model
|
29 |
+
import configparser
|
30 |
+
|
31 |
+
DUCK_DUCK_GO_FIXED_MAX_RESULTS = 10
|
32 |
+
|
33 |
+
config = configparser.ConfigParser()
|
34 |
+
config.read('config.ini')
|
35 |
+
DEFAULT_TOPIC = config.get('DEFAULT', 'default_topic')
|
36 |
+
INITIAL_WEBSITES = config.get('DEFAULT', 'initial_websites')
|
37 |
+
|
38 |
+
# The topic to generate a blog post on
|
39 |
+
topic = DEFAULT_TOPIC
|
40 |
+
|
41 |
+
class NewsArticle(BaseModel):
|
42 |
+
"""Article data model containing title, URL and description."""
|
43 |
+
title: str = Field(..., description="Title of the article.")
|
44 |
+
url: str = Field(..., description="Link to the article.")
|
45 |
+
description: Optional[str] = Field(None, description="Summary of the article if available.")
|
46 |
+
|
47 |
+
|
48 |
+
class SearchResults(BaseModel):
|
49 |
+
"""Container for search results containing a list of articles."""
|
50 |
+
articles: List[NewsArticle]
|
51 |
+
|
52 |
+
|
53 |
+
class BlogPostGenerator(Workflow):
|
54 |
+
"""Workflow for generating blog posts based on web research."""
|
55 |
+
searcher: Agent = Field(...)
|
56 |
+
backup_searcher: Agent = Field(...)
|
57 |
+
writer: Agent = Field(...)
|
58 |
+
initial_websites: List[str] = Field(default_factory=lambda: INITIAL_WEBSITES)
|
59 |
+
file_handler: Optional[Any] = Field(None)
|
60 |
+
|
61 |
+
def __init__(
|
62 |
+
self,
|
63 |
+
session_id: str,
|
64 |
+
searcher: Agent,
|
65 |
+
backup_searcher: Agent,
|
66 |
+
writer: Agent,
|
67 |
+
file_handler: Optional[Any] = None,
|
68 |
+
storage: Optional[SqlWorkflowStorage] = None,
|
69 |
+
):
|
70 |
+
super().__init__(
|
71 |
+
session_id=session_id,
|
72 |
+
searcher=searcher,
|
73 |
+
backup_searcher=backup_searcher,
|
74 |
+
writer=writer,
|
75 |
+
storage=storage,
|
76 |
+
)
|
77 |
+
self.file_handler = file_handler
|
78 |
+
|
79 |
+
# Configure search instructions
|
80 |
+
search_instructions = [
|
81 |
+
"Given a topic, search for 20 articles and return the 15 most relevant articles.",
|
82 |
+
"For each article, provide:",
|
83 |
+
"- title: The article title",
|
84 |
+
"- url: The article URL",
|
85 |
+
"- description: A brief description or summary of the article",
|
86 |
+
"Return the results in a structured format with these exact field names."
|
87 |
+
]
|
88 |
+
|
89 |
+
# Primary searcher using DuckDuckGo
|
90 |
+
self.searcher = Agent(
|
91 |
+
model=get_hf_model('searcher'),
|
92 |
+
tools=[DuckDuckGo(fixed_max_results=DUCK_DUCK_GO_FIXED_MAX_RESULTS)],
|
93 |
+
instructions=search_instructions,
|
94 |
+
response_model=SearchResults
|
95 |
+
)
|
96 |
+
|
97 |
+
|
98 |
+
# Backup searcher using Google Search
|
99 |
+
self.backup_searcher = Agent(
|
100 |
+
model=get_hf_model('searcher'),
|
101 |
+
tools=[GoogleSearch()],
|
102 |
+
instructions=search_instructions,
|
103 |
+
response_model=SearchResults
|
104 |
+
)
|
105 |
+
|
106 |
+
|
107 |
+
# Writer agent configuration
|
108 |
+
writer_instructions = [
|
109 |
+
"You are a professional research analyst tasked with creating a comprehensive report on the given topic.",
|
110 |
+
"The sources provided include both general web search results and specialized intelligence/security websites.",
|
111 |
+
"Carefully analyze and cross-reference information from all sources to create a detailed report.",
|
112 |
+
"",
|
113 |
+
"Report Structure:",
|
114 |
+
"1. Executive Summary (2-3 paragraphs)",
|
115 |
+
" - Provide a clear, concise overview of the main findings",
|
116 |
+
" - Address the research question directly",
|
117 |
+
" - Highlight key discoveries and implications",
|
118 |
+
"",
|
119 |
+
"2. Detailed Analysis (Multiple sections)",
|
120 |
+
" - Break down the topic into relevant themes or aspects",
|
121 |
+
" - For each theme:",
|
122 |
+
" * Present detailed findings from multiple sources",
|
123 |
+
" * Cross-reference information between general and specialized sources",
|
124 |
+
" * Analyze trends, patterns, and developments",
|
125 |
+
" * Discuss implications and potential impacts",
|
126 |
+
"",
|
127 |
+
"3. Source Analysis and Credibility",
|
128 |
+
" For each major source:",
|
129 |
+
" - Evaluate source credibility and expertise",
|
130 |
+
" - Note if from specialized intelligence/security website",
|
131 |
+
" - Assess potential biases or limitations",
|
132 |
+
" - Key findings and unique contributions",
|
133 |
+
"",
|
134 |
+
"4. Key Takeaways and Strategic Implications",
|
135 |
+
" - Synthesize findings from all sources",
|
136 |
+
" - Compare/contrast general media vs specialized analysis",
|
137 |
+
" - Discuss broader geopolitical implications",
|
138 |
+
" - Address potential future developments",
|
139 |
+
"",
|
140 |
+
"5. References",
|
141 |
+
" - Group sources by type (specialized websites vs general media)",
|
142 |
+
" - List all sources with full citations",
|
143 |
+
" - Include URLs as clickable markdown links [Title](URL)",
|
144 |
+
" - Ensure every major claim has at least one linked source",
|
145 |
+
"",
|
146 |
+
"Important Guidelines:",
|
147 |
+
"- Prioritize information from specialized intelligence/security sources",
|
148 |
+
"- Cross-validate claims between multiple sources when possible",
|
149 |
+
"- Maintain a professional, analytical tone",
|
150 |
+
"- Support all claims with evidence",
|
151 |
+
"- Include specific examples and data points",
|
152 |
+
"- Use direct quotes for significant statements",
|
153 |
+
"- Address potential biases in reporting",
|
154 |
+
"- Ensure the report directly answers the research question",
|
155 |
+
"",
|
156 |
+
"Format the report with clear markdown headings (# ## ###), subheadings, and paragraphs.",
|
157 |
+
"Each major section should contain multiple paragraphs with detailed analysis."
|
158 |
+
]
|
159 |
+
|
160 |
+
self.writer = Agent(
|
161 |
+
model=get_hf_model('writer'),
|
162 |
+
instructions=writer_instructions,
|
163 |
+
structured_outputs=True
|
164 |
+
)
|
165 |
+
|
166 |
+
|
167 |
+
def _parse_search_response(self, response) -> Optional[SearchResults]:
|
168 |
+
"""Parse and validate search response into SearchResults model."""
|
169 |
+
try:
|
170 |
+
if isinstance(response, str):
|
171 |
+
# Clean up markdown code blocks and extract JSON
|
172 |
+
content = response.strip()
|
173 |
+
if '```' in content:
|
174 |
+
# Extract content between code block markers
|
175 |
+
match = re.search(r'```(?:json)?\n(.*?)\n```', content, re.DOTALL)
|
176 |
+
if match:
|
177 |
+
content = match.group(1).strip()
|
178 |
+
else:
|
179 |
+
# If no proper code block found, remove all ``` markers
|
180 |
+
content = re.sub(r'```(?:json)?\n?', '', content)
|
181 |
+
content = content.strip()
|
182 |
+
|
183 |
+
# Try to parse JSON response
|
184 |
+
try:
|
185 |
+
# Clean up any trailing commas before closing brackets/braces
|
186 |
+
content = re.sub(r',(\s*[}\]])', r'\1', content)
|
187 |
+
# Fix invalid escape sequences
|
188 |
+
content = re.sub(r'\\([^"\\\/bfnrtu])', r'\1', content) # Remove invalid escapes
|
189 |
+
content = content.replace('\t', ' ') # Replace tabs with spaces
|
190 |
+
# Handle any remaining unicode escapes
|
191 |
+
content = re.sub(r'\\u([0-9a-fA-F]{4})', lambda m: chr(int(m.group(1), 16)), content)
|
192 |
+
|
193 |
+
data = json.loads(content)
|
194 |
+
|
195 |
+
if isinstance(data, dict) and 'articles' in data:
|
196 |
+
articles = []
|
197 |
+
for article in data['articles']:
|
198 |
+
if isinstance(article, dict):
|
199 |
+
# Ensure all required fields are strings
|
200 |
+
article = {
|
201 |
+
'title': str(article.get('title', '')).strip(),
|
202 |
+
'url': str(article.get('url', '')).strip(),
|
203 |
+
'description': str(article.get('description', '')).strip()
|
204 |
+
}
|
205 |
+
if article['title'] and article['url']: # Only add if has required fields
|
206 |
+
articles.append(NewsArticle(**article))
|
207 |
+
|
208 |
+
if articles:
|
209 |
+
logger.info(f"Successfully parsed {len(articles)} articles from JSON")
|
210 |
+
return SearchResults(articles=articles)
|
211 |
+
|
212 |
+
except json.JSONDecodeError as e:
|
213 |
+
logger.warning(f"Failed to parse JSON response: {str(e)}, attempting to extract data manually")
|
214 |
+
|
215 |
+
# Fallback to regex extraction if JSON parsing fails
|
216 |
+
urls = re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', content)
|
217 |
+
titles = re.findall(r'"title":\s*"([^"]+)"', content)
|
218 |
+
descriptions = re.findall(r'"description":\s*"([^"]+)"', content)
|
219 |
+
|
220 |
+
if not urls: # Try alternative patterns
|
221 |
+
urls = re.findall(r'(?<=\()http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+(?=\))', content)
|
222 |
+
|
223 |
+
if urls:
|
224 |
+
articles = []
|
225 |
+
for i, url in enumerate(urls):
|
226 |
+
title = titles[i] if i < len(titles) else f"Article {i+1}"
|
227 |
+
description = descriptions[i] if i < len(descriptions) else ""
|
228 |
+
# Clean up extracted data
|
229 |
+
title = title.strip().replace('\\"', '"')
|
230 |
+
url = url.strip().replace('\\"', '"')
|
231 |
+
description = description.strip().replace('\\"', '"')
|
232 |
+
|
233 |
+
if url: # Only add if URL exists
|
234 |
+
articles.append(NewsArticle(
|
235 |
+
title=title,
|
236 |
+
url=url,
|
237 |
+
description=description
|
238 |
+
))
|
239 |
+
|
240 |
+
if articles:
|
241 |
+
logger.info(f"Successfully extracted {len(articles)} articles using regex")
|
242 |
+
return SearchResults(articles=articles)
|
243 |
+
|
244 |
+
logger.warning("No valid articles found in response")
|
245 |
+
return None
|
246 |
+
|
247 |
+
elif isinstance(response, dict):
|
248 |
+
# Handle dictionary response
|
249 |
+
if 'articles' in response:
|
250 |
+
articles = []
|
251 |
+
for article in response['articles']:
|
252 |
+
if isinstance(article, dict):
|
253 |
+
# Ensure all fields are strings
|
254 |
+
article = {
|
255 |
+
'title': str(article.get('title', '')).strip(),
|
256 |
+
'url': str(article.get('url', '')).strip(),
|
257 |
+
'description': str(article.get('description', '')).strip()
|
258 |
+
}
|
259 |
+
if article['title'] and article['url']:
|
260 |
+
articles.append(NewsArticle(**article))
|
261 |
+
elif isinstance(article, NewsArticle):
|
262 |
+
articles.append(article)
|
263 |
+
|
264 |
+
if articles:
|
265 |
+
logger.info(f"Successfully processed {len(articles)} articles from dict")
|
266 |
+
return SearchResults(articles=articles)
|
267 |
+
return None
|
268 |
+
|
269 |
+
elif isinstance(response, SearchResults):
|
270 |
+
# Already in correct format
|
271 |
+
return response
|
272 |
+
|
273 |
+
elif isinstance(response, RunResponse):
|
274 |
+
# Extract from RunResponse
|
275 |
+
if response.content:
|
276 |
+
return self._parse_search_response(response.content)
|
277 |
+
return None
|
278 |
+
|
279 |
+
logger.error(f"Unsupported response type: {type(response)}")
|
280 |
+
return None
|
281 |
+
|
282 |
+
except Exception as e:
|
283 |
+
logger.error(f"Error parsing search response: {str(e)}")
|
284 |
+
return None
|
285 |
+
|
286 |
+
def _search_with_retry(self, topic: str, use_backup: bool = False, max_retries: int = 3) -> Optional[SearchResults]:
|
287 |
+
"""Execute search with retries and rate limit handling."""
|
288 |
+
searcher = self.backup_searcher if use_backup else self.searcher
|
289 |
+
source = "backup" if use_backup else "primary"
|
290 |
+
|
291 |
+
# Initialize rate limit tracking
|
292 |
+
rate_limited_sources = set()
|
293 |
+
|
294 |
+
for attempt in range(max_retries):
|
295 |
+
try:
|
296 |
+
if source in rate_limited_sources:
|
297 |
+
logger.warning(f"{source} search is rate limited, switching to alternative method")
|
298 |
+
if not use_backup:
|
299 |
+
# Try backup search if primary is rate limited
|
300 |
+
backup_results = self._search_with_retry(topic, use_backup=True, max_retries=max_retries)
|
301 |
+
if backup_results:
|
302 |
+
return backup_results
|
303 |
+
# If both sources are rate limited, use longer backoff
|
304 |
+
backoff_time = min(3600, 60 * (2 ** attempt)) # Max 1 hour backoff
|
305 |
+
logger.info(f"All search methods rate limited. Waiting {backoff_time} seconds before retry...")
|
306 |
+
time.sleep(backoff_time)
|
307 |
+
|
308 |
+
logger.info(f"\nAttempting {source} search (attempt {attempt + 1}/{max_retries})...")
|
309 |
+
|
310 |
+
# Try different search prompts to improve results
|
311 |
+
search_prompts = [
|
312 |
+
f"""Search for detailed articles about: {topic}
|
313 |
+
Return only high-quality, relevant sources.
|
314 |
+
Format the results as a JSON object with an 'articles' array containing:
|
315 |
+
- title: The article title
|
316 |
+
- url: The article URL
|
317 |
+
- description: A brief description or summary
|
318 |
+
""",
|
319 |
+
f"""Find comprehensive articles and research papers about: {topic}
|
320 |
+
Focus on authoritative sources and recent publications.
|
321 |
+
Return results in JSON format with 'articles' array.
|
322 |
+
""",
|
323 |
+
f"""Locate detailed analysis and reports discussing: {topic}
|
324 |
+
Prioritize academic, industry, and news sources.
|
325 |
+
Return structured JSON with article details.
|
326 |
+
"""
|
327 |
+
]
|
328 |
+
|
329 |
+
# Try each prompt until we get results
|
330 |
+
for prompt in search_prompts:
|
331 |
+
try:
|
332 |
+
response = searcher.run(prompt, stream=False)
|
333 |
+
results = self._parse_search_response(response)
|
334 |
+
if results and results.articles:
|
335 |
+
logger.info(f"Found {len(results.articles)} articles from {source} search")
|
336 |
+
return results
|
337 |
+
except Exception as e:
|
338 |
+
if any(err in str(e).lower() for err in ["rate", "limit", "quota", "exhausted"]):
|
339 |
+
rate_limited_sources.add(source)
|
340 |
+
raise
|
341 |
+
logger.warning(f"Search prompt failed: {str(e)}")
|
342 |
+
continue
|
343 |
+
|
344 |
+
logger.warning(f"{source.title()} search returned no valid results")
|
345 |
+
|
346 |
+
except Exception as e:
|
347 |
+
error_msg = str(e).lower()
|
348 |
+
if any(err in error_msg for err in ["rate", "limit", "quota", "exhausted"]):
|
349 |
+
rate_limited_sources.add(source)
|
350 |
+
logger.error(f"{source} search rate limited: {str(e)}")
|
351 |
+
# Try alternative source immediately
|
352 |
+
if not use_backup:
|
353 |
+
backup_results = self._search_with_retry(topic, use_backup=True, max_retries=max_retries)
|
354 |
+
if backup_results:
|
355 |
+
return backup_results
|
356 |
+
else:
|
357 |
+
logger.error(f"Error during {source} search (attempt {attempt + 1}): {str(e)}")
|
358 |
+
|
359 |
+
if attempt < max_retries - 1:
|
360 |
+
backoff_time = 2 ** attempt
|
361 |
+
if source in rate_limited_sources:
|
362 |
+
backoff_time = min(3600, 60 * (2 ** attempt)) # Longer backoff for rate limits
|
363 |
+
logger.info(f"Waiting {backoff_time} seconds before retry...")
|
364 |
+
time.sleep(backoff_time)
|
365 |
+
|
366 |
+
return None
|
367 |
+
|
368 |
+
def _validate_content(self, content: str) -> bool:
|
369 |
+
"""Validate that the generated content is readable and properly formatted."""
|
370 |
+
if not content or len(content.strip()) < 100:
|
371 |
+
logger.warning("Content too short or empty")
|
372 |
+
return False
|
373 |
+
|
374 |
+
# Check for basic structure
|
375 |
+
if not any(marker in content for marker in ['#', '\n\n']):
|
376 |
+
logger.warning("Content lacks proper structure (headers or paragraphs)")
|
377 |
+
return False
|
378 |
+
|
379 |
+
# Check for reasonable paragraph lengths
|
380 |
+
paragraphs = [p.strip() for p in content.split('\n\n') if p.strip()]
|
381 |
+
if not paragraphs:
|
382 |
+
logger.warning("No valid paragraphs found")
|
383 |
+
return False
|
384 |
+
|
385 |
+
# Common words that are allowed to repeat frequently
|
386 |
+
common_words = {
|
387 |
+
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
|
388 |
+
'this', 'that', 'these', 'those', 'it', 'its', 'is', 'are', 'was', 'were', 'be', 'been',
|
389 |
+
'has', 'have', 'had', 'would', 'could', 'should', 'will', 'can'
|
390 |
+
}
|
391 |
+
|
392 |
+
# Track word frequencies across paragraphs
|
393 |
+
word_frequencies = {}
|
394 |
+
total_words = 0
|
395 |
+
|
396 |
+
# Validate each paragraph
|
397 |
+
for para in paragraphs:
|
398 |
+
# Skip headers and references
|
399 |
+
if para.startswith('#') or para.startswith('http'):
|
400 |
+
continue
|
401 |
+
|
402 |
+
# Calculate word statistics
|
403 |
+
words = para.split()
|
404 |
+
if len(words) < 3:
|
405 |
+
continue # Skip very short paragraphs
|
406 |
+
|
407 |
+
# Calculate word statistics
|
408 |
+
word_lengths = [len(word) for word in words]
|
409 |
+
avg_word_length = sum(word_lengths) / len(word_lengths)
|
410 |
+
|
411 |
+
# More nuanced word length validation
|
412 |
+
long_words = [w for w in words if len(w) > 15]
|
413 |
+
long_word_ratio = len(long_words) / len(words) if words else 0
|
414 |
+
|
415 |
+
# Allow higher average length if the text contains URLs or technical terms
|
416 |
+
contains_url = any(word.startswith(('http', 'www')) for word in words)
|
417 |
+
contains_technical = any(word.lower().endswith(('tion', 'ment', 'ology', 'ware', 'tech')) for word in words)
|
418 |
+
|
419 |
+
# Adjust thresholds based on content type
|
420 |
+
max_avg_length = 12 # Base maximum average word length
|
421 |
+
if contains_url:
|
422 |
+
max_avg_length = 20 # Allow longer average for content with URLs
|
423 |
+
elif contains_technical:
|
424 |
+
max_avg_length = 15 # Allow longer average for technical content
|
425 |
+
|
426 |
+
# Fail only if multiple indicators of problematic text
|
427 |
+
if (avg_word_length > max_avg_length and long_word_ratio > 0.3) or avg_word_length > 25:
|
428 |
+
logger.warning(f"Suspicious word lengths: avg={avg_word_length:.1f}, long_ratio={long_word_ratio:.1%}")
|
429 |
+
return False
|
430 |
+
|
431 |
+
# Check for excessive punctuation or special characters
|
432 |
+
special_char_ratio = len(re.findall(r'[^a-zA-Z0-9\s.,!?()"-]', para)) / len(para)
|
433 |
+
if special_char_ratio > 0.15: # Increased threshold slightly
|
434 |
+
logger.warning(f"Too many special characters: {special_char_ratio}")
|
435 |
+
return False
|
436 |
+
|
437 |
+
# Check for coherent sentence structure
|
438 |
+
sentences = [s.strip() for s in re.split(r'[.!?]+', para) if s.strip()]
|
439 |
+
weak_sentences = 0
|
440 |
+
for sentence in sentences:
|
441 |
+
words = sentence.split()
|
442 |
+
if len(words) < 3: # Skip very short sentences
|
443 |
+
continue
|
444 |
+
|
445 |
+
# More lenient grammar check
|
446 |
+
structure_indicators = [
|
447 |
+
any(word[0].isupper() for word in words), # Has some capitalization
|
448 |
+
any(word.lower() in common_words for word in words), # Has common words
|
449 |
+
len(words) >= 3, # Reasonable length
|
450 |
+
any(len(word) > 3 for word in words), # Has some non-trivial words
|
451 |
+
]
|
452 |
+
|
453 |
+
# Only fail if less than 2 indicators are present
|
454 |
+
if sum(structure_indicators) < 2:
|
455 |
+
logger.warning(f"Weak sentence structure: {sentence}")
|
456 |
+
weak_sentences += 1
|
457 |
+
if weak_sentences > len(sentences) / 2: # Fail if more than half are weak
|
458 |
+
logger.warning("Too many poorly structured sentences")
|
459 |
+
return False
|
460 |
+
|
461 |
+
# Update word frequencies
|
462 |
+
for word in words:
|
463 |
+
word = word.lower()
|
464 |
+
if word not in common_words and len(word) > 2: # Only track non-common words
|
465 |
+
word_frequencies[word] = word_frequencies.get(word, 0) + 1
|
466 |
+
total_words += 1
|
467 |
+
|
468 |
+
# Check for excessive repetition
|
469 |
+
if total_words > 0:
|
470 |
+
for word, count in word_frequencies.items():
|
471 |
+
# Calculate the frequency as a percentage
|
472 |
+
frequency = count / total_words
|
473 |
+
|
474 |
+
# Allow up to 10% frequency for any word
|
475 |
+
if frequency > 0.1 and count > 3:
|
476 |
+
logger.warning(f"Word '{word}' appears too frequently ({count} times, {frequency:.1%})")
|
477 |
+
return False
|
478 |
+
|
479 |
+
# Content seems valid
|
480 |
+
return True
|
481 |
+
|
482 |
+
def _save_markdown(self, topic: str, content: str) -> str:
|
483 |
+
"""Save the content as an HTML file."""
|
484 |
+
try:
|
485 |
+
# Get or create report directory
|
486 |
+
report_dir = None
|
487 |
+
if hasattr(self, 'file_handler') and self.file_handler:
|
488 |
+
report_dir = self.file_handler.report_dir
|
489 |
+
else:
|
490 |
+
# Create a default report directory if no file handler
|
491 |
+
report_dir = os.path.join(os.path.dirname(__file__), f"report_{datetime.now().strftime('%Y-%m-%d')}")
|
492 |
+
os.makedirs(report_dir, exist_ok=True)
|
493 |
+
logger.info(f"Created report directory: {report_dir}")
|
494 |
+
|
495 |
+
# Create filename from topic
|
496 |
+
filename = re.sub(r'[^\w\s-]', '', topic.lower()) # Remove special chars
|
497 |
+
filename = re.sub(r'[-\s]+', '-', filename) # Replace spaces with hyphens
|
498 |
+
filename = f"{filename}.html"
|
499 |
+
file_path = os.path.join(report_dir, filename)
|
500 |
+
|
501 |
+
# Convert markdown to HTML with styling
|
502 |
+
html_content = f"""
|
503 |
+
<!DOCTYPE html>
|
504 |
+
<html lang="en">
|
505 |
+
<head>
|
506 |
+
<meta charset="UTF-8">
|
507 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
508 |
+
<title>{topic}</title>
|
509 |
+
<style>
|
510 |
+
body {{
|
511 |
+
font-family: Arial, sans-serif;
|
512 |
+
line-height: 1.6;
|
513 |
+
color: #333;
|
514 |
+
max-width: 1200px;
|
515 |
+
margin: 0 auto;
|
516 |
+
padding: 20px;
|
517 |
+
}}
|
518 |
+
h1 {{
|
519 |
+
color: #2c3e50;
|
520 |
+
border-bottom: 2px solid #3498db;
|
521 |
+
padding-bottom: 10px;
|
522 |
+
}}
|
523 |
+
h2 {{
|
524 |
+
color: #34495e;
|
525 |
+
margin-top: 30px;
|
526 |
+
}}
|
527 |
+
h3 {{
|
528 |
+
color: #455a64;
|
529 |
+
}}
|
530 |
+
a {{
|
531 |
+
color: #3498db;
|
532 |
+
text-decoration: none;
|
533 |
+
}}
|
534 |
+
a:hover {{
|
535 |
+
text-decoration: underline;
|
536 |
+
}}
|
537 |
+
.executive-summary {{
|
538 |
+
background-color: #f8f9fa;
|
539 |
+
border-left: 4px solid #3498db;
|
540 |
+
padding: 20px;
|
541 |
+
margin: 20px 0;
|
542 |
+
}}
|
543 |
+
.analysis-section {{
|
544 |
+
margin: 30px 0;
|
545 |
+
}}
|
546 |
+
.source-section {{
|
547 |
+
background-color: #f8f9fa;
|
548 |
+
padding: 15px;
|
549 |
+
margin: 10px 0;
|
550 |
+
border-radius: 5px;
|
551 |
+
}}
|
552 |
+
.references {{
|
553 |
+
margin-top: 40px;
|
554 |
+
border-top: 2px solid #ecf0f1;
|
555 |
+
padding-top: 20px;
|
556 |
+
}}
|
557 |
+
.timestamp {{
|
558 |
+
color: #7f8c8d;
|
559 |
+
font-size: 0.9em;
|
560 |
+
margin-top: 40px;
|
561 |
+
text-align: right;
|
562 |
+
}}
|
563 |
+
blockquote {{
|
564 |
+
border-left: 3px solid #3498db;
|
565 |
+
margin: 20px 0;
|
566 |
+
padding-left: 20px;
|
567 |
+
color: #555;
|
568 |
+
}}
|
569 |
+
code {{
|
570 |
+
background-color: #f7f9fa;
|
571 |
+
padding: 2px 5px;
|
572 |
+
border-radius: 3px;
|
573 |
+
font-family: monospace;
|
574 |
+
}}
|
575 |
+
</style>
|
576 |
+
</head>
|
577 |
+
<body>
|
578 |
+
<div class="content">
|
579 |
+
{self._markdown_to_html(content)}
|
580 |
+
</div>
|
581 |
+
<div class="timestamp">
|
582 |
+
Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
|
583 |
+
</div>
|
584 |
+
</body>
|
585 |
+
</html>
|
586 |
+
"""
|
587 |
+
|
588 |
+
# Write the HTML file
|
589 |
+
with open(file_path, 'w', encoding='utf-8') as f:
|
590 |
+
f.write(html_content)
|
591 |
+
|
592 |
+
logger.info(f"Successfully saved HTML report: {file_path}")
|
593 |
+
return file_path
|
594 |
+
|
595 |
+
except Exception as e:
|
596 |
+
logger.error(f"Failed to save HTML file: {str(e)}")
|
597 |
+
return None
|
598 |
+
|
599 |
+
def _markdown_to_html(self, markdown_content: str) -> str:
|
600 |
+
"""Convert markdown content to HTML with basic formatting."""
|
601 |
+
# Headers
|
602 |
+
html = markdown_content
|
603 |
+
html = re.sub(r'^# (.*?)$', r'<h1>\1</h1>', html, flags=re.MULTILINE)
|
604 |
+
html = re.sub(r'^## (.*?)$', r'<h2>\1</h2>', html, flags=re.MULTILINE)
|
605 |
+
html = re.sub(r'^### (.*?)$', r'<h3>\1</h3>', html, flags=re.MULTILINE)
|
606 |
+
|
607 |
+
# Lists
|
608 |
+
html = re.sub(r'^\* (.*?)$', r'<li>\1</li>', html, flags=re.MULTILINE)
|
609 |
+
html = re.sub(r'(<li>.*?</li>\n)+', r'<ul>\n\g<0></ul>', html, flags=re.DOTALL)
|
610 |
+
|
611 |
+
# Links
|
612 |
+
html = re.sub(r'\[(.*?)\]\((.*?)\)', r'<a href="\2">\1</a>', html)
|
613 |
+
|
614 |
+
# Emphasis
|
615 |
+
html = re.sub(r'\*\*(.*?)\*\*', r'<strong>\1</strong>', html)
|
616 |
+
html = re.sub(r'\*(.*?)\*', r'<em>\1</em>', html)
|
617 |
+
|
618 |
+
# Paragraphs
|
619 |
+
html = re.sub(r'\n\n(.*?)\n\n', r'\n<p>\1</p>\n', html, flags=re.DOTALL)
|
620 |
+
|
621 |
+
# Blockquotes
|
622 |
+
html = re.sub(r'^\> (.*?)$', r'<blockquote>\1</blockquote>', html, flags=re.MULTILINE)
|
623 |
+
|
624 |
+
# Code blocks
|
625 |
+
html = re.sub(r'```(.*?)```', r'<pre><code>\1</code></pre>', html, flags=re.DOTALL)
|
626 |
+
html = re.sub(r'`(.*?)`', r'<code>\1</code>', html)
|
627 |
+
|
628 |
+
return html
|
629 |
+
|
630 |
+
def run(self, topic: str, use_cache: bool = True) -> Iterator[RunResponse]:
|
631 |
+
"""Run the blog post generation workflow."""
|
632 |
+
logger.info(f"Starting blog post generation for topic: {topic}")
|
633 |
+
|
634 |
+
# Extract keywords from topic
|
635 |
+
keywords = topic.lower().split()
|
636 |
+
keywords = [w for w in keywords if len(w) > 3 and w not in {'what', 'where', 'when', 'how', 'why', 'is', 'are', 'was', 'were', 'will', 'would', 'could', 'should', 'the', 'and', 'but', 'or', 'for', 'with'}]
|
637 |
+
|
638 |
+
all_articles = []
|
639 |
+
existing_urls = set()
|
640 |
+
|
641 |
+
# First, try web search
|
642 |
+
logger.info("Starting web search...")
|
643 |
+
search_results = self._search_with_retry(topic)
|
644 |
+
if search_results and search_results.articles:
|
645 |
+
for article in search_results.articles:
|
646 |
+
if article.url not in existing_urls:
|
647 |
+
all_articles.append(article)
|
648 |
+
existing_urls.add(article.url)
|
649 |
+
logger.info(f"Found {len(search_results.articles)} articles from web search")
|
650 |
+
|
651 |
+
# Then, crawl initial websites
|
652 |
+
logger.info("Starting website crawl...")
|
653 |
+
from file_handler import FileHandler
|
654 |
+
crawler = WebsiteCrawler(max_pages_per_site=10)
|
655 |
+
crawler.file_handler = FileHandler() # Initialize file handler
|
656 |
+
|
657 |
+
# Get the report directory from the file handler
|
658 |
+
report_dir = crawler.file_handler.report_dir
|
659 |
+
|
660 |
+
crawled_results = crawler.crawl_all_websites(self.initial_websites, keywords)
|
661 |
+
|
662 |
+
# Save the relevance log to the report directory
|
663 |
+
crawler.save_relevance_log(report_dir)
|
664 |
+
|
665 |
+
if crawled_results:
|
666 |
+
for result in crawled_results:
|
667 |
+
if result['url'] not in existing_urls:
|
668 |
+
article = NewsArticle(**result)
|
669 |
+
all_articles.append(article)
|
670 |
+
existing_urls.add(result['url'])
|
671 |
+
logger.info(f"Found {len(crawled_results)} articles from website crawl")
|
672 |
+
|
673 |
+
# If we still need more results, try backup search
|
674 |
+
if len(all_articles) < 10:
|
675 |
+
logger.info("Supplementing with backup search...")
|
676 |
+
backup_results = self._search_with_retry(topic, use_backup=True)
|
677 |
+
if backup_results and backup_results.articles:
|
678 |
+
for article in backup_results.articles:
|
679 |
+
if article.url not in existing_urls:
|
680 |
+
all_articles.append(article)
|
681 |
+
existing_urls.add(article.url)
|
682 |
+
logger.info(f"Found {len(backup_results.articles)} articles from backup search")
|
683 |
+
|
684 |
+
# Create final search results
|
685 |
+
search_results = SearchResults(articles=all_articles)
|
686 |
+
|
687 |
+
if len(search_results.articles) < 5: # Reduced minimum requirement
|
688 |
+
error_msg = f"Failed to gather sufficient sources. Only found {len(search_results.articles)} valid sources."
|
689 |
+
logger.error(error_msg)
|
690 |
+
yield RunResponse(
|
691 |
+
event=RunEvent.run_completed,
|
692 |
+
message=error_msg
|
693 |
+
)
|
694 |
+
return
|
695 |
+
|
696 |
+
logger.info(f"Successfully gathered {len(search_results.articles)} unique sources for analysis")
|
697 |
+
|
698 |
+
# Writing phase
|
699 |
+
print("\nGenerating report from search results...")
|
700 |
+
writer_response = self.writer.run(
|
701 |
+
f"""Generate a comprehensive research report on: {topic}
|
702 |
+
Use the following articles as sources:
|
703 |
+
{json.dumps([{'title': a.title, 'url': a.url, 'description': a.description} for a in search_results.articles], indent=2)}
|
704 |
+
|
705 |
+
Format the output in markdown with:
|
706 |
+
1. Clear section headers using #, ##, ###
|
707 |
+
2. Proper paragraph spacing
|
708 |
+
3. Bullet points where appropriate
|
709 |
+
4. Links to sources
|
710 |
+
5. A references section at the end
|
711 |
+
|
712 |
+
Focus on readability and proper markdown formatting.""",
|
713 |
+
stream=False
|
714 |
+
)
|
715 |
+
|
716 |
+
if isinstance(writer_response, RunResponse):
|
717 |
+
content = writer_response.content
|
718 |
+
else:
|
719 |
+
content = writer_response
|
720 |
+
|
721 |
+
# Validate content
|
722 |
+
if not self._validate_content(content):
|
723 |
+
print("\nFirst attempt produced invalid content, trying again...")
|
724 |
+
# Try one more time with a more structured prompt
|
725 |
+
writer_response = self.writer.run(
|
726 |
+
f"""Generate a clear, well-structured research report on: {topic}
|
727 |
+
Format the output in proper markdown with:
|
728 |
+
1. A main title using #
|
729 |
+
2. Section headers using ##
|
730 |
+
3. Subsection headers using ###
|
731 |
+
4. Well-formatted paragraphs
|
732 |
+
5. Bullet points for lists
|
733 |
+
6. A references section at the end
|
734 |
+
|
735 |
+
Source articles:
|
736 |
+
{json.dumps([{'title': a.title, 'url': a.url} for a in search_results.articles], indent=2)}""",
|
737 |
+
stream=False
|
738 |
+
)
|
739 |
+
|
740 |
+
if isinstance(writer_response, RunResponse):
|
741 |
+
content = writer_response.content
|
742 |
+
else:
|
743 |
+
content = writer_response
|
744 |
+
|
745 |
+
if not self._validate_content(content):
|
746 |
+
yield RunResponse(
|
747 |
+
event=RunEvent.run_completed,
|
748 |
+
message="Failed to generate readable content. Please try again."
|
749 |
+
)
|
750 |
+
return
|
751 |
+
|
752 |
+
# Save as HTML
|
753 |
+
html_file = self._save_markdown(topic, content)
|
754 |
+
|
755 |
+
if not html_file:
|
756 |
+
yield RunResponse(
|
757 |
+
event=RunEvent.run_completed,
|
758 |
+
message="Failed to save HTML file. Please try again."
|
759 |
+
)
|
760 |
+
return
|
761 |
+
|
762 |
+
# Print the report to console and yield response
|
763 |
+
print("\n=== Generated Report ===\n")
|
764 |
+
print(content)
|
765 |
+
print("\n=====================\n")
|
766 |
+
|
767 |
+
yield RunResponse(
|
768 |
+
event=RunEvent.run_completed,
|
769 |
+
message=f"Report generated successfully. HTML saved as: {html_file}",
|
770 |
+
content=content
|
771 |
+
)
|
772 |
+
|
773 |
+
return
|
774 |
+
|
775 |
+
class WebsiteCrawler:
|
776 |
+
"""Crawler to extract relevant information from specified websites."""
|
777 |
+
|
778 |
+
def __init__(self, max_pages_per_site: int = 10):
|
779 |
+
self.max_pages_per_site = max_pages_per_site
|
780 |
+
self.visited_urls: Set[str] = set()
|
781 |
+
self.results: Dict[str, List[dict]] = {}
|
782 |
+
self.file_handler = None
|
783 |
+
|
784 |
+
# Set up logging
|
785 |
+
self.relevance_log = [] # Store relevance decisions
|
786 |
+
|
787 |
+
def _check_relevance(self, text: str, keywords: List[str]) -> tuple[bool, dict]:
|
788 |
+
"""
|
789 |
+
Check if the page content is relevant based on keywords.
|
790 |
+
Returns a tuple of (is_relevant, relevance_info).
|
791 |
+
"""
|
792 |
+
text_lower = text.lower()
|
793 |
+
keyword_matches = {}
|
794 |
+
|
795 |
+
# Check each keyword and count occurrences
|
796 |
+
for keyword in keywords:
|
797 |
+
keyword_lower = keyword.lower()
|
798 |
+
count = text_lower.count(keyword_lower)
|
799 |
+
keyword_matches[keyword] = count
|
800 |
+
|
801 |
+
# Page is relevant if any keyword is found
|
802 |
+
is_relevant = any(count > 0 for count in keyword_matches.values())
|
803 |
+
|
804 |
+
# Prepare relevance information
|
805 |
+
relevance_info = {
|
806 |
+
'is_relevant': is_relevant,
|
807 |
+
'keyword_matches': keyword_matches,
|
808 |
+
'total_matches': sum(keyword_matches.values()),
|
809 |
+
'matching_keywords': [k for k, v in keyword_matches.items() if v > 0],
|
810 |
+
'text_length': len(text)
|
811 |
+
}
|
812 |
+
|
813 |
+
return is_relevant, relevance_info
|
814 |
+
|
815 |
+
def crawl_page(self, url: str, keywords: List[str]) -> List[dict]:
|
816 |
+
"""Crawl a single page and extract relevant information."""
|
817 |
+
try:
|
818 |
+
# Skip if already visited
|
819 |
+
if url in self.visited_urls:
|
820 |
+
logger.debug(f"Skipping already visited URL: {url}")
|
821 |
+
return []
|
822 |
+
|
823 |
+
self.visited_urls.add(url)
|
824 |
+
logger.info(f"Crawling page: {url}")
|
825 |
+
|
826 |
+
# Fetch and parse the page
|
827 |
+
response = requests.get(url, timeout=10)
|
828 |
+
response.raise_for_status()
|
829 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
830 |
+
|
831 |
+
# Get page title
|
832 |
+
title = soup.title.string if soup.title else url
|
833 |
+
|
834 |
+
# Extract text content
|
835 |
+
text = ' '.join([p.get_text() for p in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'])])
|
836 |
+
|
837 |
+
# Check relevance and get detailed information
|
838 |
+
is_relevant, relevance_info = self._check_relevance(text, keywords)
|
839 |
+
|
840 |
+
# Log relevance decision
|
841 |
+
log_entry = {
|
842 |
+
'url': url,
|
843 |
+
'title': title,
|
844 |
+
'timestamp': datetime.now().isoformat(),
|
845 |
+
'relevance_info': relevance_info
|
846 |
+
}
|
847 |
+
self.relevance_log.append(log_entry)
|
848 |
+
|
849 |
+
# Log the decision with details
|
850 |
+
if is_relevant:
|
851 |
+
logger.info(
|
852 |
+
f"Page is RELEVANT: {url}\n"
|
853 |
+
f"- Title: {title}\n"
|
854 |
+
f"- Matching keywords: {relevance_info['matching_keywords']}\n"
|
855 |
+
f"- Total matches: {relevance_info['total_matches']}"
|
856 |
+
)
|
857 |
+
else:
|
858 |
+
logger.info(
|
859 |
+
f"Page is NOT RELEVANT: {url}\n"
|
860 |
+
f"- Title: {title}\n"
|
861 |
+
f"- Checked keywords: {keywords}\n"
|
862 |
+
f"- No keyword matches found in {relevance_info['text_length']} characters of text"
|
863 |
+
)
|
864 |
+
|
865 |
+
results = []
|
866 |
+
if is_relevant:
|
867 |
+
# Extract links for further crawling
|
868 |
+
links = []
|
869 |
+
for link in soup.find_all('a', href=True):
|
870 |
+
href = link['href']
|
871 |
+
absolute_url = urljoin(url, href)
|
872 |
+
if self.is_valid_url(absolute_url):
|
873 |
+
links.append(absolute_url)
|
874 |
+
|
875 |
+
# If page is relevant, process and download any supported files
|
876 |
+
if self.file_handler:
|
877 |
+
for link in soup.find_all('a', href=True):
|
878 |
+
href = link['href']
|
879 |
+
absolute_url = urljoin(url, href)
|
880 |
+
if self.file_handler.is_supported_file(absolute_url):
|
881 |
+
downloaded_path = self.file_handler.download_file(absolute_url, source_page=url)
|
882 |
+
if downloaded_path:
|
883 |
+
logger.info(f"Downloaded file from relevant page: {absolute_url} to {downloaded_path}")
|
884 |
+
|
885 |
+
# Store the relevant page information
|
886 |
+
results.append({
|
887 |
+
'url': url,
|
888 |
+
'text': text,
|
889 |
+
'title': title,
|
890 |
+
'links': links,
|
891 |
+
'relevance_info': relevance_info
|
892 |
+
})
|
893 |
+
|
894 |
+
return results
|
895 |
+
|
896 |
+
except Exception as e:
|
897 |
+
logger.error(f"Error crawling {url}: {str(e)}")
|
898 |
+
return []
|
899 |
+
|
900 |
+
def save_relevance_log(self, output_dir: str):
|
901 |
+
"""Save the relevance log to a markdown file."""
|
902 |
+
try:
|
903 |
+
log_file = os.path.join(output_dir, 'crawl_relevance_log.md')
|
904 |
+
with open(log_file, 'w', encoding='utf-8') as f:
|
905 |
+
f.write("# Web Crawling Relevance Log\n\n")
|
906 |
+
|
907 |
+
# Summary statistics
|
908 |
+
total_pages = len(self.relevance_log)
|
909 |
+
relevant_pages = sum(1 for entry in self.relevance_log if entry['relevance_info']['is_relevant'])
|
910 |
+
|
911 |
+
f.write(f"## Summary\n")
|
912 |
+
f.write(f"- Total pages crawled: {total_pages}\n")
|
913 |
+
f.write(f"- Relevant pages found: {relevant_pages}\n")
|
914 |
+
f.write(f"- Non-relevant pages: {total_pages - relevant_pages}\n\n")
|
915 |
+
|
916 |
+
# Relevant pages
|
917 |
+
f.write("## Relevant Pages\n\n")
|
918 |
+
for entry in self.relevance_log:
|
919 |
+
if entry['relevance_info']['is_relevant']:
|
920 |
+
f.write(f"### {entry['title']}\n")
|
921 |
+
f.write(f"- URL: {entry['url']}\n")
|
922 |
+
f.write(f"- Matching keywords: {entry['relevance_info']['matching_keywords']}\n")
|
923 |
+
f.write(f"- Total matches: {entry['relevance_info']['total_matches']}\n")
|
924 |
+
f.write(f"- Crawled at: {entry['timestamp']}\n\n")
|
925 |
+
|
926 |
+
# Non-relevant pages
|
927 |
+
f.write("## Non-Relevant Pages\n\n")
|
928 |
+
for entry in self.relevance_log:
|
929 |
+
if not entry['relevance_info']['is_relevant']:
|
930 |
+
f.write(f"### {entry['title']}\n")
|
931 |
+
f.write(f"- URL: {entry['url']}\n")
|
932 |
+
f.write(f"- Text length: {entry['relevance_info']['text_length']} characters\n")
|
933 |
+
f.write(f"- Crawled at: {entry['timestamp']}\n\n")
|
934 |
+
|
935 |
+
except Exception as e:
|
936 |
+
logger.error(f"Error saving relevance log: {str(e)}")
|
937 |
+
|
938 |
+
def is_valid_url(self, url: str) -> bool:
|
939 |
+
"""Check if URL is valid and belongs to allowed domains."""
|
940 |
+
try:
|
941 |
+
parsed = urlparse(url)
|
942 |
+
return bool(parsed.netloc and parsed.scheme in {'http', 'https'})
|
943 |
+
except:
|
944 |
+
return False
|
945 |
+
|
946 |
+
def extract_text_and_links(self, url: str, soup: BeautifulSoup):
|
947 |
+
"""Extract relevant text and links from a page."""
|
948 |
+
links = []
|
949 |
+
for link in soup.find_all('a', href=True):
|
950 |
+
href = link['href']
|
951 |
+
absolute_url = urljoin(url, href)
|
952 |
+
links.append(absolute_url)
|
953 |
+
return links
|
954 |
+
|
955 |
+
def crawl_website(self, base_url: str, keywords: List[str]) -> List[dict]:
|
956 |
+
"""Crawl a website starting from the base URL."""
|
957 |
+
to_visit = {base_url}
|
958 |
+
results = []
|
959 |
+
visited_count = 0
|
960 |
+
|
961 |
+
while to_visit and visited_count < self.max_pages_per_site:
|
962 |
+
url = to_visit.pop()
|
963 |
+
page_results, links = self.crawl_page(url, keywords), self.extract_text_and_links(url, BeautifulSoup(requests.get(url, timeout=10).text, 'html.parser'))
|
964 |
+
results.extend(page_results)
|
965 |
+
|
966 |
+
# Add new links to visit
|
967 |
+
domain = urlparse(base_url).netloc
|
968 |
+
new_links = {link for link in links
|
969 |
+
if urlparse(link).netloc == domain
|
970 |
+
and link not in self.visited_urls}
|
971 |
+
to_visit.update(new_links)
|
972 |
+
visited_count += 1
|
973 |
+
|
974 |
+
return results
|
975 |
+
|
976 |
+
def crawl_all_websites(self, websites: List[str], keywords: List[str]) -> List[dict]:
|
977 |
+
"""Crawl multiple websites in parallel."""
|
978 |
+
all_results = []
|
979 |
+
|
980 |
+
if isinstance(websites, str):
|
981 |
+
# Remove the brackets and split by comma
|
982 |
+
websites = websites.strip('[]').replace('"', '').replace(" ","").split(',')
|
983 |
+
# Clean up any whitespace
|
984 |
+
websites = [url.strip("'") for url in websites]
|
985 |
+
|
986 |
+
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
|
987 |
+
future_to_url = {
|
988 |
+
executor.submit(self.crawl_website, url, keywords): url
|
989 |
+
for url in websites
|
990 |
+
}
|
991 |
+
|
992 |
+
for future in concurrent.futures.as_completed(future_to_url):
|
993 |
+
url = future_to_url[future]
|
994 |
+
try:
|
995 |
+
results = future.result()
|
996 |
+
all_results.extend(results)
|
997 |
+
logger.info(f"Completed crawling {url}, found {len(results)} relevant pages")
|
998 |
+
except Exception as e:
|
999 |
+
logger.error(f"Failed to crawl {url}: {str(e)}")
|
1000 |
+
|
1001 |
+
return all_results
|
1002 |
+
|
1003 |
+
# Create the workflow
|
1004 |
+
searcher = Agent(
|
1005 |
+
model=get_hf_model('searcher'),
|
1006 |
+
tools=[DuckDuckGo(fixed_max_results=DUCK_DUCK_GO_FIXED_MAX_RESULTS)],
|
1007 |
+
|
1008 |
+
instructions=[
|
1009 |
+
"Given a topic, search for 20 articles and return the 15 most relevant articles.",
|
1010 |
+
"For each article, provide:",
|
1011 |
+
"- title: The article title",
|
1012 |
+
"- url: The article URL",
|
1013 |
+
"- description: A brief description or summary",
|
1014 |
+
"Return the results in a structured format with these exact field names."
|
1015 |
+
],
|
1016 |
+
response_model=SearchResults,
|
1017 |
+
structured_outputs=True
|
1018 |
+
)
|
1019 |
+
|
1020 |
+
backup_searcher = Agent(
|
1021 |
+
model=get_hf_model('searcher'),
|
1022 |
+
tools=[GoogleSearch()],
|
1023 |
+
|
1024 |
+
instructions=[
|
1025 |
+
"Given a topic, search for 20 articles and return the 15 most relevant articles.",
|
1026 |
+
"For each article, provide:",
|
1027 |
+
"- title: The article title",
|
1028 |
+
"- url: The article URL",
|
1029 |
+
"- description: A brief description or summary",
|
1030 |
+
"Return the results in a structured format with these exact field names."
|
1031 |
+
],
|
1032 |
+
response_model=SearchResults,
|
1033 |
+
structured_outputs=True
|
1034 |
+
)
|
1035 |
+
|
1036 |
+
writer = Agent(
|
1037 |
+
model=get_hf_model('writer'),
|
1038 |
+
instructions=[
|
1039 |
+
|
1040 |
+
"You are a professional research analyst tasked with creating a comprehensive report on the given topic.",
|
1041 |
+
"The sources provided include both general web search results and specialized intelligence/security websites.",
|
1042 |
+
"Carefully analyze and cross-reference information from all sources to create a detailed report.",
|
1043 |
+
"",
|
1044 |
+
"Report Structure:",
|
1045 |
+
"1. Executive Summary (2-3 paragraphs)",
|
1046 |
+
" - Provide a clear, concise overview of the main findings",
|
1047 |
+
" - Address the research question directly",
|
1048 |
+
" - Highlight key discoveries and implications",
|
1049 |
+
"",
|
1050 |
+
"2. Detailed Analysis (Multiple sections)",
|
1051 |
+
" - Break down the topic into relevant themes or aspects",
|
1052 |
+
" - For each theme:",
|
1053 |
+
" * Present detailed findings from multiple sources",
|
1054 |
+
" * Cross-reference information between general and specialized sources",
|
1055 |
+
" * Analyze trends, patterns, and developments",
|
1056 |
+
" * Discuss implications and potential impacts",
|
1057 |
+
"",
|
1058 |
+
"3. Source Analysis and Credibility",
|
1059 |
+
" For each major source:",
|
1060 |
+
" - Evaluate source credibility and expertise",
|
1061 |
+
" - Note if from specialized intelligence/security website",
|
1062 |
+
" - Assess potential biases or limitations",
|
1063 |
+
" - Key findings and unique contributions",
|
1064 |
+
"",
|
1065 |
+
"4. Key Takeaways and Strategic Implications",
|
1066 |
+
" - Synthesize findings from all sources",
|
1067 |
+
" - Compare/contrast general media vs specialized analysis",
|
1068 |
+
" - Discuss broader geopolitical implications",
|
1069 |
+
" - Address potential future developments",
|
1070 |
+
"",
|
1071 |
+
"5. References",
|
1072 |
+
" - Group sources by type (specialized websites vs general media)",
|
1073 |
+
" - List all sources with full citations",
|
1074 |
+
" - Include URLs as clickable markdown links [Title](URL)",
|
1075 |
+
" - Ensure every major claim has at least one linked source",
|
1076 |
+
"",
|
1077 |
+
"Important Guidelines:",
|
1078 |
+
"- Prioritize information from specialized intelligence/security sources",
|
1079 |
+
"- Cross-validate claims between multiple sources when possible",
|
1080 |
+
"- Maintain a professional, analytical tone",
|
1081 |
+
"- Support all claims with evidence",
|
1082 |
+
"- Include specific examples and data points",
|
1083 |
+
"- Use direct quotes for significant statements",
|
1084 |
+
"- Address potential biases in reporting",
|
1085 |
+
"- Ensure the report directly answers the research question",
|
1086 |
+
"",
|
1087 |
+
"Format the report with clear markdown headings (# ## ###), subheadings, and paragraphs.",
|
1088 |
+
"Each major section should contain multiple paragraphs with detailed analysis."
|
1089 |
+
],
|
1090 |
+
structured_outputs=True
|
1091 |
+
)
|
1092 |
+
|
1093 |
+
generate_blog_post = BlogPostGenerator(
|
1094 |
+
session_id=f"generate-blog-post-on-{topic}",
|
1095 |
+
searcher=searcher,
|
1096 |
+
backup_searcher=backup_searcher,
|
1097 |
+
writer=writer,
|
1098 |
+
file_handler=None, # Initialize with None
|
1099 |
+
storage=SqlWorkflowStorage(
|
1100 |
+
table_name="generate_blog_post_workflows",
|
1101 |
+
db_file="tmp/workflows.db",
|
1102 |
+
),
|
1103 |
+
)
|
1104 |
+
|
1105 |
+
# Run workflow
|
1106 |
+
blog_post: Iterator[RunResponse] = generate_blog_post.run(topic=topic, use_cache=False)
|
1107 |
+
|
1108 |
+
# Print the response
|
1109 |
+
pprint_run_response(blog_post, markdown=True)
|