tamirgz commited on
Commit
1be3350
·
1 Parent(s): 6aa0021
Files changed (9) hide show
  1. README.md +156 -12
  2. app.py +45 -0
  3. config.ini +4 -0
  4. config.py +130 -0
  5. file_handler.py +190 -0
  6. requirements.txt +7 -0
  7. save_report.py +27 -0
  8. search_utils.py +103 -0
  9. web_search.py +1109 -0
README.md CHANGED
@@ -1,12 +1,156 @@
1
- ---
2
- title: Phidata
3
- emoji: 🌍
4
- colorFrom: indigo
5
- colorTo: green
6
- sdk: streamlit
7
- sdk_version: 1.41.1
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Web Research and Report Generation System
2
+
3
+ An advanced AI-powered system for automated web research and report generation. This system uses AI agents to search, analyze, and compile comprehensive reports on any given topic.
4
+
5
+ ## Features
6
+
7
+ - **Intelligent Web Search**
8
+ - Multi-source search using DuckDuckGo and Google
9
+ - Smart retry mechanism with rate limit handling
10
+ - Configurable search depth and result limits
11
+ - Domain filtering for trusted sources
12
+
13
+ - **Advanced Report Generation**
14
+ - Beautiful HTML reports with modern styling
15
+ - Automatic keyword extraction
16
+ - Source validation and relevance scoring
17
+ - Comprehensive logging of research process
18
+
19
+ - **Smart Caching**
20
+ - Caches search results for faster repeat queries
21
+ - Configurable cache directory
22
+ - Cache invalidation management
23
+
24
+ - **Error Handling**
25
+ - Graceful fallback between search engines
26
+ - Rate limit detection and backoff
27
+ - Detailed error logging
28
+ - Automatic retry mechanisms
29
+
30
+ ## Installation
31
+
32
+ 1. Clone the repository:
33
+ ```bash
34
+ git clone [repository-url]
35
+ cd phidata_analyst
36
+ ```
37
+
38
+ 2. Install dependencies:
39
+ ```bash
40
+ pip install -r requirements.txt
41
+ ```
42
+
43
+ 3. Set up your API keys:
44
+ - Create a `.env` file in the project root
45
+ - Add your API keys:
46
+ ```
47
+ NVIDIA_API_KEY=your-nvidia-api-key
48
+ GOOGLE_API_KEY=your-google-api-key
49
+ ```
50
+
51
+ ## Usage
52
+
53
+ ### Basic Usage
54
+
55
+ ```python
56
+ from web_search import create_blog_post_workflow
57
+
58
+ # Create a workflow instance
59
+ workflow = create_blog_post_workflow()
60
+
61
+ # Generate a report
62
+ for response in workflow.run("Your research topic"):
63
+ print(response.message)
64
+ ```
65
+
66
+ ### Advanced Usage
67
+
68
+ ```python
69
+ from web_search import BlogPostGenerator, SqlWorkflowStorage
70
+ from phi.llm import Nvidia
71
+ from phi.tools import DuckDuckGo, GoogleSearch
72
+
73
+ # Configure custom agents
74
+ searcher = Agent(
75
+ model=Nvidia(
76
+ id="meta/llama-3.2-3b-instruct",
77
+ temperature=0.3,
78
+ top_p=0.1
79
+ ),
80
+ tools=[DuckDuckGo(fixed_max_results=10)]
81
+ )
82
+
83
+ # Initialize with custom configuration
84
+ generator = BlogPostGenerator(
85
+ searcher=searcher,
86
+ storage=SqlWorkflowStorage(
87
+ table_name="custom_workflows",
88
+ db_file="path/to/db.sqlite"
89
+ )
90
+ )
91
+
92
+ # Run with caching enabled
93
+ for response in generator.run("topic", use_cache=True):
94
+ print(response.message)
95
+ ```
96
+
97
+ ## Output
98
+
99
+ The system generates:
100
+ 1. Professional HTML reports with:
101
+ - Executive summary
102
+ - Detailed analysis
103
+ - Source citations
104
+ - Generation timestamp
105
+ 2. Detailed logs of:
106
+ - Search process
107
+ - Keyword extraction
108
+ - Source relevance
109
+ - Download attempts
110
+
111
+ Reports are saved in:
112
+ - Default: `./reports/YYYY-MM-DD-HH-MM-SS/`
113
+ - Custom: Configurable via `file_handler`
114
+
115
+ ## Configuration
116
+
117
+ Key configuration options:
118
+
119
+ ```python
120
+ DUCK_DUCK_GO_FIXED_MAX_RESULTS = 10 # Max results from DuckDuckGo
121
+ DEFAULT_TEMPERATURE = 0.3 # Model temperature
122
+ TOP_P = 0.1 # Top-p sampling parameter
123
+ ```
124
+
125
+ Trusted domains can be configured in `BlogPostGenerator.trusted_domains`.
126
+
127
+ ## Logging
128
+
129
+ The system uses `phi.utils.log` for comprehensive logging:
130
+ - Search progress and results
131
+ - Keyword extraction details
132
+ - File downloads and failures
133
+ - Report generation status
134
+
135
+ Logs are color-coded for easy monitoring:
136
+ - INFO: Normal operations
137
+ - WARNING: Non-critical issues
138
+ - ERROR: Critical failures
139
+
140
+ ## Contributing
141
+
142
+ 1. Fork the repository
143
+ 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
144
+ 3. Commit your changes (`git commit -m 'Add amazing feature'`)
145
+ 4. Push to the branch (`git push origin feature/amazing-feature`)
146
+ 5. Open a Pull Request
147
+
148
+ ## License
149
+
150
+ This project is licensed under the MIT License - see the LICENSE file for details.
151
+
152
+ ## Acknowledgments
153
+
154
+ - Built with [Phi](https://github.com/phidatahq/phidata)
155
+ - Uses NVIDIA AI models
156
+ - Search powered by DuckDuckGo and Google
app.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import subprocess
3
+ import configparser
4
+
5
+
6
+ config = configparser.ConfigParser()
7
+
8
+ # Streamlit page for user inputs
9
+ def user_input_page():
10
+ st.title("Research Topic and Websites Input")
11
+
12
+ # Input for research topic
13
+ topic = st.text_input("Enter the research topic:")
14
+
15
+ # Input for list of websites
16
+ websites = st.text_area("Enter the list of websites (one per line):")
17
+ websites = websites.splitlines()
18
+
19
+ config['DEFAULT'] = {'DEFAULT_TOPIC': "\"{0}\"".format(topic),
20
+ 'INITIAL_WEBSITES': websites}
21
+
22
+ with open('config.ini', 'w') as configfile:
23
+ config.write(configfile)
24
+
25
+ # Button to load and run web_search.py
26
+ if st.button("Execute Web Research"):
27
+ # Execute web_search.py and stream output
28
+ process = subprocess.run(["python3", "web_search.py"], stderr=subprocess.PIPE, text=True)
29
+ error_message = process.stderr
30
+
31
+ # Stream the output in real-time
32
+ # for line in process.stdout:
33
+ # st.write(line) # Display each line of output as it is produced
34
+
35
+ # Wait for the process to complete
36
+ # process.wait()
37
+
38
+ # Check for any errors
39
+ if process.returncode != 0:
40
+ st.error(f"Error occurred: {error_message}")
41
+
42
+ st.success("Web search executed successfully!")
43
+
44
+ # Call the user input page function
45
+ user_input_page()
config.ini ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ [DEFAULT]
2
+ default_topic = "Is there a process of establishment of Israeli Military or Offensive Cyber Industry in Australia?"
3
+ initial_websites = ['https://www.bellingcat.com', 'https://worldview.stratfor.com', 'https://thesoufancenter.org', 'https://www.globalsecurity.org', 'https://www.defenseone.com']
4
+
config.py ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Configuration settings for the web search and report generation system."""
2
+
3
+ from phi.model.groq import Groq
4
+ from phi.model.together import Together
5
+ from phi.model.huggingface import HuggingFaceChat
6
+
7
+ # DEFAULT_TOPIC = "Is there a process of establishment of Israeli Military or Offensive Cyber Industry in Australia?"
8
+
9
+ # # Initial websites for crawling
10
+ # INITIAL_WEBSITES = [
11
+ # "https://www.bellingcat.com/",
12
+ # "https://worldview.stratfor.com/",
13
+ # "https://thesoufancenter.org/",
14
+ # "https://www.globalsecurity.org/",
15
+ # "https://www.defenseone.com/"
16
+ # ]
17
+
18
+ # Model configuration
19
+ SEARCHER_MODEL_CONFIG = {
20
+ "id": "Trelis/Meta-Llama-3-70B-Instruct-function-calling",
21
+ "temperature": 0.4,
22
+ "top_p": 0.3,
23
+ "repetition_penalty": 1
24
+ }
25
+
26
+ # Model configuration
27
+ WRITER_MODEL_CONFIG = {
28
+ "id": "Trelis/Meta-Llama-3-70B-Instruct-function-calling",
29
+ "temperature": 0.2,
30
+ "top_p": 0.2,
31
+ "repetition_penalty": 1
32
+ }
33
+
34
+ # Review criteria thresholds
35
+ REVIEW_THRESHOLDS = {
36
+ "min_word_count": 2000,
37
+ "min_score": 7,
38
+ "min_avg_score": 8,
39
+ "max_iterations": 5
40
+ }
41
+
42
+ # Crawler settings
43
+ CRAWLER_CONFIG = {
44
+ "max_pages_per_site": 10,
45
+ "min_relevance_score": 0.5
46
+ }
47
+
48
+ def get_hf_model(purpose: str) -> HuggingFaceChat:
49
+ """
50
+ Factory function to create HuggingFaceChat models with specific configurations.
51
+
52
+ Args:
53
+ purpose: Either 'searcher' or 'writer' to determine which configuration to use
54
+
55
+ Returns:
56
+ Configured HuggingFaceChat model instance
57
+ """
58
+ if purpose == 'searcher':
59
+ return HuggingFaceChat(
60
+ id=SEARCHER_MODEL_CONFIG["id"],
61
+ api_key=os.getenv("HF_API_KEY"),
62
+ temperature=SEARCHER_MODEL_CONFIG["temperature"],
63
+ top_p=SEARCHER_MODEL_CONFIG["top_p"],
64
+ )
65
+ elif purpose == 'writer':
66
+ return HuggingFaceChat(
67
+ id=WRITER_MODEL_CONFIG["id"],
68
+ api_key=os.getenv("HF_API_KEY"),
69
+ temperature=WRITER_MODEL_CONFIG["temperature"],
70
+ top_p=WRITER_MODEL_CONFIG["top_p"]
71
+ )
72
+ else:
73
+ raise ValueError(f"Unknown purpose: {purpose}. Must be 'searcher' or 'writer'")
74
+
75
+ def get_together_model(purpose: str) -> Together:
76
+ """
77
+ Factory function to create Together models with specific configurations.
78
+
79
+ Args:
80
+ purpose: Either 'searcher' or 'writer' to determine which configuration to use
81
+
82
+ Returns:
83
+ Configured Together model instance
84
+ """
85
+ if purpose == 'searcher':
86
+ return Together(
87
+ id=SEARCHER_MODEL_CONFIG["id"],
88
+ api_key=TOGETHER_API_KEY,
89
+ temperature=SEARCHER_MODEL_CONFIG["temperature"],
90
+ top_p=SEARCHER_MODEL_CONFIG["top_p"],
91
+ repetition_penalty=SEARCHER_MODEL_CONFIG["repetition_penalty"]
92
+ )
93
+ elif purpose == 'writer':
94
+ return Together(
95
+ id=WRITER_MODEL_CONFIG["id"],
96
+ api_key=TOGETHER_API_KEY,
97
+ temperature=WRITER_MODEL_CONFIG["temperature"],
98
+ top_p=WRITER_MODEL_CONFIG["top_p"],
99
+ repetition_penalty=WRITER_MODEL_CONFIG["repetition_penalty"]
100
+ )
101
+ else:
102
+ raise ValueError(f"Unknown purpose: {purpose}. Must be 'searcher' or 'writer'")
103
+
104
+
105
+ def get_groq_model(purpose: str) -> Groq:
106
+ """
107
+ Factory function to create Groq models with specific configurations.
108
+
109
+ Args:
110
+ purpose: Either 'searcher' or 'writer' to determine which configuration to use
111
+
112
+ Returns:
113
+ Configured Groq model instance
114
+ """
115
+ if purpose == 'searcher':
116
+ return Groq(
117
+ id=SEARCHER_MODEL_CONFIG["id"],
118
+ api_key=GROQ_API_KEY,
119
+ temperature=SEARCHER_MODEL_CONFIG["temperature"],
120
+ top_p=SEARCHER_MODEL_CONFIG["top_p"]
121
+ )
122
+ elif purpose == 'writer':
123
+ return Groq(
124
+ id=WRITER_MODEL_CONFIG["id"],
125
+ api_key=GROQ_API_KEY,
126
+ temperature=WRITER_MODEL_CONFIG["temperature"],
127
+ top_p=WRITER_MODEL_CONFIG["top_p"]
128
+ )
129
+ else:
130
+ raise ValueError(f"Unknown purpose: {purpose}. Must be 'searcher' or 'writer'")
file_handler.py ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import requests
3
+ from typing import Optional, List, Set
4
+ from urllib.parse import urlparse, unquote
5
+ from pathlib import Path
6
+ from datetime import datetime
7
+ from save_report import save_markdown_report
8
+ from phi.utils.log import logger
9
+
10
+
11
+ class FileHandler:
12
+ """Handler for downloading and saving files discovered during web crawling."""
13
+
14
+ SUPPORTED_EXTENSIONS = {
15
+ 'pdf': 'application/pdf',
16
+ 'xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
17
+ 'csv': 'text/csv'
18
+ }
19
+
20
+ # Common browser headers
21
+ HEADERS = {
22
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
23
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
24
+ 'Accept-Language': 'en-US,en;q=0.9',
25
+ 'Accept-Encoding': 'gzip, deflate, br',
26
+ 'DNT': '1',
27
+ 'Connection': 'keep-alive',
28
+ 'Upgrade-Insecure-Requests': '1'
29
+ }
30
+
31
+ def __init__(self):
32
+ # Get the report directory for the current date
33
+ self.report_dir, _ = save_markdown_report()
34
+ self.downloaded_files: Set[str] = set()
35
+ self.file_metadata: List[dict] = []
36
+ self.failed_downloads: List[dict] = [] # Track failed downloads
37
+
38
+ # Create a subdirectory for downloaded files
39
+ self.downloads_dir = os.path.join(self.report_dir, 'downloads')
40
+ os.makedirs(self.downloads_dir, exist_ok=True)
41
+
42
+ # Create a metadata file to track downloaded files
43
+ self.metadata_file = os.path.join(self.downloads_dir, 'files_metadata.md')
44
+
45
+ def is_supported_file(self, url: str) -> bool:
46
+ """Check if the URL points to a supported file type."""
47
+ parsed_url = urlparse(url)
48
+ extension = os.path.splitext(parsed_url.path)[1].lower().lstrip('.')
49
+ return extension in self.SUPPORTED_EXTENSIONS
50
+
51
+ def get_filename_from_url(self, url: str, content_type: Optional[str] = None) -> str:
52
+ """Generate a safe filename from the URL."""
53
+ # Get the filename from the URL
54
+ parsed_url = urlparse(url)
55
+ filename = os.path.basename(unquote(parsed_url.path))
56
+
57
+ # If no filename in URL, create one based on content type
58
+ if not filename:
59
+ extension = next(
60
+ (ext for ext, mime in self.SUPPORTED_EXTENSIONS.items()
61
+ if mime == content_type),
62
+ 'unknown'
63
+ )
64
+ filename = f"downloaded_file.{extension}"
65
+
66
+ # Ensure filename is safe and unique
67
+ safe_filename = "".join(c for c in filename if c.isalnum() or c in '._-')
68
+ base, ext = os.path.splitext(safe_filename)
69
+
70
+ # Add number suffix if file exists
71
+ counter = 1
72
+ while os.path.exists(os.path.join(self.downloads_dir, safe_filename)):
73
+ safe_filename = f"{base}_{counter}{ext}"
74
+ counter += 1
75
+
76
+ return safe_filename
77
+
78
+ def download_file(self, url: str, source_page: str = None) -> Optional[str]:
79
+ """
80
+ Download a file from the URL and save it to the downloads directory.
81
+ Returns the path to the saved file if successful, None otherwise.
82
+ """
83
+ if url in self.downloaded_files:
84
+ logger.info(f"File already downloaded: {url}")
85
+ return None
86
+
87
+ try:
88
+ # Create a session to maintain headers across redirects
89
+ session = requests.Session()
90
+ session.headers.update(self.HEADERS)
91
+
92
+ # First make a HEAD request to check content type and size
93
+ head_response = session.head(url, timeout=10, allow_redirects=True)
94
+ head_response.raise_for_status()
95
+
96
+ content_type = head_response.headers.get('content-type', '').lower().split(';')[0]
97
+ content_length = int(head_response.headers.get('content-length', 0))
98
+
99
+ # Check if content type is supported and size is reasonable (less than 100MB)
100
+ if not any(mime in content_type for mime in self.SUPPORTED_EXTENSIONS.values()):
101
+ logger.warning(f"Unsupported content type: {content_type} for URL: {url}")
102
+ return None
103
+
104
+ if content_length > 100 * 1024 * 1024: # 100MB limit
105
+ logger.warning(f"File too large ({content_length} bytes) for URL: {url}")
106
+ return None
107
+
108
+ # Make the actual download request
109
+ response = session.get(url, timeout=30, stream=True)
110
+ response.raise_for_status()
111
+
112
+ # Generate safe filename
113
+ filename = self.get_filename_from_url(url, content_type)
114
+ file_path = os.path.join(self.downloads_dir, filename)
115
+
116
+ # Save the file
117
+ with open(file_path, 'wb') as f:
118
+ for chunk in response.iter_content(chunk_size=8192):
119
+ if chunk:
120
+ f.write(chunk)
121
+
122
+ # Record metadata
123
+ metadata = {
124
+ 'filename': filename,
125
+ 'source_url': url,
126
+ 'source_page': source_page,
127
+ 'content_type': content_type,
128
+ 'download_time': datetime.now().isoformat(),
129
+ 'file_size': os.path.getsize(file_path)
130
+ }
131
+ self.file_metadata.append(metadata)
132
+
133
+ # Update metadata file
134
+ self._update_metadata_file()
135
+
136
+ self.downloaded_files.add(url)
137
+ logger.info(f"Successfully downloaded: {url} to {file_path}")
138
+ return file_path
139
+
140
+ except requests.RequestException as e:
141
+ error_info = {
142
+ 'url': url,
143
+ 'source_page': source_page,
144
+ 'error': str(e),
145
+ 'time': datetime.now().isoformat()
146
+ }
147
+ self.failed_downloads.append(error_info)
148
+ self._update_metadata_file() # Update metadata including failed downloads
149
+ logger.error(f"Error downloading file from {url}: {str(e)}")
150
+ return None
151
+ except Exception as e:
152
+ logger.error(f"Unexpected error while downloading {url}: {str(e)}")
153
+ return None
154
+
155
+ def _update_metadata_file(self):
156
+ """Update the metadata markdown file with information about downloaded files."""
157
+ try:
158
+ with open(self.metadata_file, 'w', encoding='utf-8') as f:
159
+ f.write("# Downloaded Files Metadata\n\n")
160
+
161
+ # Successful downloads
162
+ if self.file_metadata:
163
+ f.write("## Successfully Downloaded Files\n\n")
164
+ for metadata in self.file_metadata:
165
+ f.write(f"### {metadata['filename']}\n")
166
+ f.write(f"- Source URL: {metadata['source_url']}\n")
167
+ if metadata['source_page']:
168
+ f.write(f"- Found on page: {metadata['source_page']}\n")
169
+ f.write(f"- Content Type: {metadata['content_type']}\n")
170
+ f.write(f"- Download Time: {metadata['download_time']}\n")
171
+ f.write(f"- File Size: {metadata['file_size']} bytes\n\n")
172
+
173
+ # Failed downloads
174
+ if self.failed_downloads:
175
+ f.write("## Failed Downloads\n\n")
176
+ for failed in self.failed_downloads:
177
+ f.write(f"### {failed['url']}\n")
178
+ if failed['source_page']:
179
+ f.write(f"- Found on page: {failed['source_page']}\n")
180
+ f.write(f"- Error: {failed['error']}\n")
181
+ f.write(f"- Time: {failed['time']}\n\n")
182
+
183
+ except Exception as e:
184
+ logger.error(f"Error updating metadata file: {str(e)}")
185
+
186
+ def get_downloaded_files(self) -> List[str]:
187
+ """Return a list of all downloaded file paths."""
188
+ return [os.path.join(self.downloads_dir, f)
189
+ for f in os.listdir(self.downloads_dir)
190
+ if os.path.isfile(os.path.join(self.downloads_dir, f))]
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ phidata
2
+ beautifulsoup4
3
+ requests
4
+ pydantic
5
+ duckduckgo-search
6
+ tenacity
7
+ streamlit
save_report.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from datetime import datetime
3
+ import shutil
4
+
5
+ def save_markdown_report():
6
+ # Get current date in YYYY-MM-DD format
7
+ current_date = datetime.now().strftime('%Y-%m-%d')
8
+
9
+ # Create directory name
10
+ report_dir = f"report_{current_date}"
11
+
12
+ # Create full path
13
+ base_path = os.path.dirname(os.path.abspath(__file__))
14
+ report_path = os.path.join(base_path, report_dir)
15
+
16
+ # Create directory if it doesn't exist
17
+ os.makedirs(report_path, exist_ok=True)
18
+
19
+ # Create markdown file path
20
+ report_file = os.path.join(report_path, f"report_{current_date}.md")
21
+
22
+ return report_path, report_file
23
+
24
+ if __name__ == "__main__":
25
+ report_path, report_file = save_markdown_report()
26
+ print(f"Report directory created at: {report_path}")
27
+ print(f"Report file path: {report_file}")
search_utils.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ import logging
3
+ import random
4
+ import threading
5
+ from typing import Optional, Dict, Any
6
+ from duckduckgo_search.exceptions import RatelimitException
7
+
8
+ logger = logging.getLogger(__name__)
9
+
10
+ class RateLimitedSearch:
11
+ """Rate limited search implementation with exponential backoff."""
12
+
13
+ def __init__(self):
14
+ self.last_request_time = 0
15
+ self.min_delay = 30 # Increased minimum delay between requests to 30 seconds
16
+ self.max_delay = 300 # Maximum delay of 5 minutes
17
+ self.jitter = 5 # Added more jitter range
18
+ self.consecutive_failures = 0
19
+ self.max_consecutive_failures = 5 # Increased max failures before giving up
20
+ self._delay_lock = threading.Lock() # Add thread safety
21
+
22
+ def _add_jitter(self, delay: float) -> float:
23
+ """Add randomized jitter to delay."""
24
+ return delay + random.uniform(-self.jitter, self.jitter)
25
+
26
+ def _wait_for_rate_limit(self):
27
+ """Wait for rate limit with exponential backoff."""
28
+ with self._delay_lock:
29
+ current_time = time.time()
30
+ elapsed = current_time - self.last_request_time
31
+
32
+ # Calculate delay based on consecutive failures
33
+ if self.consecutive_failures > 0:
34
+ delay = min(
35
+ self.max_delay,
36
+ self.min_delay * (2 ** (self.consecutive_failures - 1))
37
+ )
38
+ else:
39
+ delay = self.min_delay
40
+
41
+ # Add jitter to prevent synchronized requests
42
+ jitter = random.uniform(-self.jitter, self.jitter)
43
+ delay = max(0, delay + jitter)
44
+
45
+ # If not enough time has elapsed, wait
46
+ if elapsed < delay:
47
+ time.sleep(delay - elapsed)
48
+
49
+ self.last_request_time = time.time()
50
+
51
+ def execute_with_retry(self,
52
+ search_func: callable,
53
+ max_retries: int = 3,
54
+ **kwargs) -> Optional[Dict[str, Any]]:
55
+ """Execute search with retries and exponential backoff."""
56
+
57
+ for attempt in range(max_retries):
58
+ try:
59
+ # Enforce rate limiting
60
+ self._wait_for_rate_limit()
61
+
62
+ # Execute search
63
+ result = search_func(**kwargs)
64
+
65
+ # Reset consecutive failures on success
66
+ self.consecutive_failures = 0
67
+ return result
68
+
69
+ except RatelimitException as e:
70
+ self.consecutive_failures += 1
71
+
72
+ # Calculate backoff time
73
+ backoff = min(
74
+ self.max_delay,
75
+ self.min_delay * (2 ** attempt)
76
+ )
77
+ backoff = self._add_jitter(backoff)
78
+
79
+ if attempt == max_retries - 1:
80
+ logger.error(f"Rate limit exceeded after {max_retries} retries")
81
+ raise
82
+
83
+ logger.warning(f"Rate limit hit, attempt {attempt + 1}/{max_retries}. "
84
+ f"Waiting {backoff:.2f} seconds...")
85
+ time.sleep(backoff)
86
+
87
+ # If we've hit too many consecutive failures, raise an exception
88
+ if self.consecutive_failures >= self.max_consecutive_failures:
89
+ logger.error("Too many consecutive rate limit failures")
90
+ raise RatelimitException("Persistent rate limiting detected")
91
+ continue
92
+
93
+ except Exception as e:
94
+ logger.error(f"Search error on attempt {attempt + 1}: {str(e)}")
95
+ if attempt == max_retries - 1:
96
+ raise
97
+
98
+ backoff = self.min_delay * (2 ** attempt)
99
+ backoff = self._add_jitter(backoff)
100
+ logger.info(f"Retrying in {backoff:.2f} seconds...")
101
+ time.sleep(backoff)
102
+
103
+ return None
web_search.py ADDED
@@ -0,0 +1,1109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import re
3
+ import time
4
+ import os
5
+ import concurrent.futures
6
+ from typing import Optional, Iterator, List, Set, Dict, Any
7
+ from urllib.parse import urlparse, urljoin
8
+ import requests
9
+ from bs4 import BeautifulSoup
10
+ from pydantic import BaseModel, Field
11
+ from datetime import datetime
12
+
13
+ # Phi imports
14
+ from phi.workflow import Workflow, RunResponse, RunEvent
15
+ from phi.storage.workflow.sqlite import SqlWorkflowStorage
16
+ from phi.agent import Agent
17
+ from phi.model.groq import Groq
18
+ from phi.tools.duckduckgo import DuckDuckGo
19
+ from phi.tools.googlesearch import GoogleSearch
20
+ from phi.utils.pprint import pprint_run_response
21
+ from phi.utils.log import logger
22
+
23
+ # Error handling imports
24
+ from duckduckgo_search.exceptions import RatelimitException
25
+ from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
26
+ from requests.exceptions import HTTPError
27
+
28
+ from config import GROQ_API_KEY, NVIDIA_API_KEY, SEARCHER_MODEL_CONFIG, WRITER_MODEL_CONFIG, get_hf_model
29
+ import configparser
30
+
31
+ DUCK_DUCK_GO_FIXED_MAX_RESULTS = 10
32
+
33
+ config = configparser.ConfigParser()
34
+ config.read('config.ini')
35
+ DEFAULT_TOPIC = config.get('DEFAULT', 'default_topic')
36
+ INITIAL_WEBSITES = config.get('DEFAULT', 'initial_websites')
37
+
38
+ # The topic to generate a blog post on
39
+ topic = DEFAULT_TOPIC
40
+
41
+ class NewsArticle(BaseModel):
42
+ """Article data model containing title, URL and description."""
43
+ title: str = Field(..., description="Title of the article.")
44
+ url: str = Field(..., description="Link to the article.")
45
+ description: Optional[str] = Field(None, description="Summary of the article if available.")
46
+
47
+
48
+ class SearchResults(BaseModel):
49
+ """Container for search results containing a list of articles."""
50
+ articles: List[NewsArticle]
51
+
52
+
53
+ class BlogPostGenerator(Workflow):
54
+ """Workflow for generating blog posts based on web research."""
55
+ searcher: Agent = Field(...)
56
+ backup_searcher: Agent = Field(...)
57
+ writer: Agent = Field(...)
58
+ initial_websites: List[str] = Field(default_factory=lambda: INITIAL_WEBSITES)
59
+ file_handler: Optional[Any] = Field(None)
60
+
61
+ def __init__(
62
+ self,
63
+ session_id: str,
64
+ searcher: Agent,
65
+ backup_searcher: Agent,
66
+ writer: Agent,
67
+ file_handler: Optional[Any] = None,
68
+ storage: Optional[SqlWorkflowStorage] = None,
69
+ ):
70
+ super().__init__(
71
+ session_id=session_id,
72
+ searcher=searcher,
73
+ backup_searcher=backup_searcher,
74
+ writer=writer,
75
+ storage=storage,
76
+ )
77
+ self.file_handler = file_handler
78
+
79
+ # Configure search instructions
80
+ search_instructions = [
81
+ "Given a topic, search for 20 articles and return the 15 most relevant articles.",
82
+ "For each article, provide:",
83
+ "- title: The article title",
84
+ "- url: The article URL",
85
+ "- description: A brief description or summary of the article",
86
+ "Return the results in a structured format with these exact field names."
87
+ ]
88
+
89
+ # Primary searcher using DuckDuckGo
90
+ self.searcher = Agent(
91
+ model=get_hf_model('searcher'),
92
+ tools=[DuckDuckGo(fixed_max_results=DUCK_DUCK_GO_FIXED_MAX_RESULTS)],
93
+ instructions=search_instructions,
94
+ response_model=SearchResults
95
+ )
96
+
97
+
98
+ # Backup searcher using Google Search
99
+ self.backup_searcher = Agent(
100
+ model=get_hf_model('searcher'),
101
+ tools=[GoogleSearch()],
102
+ instructions=search_instructions,
103
+ response_model=SearchResults
104
+ )
105
+
106
+
107
+ # Writer agent configuration
108
+ writer_instructions = [
109
+ "You are a professional research analyst tasked with creating a comprehensive report on the given topic.",
110
+ "The sources provided include both general web search results and specialized intelligence/security websites.",
111
+ "Carefully analyze and cross-reference information from all sources to create a detailed report.",
112
+ "",
113
+ "Report Structure:",
114
+ "1. Executive Summary (2-3 paragraphs)",
115
+ " - Provide a clear, concise overview of the main findings",
116
+ " - Address the research question directly",
117
+ " - Highlight key discoveries and implications",
118
+ "",
119
+ "2. Detailed Analysis (Multiple sections)",
120
+ " - Break down the topic into relevant themes or aspects",
121
+ " - For each theme:",
122
+ " * Present detailed findings from multiple sources",
123
+ " * Cross-reference information between general and specialized sources",
124
+ " * Analyze trends, patterns, and developments",
125
+ " * Discuss implications and potential impacts",
126
+ "",
127
+ "3. Source Analysis and Credibility",
128
+ " For each major source:",
129
+ " - Evaluate source credibility and expertise",
130
+ " - Note if from specialized intelligence/security website",
131
+ " - Assess potential biases or limitations",
132
+ " - Key findings and unique contributions",
133
+ "",
134
+ "4. Key Takeaways and Strategic Implications",
135
+ " - Synthesize findings from all sources",
136
+ " - Compare/contrast general media vs specialized analysis",
137
+ " - Discuss broader geopolitical implications",
138
+ " - Address potential future developments",
139
+ "",
140
+ "5. References",
141
+ " - Group sources by type (specialized websites vs general media)",
142
+ " - List all sources with full citations",
143
+ " - Include URLs as clickable markdown links [Title](URL)",
144
+ " - Ensure every major claim has at least one linked source",
145
+ "",
146
+ "Important Guidelines:",
147
+ "- Prioritize information from specialized intelligence/security sources",
148
+ "- Cross-validate claims between multiple sources when possible",
149
+ "- Maintain a professional, analytical tone",
150
+ "- Support all claims with evidence",
151
+ "- Include specific examples and data points",
152
+ "- Use direct quotes for significant statements",
153
+ "- Address potential biases in reporting",
154
+ "- Ensure the report directly answers the research question",
155
+ "",
156
+ "Format the report with clear markdown headings (# ## ###), subheadings, and paragraphs.",
157
+ "Each major section should contain multiple paragraphs with detailed analysis."
158
+ ]
159
+
160
+ self.writer = Agent(
161
+ model=get_hf_model('writer'),
162
+ instructions=writer_instructions,
163
+ structured_outputs=True
164
+ )
165
+
166
+
167
+ def _parse_search_response(self, response) -> Optional[SearchResults]:
168
+ """Parse and validate search response into SearchResults model."""
169
+ try:
170
+ if isinstance(response, str):
171
+ # Clean up markdown code blocks and extract JSON
172
+ content = response.strip()
173
+ if '```' in content:
174
+ # Extract content between code block markers
175
+ match = re.search(r'```(?:json)?\n(.*?)\n```', content, re.DOTALL)
176
+ if match:
177
+ content = match.group(1).strip()
178
+ else:
179
+ # If no proper code block found, remove all ``` markers
180
+ content = re.sub(r'```(?:json)?\n?', '', content)
181
+ content = content.strip()
182
+
183
+ # Try to parse JSON response
184
+ try:
185
+ # Clean up any trailing commas before closing brackets/braces
186
+ content = re.sub(r',(\s*[}\]])', r'\1', content)
187
+ # Fix invalid escape sequences
188
+ content = re.sub(r'\\([^"\\\/bfnrtu])', r'\1', content) # Remove invalid escapes
189
+ content = content.replace('\t', ' ') # Replace tabs with spaces
190
+ # Handle any remaining unicode escapes
191
+ content = re.sub(r'\\u([0-9a-fA-F]{4})', lambda m: chr(int(m.group(1), 16)), content)
192
+
193
+ data = json.loads(content)
194
+
195
+ if isinstance(data, dict) and 'articles' in data:
196
+ articles = []
197
+ for article in data['articles']:
198
+ if isinstance(article, dict):
199
+ # Ensure all required fields are strings
200
+ article = {
201
+ 'title': str(article.get('title', '')).strip(),
202
+ 'url': str(article.get('url', '')).strip(),
203
+ 'description': str(article.get('description', '')).strip()
204
+ }
205
+ if article['title'] and article['url']: # Only add if has required fields
206
+ articles.append(NewsArticle(**article))
207
+
208
+ if articles:
209
+ logger.info(f"Successfully parsed {len(articles)} articles from JSON")
210
+ return SearchResults(articles=articles)
211
+
212
+ except json.JSONDecodeError as e:
213
+ logger.warning(f"Failed to parse JSON response: {str(e)}, attempting to extract data manually")
214
+
215
+ # Fallback to regex extraction if JSON parsing fails
216
+ urls = re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', content)
217
+ titles = re.findall(r'"title":\s*"([^"]+)"', content)
218
+ descriptions = re.findall(r'"description":\s*"([^"]+)"', content)
219
+
220
+ if not urls: # Try alternative patterns
221
+ urls = re.findall(r'(?<=\()http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+(?=\))', content)
222
+
223
+ if urls:
224
+ articles = []
225
+ for i, url in enumerate(urls):
226
+ title = titles[i] if i < len(titles) else f"Article {i+1}"
227
+ description = descriptions[i] if i < len(descriptions) else ""
228
+ # Clean up extracted data
229
+ title = title.strip().replace('\\"', '"')
230
+ url = url.strip().replace('\\"', '"')
231
+ description = description.strip().replace('\\"', '"')
232
+
233
+ if url: # Only add if URL exists
234
+ articles.append(NewsArticle(
235
+ title=title,
236
+ url=url,
237
+ description=description
238
+ ))
239
+
240
+ if articles:
241
+ logger.info(f"Successfully extracted {len(articles)} articles using regex")
242
+ return SearchResults(articles=articles)
243
+
244
+ logger.warning("No valid articles found in response")
245
+ return None
246
+
247
+ elif isinstance(response, dict):
248
+ # Handle dictionary response
249
+ if 'articles' in response:
250
+ articles = []
251
+ for article in response['articles']:
252
+ if isinstance(article, dict):
253
+ # Ensure all fields are strings
254
+ article = {
255
+ 'title': str(article.get('title', '')).strip(),
256
+ 'url': str(article.get('url', '')).strip(),
257
+ 'description': str(article.get('description', '')).strip()
258
+ }
259
+ if article['title'] and article['url']:
260
+ articles.append(NewsArticle(**article))
261
+ elif isinstance(article, NewsArticle):
262
+ articles.append(article)
263
+
264
+ if articles:
265
+ logger.info(f"Successfully processed {len(articles)} articles from dict")
266
+ return SearchResults(articles=articles)
267
+ return None
268
+
269
+ elif isinstance(response, SearchResults):
270
+ # Already in correct format
271
+ return response
272
+
273
+ elif isinstance(response, RunResponse):
274
+ # Extract from RunResponse
275
+ if response.content:
276
+ return self._parse_search_response(response.content)
277
+ return None
278
+
279
+ logger.error(f"Unsupported response type: {type(response)}")
280
+ return None
281
+
282
+ except Exception as e:
283
+ logger.error(f"Error parsing search response: {str(e)}")
284
+ return None
285
+
286
+ def _search_with_retry(self, topic: str, use_backup: bool = False, max_retries: int = 3) -> Optional[SearchResults]:
287
+ """Execute search with retries and rate limit handling."""
288
+ searcher = self.backup_searcher if use_backup else self.searcher
289
+ source = "backup" if use_backup else "primary"
290
+
291
+ # Initialize rate limit tracking
292
+ rate_limited_sources = set()
293
+
294
+ for attempt in range(max_retries):
295
+ try:
296
+ if source in rate_limited_sources:
297
+ logger.warning(f"{source} search is rate limited, switching to alternative method")
298
+ if not use_backup:
299
+ # Try backup search if primary is rate limited
300
+ backup_results = self._search_with_retry(topic, use_backup=True, max_retries=max_retries)
301
+ if backup_results:
302
+ return backup_results
303
+ # If both sources are rate limited, use longer backoff
304
+ backoff_time = min(3600, 60 * (2 ** attempt)) # Max 1 hour backoff
305
+ logger.info(f"All search methods rate limited. Waiting {backoff_time} seconds before retry...")
306
+ time.sleep(backoff_time)
307
+
308
+ logger.info(f"\nAttempting {source} search (attempt {attempt + 1}/{max_retries})...")
309
+
310
+ # Try different search prompts to improve results
311
+ search_prompts = [
312
+ f"""Search for detailed articles about: {topic}
313
+ Return only high-quality, relevant sources.
314
+ Format the results as a JSON object with an 'articles' array containing:
315
+ - title: The article title
316
+ - url: The article URL
317
+ - description: A brief description or summary
318
+ """,
319
+ f"""Find comprehensive articles and research papers about: {topic}
320
+ Focus on authoritative sources and recent publications.
321
+ Return results in JSON format with 'articles' array.
322
+ """,
323
+ f"""Locate detailed analysis and reports discussing: {topic}
324
+ Prioritize academic, industry, and news sources.
325
+ Return structured JSON with article details.
326
+ """
327
+ ]
328
+
329
+ # Try each prompt until we get results
330
+ for prompt in search_prompts:
331
+ try:
332
+ response = searcher.run(prompt, stream=False)
333
+ results = self._parse_search_response(response)
334
+ if results and results.articles:
335
+ logger.info(f"Found {len(results.articles)} articles from {source} search")
336
+ return results
337
+ except Exception as e:
338
+ if any(err in str(e).lower() for err in ["rate", "limit", "quota", "exhausted"]):
339
+ rate_limited_sources.add(source)
340
+ raise
341
+ logger.warning(f"Search prompt failed: {str(e)}")
342
+ continue
343
+
344
+ logger.warning(f"{source.title()} search returned no valid results")
345
+
346
+ except Exception as e:
347
+ error_msg = str(e).lower()
348
+ if any(err in error_msg for err in ["rate", "limit", "quota", "exhausted"]):
349
+ rate_limited_sources.add(source)
350
+ logger.error(f"{source} search rate limited: {str(e)}")
351
+ # Try alternative source immediately
352
+ if not use_backup:
353
+ backup_results = self._search_with_retry(topic, use_backup=True, max_retries=max_retries)
354
+ if backup_results:
355
+ return backup_results
356
+ else:
357
+ logger.error(f"Error during {source} search (attempt {attempt + 1}): {str(e)}")
358
+
359
+ if attempt < max_retries - 1:
360
+ backoff_time = 2 ** attempt
361
+ if source in rate_limited_sources:
362
+ backoff_time = min(3600, 60 * (2 ** attempt)) # Longer backoff for rate limits
363
+ logger.info(f"Waiting {backoff_time} seconds before retry...")
364
+ time.sleep(backoff_time)
365
+
366
+ return None
367
+
368
+ def _validate_content(self, content: str) -> bool:
369
+ """Validate that the generated content is readable and properly formatted."""
370
+ if not content or len(content.strip()) < 100:
371
+ logger.warning("Content too short or empty")
372
+ return False
373
+
374
+ # Check for basic structure
375
+ if not any(marker in content for marker in ['#', '\n\n']):
376
+ logger.warning("Content lacks proper structure (headers or paragraphs)")
377
+ return False
378
+
379
+ # Check for reasonable paragraph lengths
380
+ paragraphs = [p.strip() for p in content.split('\n\n') if p.strip()]
381
+ if not paragraphs:
382
+ logger.warning("No valid paragraphs found")
383
+ return False
384
+
385
+ # Common words that are allowed to repeat frequently
386
+ common_words = {
387
+ 'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
388
+ 'this', 'that', 'these', 'those', 'it', 'its', 'is', 'are', 'was', 'were', 'be', 'been',
389
+ 'has', 'have', 'had', 'would', 'could', 'should', 'will', 'can'
390
+ }
391
+
392
+ # Track word frequencies across paragraphs
393
+ word_frequencies = {}
394
+ total_words = 0
395
+
396
+ # Validate each paragraph
397
+ for para in paragraphs:
398
+ # Skip headers and references
399
+ if para.startswith('#') or para.startswith('http'):
400
+ continue
401
+
402
+ # Calculate word statistics
403
+ words = para.split()
404
+ if len(words) < 3:
405
+ continue # Skip very short paragraphs
406
+
407
+ # Calculate word statistics
408
+ word_lengths = [len(word) for word in words]
409
+ avg_word_length = sum(word_lengths) / len(word_lengths)
410
+
411
+ # More nuanced word length validation
412
+ long_words = [w for w in words if len(w) > 15]
413
+ long_word_ratio = len(long_words) / len(words) if words else 0
414
+
415
+ # Allow higher average length if the text contains URLs or technical terms
416
+ contains_url = any(word.startswith(('http', 'www')) for word in words)
417
+ contains_technical = any(word.lower().endswith(('tion', 'ment', 'ology', 'ware', 'tech')) for word in words)
418
+
419
+ # Adjust thresholds based on content type
420
+ max_avg_length = 12 # Base maximum average word length
421
+ if contains_url:
422
+ max_avg_length = 20 # Allow longer average for content with URLs
423
+ elif contains_technical:
424
+ max_avg_length = 15 # Allow longer average for technical content
425
+
426
+ # Fail only if multiple indicators of problematic text
427
+ if (avg_word_length > max_avg_length and long_word_ratio > 0.3) or avg_word_length > 25:
428
+ logger.warning(f"Suspicious word lengths: avg={avg_word_length:.1f}, long_ratio={long_word_ratio:.1%}")
429
+ return False
430
+
431
+ # Check for excessive punctuation or special characters
432
+ special_char_ratio = len(re.findall(r'[^a-zA-Z0-9\s.,!?()"-]', para)) / len(para)
433
+ if special_char_ratio > 0.15: # Increased threshold slightly
434
+ logger.warning(f"Too many special characters: {special_char_ratio}")
435
+ return False
436
+
437
+ # Check for coherent sentence structure
438
+ sentences = [s.strip() for s in re.split(r'[.!?]+', para) if s.strip()]
439
+ weak_sentences = 0
440
+ for sentence in sentences:
441
+ words = sentence.split()
442
+ if len(words) < 3: # Skip very short sentences
443
+ continue
444
+
445
+ # More lenient grammar check
446
+ structure_indicators = [
447
+ any(word[0].isupper() for word in words), # Has some capitalization
448
+ any(word.lower() in common_words for word in words), # Has common words
449
+ len(words) >= 3, # Reasonable length
450
+ any(len(word) > 3 for word in words), # Has some non-trivial words
451
+ ]
452
+
453
+ # Only fail if less than 2 indicators are present
454
+ if sum(structure_indicators) < 2:
455
+ logger.warning(f"Weak sentence structure: {sentence}")
456
+ weak_sentences += 1
457
+ if weak_sentences > len(sentences) / 2: # Fail if more than half are weak
458
+ logger.warning("Too many poorly structured sentences")
459
+ return False
460
+
461
+ # Update word frequencies
462
+ for word in words:
463
+ word = word.lower()
464
+ if word not in common_words and len(word) > 2: # Only track non-common words
465
+ word_frequencies[word] = word_frequencies.get(word, 0) + 1
466
+ total_words += 1
467
+
468
+ # Check for excessive repetition
469
+ if total_words > 0:
470
+ for word, count in word_frequencies.items():
471
+ # Calculate the frequency as a percentage
472
+ frequency = count / total_words
473
+
474
+ # Allow up to 10% frequency for any word
475
+ if frequency > 0.1 and count > 3:
476
+ logger.warning(f"Word '{word}' appears too frequently ({count} times, {frequency:.1%})")
477
+ return False
478
+
479
+ # Content seems valid
480
+ return True
481
+
482
+ def _save_markdown(self, topic: str, content: str) -> str:
483
+ """Save the content as an HTML file."""
484
+ try:
485
+ # Get or create report directory
486
+ report_dir = None
487
+ if hasattr(self, 'file_handler') and self.file_handler:
488
+ report_dir = self.file_handler.report_dir
489
+ else:
490
+ # Create a default report directory if no file handler
491
+ report_dir = os.path.join(os.path.dirname(__file__), f"report_{datetime.now().strftime('%Y-%m-%d')}")
492
+ os.makedirs(report_dir, exist_ok=True)
493
+ logger.info(f"Created report directory: {report_dir}")
494
+
495
+ # Create filename from topic
496
+ filename = re.sub(r'[^\w\s-]', '', topic.lower()) # Remove special chars
497
+ filename = re.sub(r'[-\s]+', '-', filename) # Replace spaces with hyphens
498
+ filename = f"{filename}.html"
499
+ file_path = os.path.join(report_dir, filename)
500
+
501
+ # Convert markdown to HTML with styling
502
+ html_content = f"""
503
+ <!DOCTYPE html>
504
+ <html lang="en">
505
+ <head>
506
+ <meta charset="UTF-8">
507
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
508
+ <title>{topic}</title>
509
+ <style>
510
+ body {{
511
+ font-family: Arial, sans-serif;
512
+ line-height: 1.6;
513
+ color: #333;
514
+ max-width: 1200px;
515
+ margin: 0 auto;
516
+ padding: 20px;
517
+ }}
518
+ h1 {{
519
+ color: #2c3e50;
520
+ border-bottom: 2px solid #3498db;
521
+ padding-bottom: 10px;
522
+ }}
523
+ h2 {{
524
+ color: #34495e;
525
+ margin-top: 30px;
526
+ }}
527
+ h3 {{
528
+ color: #455a64;
529
+ }}
530
+ a {{
531
+ color: #3498db;
532
+ text-decoration: none;
533
+ }}
534
+ a:hover {{
535
+ text-decoration: underline;
536
+ }}
537
+ .executive-summary {{
538
+ background-color: #f8f9fa;
539
+ border-left: 4px solid #3498db;
540
+ padding: 20px;
541
+ margin: 20px 0;
542
+ }}
543
+ .analysis-section {{
544
+ margin: 30px 0;
545
+ }}
546
+ .source-section {{
547
+ background-color: #f8f9fa;
548
+ padding: 15px;
549
+ margin: 10px 0;
550
+ border-radius: 5px;
551
+ }}
552
+ .references {{
553
+ margin-top: 40px;
554
+ border-top: 2px solid #ecf0f1;
555
+ padding-top: 20px;
556
+ }}
557
+ .timestamp {{
558
+ color: #7f8c8d;
559
+ font-size: 0.9em;
560
+ margin-top: 40px;
561
+ text-align: right;
562
+ }}
563
+ blockquote {{
564
+ border-left: 3px solid #3498db;
565
+ margin: 20px 0;
566
+ padding-left: 20px;
567
+ color: #555;
568
+ }}
569
+ code {{
570
+ background-color: #f7f9fa;
571
+ padding: 2px 5px;
572
+ border-radius: 3px;
573
+ font-family: monospace;
574
+ }}
575
+ </style>
576
+ </head>
577
+ <body>
578
+ <div class="content">
579
+ {self._markdown_to_html(content)}
580
+ </div>
581
+ <div class="timestamp">
582
+ Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
583
+ </div>
584
+ </body>
585
+ </html>
586
+ """
587
+
588
+ # Write the HTML file
589
+ with open(file_path, 'w', encoding='utf-8') as f:
590
+ f.write(html_content)
591
+
592
+ logger.info(f"Successfully saved HTML report: {file_path}")
593
+ return file_path
594
+
595
+ except Exception as e:
596
+ logger.error(f"Failed to save HTML file: {str(e)}")
597
+ return None
598
+
599
+ def _markdown_to_html(self, markdown_content: str) -> str:
600
+ """Convert markdown content to HTML with basic formatting."""
601
+ # Headers
602
+ html = markdown_content
603
+ html = re.sub(r'^# (.*?)$', r'<h1>\1</h1>', html, flags=re.MULTILINE)
604
+ html = re.sub(r'^## (.*?)$', r'<h2>\1</h2>', html, flags=re.MULTILINE)
605
+ html = re.sub(r'^### (.*?)$', r'<h3>\1</h3>', html, flags=re.MULTILINE)
606
+
607
+ # Lists
608
+ html = re.sub(r'^\* (.*?)$', r'<li>\1</li>', html, flags=re.MULTILINE)
609
+ html = re.sub(r'(<li>.*?</li>\n)+', r'<ul>\n\g<0></ul>', html, flags=re.DOTALL)
610
+
611
+ # Links
612
+ html = re.sub(r'\[(.*?)\]\((.*?)\)', r'<a href="\2">\1</a>', html)
613
+
614
+ # Emphasis
615
+ html = re.sub(r'\*\*(.*?)\*\*', r'<strong>\1</strong>', html)
616
+ html = re.sub(r'\*(.*?)\*', r'<em>\1</em>', html)
617
+
618
+ # Paragraphs
619
+ html = re.sub(r'\n\n(.*?)\n\n', r'\n<p>\1</p>\n', html, flags=re.DOTALL)
620
+
621
+ # Blockquotes
622
+ html = re.sub(r'^\> (.*?)$', r'<blockquote>\1</blockquote>', html, flags=re.MULTILINE)
623
+
624
+ # Code blocks
625
+ html = re.sub(r'```(.*?)```', r'<pre><code>\1</code></pre>', html, flags=re.DOTALL)
626
+ html = re.sub(r'`(.*?)`', r'<code>\1</code>', html)
627
+
628
+ return html
629
+
630
+ def run(self, topic: str, use_cache: bool = True) -> Iterator[RunResponse]:
631
+ """Run the blog post generation workflow."""
632
+ logger.info(f"Starting blog post generation for topic: {topic}")
633
+
634
+ # Extract keywords from topic
635
+ keywords = topic.lower().split()
636
+ keywords = [w for w in keywords if len(w) > 3 and w not in {'what', 'where', 'when', 'how', 'why', 'is', 'are', 'was', 'were', 'will', 'would', 'could', 'should', 'the', 'and', 'but', 'or', 'for', 'with'}]
637
+
638
+ all_articles = []
639
+ existing_urls = set()
640
+
641
+ # First, try web search
642
+ logger.info("Starting web search...")
643
+ search_results = self._search_with_retry(topic)
644
+ if search_results and search_results.articles:
645
+ for article in search_results.articles:
646
+ if article.url not in existing_urls:
647
+ all_articles.append(article)
648
+ existing_urls.add(article.url)
649
+ logger.info(f"Found {len(search_results.articles)} articles from web search")
650
+
651
+ # Then, crawl initial websites
652
+ logger.info("Starting website crawl...")
653
+ from file_handler import FileHandler
654
+ crawler = WebsiteCrawler(max_pages_per_site=10)
655
+ crawler.file_handler = FileHandler() # Initialize file handler
656
+
657
+ # Get the report directory from the file handler
658
+ report_dir = crawler.file_handler.report_dir
659
+
660
+ crawled_results = crawler.crawl_all_websites(self.initial_websites, keywords)
661
+
662
+ # Save the relevance log to the report directory
663
+ crawler.save_relevance_log(report_dir)
664
+
665
+ if crawled_results:
666
+ for result in crawled_results:
667
+ if result['url'] not in existing_urls:
668
+ article = NewsArticle(**result)
669
+ all_articles.append(article)
670
+ existing_urls.add(result['url'])
671
+ logger.info(f"Found {len(crawled_results)} articles from website crawl")
672
+
673
+ # If we still need more results, try backup search
674
+ if len(all_articles) < 10:
675
+ logger.info("Supplementing with backup search...")
676
+ backup_results = self._search_with_retry(topic, use_backup=True)
677
+ if backup_results and backup_results.articles:
678
+ for article in backup_results.articles:
679
+ if article.url not in existing_urls:
680
+ all_articles.append(article)
681
+ existing_urls.add(article.url)
682
+ logger.info(f"Found {len(backup_results.articles)} articles from backup search")
683
+
684
+ # Create final search results
685
+ search_results = SearchResults(articles=all_articles)
686
+
687
+ if len(search_results.articles) < 5: # Reduced minimum requirement
688
+ error_msg = f"Failed to gather sufficient sources. Only found {len(search_results.articles)} valid sources."
689
+ logger.error(error_msg)
690
+ yield RunResponse(
691
+ event=RunEvent.run_completed,
692
+ message=error_msg
693
+ )
694
+ return
695
+
696
+ logger.info(f"Successfully gathered {len(search_results.articles)} unique sources for analysis")
697
+
698
+ # Writing phase
699
+ print("\nGenerating report from search results...")
700
+ writer_response = self.writer.run(
701
+ f"""Generate a comprehensive research report on: {topic}
702
+ Use the following articles as sources:
703
+ {json.dumps([{'title': a.title, 'url': a.url, 'description': a.description} for a in search_results.articles], indent=2)}
704
+
705
+ Format the output in markdown with:
706
+ 1. Clear section headers using #, ##, ###
707
+ 2. Proper paragraph spacing
708
+ 3. Bullet points where appropriate
709
+ 4. Links to sources
710
+ 5. A references section at the end
711
+
712
+ Focus on readability and proper markdown formatting.""",
713
+ stream=False
714
+ )
715
+
716
+ if isinstance(writer_response, RunResponse):
717
+ content = writer_response.content
718
+ else:
719
+ content = writer_response
720
+
721
+ # Validate content
722
+ if not self._validate_content(content):
723
+ print("\nFirst attempt produced invalid content, trying again...")
724
+ # Try one more time with a more structured prompt
725
+ writer_response = self.writer.run(
726
+ f"""Generate a clear, well-structured research report on: {topic}
727
+ Format the output in proper markdown with:
728
+ 1. A main title using #
729
+ 2. Section headers using ##
730
+ 3. Subsection headers using ###
731
+ 4. Well-formatted paragraphs
732
+ 5. Bullet points for lists
733
+ 6. A references section at the end
734
+
735
+ Source articles:
736
+ {json.dumps([{'title': a.title, 'url': a.url} for a in search_results.articles], indent=2)}""",
737
+ stream=False
738
+ )
739
+
740
+ if isinstance(writer_response, RunResponse):
741
+ content = writer_response.content
742
+ else:
743
+ content = writer_response
744
+
745
+ if not self._validate_content(content):
746
+ yield RunResponse(
747
+ event=RunEvent.run_completed,
748
+ message="Failed to generate readable content. Please try again."
749
+ )
750
+ return
751
+
752
+ # Save as HTML
753
+ html_file = self._save_markdown(topic, content)
754
+
755
+ if not html_file:
756
+ yield RunResponse(
757
+ event=RunEvent.run_completed,
758
+ message="Failed to save HTML file. Please try again."
759
+ )
760
+ return
761
+
762
+ # Print the report to console and yield response
763
+ print("\n=== Generated Report ===\n")
764
+ print(content)
765
+ print("\n=====================\n")
766
+
767
+ yield RunResponse(
768
+ event=RunEvent.run_completed,
769
+ message=f"Report generated successfully. HTML saved as: {html_file}",
770
+ content=content
771
+ )
772
+
773
+ return
774
+
775
+ class WebsiteCrawler:
776
+ """Crawler to extract relevant information from specified websites."""
777
+
778
+ def __init__(self, max_pages_per_site: int = 10):
779
+ self.max_pages_per_site = max_pages_per_site
780
+ self.visited_urls: Set[str] = set()
781
+ self.results: Dict[str, List[dict]] = {}
782
+ self.file_handler = None
783
+
784
+ # Set up logging
785
+ self.relevance_log = [] # Store relevance decisions
786
+
787
+ def _check_relevance(self, text: str, keywords: List[str]) -> tuple[bool, dict]:
788
+ """
789
+ Check if the page content is relevant based on keywords.
790
+ Returns a tuple of (is_relevant, relevance_info).
791
+ """
792
+ text_lower = text.lower()
793
+ keyword_matches = {}
794
+
795
+ # Check each keyword and count occurrences
796
+ for keyword in keywords:
797
+ keyword_lower = keyword.lower()
798
+ count = text_lower.count(keyword_lower)
799
+ keyword_matches[keyword] = count
800
+
801
+ # Page is relevant if any keyword is found
802
+ is_relevant = any(count > 0 for count in keyword_matches.values())
803
+
804
+ # Prepare relevance information
805
+ relevance_info = {
806
+ 'is_relevant': is_relevant,
807
+ 'keyword_matches': keyword_matches,
808
+ 'total_matches': sum(keyword_matches.values()),
809
+ 'matching_keywords': [k for k, v in keyword_matches.items() if v > 0],
810
+ 'text_length': len(text)
811
+ }
812
+
813
+ return is_relevant, relevance_info
814
+
815
+ def crawl_page(self, url: str, keywords: List[str]) -> List[dict]:
816
+ """Crawl a single page and extract relevant information."""
817
+ try:
818
+ # Skip if already visited
819
+ if url in self.visited_urls:
820
+ logger.debug(f"Skipping already visited URL: {url}")
821
+ return []
822
+
823
+ self.visited_urls.add(url)
824
+ logger.info(f"Crawling page: {url}")
825
+
826
+ # Fetch and parse the page
827
+ response = requests.get(url, timeout=10)
828
+ response.raise_for_status()
829
+ soup = BeautifulSoup(response.text, 'html.parser')
830
+
831
+ # Get page title
832
+ title = soup.title.string if soup.title else url
833
+
834
+ # Extract text content
835
+ text = ' '.join([p.get_text() for p in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'])])
836
+
837
+ # Check relevance and get detailed information
838
+ is_relevant, relevance_info = self._check_relevance(text, keywords)
839
+
840
+ # Log relevance decision
841
+ log_entry = {
842
+ 'url': url,
843
+ 'title': title,
844
+ 'timestamp': datetime.now().isoformat(),
845
+ 'relevance_info': relevance_info
846
+ }
847
+ self.relevance_log.append(log_entry)
848
+
849
+ # Log the decision with details
850
+ if is_relevant:
851
+ logger.info(
852
+ f"Page is RELEVANT: {url}\n"
853
+ f"- Title: {title}\n"
854
+ f"- Matching keywords: {relevance_info['matching_keywords']}\n"
855
+ f"- Total matches: {relevance_info['total_matches']}"
856
+ )
857
+ else:
858
+ logger.info(
859
+ f"Page is NOT RELEVANT: {url}\n"
860
+ f"- Title: {title}\n"
861
+ f"- Checked keywords: {keywords}\n"
862
+ f"- No keyword matches found in {relevance_info['text_length']} characters of text"
863
+ )
864
+
865
+ results = []
866
+ if is_relevant:
867
+ # Extract links for further crawling
868
+ links = []
869
+ for link in soup.find_all('a', href=True):
870
+ href = link['href']
871
+ absolute_url = urljoin(url, href)
872
+ if self.is_valid_url(absolute_url):
873
+ links.append(absolute_url)
874
+
875
+ # If page is relevant, process and download any supported files
876
+ if self.file_handler:
877
+ for link in soup.find_all('a', href=True):
878
+ href = link['href']
879
+ absolute_url = urljoin(url, href)
880
+ if self.file_handler.is_supported_file(absolute_url):
881
+ downloaded_path = self.file_handler.download_file(absolute_url, source_page=url)
882
+ if downloaded_path:
883
+ logger.info(f"Downloaded file from relevant page: {absolute_url} to {downloaded_path}")
884
+
885
+ # Store the relevant page information
886
+ results.append({
887
+ 'url': url,
888
+ 'text': text,
889
+ 'title': title,
890
+ 'links': links,
891
+ 'relevance_info': relevance_info
892
+ })
893
+
894
+ return results
895
+
896
+ except Exception as e:
897
+ logger.error(f"Error crawling {url}: {str(e)}")
898
+ return []
899
+
900
+ def save_relevance_log(self, output_dir: str):
901
+ """Save the relevance log to a markdown file."""
902
+ try:
903
+ log_file = os.path.join(output_dir, 'crawl_relevance_log.md')
904
+ with open(log_file, 'w', encoding='utf-8') as f:
905
+ f.write("# Web Crawling Relevance Log\n\n")
906
+
907
+ # Summary statistics
908
+ total_pages = len(self.relevance_log)
909
+ relevant_pages = sum(1 for entry in self.relevance_log if entry['relevance_info']['is_relevant'])
910
+
911
+ f.write(f"## Summary\n")
912
+ f.write(f"- Total pages crawled: {total_pages}\n")
913
+ f.write(f"- Relevant pages found: {relevant_pages}\n")
914
+ f.write(f"- Non-relevant pages: {total_pages - relevant_pages}\n\n")
915
+
916
+ # Relevant pages
917
+ f.write("## Relevant Pages\n\n")
918
+ for entry in self.relevance_log:
919
+ if entry['relevance_info']['is_relevant']:
920
+ f.write(f"### {entry['title']}\n")
921
+ f.write(f"- URL: {entry['url']}\n")
922
+ f.write(f"- Matching keywords: {entry['relevance_info']['matching_keywords']}\n")
923
+ f.write(f"- Total matches: {entry['relevance_info']['total_matches']}\n")
924
+ f.write(f"- Crawled at: {entry['timestamp']}\n\n")
925
+
926
+ # Non-relevant pages
927
+ f.write("## Non-Relevant Pages\n\n")
928
+ for entry in self.relevance_log:
929
+ if not entry['relevance_info']['is_relevant']:
930
+ f.write(f"### {entry['title']}\n")
931
+ f.write(f"- URL: {entry['url']}\n")
932
+ f.write(f"- Text length: {entry['relevance_info']['text_length']} characters\n")
933
+ f.write(f"- Crawled at: {entry['timestamp']}\n\n")
934
+
935
+ except Exception as e:
936
+ logger.error(f"Error saving relevance log: {str(e)}")
937
+
938
+ def is_valid_url(self, url: str) -> bool:
939
+ """Check if URL is valid and belongs to allowed domains."""
940
+ try:
941
+ parsed = urlparse(url)
942
+ return bool(parsed.netloc and parsed.scheme in {'http', 'https'})
943
+ except:
944
+ return False
945
+
946
+ def extract_text_and_links(self, url: str, soup: BeautifulSoup):
947
+ """Extract relevant text and links from a page."""
948
+ links = []
949
+ for link in soup.find_all('a', href=True):
950
+ href = link['href']
951
+ absolute_url = urljoin(url, href)
952
+ links.append(absolute_url)
953
+ return links
954
+
955
+ def crawl_website(self, base_url: str, keywords: List[str]) -> List[dict]:
956
+ """Crawl a website starting from the base URL."""
957
+ to_visit = {base_url}
958
+ results = []
959
+ visited_count = 0
960
+
961
+ while to_visit and visited_count < self.max_pages_per_site:
962
+ url = to_visit.pop()
963
+ page_results, links = self.crawl_page(url, keywords), self.extract_text_and_links(url, BeautifulSoup(requests.get(url, timeout=10).text, 'html.parser'))
964
+ results.extend(page_results)
965
+
966
+ # Add new links to visit
967
+ domain = urlparse(base_url).netloc
968
+ new_links = {link for link in links
969
+ if urlparse(link).netloc == domain
970
+ and link not in self.visited_urls}
971
+ to_visit.update(new_links)
972
+ visited_count += 1
973
+
974
+ return results
975
+
976
+ def crawl_all_websites(self, websites: List[str], keywords: List[str]) -> List[dict]:
977
+ """Crawl multiple websites in parallel."""
978
+ all_results = []
979
+
980
+ if isinstance(websites, str):
981
+ # Remove the brackets and split by comma
982
+ websites = websites.strip('[]').replace('"', '').replace(" ","").split(',')
983
+ # Clean up any whitespace
984
+ websites = [url.strip("'") for url in websites]
985
+
986
+ with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
987
+ future_to_url = {
988
+ executor.submit(self.crawl_website, url, keywords): url
989
+ for url in websites
990
+ }
991
+
992
+ for future in concurrent.futures.as_completed(future_to_url):
993
+ url = future_to_url[future]
994
+ try:
995
+ results = future.result()
996
+ all_results.extend(results)
997
+ logger.info(f"Completed crawling {url}, found {len(results)} relevant pages")
998
+ except Exception as e:
999
+ logger.error(f"Failed to crawl {url}: {str(e)}")
1000
+
1001
+ return all_results
1002
+
1003
+ # Create the workflow
1004
+ searcher = Agent(
1005
+ model=get_hf_model('searcher'),
1006
+ tools=[DuckDuckGo(fixed_max_results=DUCK_DUCK_GO_FIXED_MAX_RESULTS)],
1007
+
1008
+ instructions=[
1009
+ "Given a topic, search for 20 articles and return the 15 most relevant articles.",
1010
+ "For each article, provide:",
1011
+ "- title: The article title",
1012
+ "- url: The article URL",
1013
+ "- description: A brief description or summary",
1014
+ "Return the results in a structured format with these exact field names."
1015
+ ],
1016
+ response_model=SearchResults,
1017
+ structured_outputs=True
1018
+ )
1019
+
1020
+ backup_searcher = Agent(
1021
+ model=get_hf_model('searcher'),
1022
+ tools=[GoogleSearch()],
1023
+
1024
+ instructions=[
1025
+ "Given a topic, search for 20 articles and return the 15 most relevant articles.",
1026
+ "For each article, provide:",
1027
+ "- title: The article title",
1028
+ "- url: The article URL",
1029
+ "- description: A brief description or summary",
1030
+ "Return the results in a structured format with these exact field names."
1031
+ ],
1032
+ response_model=SearchResults,
1033
+ structured_outputs=True
1034
+ )
1035
+
1036
+ writer = Agent(
1037
+ model=get_hf_model('writer'),
1038
+ instructions=[
1039
+
1040
+ "You are a professional research analyst tasked with creating a comprehensive report on the given topic.",
1041
+ "The sources provided include both general web search results and specialized intelligence/security websites.",
1042
+ "Carefully analyze and cross-reference information from all sources to create a detailed report.",
1043
+ "",
1044
+ "Report Structure:",
1045
+ "1. Executive Summary (2-3 paragraphs)",
1046
+ " - Provide a clear, concise overview of the main findings",
1047
+ " - Address the research question directly",
1048
+ " - Highlight key discoveries and implications",
1049
+ "",
1050
+ "2. Detailed Analysis (Multiple sections)",
1051
+ " - Break down the topic into relevant themes or aspects",
1052
+ " - For each theme:",
1053
+ " * Present detailed findings from multiple sources",
1054
+ " * Cross-reference information between general and specialized sources",
1055
+ " * Analyze trends, patterns, and developments",
1056
+ " * Discuss implications and potential impacts",
1057
+ "",
1058
+ "3. Source Analysis and Credibility",
1059
+ " For each major source:",
1060
+ " - Evaluate source credibility and expertise",
1061
+ " - Note if from specialized intelligence/security website",
1062
+ " - Assess potential biases or limitations",
1063
+ " - Key findings and unique contributions",
1064
+ "",
1065
+ "4. Key Takeaways and Strategic Implications",
1066
+ " - Synthesize findings from all sources",
1067
+ " - Compare/contrast general media vs specialized analysis",
1068
+ " - Discuss broader geopolitical implications",
1069
+ " - Address potential future developments",
1070
+ "",
1071
+ "5. References",
1072
+ " - Group sources by type (specialized websites vs general media)",
1073
+ " - List all sources with full citations",
1074
+ " - Include URLs as clickable markdown links [Title](URL)",
1075
+ " - Ensure every major claim has at least one linked source",
1076
+ "",
1077
+ "Important Guidelines:",
1078
+ "- Prioritize information from specialized intelligence/security sources",
1079
+ "- Cross-validate claims between multiple sources when possible",
1080
+ "- Maintain a professional, analytical tone",
1081
+ "- Support all claims with evidence",
1082
+ "- Include specific examples and data points",
1083
+ "- Use direct quotes for significant statements",
1084
+ "- Address potential biases in reporting",
1085
+ "- Ensure the report directly answers the research question",
1086
+ "",
1087
+ "Format the report with clear markdown headings (# ## ###), subheadings, and paragraphs.",
1088
+ "Each major section should contain multiple paragraphs with detailed analysis."
1089
+ ],
1090
+ structured_outputs=True
1091
+ )
1092
+
1093
+ generate_blog_post = BlogPostGenerator(
1094
+ session_id=f"generate-blog-post-on-{topic}",
1095
+ searcher=searcher,
1096
+ backup_searcher=backup_searcher,
1097
+ writer=writer,
1098
+ file_handler=None, # Initialize with None
1099
+ storage=SqlWorkflowStorage(
1100
+ table_name="generate_blog_post_workflows",
1101
+ db_file="tmp/workflows.db",
1102
+ ),
1103
+ )
1104
+
1105
+ # Run workflow
1106
+ blog_post: Iterator[RunResponse] = generate_blog_post.run(topic=topic, use_cache=False)
1107
+
1108
+ # Print the response
1109
+ pprint_run_response(blog_post, markdown=True)