AdityaAdaki commited on
Commit
4ac113f
Β·
1 Parent(s): 9d37152

ui updates and readme add

Browse files
Files changed (4) hide show
  1. README.md +129 -12
  2. app.py +57 -88
  3. f1_scraper.py +294 -0
  4. image.png +0 -0
README.md CHANGED
@@ -1,12 +1,129 @@
1
- ---
2
- title: F1 Ai
3
- emoji: πŸƒ
4
- colorFrom: green
5
- colorTo: indigo
6
- sdk: streamlit
7
- sdk_version: 1.43.2
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # F1-AI: Formula 1 RAG Application
2
+
3
+ F1-AI is a Retrieval-Augmented Generation (RAG) application specifically designed for Formula 1 information. It features an intelligent web scraper that automatically discovers and extracts Formula 1-related content from the web, stores it in a vector database, and enables natural language querying of the stored information.
4
+
5
+ ## Features
6
+
7
+ ![Example](image.png)
8
+
9
+ - Web scraping of Formula 1 content with automatic content extraction
10
+ - Vector database storage using Pinecone for efficient similarity search
11
+ - OpenRouter integration with Mistral-7B-Instruct model for advanced LLM capabilities
12
+ - HuggingFace embeddings for improved semantic understanding
13
+ - RAG-powered question answering with contextual understanding and source citations
14
+ - Command-line interface for automation and scripting
15
+ - User-friendly Streamlit web interface with chat history
16
+ - Asynchronous data ingestion and processing for improved performance
17
+
18
+ ## Architecture
19
+
20
+ F1-AI is built on a modern tech stack:
21
+
22
+ - **LangChain**: Orchestrates the RAG pipeline and manages interactions between components
23
+ - **Pinecone**: Vector database for storing and retrieving embeddings
24
+ - **OpenRouter**: Primary LLM provider with Mistral-7B-Instruct model
25
+ - **HuggingFace**: Provides all-MiniLM-L6-v2 embeddings model
26
+ - **Playwright**: Handles web scraping with JavaScript support
27
+ - **BeautifulSoup4**: Processes HTML content and extracts relevant information
28
+ - **Streamlit**: Provides an interactive web interface with chat functionality
29
+
30
+ ## Prerequisites
31
+
32
+ - Python 3.8 or higher
33
+ - OpenRouter API key (set as OPENROUTER_API_KEY environment variable)
34
+ - Pinecone API key (set as PINECONE_API_KEY environment variable)
35
+ - 8GB RAM minimum (16GB recommended)
36
+ - Internet connection for web scraping
37
+
38
+ ## Installation
39
+
40
+ 1. Clone the repository:
41
+ ```bash
42
+ git clone <repository-url>
43
+ cd f1-ai
44
+ ```
45
+
46
+ 2. Install the required dependencies:
47
+ ```bash
48
+ pip install -r requirements.txt
49
+ ```
50
+
51
+ 3. Install Playwright browsers:
52
+ ```bash
53
+ playwright install chromium
54
+ ```
55
+
56
+ 4. Set up environment variables:
57
+ Create a .env file with:
58
+ ```
59
+ OPENROUTER_API_KEY=your_api_key_here # Required for LLM functionality
60
+ PINECONE_API_KEY=your_api_key_here # Required for vector storage
61
+ ```
62
+
63
+ ## Usage
64
+
65
+ ### Command Line Interface
66
+
67
+ 1. Scrape and ingest F1 content:
68
+ ```bash
69
+ python f1_scraper.py --start-urls https://www.formula1.com/ --max-pages 100 --depth 2 --ingest
70
+ ```
71
+ Options:
72
+ - `--start-urls`: Space-separated list of URLs to start crawling from
73
+ - `--max-pages`: Maximum number of pages to crawl (default: 100)
74
+ - `--depth`: Maximum crawl depth (default: 2)
75
+ - `--ingest`: Flag to ingest discovered content into RAG system
76
+ - `--max-chunks`: Maximum chunks per URL for ingestion (default: 50)
77
+ - `--llm-provider`: Choose LLM provider (openrouter)
78
+
79
+ 2. Ask questions about Formula 1:
80
+ ```bash
81
+ python f1_ai.py ask "Who won the 2023 F1 World Championship?"
82
+ ```
83
+
84
+ ### Streamlit Interface
85
+
86
+ Run the Streamlit app:
87
+ ```bash
88
+ streamlit run app.py
89
+ ```
90
+
91
+ This will open a web interface where you can:
92
+ - Ask questions about Formula 1
93
+ - View responses in a chat-like interface
94
+ - See source citations for answers
95
+ - Track conversation history
96
+ - Get real-time updates on response generation
97
+
98
+ ## Project Structure
99
+
100
+ - `f1_scraper.py`: Intelligent web crawler implementation
101
+ - Automatically discovers F1-related content using keyword scoring
102
+ - Handles content relevance detection with priority paths
103
+ - Manages crawling depth and limits
104
+ - Implements domain-specific filtering
105
+ - `f1_ai.py`: Core RAG application implementation
106
+ - Handles data ingestion and chunking
107
+ - Manages vector database operations
108
+ - Implements question-answering logic with source tracking
109
+ - Provides robust error handling
110
+ - `llm_manager.py`: LLM provider management
111
+ - Integrates with OpenRouter for advanced LLM capabilities
112
+ - Manages HuggingFace embeddings generation
113
+ - Implements rate limiting and error recovery
114
+ - Handles async API interactions
115
+ - `app.py`: Streamlit web interface
116
+ - Provides chat-based UI with message history
117
+ - Manages conversation state
118
+ - Handles async operations with progress tracking
119
+ - Implements error handling and user feedback
120
+
121
+ ## Contributing
122
+
123
+ Contributions are welcome! Please follow these steps:
124
+
125
+ 1. Fork the repository
126
+ 2. Create a feature branch
127
+ 3. Commit your changes
128
+ 4. Push to the branch
129
+ 5. Submit a Pull Request
app.py CHANGED
@@ -24,98 +24,67 @@ st.markdown("""
24
  This application uses Retrieval-Augmented Generation (RAG) to answer questions about Formula 1.
25
  """)
26
 
27
- # Add tabs
28
- tab1, tab2 = st.tabs(["Chat", "Add Content"])
29
-
30
- with tab1:
31
- # Custom CSS for better styling
32
- st.markdown("""
33
- <style>
34
- .stChatMessage {
35
- padding: 1rem;
36
- border-radius: 0.5rem;
37
- margin-bottom: 1rem;
38
- box-shadow: 0 2px 4px rgba(0,0,0,0.1);
39
- }
40
- .stChatMessage.user {
41
- background-color: #f0f2f6;
42
- }
43
- .stChatMessage.assistant {
44
- background-color: #ffffff;
45
- }
46
- .source-link {
47
- font-size: 0.8rem;
48
- color: #666;
49
- text-decoration: none;
50
- }
51
- </style>
52
- """, unsafe_allow_html=True)
53
-
54
- # Display chat history with enhanced formatting
55
- for message in st.session_state.chat_history:
56
- with st.chat_message(message["role"]):
57
- if message["role"] == "assistant" and isinstance(message["content"], dict):
58
- st.markdown(message["content"]["answer"])
59
- if message["content"]["sources"]:
60
- st.markdown("---")
61
- st.markdown("**Sources:**")
62
- for source in message["content"]["sources"]:
63
- st.markdown(f"- [{source['url']}]({source['url']})")
64
- else:
65
- st.markdown(message["content"])
66
 
67
- # Question input
68
- if question := st.chat_input("Ask a question about Formula 1"):
69
- # Add user question to chat history
70
- st.session_state.chat_history.append({"role": "user", "content": question})
71
-
72
- # Display user question
73
- with st.chat_message("user"):
74
- st.write(question)
75
-
76
- # Generate and display response with enhanced formatting
77
- with st.chat_message("assistant"):
78
- with st.spinner("πŸ€” Analyzing Formula 1 knowledge..."):
79
- response = asyncio.run(st.session_state.f1_ai.ask_question(question))
80
- st.markdown(response["answer"])
81
-
82
- # Display sources if available
83
- if response["sources"]:
84
- st.markdown("---")
85
- st.markdown("**Sources:**")
86
- for source in response["sources"]:
87
- st.markdown(f"- [{source['url']}]({source['url']})")
88
-
89
- # Add assistant response to chat history
90
- st.session_state.chat_history.append({"role": "assistant", "content": response})
91
 
92
- with tab2:
93
- st.header("Add Content to Knowledge Base")
94
-
95
- urls_input = st.text_area("Enter URLs (one per line)",
96
- placeholder="https://en.wikipedia.org/wiki/Formula_One\nhttps://www.formula1.com/en/latest/article....")
97
 
98
- max_chunks = st.slider("Maximum chunks per URL", min_value=10, max_value=500, value=100, step=10)
 
 
99
 
100
- if st.button("Ingest Data"):
101
- if urls_input:
102
- urls = [url.strip() for url in urls_input.split("\n") if url.strip()]
103
- if urls:
104
- with st.spinner(f"Ingesting data from {len(urls)} URLs... This may take several minutes."):
105
- progress_bar = st.progress(0)
106
-
107
- # Process URLs one by one for better UI feedback
108
- for i, url in enumerate(urls):
109
- st.write(f"Processing: {url}")
110
- asyncio.run(st.session_state.f1_ai.ingest([url], max_chunks_per_url=max_chunks))
111
- progress_bar.progress((i + 1) / len(urls))
112
-
113
- st.success("βœ… Data ingestion complete!")
114
- else:
115
- st.error("Please enter at least one valid URL.")
116
- else:
117
- st.error("Please enter at least one URL to ingest.")
118
 
119
  # Add a footer with credits
120
  st.markdown("---")
121
- st.markdown("F1-AI: A Formula 1 RAG Application β€’ Powered by Hugging Face, Pinecone, and LangChain")
 
24
  This application uses Retrieval-Augmented Generation (RAG) to answer questions about Formula 1.
25
  """)
26
 
27
+ # Custom CSS for better styling
28
+ st.markdown("""
29
+ <style>
30
+ .stChatMessage {
31
+ padding: 1rem;
32
+ border-radius: 0.5rem;
33
+ margin-bottom: 1rem;
34
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
35
+ }
36
+ .stChatMessage.user {
37
+ background-color: #f0f2f6;
38
+ }
39
+ .stChatMessage.assistant {
40
+ background-color: #ffffff;
41
+ }
42
+ .source-link {
43
+ font-size: 0.8rem;
44
+ color: #666;
45
+ text-decoration: none;
46
+ }
47
+ </style>
48
+ """, unsafe_allow_html=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
+ # Display chat history with enhanced formatting
51
+ for message in st.session_state.chat_history:
52
+ with st.chat_message(message["role"]):
53
+ if message["role"] == "assistant" and isinstance(message["content"], dict):
54
+ st.markdown(message["content"]["answer"])
55
+ if message["content"]["sources"]:
56
+ st.markdown("---")
57
+ st.markdown("**Sources:**")
58
+ for source in message["content"]["sources"]:
59
+ st.markdown(f"- [{source['url']}]({source['url']})")
60
+ else:
61
+ st.markdown(message["content"])
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
+ # Question input
64
+ if question := st.chat_input("Ask a question about Formula 1"):
65
+ # Add user question to chat history
66
+ st.session_state.chat_history.append({"role": "user", "content": question})
 
67
 
68
+ # Display user question
69
+ with st.chat_message("user"):
70
+ st.write(question)
71
 
72
+ # Generate and display response with enhanced formatting
73
+ with st.chat_message("assistant"):
74
+ with st.spinner("πŸ€” Analyzing Formula 1 knowledge..."):
75
+ response = asyncio.run(st.session_state.f1_ai.ask_question(question))
76
+ st.markdown(response["answer"])
77
+
78
+ # Display sources if available
79
+ if response["sources"]:
80
+ st.markdown("---")
81
+ st.markdown("**Sources:**")
82
+ for source in response["sources"]:
83
+ st.markdown(f"- [{source['url']}]({source['url']})")
84
+
85
+ # Add assistant response to chat history
86
+ st.session_state.chat_history.append({"role": "assistant", "content": response})
 
 
 
87
 
88
  # Add a footer with credits
89
  st.markdown("---")
90
+ st.markdown("F1-AI: A Formula 1 RAG Application")
f1_scraper.py ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import asyncio
3
+ import argparse
4
+ import logging
5
+ from datetime import datetime
6
+ from urllib.parse import urlparse, urljoin
7
+ from typing import List, Dict, Set, Optional, Any
8
+ from rich.console import Console
9
+ from rich.progress import Progress
10
+ from playwright.async_api import async_playwright, TimeoutError
11
+ from bs4 import BeautifulSoup
12
+ from dotenv import load_dotenv
13
+
14
+ # Import our custom F1AI class
15
+ from f1_ai import F1AI
16
+
17
+ # Configure logging
18
+ logging.basicConfig(level=logging.INFO)
19
+ logger = logging.getLogger(__name__)
20
+ console = Console()
21
+
22
+ # Load environment variables
23
+ load_dotenv()
24
+
25
+ class F1Scraper:
26
+ def __init__(self, max_pages: int = 100, depth: int = 2, f1_ai: Optional[F1AI] = None):
27
+ """
28
+ Initialize the F1 web scraper.
29
+
30
+ Args:
31
+ max_pages (int): Maximum number of pages to scrape
32
+ depth (int): Maximum depth for crawling
33
+ f1_ai (F1AI): Optional F1AI instance to use for ingestion
34
+ """
35
+ self.max_pages = max_pages
36
+ self.depth = depth
37
+ self.visited_urls: Set[str] = set()
38
+ self.f1_urls: List[str] = []
39
+ self.f1_ai = f1_ai if f1_ai else F1AI(llm_provider="openrouter")
40
+
41
+ # Define F1-related keywords to identify relevant pages
42
+ self.f1_keywords = [
43
+ "formula 1", "formula one", "f1", "grand prix", "gp", "race", "racing",
44
+ "driver", "team", "championship", "qualifying", "podium", "ferrari",
45
+ "mercedes", "red bull", "mclaren", "williams", "alpine", "aston martin",
46
+ "haas", "alfa romeo", "alphatauri", "fia", "pirelli", "drs", "pit stop",
47
+ "verstappen", "hamilton", "leclerc", "sainz", "norris", "perez",
48
+ "russell", "alonso", "track", "circuit", "lap", "pole position"
49
+ ]
50
+
51
+ # Core F1 websites to target
52
+ self.f1_core_sites = [
53
+ "formula1.com",
54
+ "autosport.com",
55
+ "motorsport.com",
56
+ "f1i.com",
57
+ "racefans.net",
58
+ "crash.net/f1",
59
+ "espn.com/f1",
60
+ "bbc.com/sport/formula1",
61
+ "skysports.com/f1"
62
+ ]
63
+
64
+ def is_f1_related(self, url: str, content: Optional[str] = None) -> bool:
65
+ """Determine if a URL and its content are F1-related."""
66
+ # Check if URL is from a core F1 site
67
+ parsed_url = urlparse(url)
68
+ domain = parsed_url.netloc
69
+
70
+ for core_site in self.f1_core_sites:
71
+ if core_site in domain:
72
+ return True
73
+
74
+ # High-priority paths that are definitely F1-related
75
+ priority_paths = [
76
+ "/racing/", "/drivers/", "/teams/", "/results/",
77
+ "/grands-prix/", "/championship/", "/races/",
78
+ "/season/", "/standings/", "/stats/","/calendar/",
79
+ "/schedule/"
80
+ ]
81
+
82
+ # Skip these paths even if they contain F1-related terms
83
+ skip_paths = [
84
+ "/privacy/", "/terms/", "/legal/", "/contact/",
85
+ "/cookie/", "/account/", "/login/", "/register/",
86
+ "/admin/", "/about/", "/careers/", "/press/",
87
+ "/media-centre/", "/corporate/", "/investors/",
88
+ "/f1store", "f1authentincs", "/articles/", "/news/",
89
+ "/blog/", "/videos/", "/photos/", "/gallery/", "/photoshoot/"
90
+ ]
91
+
92
+ url_lower = url.lower()
93
+
94
+ # Check if URL is in skip paths
95
+ if any(path in url_lower for path in skip_paths):
96
+ return False
97
+
98
+ # Priority paths are always considered F1-related
99
+ if any(path in url_lower for path in priority_paths):
100
+ return True
101
+
102
+ # Check URL path for F1 keywords
103
+ url_path = parsed_url.path.lower()
104
+ for keyword in self.f1_keywords:
105
+ if keyword in url_path:
106
+ return True
107
+
108
+ # If content provided, check for F1 keywords
109
+ if content:
110
+ content_lower = content.lower()
111
+ # Count keyword occurrences to determine relevance
112
+ keyword_count = sum(1 for keyword in self.f1_keywords if keyword in content_lower)
113
+ # If many keywords are found, it's likely F1-related
114
+ if keyword_count >= 3:
115
+ return True
116
+
117
+ return False
118
+
119
+ async def extract_links(self, url: str) -> List[str]:
120
+ """Extract links from a webpage."""
121
+ links = []
122
+ try:
123
+ async with async_playwright() as p:
124
+ browser = await p.chromium.launch()
125
+ page = await browser.new_page()
126
+
127
+ try:
128
+ await page.goto(url, timeout=30000)
129
+ html_content = await page.content()
130
+ soup = BeautifulSoup(html_content, 'html.parser')
131
+
132
+ # Get base domain for domain restriction
133
+ parsed_url = urlparse(url)
134
+ base_domain = parsed_url.netloc
135
+
136
+ # Find all links
137
+ for a_tag in soup.find_all('a', href=True):
138
+ href = a_tag['href']
139
+ # Convert relative URLs to absolute
140
+ if href.startswith('/'):
141
+ href = urljoin(url, href)
142
+
143
+ # Skip non-http(s) URLs
144
+ if not href.startswith(('http://', 'https://')):
145
+ continue
146
+
147
+ # Only include links from formula1.com if it's the default start URL
148
+ if base_domain == 'www.formula1.com':
149
+ parsed_href = urlparse(href)
150
+ if parsed_href.netloc != 'www.formula1.com':
151
+ continue
152
+
153
+ links.append(href)
154
+
155
+ # Check if content is F1 related before returning
156
+ text_content = soup.get_text(separator=' ', strip=True)
157
+ if self.is_f1_related(url, text_content):
158
+ self.f1_urls.append(url)
159
+ logger.info(f"βœ… F1-related content found: {url}")
160
+
161
+ except TimeoutError:
162
+ logger.error(f"Timeout while loading {url}")
163
+ finally:
164
+ await browser.close()
165
+
166
+ return links
167
+ except Exception as e:
168
+ logger.error(f"Error extracting links from {url}: {str(e)}")
169
+ return []
170
+
171
+ async def crawl(self, start_urls: List[str]) -> List[str]:
172
+ """
173
+ Crawl F1-related websites starting from the provided URLs.
174
+
175
+ Args:
176
+ start_urls (List[str]): Starting URLs for crawling
177
+
178
+ Returns:
179
+ List[str]: List of discovered F1-related URLs
180
+ """
181
+ to_visit = start_urls.copy()
182
+ current_depth = 0
183
+
184
+ with Progress() as progress:
185
+ task = progress.add_task("[green]Crawling F1 websites...", total=self.max_pages)
186
+
187
+ while to_visit and len(self.visited_urls) < self.max_pages and current_depth <= self.depth:
188
+ current_depth += 1
189
+ next_level = []
190
+
191
+ for url in to_visit:
192
+ if url in self.visited_urls:
193
+ continue
194
+
195
+ self.visited_urls.add(url)
196
+ progress.update(task, advance=1, description=f"[green]Crawling: {url[:50]}...")
197
+
198
+ links = await self.extract_links(url)
199
+ next_level.extend([link for link in links if link not in self.visited_urls])
200
+
201
+ # Update progress
202
+ progress.update(task, completed=len(self.visited_urls), total=self.max_pages)
203
+ if len(self.visited_urls) >= self.max_pages:
204
+ break
205
+
206
+ to_visit = next_level
207
+ logger.info(f"Completed depth {current_depth}, discovered {len(self.f1_urls)} F1-related URLs")
208
+
209
+ # Deduplicate and return results
210
+ self.f1_urls = list(set(self.f1_urls))
211
+ return self.f1_urls
212
+
213
+ async def ingest_discovered_urls(self, max_chunks_per_url: int = 50) -> None:
214
+ """
215
+ Ingest discovered F1-related URLs into the RAG system.
216
+
217
+ Args:
218
+ max_chunks_per_url (int): Maximum chunks to extract per URL
219
+ """
220
+ if not self.f1_urls:
221
+ logger.warning("No F1-related URLs to ingest. Run crawl() first.")
222
+ return
223
+
224
+ logger.info(f"Ingesting {len(self.f1_urls)} F1-related URLs into RAG system...")
225
+ await self.f1_ai.ingest(self.f1_urls, max_chunks_per_url=max_chunks_per_url)
226
+ logger.info("βœ… Ingestion complete!")
227
+
228
+ def save_urls_to_file(self, filename: str = "f1_urls.txt") -> None:
229
+ """
230
+ Save discovered F1 URLs to a text file.
231
+
232
+ Args:
233
+ filename (str): Name of the output file
234
+ """
235
+ if not self.f1_urls:
236
+ logger.warning("No F1-related URLs to save. Run crawl() first.")
237
+ return
238
+
239
+ with open(filename, "w") as f:
240
+ f.write(f"# F1-related URLs discovered on {datetime.now().isoformat()}\n")
241
+ f.write(f"# Total URLs: {len(self.f1_urls)}\n\n")
242
+ for url in self.f1_urls:
243
+ f.write(f"{url}\n")
244
+
245
+ logger.info(f"βœ… Saved {len(self.f1_urls)} URLs to {filename}")
246
+
247
+ async def main():
248
+ """Main function to run the F1 scraper."""
249
+ parser = argparse.ArgumentParser(description="F1 Web Scraper to discover and ingest F1-related content")
250
+ parser.add_argument("--start-urls", nargs="+", default=["https://www.formula1.com/"],
251
+ help="Starting URLs for crawling")
252
+ parser.add_argument("--max-pages", type=int, default=100,
253
+ help="Maximum number of pages to crawl")
254
+ parser.add_argument("--depth", type=int, default=2,
255
+ help="Maximum crawl depth")
256
+ parser.add_argument("--ingest", action="store_true",
257
+ help="Ingest discovered URLs into RAG system")
258
+ parser.add_argument("--max-chunks", type=int, default=50,
259
+ help="Maximum chunks per URL for ingestion")
260
+ parser.add_argument("--output", type=str, default="f1_urls.txt",
261
+ help="Output file for discovered URLs")
262
+ parser.add_argument("--llm-provider", choices=["ollama", "openrouter"], default="openrouter",
263
+ help="Provider for LLM (default: openrouter)")
264
+
265
+ args = parser.parse_args()
266
+
267
+ # Initialize F1AI if needed
268
+ f1_ai = None
269
+ if args.ingest:
270
+ f1_ai = F1AI(llm_provider=args.llm_provider)
271
+
272
+ # Initialize and run the scraper
273
+ scraper = F1Scraper(
274
+ max_pages=args.max_pages,
275
+ depth=args.depth,
276
+ f1_ai=f1_ai
277
+ )
278
+
279
+ # Crawl to discover F1-related URLs
280
+ console.print("[bold blue]Starting F1 web crawler[/bold blue]")
281
+ discovered_urls = await scraper.crawl(args.start_urls)
282
+ console.print(f"[bold green]Discovered {len(discovered_urls)} F1-related URLs[/bold green]")
283
+
284
+ # Save URLs to file
285
+ scraper.save_urls_to_file(args.output)
286
+
287
+ # Ingest if requested
288
+ if args.ingest:
289
+ console.print("[bold yellow]Starting ingestion into RAG system...[/bold yellow]")
290
+ await scraper.ingest_discovered_urls(max_chunks_per_url=args.max_chunks)
291
+ console.print("[bold green]Ingestion complete![/bold green]")
292
+
293
+ if __name__ == "__main__":
294
+ asyncio.run(main())
image.png ADDED