Spaces:
Running
Running
AdityaAdaki
commited on
Commit
Β·
4ac113f
1
Parent(s):
9d37152
ui updates and readme add
Browse files
README.md
CHANGED
@@ -1,12 +1,129 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# F1-AI: Formula 1 RAG Application
|
2 |
+
|
3 |
+
F1-AI is a Retrieval-Augmented Generation (RAG) application specifically designed for Formula 1 information. It features an intelligent web scraper that automatically discovers and extracts Formula 1-related content from the web, stores it in a vector database, and enables natural language querying of the stored information.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+

|
8 |
+
|
9 |
+
- Web scraping of Formula 1 content with automatic content extraction
|
10 |
+
- Vector database storage using Pinecone for efficient similarity search
|
11 |
+
- OpenRouter integration with Mistral-7B-Instruct model for advanced LLM capabilities
|
12 |
+
- HuggingFace embeddings for improved semantic understanding
|
13 |
+
- RAG-powered question answering with contextual understanding and source citations
|
14 |
+
- Command-line interface for automation and scripting
|
15 |
+
- User-friendly Streamlit web interface with chat history
|
16 |
+
- Asynchronous data ingestion and processing for improved performance
|
17 |
+
|
18 |
+
## Architecture
|
19 |
+
|
20 |
+
F1-AI is built on a modern tech stack:
|
21 |
+
|
22 |
+
- **LangChain**: Orchestrates the RAG pipeline and manages interactions between components
|
23 |
+
- **Pinecone**: Vector database for storing and retrieving embeddings
|
24 |
+
- **OpenRouter**: Primary LLM provider with Mistral-7B-Instruct model
|
25 |
+
- **HuggingFace**: Provides all-MiniLM-L6-v2 embeddings model
|
26 |
+
- **Playwright**: Handles web scraping with JavaScript support
|
27 |
+
- **BeautifulSoup4**: Processes HTML content and extracts relevant information
|
28 |
+
- **Streamlit**: Provides an interactive web interface with chat functionality
|
29 |
+
|
30 |
+
## Prerequisites
|
31 |
+
|
32 |
+
- Python 3.8 or higher
|
33 |
+
- OpenRouter API key (set as OPENROUTER_API_KEY environment variable)
|
34 |
+
- Pinecone API key (set as PINECONE_API_KEY environment variable)
|
35 |
+
- 8GB RAM minimum (16GB recommended)
|
36 |
+
- Internet connection for web scraping
|
37 |
+
|
38 |
+
## Installation
|
39 |
+
|
40 |
+
1. Clone the repository:
|
41 |
+
```bash
|
42 |
+
git clone <repository-url>
|
43 |
+
cd f1-ai
|
44 |
+
```
|
45 |
+
|
46 |
+
2. Install the required dependencies:
|
47 |
+
```bash
|
48 |
+
pip install -r requirements.txt
|
49 |
+
```
|
50 |
+
|
51 |
+
3. Install Playwright browsers:
|
52 |
+
```bash
|
53 |
+
playwright install chromium
|
54 |
+
```
|
55 |
+
|
56 |
+
4. Set up environment variables:
|
57 |
+
Create a .env file with:
|
58 |
+
```
|
59 |
+
OPENROUTER_API_KEY=your_api_key_here # Required for LLM functionality
|
60 |
+
PINECONE_API_KEY=your_api_key_here # Required for vector storage
|
61 |
+
```
|
62 |
+
|
63 |
+
## Usage
|
64 |
+
|
65 |
+
### Command Line Interface
|
66 |
+
|
67 |
+
1. Scrape and ingest F1 content:
|
68 |
+
```bash
|
69 |
+
python f1_scraper.py --start-urls https://www.formula1.com/ --max-pages 100 --depth 2 --ingest
|
70 |
+
```
|
71 |
+
Options:
|
72 |
+
- `--start-urls`: Space-separated list of URLs to start crawling from
|
73 |
+
- `--max-pages`: Maximum number of pages to crawl (default: 100)
|
74 |
+
- `--depth`: Maximum crawl depth (default: 2)
|
75 |
+
- `--ingest`: Flag to ingest discovered content into RAG system
|
76 |
+
- `--max-chunks`: Maximum chunks per URL for ingestion (default: 50)
|
77 |
+
- `--llm-provider`: Choose LLM provider (openrouter)
|
78 |
+
|
79 |
+
2. Ask questions about Formula 1:
|
80 |
+
```bash
|
81 |
+
python f1_ai.py ask "Who won the 2023 F1 World Championship?"
|
82 |
+
```
|
83 |
+
|
84 |
+
### Streamlit Interface
|
85 |
+
|
86 |
+
Run the Streamlit app:
|
87 |
+
```bash
|
88 |
+
streamlit run app.py
|
89 |
+
```
|
90 |
+
|
91 |
+
This will open a web interface where you can:
|
92 |
+
- Ask questions about Formula 1
|
93 |
+
- View responses in a chat-like interface
|
94 |
+
- See source citations for answers
|
95 |
+
- Track conversation history
|
96 |
+
- Get real-time updates on response generation
|
97 |
+
|
98 |
+
## Project Structure
|
99 |
+
|
100 |
+
- `f1_scraper.py`: Intelligent web crawler implementation
|
101 |
+
- Automatically discovers F1-related content using keyword scoring
|
102 |
+
- Handles content relevance detection with priority paths
|
103 |
+
- Manages crawling depth and limits
|
104 |
+
- Implements domain-specific filtering
|
105 |
+
- `f1_ai.py`: Core RAG application implementation
|
106 |
+
- Handles data ingestion and chunking
|
107 |
+
- Manages vector database operations
|
108 |
+
- Implements question-answering logic with source tracking
|
109 |
+
- Provides robust error handling
|
110 |
+
- `llm_manager.py`: LLM provider management
|
111 |
+
- Integrates with OpenRouter for advanced LLM capabilities
|
112 |
+
- Manages HuggingFace embeddings generation
|
113 |
+
- Implements rate limiting and error recovery
|
114 |
+
- Handles async API interactions
|
115 |
+
- `app.py`: Streamlit web interface
|
116 |
+
- Provides chat-based UI with message history
|
117 |
+
- Manages conversation state
|
118 |
+
- Handles async operations with progress tracking
|
119 |
+
- Implements error handling and user feedback
|
120 |
+
|
121 |
+
## Contributing
|
122 |
+
|
123 |
+
Contributions are welcome! Please follow these steps:
|
124 |
+
|
125 |
+
1. Fork the repository
|
126 |
+
2. Create a feature branch
|
127 |
+
3. Commit your changes
|
128 |
+
4. Push to the branch
|
129 |
+
5. Submit a Pull Request
|
app.py
CHANGED
@@ -24,98 +24,67 @@ st.markdown("""
|
|
24 |
This application uses Retrieval-Augmented Generation (RAG) to answer questions about Formula 1.
|
25 |
""")
|
26 |
|
27 |
-
#
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
.
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
.
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
text-decoration: none;
|
50 |
-
}
|
51 |
-
</style>
|
52 |
-
""", unsafe_allow_html=True)
|
53 |
-
|
54 |
-
# Display chat history with enhanced formatting
|
55 |
-
for message in st.session_state.chat_history:
|
56 |
-
with st.chat_message(message["role"]):
|
57 |
-
if message["role"] == "assistant" and isinstance(message["content"], dict):
|
58 |
-
st.markdown(message["content"]["answer"])
|
59 |
-
if message["content"]["sources"]:
|
60 |
-
st.markdown("---")
|
61 |
-
st.markdown("**Sources:**")
|
62 |
-
for source in message["content"]["sources"]:
|
63 |
-
st.markdown(f"- [{source['url']}]({source['url']})")
|
64 |
-
else:
|
65 |
-
st.markdown(message["content"])
|
66 |
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
response = asyncio.run(st.session_state.f1_ai.ask_question(question))
|
80 |
-
st.markdown(response["answer"])
|
81 |
-
|
82 |
-
# Display sources if available
|
83 |
-
if response["sources"]:
|
84 |
-
st.markdown("---")
|
85 |
-
st.markdown("**Sources:**")
|
86 |
-
for source in response["sources"]:
|
87 |
-
st.markdown(f"- [{source['url']}]({source['url']})")
|
88 |
-
|
89 |
-
# Add assistant response to chat history
|
90 |
-
st.session_state.chat_history.append({"role": "assistant", "content": response})
|
91 |
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
placeholder="https://en.wikipedia.org/wiki/Formula_One\nhttps://www.formula1.com/en/latest/article....")
|
97 |
|
98 |
-
|
|
|
|
|
99 |
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
st.error("Please enter at least one valid URL.")
|
116 |
-
else:
|
117 |
-
st.error("Please enter at least one URL to ingest.")
|
118 |
|
119 |
# Add a footer with credits
|
120 |
st.markdown("---")
|
121 |
-
st.markdown("F1-AI: A Formula 1 RAG Application
|
|
|
24 |
This application uses Retrieval-Augmented Generation (RAG) to answer questions about Formula 1.
|
25 |
""")
|
26 |
|
27 |
+
# Custom CSS for better styling
|
28 |
+
st.markdown("""
|
29 |
+
<style>
|
30 |
+
.stChatMessage {
|
31 |
+
padding: 1rem;
|
32 |
+
border-radius: 0.5rem;
|
33 |
+
margin-bottom: 1rem;
|
34 |
+
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
35 |
+
}
|
36 |
+
.stChatMessage.user {
|
37 |
+
background-color: #f0f2f6;
|
38 |
+
}
|
39 |
+
.stChatMessage.assistant {
|
40 |
+
background-color: #ffffff;
|
41 |
+
}
|
42 |
+
.source-link {
|
43 |
+
font-size: 0.8rem;
|
44 |
+
color: #666;
|
45 |
+
text-decoration: none;
|
46 |
+
}
|
47 |
+
</style>
|
48 |
+
""", unsafe_allow_html=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
|
50 |
+
# Display chat history with enhanced formatting
|
51 |
+
for message in st.session_state.chat_history:
|
52 |
+
with st.chat_message(message["role"]):
|
53 |
+
if message["role"] == "assistant" and isinstance(message["content"], dict):
|
54 |
+
st.markdown(message["content"]["answer"])
|
55 |
+
if message["content"]["sources"]:
|
56 |
+
st.markdown("---")
|
57 |
+
st.markdown("**Sources:**")
|
58 |
+
for source in message["content"]["sources"]:
|
59 |
+
st.markdown(f"- [{source['url']}]({source['url']})")
|
60 |
+
else:
|
61 |
+
st.markdown(message["content"])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
+
# Question input
|
64 |
+
if question := st.chat_input("Ask a question about Formula 1"):
|
65 |
+
# Add user question to chat history
|
66 |
+
st.session_state.chat_history.append({"role": "user", "content": question})
|
|
|
67 |
|
68 |
+
# Display user question
|
69 |
+
with st.chat_message("user"):
|
70 |
+
st.write(question)
|
71 |
|
72 |
+
# Generate and display response with enhanced formatting
|
73 |
+
with st.chat_message("assistant"):
|
74 |
+
with st.spinner("π€ Analyzing Formula 1 knowledge..."):
|
75 |
+
response = asyncio.run(st.session_state.f1_ai.ask_question(question))
|
76 |
+
st.markdown(response["answer"])
|
77 |
+
|
78 |
+
# Display sources if available
|
79 |
+
if response["sources"]:
|
80 |
+
st.markdown("---")
|
81 |
+
st.markdown("**Sources:**")
|
82 |
+
for source in response["sources"]:
|
83 |
+
st.markdown(f"- [{source['url']}]({source['url']})")
|
84 |
+
|
85 |
+
# Add assistant response to chat history
|
86 |
+
st.session_state.chat_history.append({"role": "assistant", "content": response})
|
|
|
|
|
|
|
87 |
|
88 |
# Add a footer with credits
|
89 |
st.markdown("---")
|
90 |
+
st.markdown("F1-AI: A Formula 1 RAG Application")
|
f1_scraper.py
ADDED
@@ -0,0 +1,294 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import asyncio
|
3 |
+
import argparse
|
4 |
+
import logging
|
5 |
+
from datetime import datetime
|
6 |
+
from urllib.parse import urlparse, urljoin
|
7 |
+
from typing import List, Dict, Set, Optional, Any
|
8 |
+
from rich.console import Console
|
9 |
+
from rich.progress import Progress
|
10 |
+
from playwright.async_api import async_playwright, TimeoutError
|
11 |
+
from bs4 import BeautifulSoup
|
12 |
+
from dotenv import load_dotenv
|
13 |
+
|
14 |
+
# Import our custom F1AI class
|
15 |
+
from f1_ai import F1AI
|
16 |
+
|
17 |
+
# Configure logging
|
18 |
+
logging.basicConfig(level=logging.INFO)
|
19 |
+
logger = logging.getLogger(__name__)
|
20 |
+
console = Console()
|
21 |
+
|
22 |
+
# Load environment variables
|
23 |
+
load_dotenv()
|
24 |
+
|
25 |
+
class F1Scraper:
|
26 |
+
def __init__(self, max_pages: int = 100, depth: int = 2, f1_ai: Optional[F1AI] = None):
|
27 |
+
"""
|
28 |
+
Initialize the F1 web scraper.
|
29 |
+
|
30 |
+
Args:
|
31 |
+
max_pages (int): Maximum number of pages to scrape
|
32 |
+
depth (int): Maximum depth for crawling
|
33 |
+
f1_ai (F1AI): Optional F1AI instance to use for ingestion
|
34 |
+
"""
|
35 |
+
self.max_pages = max_pages
|
36 |
+
self.depth = depth
|
37 |
+
self.visited_urls: Set[str] = set()
|
38 |
+
self.f1_urls: List[str] = []
|
39 |
+
self.f1_ai = f1_ai if f1_ai else F1AI(llm_provider="openrouter")
|
40 |
+
|
41 |
+
# Define F1-related keywords to identify relevant pages
|
42 |
+
self.f1_keywords = [
|
43 |
+
"formula 1", "formula one", "f1", "grand prix", "gp", "race", "racing",
|
44 |
+
"driver", "team", "championship", "qualifying", "podium", "ferrari",
|
45 |
+
"mercedes", "red bull", "mclaren", "williams", "alpine", "aston martin",
|
46 |
+
"haas", "alfa romeo", "alphatauri", "fia", "pirelli", "drs", "pit stop",
|
47 |
+
"verstappen", "hamilton", "leclerc", "sainz", "norris", "perez",
|
48 |
+
"russell", "alonso", "track", "circuit", "lap", "pole position"
|
49 |
+
]
|
50 |
+
|
51 |
+
# Core F1 websites to target
|
52 |
+
self.f1_core_sites = [
|
53 |
+
"formula1.com",
|
54 |
+
"autosport.com",
|
55 |
+
"motorsport.com",
|
56 |
+
"f1i.com",
|
57 |
+
"racefans.net",
|
58 |
+
"crash.net/f1",
|
59 |
+
"espn.com/f1",
|
60 |
+
"bbc.com/sport/formula1",
|
61 |
+
"skysports.com/f1"
|
62 |
+
]
|
63 |
+
|
64 |
+
def is_f1_related(self, url: str, content: Optional[str] = None) -> bool:
|
65 |
+
"""Determine if a URL and its content are F1-related."""
|
66 |
+
# Check if URL is from a core F1 site
|
67 |
+
parsed_url = urlparse(url)
|
68 |
+
domain = parsed_url.netloc
|
69 |
+
|
70 |
+
for core_site in self.f1_core_sites:
|
71 |
+
if core_site in domain:
|
72 |
+
return True
|
73 |
+
|
74 |
+
# High-priority paths that are definitely F1-related
|
75 |
+
priority_paths = [
|
76 |
+
"/racing/", "/drivers/", "/teams/", "/results/",
|
77 |
+
"/grands-prix/", "/championship/", "/races/",
|
78 |
+
"/season/", "/standings/", "/stats/","/calendar/",
|
79 |
+
"/schedule/"
|
80 |
+
]
|
81 |
+
|
82 |
+
# Skip these paths even if they contain F1-related terms
|
83 |
+
skip_paths = [
|
84 |
+
"/privacy/", "/terms/", "/legal/", "/contact/",
|
85 |
+
"/cookie/", "/account/", "/login/", "/register/",
|
86 |
+
"/admin/", "/about/", "/careers/", "/press/",
|
87 |
+
"/media-centre/", "/corporate/", "/investors/",
|
88 |
+
"/f1store", "f1authentincs", "/articles/", "/news/",
|
89 |
+
"/blog/", "/videos/", "/photos/", "/gallery/", "/photoshoot/"
|
90 |
+
]
|
91 |
+
|
92 |
+
url_lower = url.lower()
|
93 |
+
|
94 |
+
# Check if URL is in skip paths
|
95 |
+
if any(path in url_lower for path in skip_paths):
|
96 |
+
return False
|
97 |
+
|
98 |
+
# Priority paths are always considered F1-related
|
99 |
+
if any(path in url_lower for path in priority_paths):
|
100 |
+
return True
|
101 |
+
|
102 |
+
# Check URL path for F1 keywords
|
103 |
+
url_path = parsed_url.path.lower()
|
104 |
+
for keyword in self.f1_keywords:
|
105 |
+
if keyword in url_path:
|
106 |
+
return True
|
107 |
+
|
108 |
+
# If content provided, check for F1 keywords
|
109 |
+
if content:
|
110 |
+
content_lower = content.lower()
|
111 |
+
# Count keyword occurrences to determine relevance
|
112 |
+
keyword_count = sum(1 for keyword in self.f1_keywords if keyword in content_lower)
|
113 |
+
# If many keywords are found, it's likely F1-related
|
114 |
+
if keyword_count >= 3:
|
115 |
+
return True
|
116 |
+
|
117 |
+
return False
|
118 |
+
|
119 |
+
async def extract_links(self, url: str) -> List[str]:
|
120 |
+
"""Extract links from a webpage."""
|
121 |
+
links = []
|
122 |
+
try:
|
123 |
+
async with async_playwright() as p:
|
124 |
+
browser = await p.chromium.launch()
|
125 |
+
page = await browser.new_page()
|
126 |
+
|
127 |
+
try:
|
128 |
+
await page.goto(url, timeout=30000)
|
129 |
+
html_content = await page.content()
|
130 |
+
soup = BeautifulSoup(html_content, 'html.parser')
|
131 |
+
|
132 |
+
# Get base domain for domain restriction
|
133 |
+
parsed_url = urlparse(url)
|
134 |
+
base_domain = parsed_url.netloc
|
135 |
+
|
136 |
+
# Find all links
|
137 |
+
for a_tag in soup.find_all('a', href=True):
|
138 |
+
href = a_tag['href']
|
139 |
+
# Convert relative URLs to absolute
|
140 |
+
if href.startswith('/'):
|
141 |
+
href = urljoin(url, href)
|
142 |
+
|
143 |
+
# Skip non-http(s) URLs
|
144 |
+
if not href.startswith(('http://', 'https://')):
|
145 |
+
continue
|
146 |
+
|
147 |
+
# Only include links from formula1.com if it's the default start URL
|
148 |
+
if base_domain == 'www.formula1.com':
|
149 |
+
parsed_href = urlparse(href)
|
150 |
+
if parsed_href.netloc != 'www.formula1.com':
|
151 |
+
continue
|
152 |
+
|
153 |
+
links.append(href)
|
154 |
+
|
155 |
+
# Check if content is F1 related before returning
|
156 |
+
text_content = soup.get_text(separator=' ', strip=True)
|
157 |
+
if self.is_f1_related(url, text_content):
|
158 |
+
self.f1_urls.append(url)
|
159 |
+
logger.info(f"β
F1-related content found: {url}")
|
160 |
+
|
161 |
+
except TimeoutError:
|
162 |
+
logger.error(f"Timeout while loading {url}")
|
163 |
+
finally:
|
164 |
+
await browser.close()
|
165 |
+
|
166 |
+
return links
|
167 |
+
except Exception as e:
|
168 |
+
logger.error(f"Error extracting links from {url}: {str(e)}")
|
169 |
+
return []
|
170 |
+
|
171 |
+
async def crawl(self, start_urls: List[str]) -> List[str]:
|
172 |
+
"""
|
173 |
+
Crawl F1-related websites starting from the provided URLs.
|
174 |
+
|
175 |
+
Args:
|
176 |
+
start_urls (List[str]): Starting URLs for crawling
|
177 |
+
|
178 |
+
Returns:
|
179 |
+
List[str]: List of discovered F1-related URLs
|
180 |
+
"""
|
181 |
+
to_visit = start_urls.copy()
|
182 |
+
current_depth = 0
|
183 |
+
|
184 |
+
with Progress() as progress:
|
185 |
+
task = progress.add_task("[green]Crawling F1 websites...", total=self.max_pages)
|
186 |
+
|
187 |
+
while to_visit and len(self.visited_urls) < self.max_pages and current_depth <= self.depth:
|
188 |
+
current_depth += 1
|
189 |
+
next_level = []
|
190 |
+
|
191 |
+
for url in to_visit:
|
192 |
+
if url in self.visited_urls:
|
193 |
+
continue
|
194 |
+
|
195 |
+
self.visited_urls.add(url)
|
196 |
+
progress.update(task, advance=1, description=f"[green]Crawling: {url[:50]}...")
|
197 |
+
|
198 |
+
links = await self.extract_links(url)
|
199 |
+
next_level.extend([link for link in links if link not in self.visited_urls])
|
200 |
+
|
201 |
+
# Update progress
|
202 |
+
progress.update(task, completed=len(self.visited_urls), total=self.max_pages)
|
203 |
+
if len(self.visited_urls) >= self.max_pages:
|
204 |
+
break
|
205 |
+
|
206 |
+
to_visit = next_level
|
207 |
+
logger.info(f"Completed depth {current_depth}, discovered {len(self.f1_urls)} F1-related URLs")
|
208 |
+
|
209 |
+
# Deduplicate and return results
|
210 |
+
self.f1_urls = list(set(self.f1_urls))
|
211 |
+
return self.f1_urls
|
212 |
+
|
213 |
+
async def ingest_discovered_urls(self, max_chunks_per_url: int = 50) -> None:
|
214 |
+
"""
|
215 |
+
Ingest discovered F1-related URLs into the RAG system.
|
216 |
+
|
217 |
+
Args:
|
218 |
+
max_chunks_per_url (int): Maximum chunks to extract per URL
|
219 |
+
"""
|
220 |
+
if not self.f1_urls:
|
221 |
+
logger.warning("No F1-related URLs to ingest. Run crawl() first.")
|
222 |
+
return
|
223 |
+
|
224 |
+
logger.info(f"Ingesting {len(self.f1_urls)} F1-related URLs into RAG system...")
|
225 |
+
await self.f1_ai.ingest(self.f1_urls, max_chunks_per_url=max_chunks_per_url)
|
226 |
+
logger.info("β
Ingestion complete!")
|
227 |
+
|
228 |
+
def save_urls_to_file(self, filename: str = "f1_urls.txt") -> None:
|
229 |
+
"""
|
230 |
+
Save discovered F1 URLs to a text file.
|
231 |
+
|
232 |
+
Args:
|
233 |
+
filename (str): Name of the output file
|
234 |
+
"""
|
235 |
+
if not self.f1_urls:
|
236 |
+
logger.warning("No F1-related URLs to save. Run crawl() first.")
|
237 |
+
return
|
238 |
+
|
239 |
+
with open(filename, "w") as f:
|
240 |
+
f.write(f"# F1-related URLs discovered on {datetime.now().isoformat()}\n")
|
241 |
+
f.write(f"# Total URLs: {len(self.f1_urls)}\n\n")
|
242 |
+
for url in self.f1_urls:
|
243 |
+
f.write(f"{url}\n")
|
244 |
+
|
245 |
+
logger.info(f"β
Saved {len(self.f1_urls)} URLs to {filename}")
|
246 |
+
|
247 |
+
async def main():
|
248 |
+
"""Main function to run the F1 scraper."""
|
249 |
+
parser = argparse.ArgumentParser(description="F1 Web Scraper to discover and ingest F1-related content")
|
250 |
+
parser.add_argument("--start-urls", nargs="+", default=["https://www.formula1.com/"],
|
251 |
+
help="Starting URLs for crawling")
|
252 |
+
parser.add_argument("--max-pages", type=int, default=100,
|
253 |
+
help="Maximum number of pages to crawl")
|
254 |
+
parser.add_argument("--depth", type=int, default=2,
|
255 |
+
help="Maximum crawl depth")
|
256 |
+
parser.add_argument("--ingest", action="store_true",
|
257 |
+
help="Ingest discovered URLs into RAG system")
|
258 |
+
parser.add_argument("--max-chunks", type=int, default=50,
|
259 |
+
help="Maximum chunks per URL for ingestion")
|
260 |
+
parser.add_argument("--output", type=str, default="f1_urls.txt",
|
261 |
+
help="Output file for discovered URLs")
|
262 |
+
parser.add_argument("--llm-provider", choices=["ollama", "openrouter"], default="openrouter",
|
263 |
+
help="Provider for LLM (default: openrouter)")
|
264 |
+
|
265 |
+
args = parser.parse_args()
|
266 |
+
|
267 |
+
# Initialize F1AI if needed
|
268 |
+
f1_ai = None
|
269 |
+
if args.ingest:
|
270 |
+
f1_ai = F1AI(llm_provider=args.llm_provider)
|
271 |
+
|
272 |
+
# Initialize and run the scraper
|
273 |
+
scraper = F1Scraper(
|
274 |
+
max_pages=args.max_pages,
|
275 |
+
depth=args.depth,
|
276 |
+
f1_ai=f1_ai
|
277 |
+
)
|
278 |
+
|
279 |
+
# Crawl to discover F1-related URLs
|
280 |
+
console.print("[bold blue]Starting F1 web crawler[/bold blue]")
|
281 |
+
discovered_urls = await scraper.crawl(args.start_urls)
|
282 |
+
console.print(f"[bold green]Discovered {len(discovered_urls)} F1-related URLs[/bold green]")
|
283 |
+
|
284 |
+
# Save URLs to file
|
285 |
+
scraper.save_urls_to_file(args.output)
|
286 |
+
|
287 |
+
# Ingest if requested
|
288 |
+
if args.ingest:
|
289 |
+
console.print("[bold yellow]Starting ingestion into RAG system...[/bold yellow]")
|
290 |
+
await scraper.ingest_discovered_urls(max_chunks_per_url=args.max_chunks)
|
291 |
+
console.print("[bold green]Ingestion complete![/bold green]")
|
292 |
+
|
293 |
+
if __name__ == "__main__":
|
294 |
+
asyncio.run(main())
|
image.png
ADDED
![]() |