File size: 4,688 Bytes
ffeb80a
 
 
 
80a2c80
ffeb80a
 
 
 
 
 
 
 
4ac113f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: F1-AI
emoji: 🏎️
colorFrom: red
colorTo: gray
sdk: streamlit
sdk_version: "1.27.2"
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# F1-AI: Formula 1 RAG Application

F1-AI is a Retrieval-Augmented Generation (RAG) application specifically designed for Formula 1 information. It features an intelligent web scraper that automatically discovers and extracts Formula 1-related content from the web, stores it in a vector database, and enables natural language querying of the stored information.

## Features

![Example](image.png)

- Web scraping of Formula 1 content with automatic content extraction
- Vector database storage using Pinecone for efficient similarity search
- OpenRouter integration with Mistral-7B-Instruct model for advanced LLM capabilities
- HuggingFace embeddings for improved semantic understanding
- RAG-powered question answering with contextual understanding and source citations
- Command-line interface for automation and scripting
- User-friendly Streamlit web interface with chat history
- Asynchronous data ingestion and processing for improved performance

## Architecture

F1-AI is built on a modern tech stack:

- **LangChain**: Orchestrates the RAG pipeline and manages interactions between components
- **Pinecone**: Vector database for storing and retrieving embeddings
- **OpenRouter**: Primary LLM provider with Mistral-7B-Instruct model
- **HuggingFace**: Provides all-MiniLM-L6-v2 embeddings model
- **Playwright**: Handles web scraping with JavaScript support
- **BeautifulSoup4**: Processes HTML content and extracts relevant information
- **Streamlit**: Provides an interactive web interface with chat functionality

## Prerequisites

- Python 3.8 or higher
- OpenRouter API key (set as OPENROUTER_API_KEY environment variable)
- Pinecone API key (set as PINECONE_API_KEY environment variable)
- 8GB RAM minimum (16GB recommended)
- Internet connection for web scraping

## Installation

1. Clone the repository:
   ```bash
   git clone <repository-url>
   cd f1-ai
   ```

2. Install the required dependencies:
   ```bash
   pip install -r requirements.txt
   ```

3. Install Playwright browsers:
   ```bash
   playwright install chromium
   ```

4. Set up environment variables:
   Create a .env file with:
   ```
   OPENROUTER_API_KEY=your_api_key_here    # Required for LLM functionality
   PINECONE_API_KEY=your_api_key_here      # Required for vector storage
   ```

## Usage

### Command Line Interface

1. Scrape and ingest F1 content:
   ```bash
   python f1_scraper.py --start-urls https://www.formula1.com/ --max-pages 100 --depth 2 --ingest
   ```
   Options:
   - `--start-urls`: Space-separated list of URLs to start crawling from
   - `--max-pages`: Maximum number of pages to crawl (default: 100)
   - `--depth`: Maximum crawl depth (default: 2)
   - `--ingest`: Flag to ingest discovered content into RAG system
   - `--max-chunks`: Maximum chunks per URL for ingestion (default: 50)
   - `--llm-provider`: Choose LLM provider (openrouter)

2. Ask questions about Formula 1:
   ```bash
   python f1_ai.py ask "Who won the 2023 F1 World Championship?"
   ```

### Streamlit Interface

Run the Streamlit app:
```bash
streamlit run app.py
```

This will open a web interface where you can:
- Ask questions about Formula 1
- View responses in a chat-like interface
- See source citations for answers
- Track conversation history
- Get real-time updates on response generation

## Project Structure

- `f1_scraper.py`: Intelligent web crawler implementation
  - Automatically discovers F1-related content using keyword scoring
  - Handles content relevance detection with priority paths
  - Manages crawling depth and limits
  - Implements domain-specific filtering
- `f1_ai.py`: Core RAG application implementation
  - Handles data ingestion and chunking
  - Manages vector database operations
  - Implements question-answering logic with source tracking
  - Provides robust error handling
- `llm_manager.py`: LLM provider management
  - Integrates with OpenRouter for advanced LLM capabilities
  - Manages HuggingFace embeddings generation
  - Implements rate limiting and error recovery
  - Handles async API interactions
- `app.py`: Streamlit web interface
  - Provides chat-based UI with message history
  - Manages conversation state
  - Handles async operations with progress tracking
  - Implements error handling and user feedback

## Contributing

Contributions are welcome! Please follow these steps:

1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Submit a Pull Request