File size: 7,562 Bytes
8f3ce4a
0e9b799
 
 
 
 
5c2f4c9
0e9b799
 
8f3ce4a
 
 
b8a0c9a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: SearXNG Web Search
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.3.0
app_file: app.py
pinned: false
license: apache-2.0
---

An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

# Web Scraper for Financial News with Sentinel AI

## Table of Contents
1. [Overview](#overview)
2. [Core Components](#core-components)
   - [Search Engine Integration](#search-engine-integration)
   - [AI Models Integration](#ai-models-integration)
   - [Content Processing](#content-processing)
3. [Key Features](#key-features)
   - [Intelligent Query Processing](#intelligent-query-processing)
   - [Content Analysis](#content-analysis)
   - [Search Optimization](#search-optimization)
4. [Architecture](#architecture)
   - [User Interface (UI)](#user-interface-ui)
   - [Query Processing](#query-processing)
   - [Search Engine](#search-engine)
   - [Content Analysis](#content-analysis)
   - [Ranking System](#ranking-system)
   - [Response Generation](#response-generation)
   - [Core Classes](#core-classes)
5. [Main Functions](#main-functions)
6. [API Integration](#api-integration)
8. [Advanced Parameters](#advanced-parameters)

## 1. Overview
This application is a sophisticated web scraper and AI-powered chat interface specifically designed for financial news analysis. It combines web scraping capabilities with multiple Language Learning Models (LLMs) to provide intelligent, context-aware responses to user queries about financial information.

## 2. Core Components

### Search Engine Integration
- Uses SearXNG as the primary search meta-engine
- Supports multiple search engines (Google, Bing, DuckDuckGo, etc.)
- Implements custom retry mechanisms and timeout handling

### AI Models Integration
- Supports multiple LLM providers: Hugging Face (Mistral-Small-Instruct), Groq (Llama-3.1-70b), Mistral AI (Open-Mistral-Nemo)
- Implements semantic similarity using Sentence-Transformer

### Content Processing
- PDF processing with PyPDF2
- Web content scraping with Newspaper3k
- BM25 ranking algorithm implementation
- Document deduplication and relevance assessment

## 3. Key Features

### Intelligent Query Processing
- Query type determination (knowledge base vs. web search)
- Query rephrasing for optimal search results
- Entity recognition
- Time-aware query modification

### Content Analysis
- Relevance assessment
- Content summarization
- Semantic similarity comparison
- Document deduplication
- Priority-based content ranking

### Search Optimization
- Custom retry mechanism
- Rate limiting
- Error handling
- Content filtering and validation

## 4. Architecture

### User Interface (UI)
- You start by interacting with a Gradio Chat Interface.

### Query Processing
- Your query is sent to the Query Analysis (QA) section.
- The system then determines the type of query (DT).
- If it's a type that can use a Knowledge Base, it generates an AI response (KB).
- If it requires web searching, it rephrases the query (QR) for web search.
- The system extracts the entity domain (ED) from the rephrased query.

### Search Engine
- The extracted entity domain is sent to the SearXNG Search Engine (SE).
- The search engine returns the search results (SR).

### Content Analysis
- The search results are processed by web scraping (WS).
- If the content is in PDF format, it is scraped using PDF Scraping (PDF).
- If in HTML format, it's scraped using Newspaper3k Scraping (NEWS).
- Relevant content is summarized (DS) and checked for uniqueness (UC).

### Ranking System
- Content is ranked (DR) based on:
  - **BM25 Scoring (BM):** A scoring method to rank documents.
  - **Semantic Similarity (SS):** How similar the content is to the query.
- The scores are combined (CS) to produce a final ranking (FR).

### Response Generation
- The final ranking is summarized again (FS) to create a final summary.
- The AI-generated response (KB) and final summary (FS) are combined to form the final response.

### Completion
- The final response is sent back to the Gradio Chat Interface (UI) for you to see.

### Core Classes
- **BM25:** Custom implementation for document ranking
- **Search and Scrape Pipeline:** Handles query processing, web search, content scraping, document analysis, and content summarization.

## 5. Main Functions
- **`determine_query_type(query, chat_history, llm_client)`**: Determines whether to use knowledge base or web search based on context.
- **`search_and_scrape(query, chat_history, ...)`**: Main function for web search and content aggregation.
- **`rerank_documents_with_priority(query, documents, entity_domain, ...)`**: Hybrid ranking using BM25 and semantic similarity.
- **`llm_summarize(json_input, model, temperature)`**: Generates summaries using the specified LLM and handles citation and formatting.

## 6. API Integration
- **Required API Keys**: Hugging Face, Groq, Mistral, SearXNG
- **Environment Variables Setup**: Use dotenv to load environment variables

## 8. Advanced Parameters

| **Parameter**            | **Description**                                               | **Range/Options**                          | **Default**   | **Usage**                                                     |
|--------------------------|---------------------------------------------------------------|--------------------------------------------|---------------|---------------------------------------------------------------|
| **Number of Results**     | Number of search results retrieved.                           | 5 to 20                                    | 5             | Controls number of links/articles fetched from web searches.   |
| **Maximum Characters**    | Limits characters per document processed.                     | 500 to 10,000                              | 3000          | Truncates long documents, focusing on relevant information.    |
| **Time Range**            | Specifies the time period for search results.                 | day, week, month, year                     | month         | Filters results based on recent or historical data.            |
| **Language Selection**    | Filters search results by language.                           | `en`, `fr`, `es`, etc.                     | `en`          | Retrieves content in a specified language.                     |
| **LLM Temperature**       | Controls randomness in responses from LLM.                    | 0.0 to 1.0                                 | 0.2           | Low values for factual responses; higher for creative ones.    |
| **Search Engines**        | Specifies search engines used for scraping.                   | Google, Bing, DuckDuckGo, etc.             | All engines   | Choose specific search engines for better or private results.  |
| **Safe Search Level**     | Filters explicit/inappropriate content.                       | 0: No filter, 1: Moderate, 2: Strict       | 2 (Strict)    | Ensures family-friendly or professional content.               |
| **Model Selection**       | Chooses the LLM for summaries or responses.                   | Mistral, GPT-4, Groq                       | Varies        | Select models based on performance or speed.                   |
| **PDF Processing Toggle** | Enables/disables PDF document processing.                     | `True` (process) or `False` (skip)         | `False`       | Processes PDFs, useful for reports but may slow down speed.    |