Shreyas094
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -10,4 +10,132 @@ pinned: false
|
|
10 |
license: apache-2.0
|
11 |
---
|
12 |
|
13 |
-
An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
license: apache-2.0
|
11 |
---
|
12 |
|
13 |
+
An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
|
14 |
+
|
15 |
+
# Web Scraper for Financial News with Sentinel AI
|
16 |
+
|
17 |
+
## Table of Contents
|
18 |
+
1. [Overview](#overview)
|
19 |
+
2. [Core Components](#core-components)
|
20 |
+
- [Search Engine Integration](#search-engine-integration)
|
21 |
+
- [AI Models Integration](#ai-models-integration)
|
22 |
+
- [Content Processing](#content-processing)
|
23 |
+
3. [Key Features](#key-features)
|
24 |
+
- [Intelligent Query Processing](#intelligent-query-processing)
|
25 |
+
- [Content Analysis](#content-analysis)
|
26 |
+
- [Search Optimization](#search-optimization)
|
27 |
+
4. [Architecture](#architecture)
|
28 |
+
- [User Interface (UI)](#user-interface-ui)
|
29 |
+
- [Query Processing](#query-processing)
|
30 |
+
- [Search Engine](#search-engine)
|
31 |
+
- [Content Analysis](#content-analysis)
|
32 |
+
- [Ranking System](#ranking-system)
|
33 |
+
- [Response Generation](#response-generation)
|
34 |
+
- [Core Classes](#core-classes)
|
35 |
+
5. [Main Functions](#main-functions)
|
36 |
+
6. [API Integration](#api-integration)
|
37 |
+
8. [Advanced Parameters](#advanced-parameters)
|
38 |
+
|
39 |
+
## 1. Overview
|
40 |
+
This application is a sophisticated web scraper and AI-powered chat interface specifically designed for financial news analysis. It combines web scraping capabilities with multiple Language Learning Models (LLMs) to provide intelligent, context-aware responses to user queries about financial information.
|
41 |
+
|
42 |
+
## 2. Core Components
|
43 |
+
|
44 |
+
### Search Engine Integration
|
45 |
+
- Uses SearXNG as the primary search meta-engine
|
46 |
+
- Supports multiple search engines (Google, Bing, DuckDuckGo, etc.)
|
47 |
+
- Implements custom retry mechanisms and timeout handling
|
48 |
+
|
49 |
+
### AI Models Integration
|
50 |
+
- Supports multiple LLM providers: Hugging Face (Mistral-Small-Instruct), Groq (Llama-3.1-70b), Mistral AI (Open-Mistral-Nemo)
|
51 |
+
- Implements semantic similarity using Sentence-Transformer
|
52 |
+
|
53 |
+
### Content Processing
|
54 |
+
- PDF processing with PyPDF2
|
55 |
+
- Web content scraping with Newspaper3k
|
56 |
+
- BM25 ranking algorithm implementation
|
57 |
+
- Document deduplication and relevance assessment
|
58 |
+
|
59 |
+
## 3. Key Features
|
60 |
+
|
61 |
+
### Intelligent Query Processing
|
62 |
+
- Query type determination (knowledge base vs. web search)
|
63 |
+
- Query rephrasing for optimal search results
|
64 |
+
- Entity recognition
|
65 |
+
- Time-aware query modification
|
66 |
+
|
67 |
+
### Content Analysis
|
68 |
+
- Relevance assessment
|
69 |
+
- Content summarization
|
70 |
+
- Semantic similarity comparison
|
71 |
+
- Document deduplication
|
72 |
+
- Priority-based content ranking
|
73 |
+
|
74 |
+
### Search Optimization
|
75 |
+
- Custom retry mechanism
|
76 |
+
- Rate limiting
|
77 |
+
- Error handling
|
78 |
+
- Content filtering and validation
|
79 |
+
|
80 |
+
## 4. Architecture
|
81 |
+
|
82 |
+
### User Interface (UI)
|
83 |
+
- You start by interacting with a Gradio Chat Interface.
|
84 |
+
|
85 |
+
### Query Processing
|
86 |
+
- Your query is sent to the Query Analysis (QA) section.
|
87 |
+
- The system then determines the type of query (DT).
|
88 |
+
- If it's a type that can use a Knowledge Base, it generates an AI response (KB).
|
89 |
+
- If it requires web searching, it rephrases the query (QR) for web search.
|
90 |
+
- The system extracts the entity domain (ED) from the rephrased query.
|
91 |
+
|
92 |
+
### Search Engine
|
93 |
+
- The extracted entity domain is sent to the SearXNG Search Engine (SE).
|
94 |
+
- The search engine returns the search results (SR).
|
95 |
+
|
96 |
+
### Content Analysis
|
97 |
+
- The search results are processed by web scraping (WS).
|
98 |
+
- If the content is in PDF format, it is scraped using PDF Scraping (PDF).
|
99 |
+
- If in HTML format, it's scraped using Newspaper3k Scraping (NEWS).
|
100 |
+
- Relevant content is summarized (DS) and checked for uniqueness (UC).
|
101 |
+
|
102 |
+
### Ranking System
|
103 |
+
- Content is ranked (DR) based on:
|
104 |
+
- **BM25 Scoring (BM):** A scoring method to rank documents.
|
105 |
+
- **Semantic Similarity (SS):** How similar the content is to the query.
|
106 |
+
- The scores are combined (CS) to produce a final ranking (FR).
|
107 |
+
|
108 |
+
### Response Generation
|
109 |
+
- The final ranking is summarized again (FS) to create a final summary.
|
110 |
+
- The AI-generated response (KB) and final summary (FS) are combined to form the final response.
|
111 |
+
|
112 |
+
### Completion
|
113 |
+
- The final response is sent back to the Gradio Chat Interface (UI) for you to see.
|
114 |
+
|
115 |
+
### Core Classes
|
116 |
+
- **BM25:** Custom implementation for document ranking
|
117 |
+
- **Search and Scrape Pipeline:** Handles query processing, web search, content scraping, document analysis, and content summarization.
|
118 |
+
|
119 |
+
## 5. Main Functions
|
120 |
+
- **`determine_query_type(query, chat_history, llm_client)`**: Determines whether to use knowledge base or web search based on context.
|
121 |
+
- **`search_and_scrape(query, chat_history, ...)`**: Main function for web search and content aggregation.
|
122 |
+
- **`rerank_documents_with_priority(query, documents, entity_domain, ...)`**: Hybrid ranking using BM25 and semantic similarity.
|
123 |
+
- **`llm_summarize(json_input, model, temperature)`**: Generates summaries using the specified LLM and handles citation and formatting.
|
124 |
+
|
125 |
+
## 6. API Integration
|
126 |
+
- **Required API Keys**: Hugging Face, Groq, Mistral, SearXNG
|
127 |
+
- **Environment Variables Setup**: Use dotenv to load environment variables
|
128 |
+
|
129 |
+
## 8. Advanced Parameters
|
130 |
+
|
131 |
+
| **Parameter** | **Description** | **Range/Options** | **Default** | **Usage** |
|
132 |
+
|--------------------------|---------------------------------------------------------------|--------------------------------------------|---------------|---------------------------------------------------------------|
|
133 |
+
| **Number of Results** | Number of search results retrieved. | 5 to 20 | 5 | Controls number of links/articles fetched from web searches. |
|
134 |
+
| **Maximum Characters** | Limits characters per document processed. | 500 to 10,000 | 3000 | Truncates long documents, focusing on relevant information. |
|
135 |
+
| **Time Range** | Specifies the time period for search results. | day, week, month, year | month | Filters results based on recent or historical data. |
|
136 |
+
| **Language Selection** | Filters search results by language. | `en`, `fr`, `es`, etc. | `en` | Retrieves content in a specified language. |
|
137 |
+
| **LLM Temperature** | Controls randomness in responses from LLM. | 0.0 to 1.0 | 0.2 | Low values for factual responses; higher for creative ones. |
|
138 |
+
| **Search Engines** | Specifies search engines used for scraping. | Google, Bing, DuckDuckGo, etc. | All engines | Choose specific search engines for better or private results. |
|
139 |
+
| **Safe Search Level** | Filters explicit/inappropriate content. | 0: No filter, 1: Moderate, 2: Strict | 2 (Strict) | Ensures family-friendly or professional content. |
|
140 |
+
| **Model Selection** | Chooses the LLM for summaries or responses. | Mistral, GPT-4, Groq | Varies | Select models based on performance or speed. |
|
141 |
+
| **PDF Processing Toggle** | Enables/disables PDF document processing. | `True` (process) or `False` (skip) | `False` | Processes PDFs, useful for reports but may slow down speed. |
|