Shreyas094 commited on
Commit
b8a0c9a
·
verified ·
1 Parent(s): 8f3ce4a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -1
README.md CHANGED
@@ -10,4 +10,132 @@ pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: apache-2.0
11
  ---
12
 
13
+ An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
14
+
15
+ # Web Scraper for Financial News with Sentinel AI
16
+
17
+ ## Table of Contents
18
+ 1. [Overview](#overview)
19
+ 2. [Core Components](#core-components)
20
+ - [Search Engine Integration](#search-engine-integration)
21
+ - [AI Models Integration](#ai-models-integration)
22
+ - [Content Processing](#content-processing)
23
+ 3. [Key Features](#key-features)
24
+ - [Intelligent Query Processing](#intelligent-query-processing)
25
+ - [Content Analysis](#content-analysis)
26
+ - [Search Optimization](#search-optimization)
27
+ 4. [Architecture](#architecture)
28
+ - [User Interface (UI)](#user-interface-ui)
29
+ - [Query Processing](#query-processing)
30
+ - [Search Engine](#search-engine)
31
+ - [Content Analysis](#content-analysis)
32
+ - [Ranking System](#ranking-system)
33
+ - [Response Generation](#response-generation)
34
+ - [Core Classes](#core-classes)
35
+ 5. [Main Functions](#main-functions)
36
+ 6. [API Integration](#api-integration)
37
+ 8. [Advanced Parameters](#advanced-parameters)
38
+
39
+ ## 1. Overview
40
+ This application is a sophisticated web scraper and AI-powered chat interface specifically designed for financial news analysis. It combines web scraping capabilities with multiple Language Learning Models (LLMs) to provide intelligent, context-aware responses to user queries about financial information.
41
+
42
+ ## 2. Core Components
43
+
44
+ ### Search Engine Integration
45
+ - Uses SearXNG as the primary search meta-engine
46
+ - Supports multiple search engines (Google, Bing, DuckDuckGo, etc.)
47
+ - Implements custom retry mechanisms and timeout handling
48
+
49
+ ### AI Models Integration
50
+ - Supports multiple LLM providers: Hugging Face (Mistral-Small-Instruct), Groq (Llama-3.1-70b), Mistral AI (Open-Mistral-Nemo)
51
+ - Implements semantic similarity using Sentence-Transformer
52
+
53
+ ### Content Processing
54
+ - PDF processing with PyPDF2
55
+ - Web content scraping with Newspaper3k
56
+ - BM25 ranking algorithm implementation
57
+ - Document deduplication and relevance assessment
58
+
59
+ ## 3. Key Features
60
+
61
+ ### Intelligent Query Processing
62
+ - Query type determination (knowledge base vs. web search)
63
+ - Query rephrasing for optimal search results
64
+ - Entity recognition
65
+ - Time-aware query modification
66
+
67
+ ### Content Analysis
68
+ - Relevance assessment
69
+ - Content summarization
70
+ - Semantic similarity comparison
71
+ - Document deduplication
72
+ - Priority-based content ranking
73
+
74
+ ### Search Optimization
75
+ - Custom retry mechanism
76
+ - Rate limiting
77
+ - Error handling
78
+ - Content filtering and validation
79
+
80
+ ## 4. Architecture
81
+
82
+ ### User Interface (UI)
83
+ - You start by interacting with a Gradio Chat Interface.
84
+
85
+ ### Query Processing
86
+ - Your query is sent to the Query Analysis (QA) section.
87
+ - The system then determines the type of query (DT).
88
+ - If it's a type that can use a Knowledge Base, it generates an AI response (KB).
89
+ - If it requires web searching, it rephrases the query (QR) for web search.
90
+ - The system extracts the entity domain (ED) from the rephrased query.
91
+
92
+ ### Search Engine
93
+ - The extracted entity domain is sent to the SearXNG Search Engine (SE).
94
+ - The search engine returns the search results (SR).
95
+
96
+ ### Content Analysis
97
+ - The search results are processed by web scraping (WS).
98
+ - If the content is in PDF format, it is scraped using PDF Scraping (PDF).
99
+ - If in HTML format, it's scraped using Newspaper3k Scraping (NEWS).
100
+ - Relevant content is summarized (DS) and checked for uniqueness (UC).
101
+
102
+ ### Ranking System
103
+ - Content is ranked (DR) based on:
104
+ - **BM25 Scoring (BM):** A scoring method to rank documents.
105
+ - **Semantic Similarity (SS):** How similar the content is to the query.
106
+ - The scores are combined (CS) to produce a final ranking (FR).
107
+
108
+ ### Response Generation
109
+ - The final ranking is summarized again (FS) to create a final summary.
110
+ - The AI-generated response (KB) and final summary (FS) are combined to form the final response.
111
+
112
+ ### Completion
113
+ - The final response is sent back to the Gradio Chat Interface (UI) for you to see.
114
+
115
+ ### Core Classes
116
+ - **BM25:** Custom implementation for document ranking
117
+ - **Search and Scrape Pipeline:** Handles query processing, web search, content scraping, document analysis, and content summarization.
118
+
119
+ ## 5. Main Functions
120
+ - **`determine_query_type(query, chat_history, llm_client)`**: Determines whether to use knowledge base or web search based on context.
121
+ - **`search_and_scrape(query, chat_history, ...)`**: Main function for web search and content aggregation.
122
+ - **`rerank_documents_with_priority(query, documents, entity_domain, ...)`**: Hybrid ranking using BM25 and semantic similarity.
123
+ - **`llm_summarize(json_input, model, temperature)`**: Generates summaries using the specified LLM and handles citation and formatting.
124
+
125
+ ## 6. API Integration
126
+ - **Required API Keys**: Hugging Face, Groq, Mistral, SearXNG
127
+ - **Environment Variables Setup**: Use dotenv to load environment variables
128
+
129
+ ## 8. Advanced Parameters
130
+
131
+ | **Parameter** | **Description** | **Range/Options** | **Default** | **Usage** |
132
+ |--------------------------|---------------------------------------------------------------|--------------------------------------------|---------------|---------------------------------------------------------------|
133
+ | **Number of Results** | Number of search results retrieved. | 5 to 20 | 5 | Controls number of links/articles fetched from web searches. |
134
+ | **Maximum Characters** | Limits characters per document processed. | 500 to 10,000 | 3000 | Truncates long documents, focusing on relevant information. |
135
+ | **Time Range** | Specifies the time period for search results. | day, week, month, year | month | Filters results based on recent or historical data. |
136
+ | **Language Selection** | Filters search results by language. | `en`, `fr`, `es`, etc. | `en` | Retrieves content in a specified language. |
137
+ | **LLM Temperature** | Controls randomness in responses from LLM. | 0.0 to 1.0 | 0.2 | Low values for factual responses; higher for creative ones. |
138
+ | **Search Engines** | Specifies search engines used for scraping. | Google, Bing, DuckDuckGo, etc. | All engines | Choose specific search engines for better or private results. |
139
+ | **Safe Search Level** | Filters explicit/inappropriate content. | 0: No filter, 1: Moderate, 2: Strict | 2 (Strict) | Ensures family-friendly or professional content. |
140
+ | **Model Selection** | Chooses the LLM for summaries or responses. | Mistral, GPT-4, Groq | Varies | Select models based on performance or speed. |
141
+ | **PDF Processing Toggle** | Enables/disables PDF document processing. | `True` (process) or `False` (skip) | `False` | Processes PDFs, useful for reports but may slow down speed. |