Shreyas094 commited on
Commit
0e9b799
·
verified ·
1 Parent(s): 47fc73d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -127
README.md CHANGED
@@ -1,127 +1,9 @@
1
- # Web Scraper for Financial News with Sentinel AI
2
-
3
- ## Table of Contents
4
- 1. [Overview](#overview)
5
- 2. [Core Components](#core-components)
6
- - [Search Engine Integration](#search-engine-integration)
7
- - [AI Models Integration](#ai-models-integration)
8
- - [Content Processing](#content-processing)
9
- 3. [Key Features](#key-features)
10
- - [Intelligent Query Processing](#intelligent-query-processing)
11
- - [Content Analysis](#content-analysis)
12
- - [Search Optimization](#search-optimization)
13
- 4. [Architecture](#architecture)
14
- - [User Interface (UI)](#user-interface-ui)
15
- - [Query Processing](#query-processing)
16
- - [Search Engine](#search-engine)
17
- - [Content Analysis](#content-analysis)
18
- - [Ranking System](#ranking-system)
19
- - [Response Generation](#response-generation)
20
- - [Core Classes](#core-classes)
21
- 5. [Main Functions](#main-functions)
22
- 6. [API Integration](#api-integration)
23
- 8. [Advanced Parameters](#advanced-parameters)
24
-
25
- ## 1. Overview
26
- This application is a sophisticated web scraper and AI-powered chat interface specifically designed for financial news analysis. It combines web scraping capabilities with multiple Language Learning Models (LLMs) to provide intelligent, context-aware responses to user queries about financial information.
27
-
28
- ## 2. Core Components
29
-
30
- ### Search Engine Integration
31
- - Uses SearXNG as the primary search meta-engine
32
- - Supports multiple search engines (Google, Bing, DuckDuckGo, etc.)
33
- - Implements custom retry mechanisms and timeout handling
34
-
35
- ### AI Models Integration
36
- - Supports multiple LLM providers: Hugging Face (Mistral-Small-Instruct), Groq (Llama-3.1-70b), Mistral AI (Open-Mistral-Nemo)
37
- - Implements semantic similarity using Sentence-Transformer
38
-
39
- ### Content Processing
40
- - PDF processing with PyPDF2
41
- - Web content scraping with Newspaper3k
42
- - BM25 ranking algorithm implementation
43
- - Document deduplication and relevance assessment
44
-
45
- ## 3. Key Features
46
-
47
- ### Intelligent Query Processing
48
- - Query type determination (knowledge base vs. web search)
49
- - Query rephrasing for optimal search results
50
- - Entity recognition
51
- - Time-aware query modification
52
-
53
- ### Content Analysis
54
- - Relevance assessment
55
- - Content summarization
56
- - Semantic similarity comparison
57
- - Document deduplication
58
- - Priority-based content ranking
59
-
60
- ### Search Optimization
61
- - Custom retry mechanism
62
- - Rate limiting
63
- - Error handling
64
- - Content filtering and validation
65
-
66
- ## 4. Architecture
67
-
68
- ### User Interface (UI)
69
- - You start by interacting with a Gradio Chat Interface.
70
-
71
- ### Query Processing
72
- - Your query is sent to the Query Analysis (QA) section.
73
- - The system then determines the type of query (DT).
74
- - If it's a type that can use a Knowledge Base, it generates an AI response (KB).
75
- - If it requires web searching, it rephrases the query (QR) for web search.
76
- - The system extracts the entity domain (ED) from the rephrased query.
77
-
78
- ### Search Engine
79
- - The extracted entity domain is sent to the SearXNG Search Engine (SE).
80
- - The search engine returns the search results (SR).
81
-
82
- ### Content Analysis
83
- - The search results are processed by web scraping (WS).
84
- - If the content is in PDF format, it is scraped using PDF Scraping (PDF).
85
- - If in HTML format, it's scraped using Newspaper3k Scraping (NEWS).
86
- - Relevant content is summarized (DS) and checked for uniqueness (UC).
87
-
88
- ### Ranking System
89
- - Content is ranked (DR) based on:
90
- - **BM25 Scoring (BM):** A scoring method to rank documents.
91
- - **Semantic Similarity (SS):** How similar the content is to the query.
92
- - The scores are combined (CS) to produce a final ranking (FR).
93
-
94
- ### Response Generation
95
- - The final ranking is summarized again (FS) to create a final summary.
96
- - The AI-generated response (KB) and final summary (FS) are combined to form the final response.
97
-
98
- ### Completion
99
- - The final response is sent back to the Gradio Chat Interface (UI) for you to see.
100
-
101
- ### Core Classes
102
- - **BM25:** Custom implementation for document ranking
103
- - **Search and Scrape Pipeline:** Handles query processing, web search, content scraping, document analysis, and content summarization.
104
-
105
- ## 5. Main Functions
106
- - **`determine_query_type(query, chat_history, llm_client)`**: Determines whether to use knowledge base or web search based on context.
107
- - **`search_and_scrape(query, chat_history, ...)`**: Main function for web search and content aggregation.
108
- - **`rerank_documents_with_priority(query, documents, entity_domain, ...)`**: Hybrid ranking using BM25 and semantic similarity.
109
- - **`llm_summarize(json_input, model, temperature)`**: Generates summaries using the specified LLM and handles citation and formatting.
110
-
111
- ## 6. API Integration
112
- - **Required API Keys**: Hugging Face, Groq, Mistral, SearXNG
113
- - **Environment Variables Setup**: Use dotenv to load environment variables
114
-
115
- ## 8. Advanced Parameters
116
-
117
- | **Parameter** | **Description** | **Range/Options** | **Default** | **Usage** |
118
- |--------------------------|---------------------------------------------------------------|--------------------------------------------|---------------|---------------------------------------------------------------|
119
- | **Number of Results** | Number of search results retrieved. | 5 to 20 | 5 | Controls number of links/articles fetched from web searches. |
120
- | **Maximum Characters** | Limits characters per document processed. | 500 to 10,000 | 3000 | Truncates long documents, focusing on relevant information. |
121
- | **Time Range** | Specifies the time period for search results. | day, week, month, year | month | Filters results based on recent or historical data. |
122
- | **Language Selection** | Filters search results by language. | `en`, `fr`, `es`, etc. | `en` | Retrieves content in a specified language. |
123
- | **LLM Temperature** | Controls randomness in responses from LLM. | 0.0 to 1.0 | 0.2 | Low values for factual responses; higher for creative ones. |
124
- | **Search Engines** | Specifies search engines used for scraping. | Google, Bing, DuckDuckGo, etc. | All engines | Choose specific search engines for better or private results. |
125
- | **Safe Search Level** | Filters explicit/inappropriate content. | 0: No filter, 1: Moderate, 2: Strict | 2 (Strict) | Ensures family-friendly or professional content. |
126
- | **Model Selection** | Chooses the LLM for summaries or responses. | Mistral, GPT-4, Groq | Varies | Select models based on performance or speed. |
127
- | **PDF Processing Toggle** | Enables/disables PDF document processing. | `True` (process) or `False` (skip) | `False` | Processes PDFs, useful for reports but may slow down speed. |
 
1
+ title: SearXNG Web Search
2
+ emoji: 💬
3
+ colorFrom: yellow
4
+ colorTo: purple
5
+ sdk: gradio
6
+ sdk_version: 4.36.1
7
+ app_file: app.py
8
+ pinned: false
9
+ license: apache-2.0