SearXNG-WebSearch-Agent

Running

App Files Files Community

Shreyas094 commited on Oct 18, 2024

Commit

b8a0c9a

verified ·

1 Parent(s): 8f3ce4a

Update README.md

Browse files

Files changed (1) hide show

README.md +129 -1

README.md CHANGED Viewed

@@ -10,4 +10,132 @@ pinned: false
 license: apache-2.0
 ---
-An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

 license: apache-2.0
 ---
+An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
+# Web Scraper for Financial News with Sentinel AI
+## Table of Contents
+1. [Overview](#overview)
+2. [Core Components](#core-components)
+   - [Search Engine Integration](#search-engine-integration)
+   - [AI Models Integration](#ai-models-integration)
+   - [Content Processing](#content-processing)
+3. [Key Features](#key-features)
+   - [Intelligent Query Processing](#intelligent-query-processing)
+   - [Content Analysis](#content-analysis)
+   - [Search Optimization](#search-optimization)
+4. [Architecture](#architecture)
+   - [User Interface (UI)](#user-interface-ui)
+   - [Query Processing](#query-processing)
+   - [Search Engine](#search-engine)
+   - [Content Analysis](#content-analysis)
+   - [Ranking System](#ranking-system)
+   - [Response Generation](#response-generation)
+   - [Core Classes](#core-classes)
+5. [Main Functions](#main-functions)
+6. [API Integration](#api-integration)
+8. [Advanced Parameters](#advanced-parameters)
+## 1. Overview
+This application is a sophisticated web scraper and AI-powered chat interface specifically designed for financial news analysis. It combines web scraping capabilities with multiple Language Learning Models (LLMs) to provide intelligent, context-aware responses to user queries about financial information.
+## 2. Core Components
+### Search Engine Integration
+- Uses SearXNG as the primary search meta-engine
+- Supports multiple search engines (Google, Bing, DuckDuckGo, etc.)
+- Implements custom retry mechanisms and timeout handling
+### AI Models Integration
+- Supports multiple LLM providers: Hugging Face (Mistral-Small-Instruct), Groq (Llama-3.1-70b), Mistral AI (Open-Mistral-Nemo)
+- Implements semantic similarity using Sentence-Transformer
+### Content Processing
+- PDF processing with PyPDF2
+- Web content scraping with Newspaper3k
+- BM25 ranking algorithm implementation
+- Document deduplication and relevance assessment
+## 3. Key Features
+### Intelligent Query Processing
+- Query type determination (knowledge base vs. web search)
+- Query rephrasing for optimal search results
+- Entity recognition
+- Time-aware query modification
+### Content Analysis
+- Relevance assessment
+- Content summarization
+- Semantic similarity comparison
+- Document deduplication
+- Priority-based content ranking
+### Search Optimization
+- Custom retry mechanism
+- Rate limiting
+- Error handling
+- Content filtering and validation
+## 4. Architecture
+### User Interface (UI)
+- You start by interacting with a Gradio Chat Interface.
+### Query Processing
+- Your query is sent to the Query Analysis (QA) section.
+- The system then determines the type of query (DT).
+- If it's a type that can use a Knowledge Base, it generates an AI response (KB).
+- If it requires web searching, it rephrases the query (QR) for web search.
+- The system extracts the entity domain (ED) from the rephrased query.
+### Search Engine
+- The extracted entity domain is sent to the SearXNG Search Engine (SE).
+- The search engine returns the search results (SR).
+### Content Analysis
+- The search results are processed by web scraping (WS).
+- If the content is in PDF format, it is scraped using PDF Scraping (PDF).
+- If in HTML format, it's scraped using Newspaper3k Scraping (NEWS).
+- Relevant content is summarized (DS) and checked for uniqueness (UC).
+### Ranking System
+- Content is ranked (DR) based on:
+  - **BM25 Scoring (BM):** A scoring method to rank documents.
+  - **Semantic Similarity (SS):** How similar the content is to the query.
+- The scores are combined (CS) to produce a final ranking (FR).
+### Response Generation
+- The final ranking is summarized again (FS) to create a final summary.
+- The AI-generated response (KB) and final summary (FS) are combined to form the final response.
+### Completion
+- The final response is sent back to the Gradio Chat Interface (UI) for you to see.
+### Core Classes
+- **BM25:** Custom implementation for document ranking
+- **Search and Scrape Pipeline:** Handles query processing, web search, content scraping, document analysis, and content summarization.
+## 5. Main Functions
+- **`determine_query_type(query, chat_history, llm_client)`**: Determines whether to use knowledge base or web search based on context.
+- **`search_and_scrape(query, chat_history, ...)`**: Main function for web search and content aggregation.
+- **`rerank_documents_with_priority(query, documents, entity_domain, ...)`**: Hybrid ranking using BM25 and semantic similarity.
+- **`llm_summarize(json_input, model, temperature)`**: Generates summaries using the specified LLM and handles citation and formatting.
+## 6. API Integration
+- **Required API Keys**: Hugging Face, Groq, Mistral, SearXNG
+- **Environment Variables Setup**: Use dotenv to load environment variables
+## 8. Advanced Parameters
+| **Parameter**            | **Description**                                               | **Range/Options**                          | **Default**   | **Usage**                                                     |
+|--------------------------|---------------------------------------------------------------|--------------------------------------------|---------------|---------------------------------------------------------------|
+| **Number of Results**     | Number of search results retrieved.                           | 5 to 20                                    | 5             | Controls number of links/articles fetched from web searches.   |
+| **Maximum Characters**    | Limits characters per document processed.                     | 500 to 10,000                              | 3000          | Truncates long documents, focusing on relevant information.    |
+| **Time Range**            | Specifies the time period for search results.                 | day, week, month, year                     | month         | Filters results based on recent or historical data.            |
+| **Language Selection**    | Filters search results by language.                           | `en`, `fr`, `es`, etc.                     | `en`          | Retrieves content in a specified language.                     |
+| **LLM Temperature**       | Controls randomness in responses from LLM.                    | 0.0 to 1.0                                 | 0.2           | Low values for factual responses; higher for creative ones.    |
+| **Search Engines**        | Specifies search engines used for scraping.                   | Google, Bing, DuckDuckGo, etc.             | All engines   | Choose specific search engines for better or private results.  |
+| **Safe Search Level**     | Filters explicit/inappropriate content.                       | 0: No filter, 1: Moderate, 2: Strict       | 2 (Strict)    | Ensures family-friendly or professional content.               |
+| **Model Selection**       | Chooses the LLM for summaries or responses.                   | Mistral, GPT-4, Groq                       | Varies        | Select models based on performance or speed.                   |
+| **PDF Processing Toggle** | Enables/disables PDF document processing.                     | `True` (process) or `False` (skip)         | `False`       | Processes PDFs, useful for reports but may slow down speed.    |