SearXNG-WebSearch-Agent

Running

App Files Files Community

SearXNG-WebSearch-Agent / README.md

Shreyas094

Update README.md

9fb0197 verified 2 months ago

preview code

raw

history blame

7.59 kB

	---
	title: SearXNG Web Search
	emoji: 🌍
	colorFrom: yellow
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.1
	app_file: app.py
	pinned: true
	license: apache-2.0
	short_description: Web Search AI
	---

	An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

	# Web Scraper for Financial News with Sentinel AI

	## Table of Contents
	1. [Overview](#overview)
	2. [Core Components](#core-components)
	- [Search Engine Integration](#search-engine-integration)
	- [AI Models Integration](#ai-models-integration)
	- [Content Processing](#content-processing)
	3. [Key Features](#key-features)
	- [Intelligent Query Processing](#intelligent-query-processing)
	- [Content Analysis](#content-analysis)
	- [Search Optimization](#search-optimization)
	4. [Architecture](#architecture)
	- [User Interface (UI)](#user-interface-ui)
	- [Query Processing](#query-processing)
	- [Search Engine](#search-engine)
	- [Content Analysis](#content-analysis)
	- [Ranking System](#ranking-system)
	- [Response Generation](#response-generation)
	- [Core Classes](#core-classes)
	5. [Main Functions](#main-functions)
	6. [API Integration](#api-integration)
	8. [Advanced Parameters](#advanced-parameters)

	## 1. Overview
	This application is a sophisticated web scraper and AI-powered chat interface specifically designed for financial news analysis. It combines web scraping capabilities with multiple Language Learning Models (LLMs) to provide intelligent, context-aware responses to user queries about financial information.

	## 2. Core Components

	### Search Engine Integration
	- Uses SearXNG as the primary search meta-engine
	- Supports multiple search engines (Google, Bing, DuckDuckGo, etc.)
	- Implements custom retry mechanisms and timeout handling

	### AI Models Integration
	- Supports multiple LLM providers: Hugging Face (Mistral-Small-Instruct), Groq (Llama-3.1-70b), Mistral AI (Open-Mistral-Nemo)
	- Implements semantic similarity using Sentence-Transformer

	### Content Processing
	- PDF processing with PyPDF2
	- Web content scraping with Newspaper3k
	- BM25 ranking algorithm implementation
	- Document deduplication and relevance assessment

	## 3. Key Features

	### Intelligent Query Processing
	- Query type determination (knowledge base vs. web search)
	- Query rephrasing for optimal search results
	- Entity recognition
	- Time-aware query modification

	### Content Analysis
	- Relevance assessment
	- Content summarization
	- Semantic similarity comparison
	- Document deduplication
	- Priority-based content ranking

	### Search Optimization
	- Custom retry mechanism
	- Rate limiting
	- Error handling
	- Content filtering and validation

	## 4. Architecture

	### User Interface (UI)
	- You start by interacting with a Gradio Chat Interface.

	### Query Processing
	- Your query is sent to the Query Analysis (QA) section.
	- The system then determines the type of query (DT).
	- If it's a type that can use a Knowledge Base, it generates an AI response (KB).
	- If it requires web searching, it rephrases the query (QR) for web search.
	- The system extracts the entity domain (ED) from the rephrased query.

	### Search Engine
	- The extracted entity domain is sent to the SearXNG Search Engine (SE).
	- The search engine returns the search results (SR).

	### Content Analysis
	- The search results are processed by web scraping (WS).
	- If the content is in PDF format, it is scraped using PDF Scraping (PDF).
	- If in HTML format, it's scraped using Newspaper3k Scraping (NEWS).
	- Relevant content is summarized (DS) and checked for uniqueness (UC).

	### Ranking System
	- Content is ranked (DR) based on:
	- BM25 Scoring (BM): A scoring method to rank documents.
	- Semantic Similarity (SS): How similar the content is to the query.
	- The scores are combined (CS) to produce a final ranking (FR).

	### Response Generation
	- The final ranking is summarized again (FS) to create a final summary.
	- The AI-generated response (KB) and final summary (FS) are combined to form the final response.

	### Completion
	- The final response is sent back to the Gradio Chat Interface (UI) for you to see.

	### Core Classes
	- BM25: Custom implementation for document ranking
	- Search and Scrape Pipeline: Handles query processing, web search, content scraping, document analysis, and content summarization.

	## 5. Main Functions
	- `determine_query_type(query, chat_history, llm_client)`: Determines whether to use knowledge base or web search based on context.
	- `search_and_scrape(query, chat_history, ...)`: Main function for web search and content aggregation.
	- `rerank_documents_with_priority(query, documents, entity_domain, ...)`: Hybrid ranking using BM25 and semantic similarity.
	- `llm_summarize(json_input, model, temperature)`: Generates summaries using the specified LLM and handles citation and formatting.

	## 6. API Integration
	- Required API Keys: Hugging Face, Groq, Mistral, SearXNG
	- Environment Variables Setup: Use dotenv to load environment variables

	## 8. Advanced Parameters

	\| Parameter \| Description \| Range/Options \| Default \| Usage \|
	\|--------------------------\|---------------------------------------------------------------\|--------------------------------------------\|---------------\|---------------------------------------------------------------\|
	\| Number of Results \| Number of search results retrieved. \| 5 to 20 \| 5 \| Controls number of links/articles fetched from web searches. \|
	\| Maximum Characters \| Limits characters per document processed. \| 500 to 10,000 \| 3000 \| Truncates long documents, focusing on relevant information. \|
	\| Time Range \| Specifies the time period for search results. \| day, week, month, year \| month \| Filters results based on recent or historical data. \|
	\| Language Selection \| Filters search results by language. \| `en`, `fr`, `es`, etc. \| `en` \| Retrieves content in a specified language. \|
	\| LLM Temperature \| Controls randomness in responses from LLM. \| 0.0 to 1.0 \| 0.2 \| Low values for factual responses; higher for creative ones. \|
	\| Search Engines \| Specifies search engines used for scraping. \| Google, Bing, DuckDuckGo, etc. \| All engines \| Choose specific search engines for better or private results. \|
	\| Safe Search Level \| Filters explicit/inappropriate content. \| 0: No filter, 1: Moderate, 2: Strict \| 2 (Strict) \| Ensures family-friendly or professional content. \|
	\| Model Selection \| Chooses the LLM for summaries or responses. \| Mistral, GPT-4, Groq \| Varies \| Select models based on performance or speed. \|
	\| PDF Processing Toggle \| Enables/disables PDF document processing. \| `True` (process) or `False` (skip) \| `False` \| Processes PDFs, useful for reports but may slow down speed. \|