metadata

title: SearXNG Web Search
emoji: 🌍
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: apache-2.0
short_description: Web Search AI

An example chatbot using Gradio, huggingface_hub, and the Hugging Face Inference API.

Web Scraper for Financial News with Sentinel AI

Overview
Core Components
Key Features
Architecture
Main Functions
API Integration
Advanced Parameters

1. Overview

This application is a sophisticated web scraper and AI-powered chat interface specifically designed for financial news analysis. It combines web scraping capabilities with multiple Language Learning Models (LLMs) to provide intelligent, context-aware responses to user queries about financial information.

2. Core Components

Search Engine Integration

Uses SearXNG as the primary search meta-engine
Supports multiple search engines (Google, Bing, DuckDuckGo, etc.)
Implements custom retry mechanisms and timeout handling

AI Models Integration

Supports multiple LLM providers: Hugging Face (Mistral-Small-Instruct), Groq (Llama-3.1-70b), Mistral AI (Open-Mistral-Nemo)
Implements semantic similarity using Sentence-Transformer

Content Processing

PDF processing with PyPDF2
Web content scraping with Newspaper3k
BM25 ranking algorithm implementation
Document deduplication and relevance assessment

3. Key Features

Intelligent Query Processing

Query type determination (knowledge base vs. web search)
Query rephrasing for optimal search results
Entity recognition
Time-aware query modification

Content Analysis

Relevance assessment
Content summarization
Semantic similarity comparison
Document deduplication
Priority-based content ranking

Search Optimization

Custom retry mechanism
Rate limiting
Error handling
Content filtering and validation

4. Architecture

User Interface (UI)

You start by interacting with a Gradio Chat Interface.

Query Processing

Your query is sent to the Query Analysis (QA) section.
The system then determines the type of query (DT).
If it's a type that can use a Knowledge Base, it generates an AI response (KB).
If it requires web searching, it rephrases the query (QR) for web search.
The system extracts the entity domain (ED) from the rephrased query.

Search Engine

The extracted entity domain is sent to the SearXNG Search Engine (SE).
The search engine returns the search results (SR).

Content Analysis

The search results are processed by web scraping (WS).
If the content is in PDF format, it is scraped using PDF Scraping (PDF).
If in HTML format, it's scraped using Newspaper3k Scraping (NEWS).
Relevant content is summarized (DS) and checked for uniqueness (UC).

Ranking System

Content is ranked (DR) based on:
- BM25 Scoring (BM): A scoring method to rank documents.
- Semantic Similarity (SS): How similar the content is to the query.
The scores are combined (CS) to produce a final ranking (FR).

Response Generation

The final ranking is summarized again (FS) to create a final summary.
The AI-generated response (KB) and final summary (FS) are combined to form the final response.

Completion

The final response is sent back to the Gradio Chat Interface (UI) for you to see.

Core Classes

BM25: Custom implementation for document ranking
Search and Scrape Pipeline: Handles query processing, web search, content scraping, document analysis, and content summarization.

5. Main Functions

determine_query_type(query, chat_history, llm_client): Determines whether to use knowledge base or web search based on context.
search_and_scrape(query, chat_history, ...): Main function for web search and content aggregation.
rerank_documents_with_priority(query, documents, entity_domain, ...): Hybrid ranking using BM25 and semantic similarity.
llm_summarize(json_input, model, temperature): Generates summaries using the specified LLM and handles citation and formatting.

6. API Integration

Required API Keys: Hugging Face, Groq, Mistral, SearXNG
Environment Variables Setup: Use dotenv to load environment variables

8. Advanced Parameters

Parameter	Description	Range/Options	Default	Usage
Number of Results	Number of search results retrieved.	5 to 20	5	Controls number of links/articles fetched from web searches.
Maximum Characters	Limits characters per document processed.	500 to 10,000	3000	Truncates long documents, focusing on relevant information.
Time Range	Specifies the time period for search results.	day, week, month, year	month	Filters results based on recent or historical data.
Language Selection	Filters search results by language.	`en`, `fr`, `es`, etc.	`en`	Retrieves content in a specified language.
LLM Temperature	Controls randomness in responses from LLM.	0.0 to 1.0	0.2	Low values for factual responses; higher for creative ones.
Search Engines	Specifies search engines used for scraping.	Google, Bing, DuckDuckGo, etc.	All engines	Choose specific search engines for better or private results.
Safe Search Level	Filters explicit/inappropriate content.	0: No filter, 1: Moderate, 2: Strict	2 (Strict)	Ensures family-friendly or professional content.
Model Selection	Chooses the LLM for summaries or responses.	Mistral, GPT-4, Groq	Varies	Select models based on performance or speed.
PDF Processing Toggle	Enables/disables PDF document processing.	`True` (process) or `False` (skip)	`False`	Processes PDFs, useful for reports but may slow down speed.

Spaces:

Duplicated from Shreyas094/SearXNG-WebSearch-Test-4

Shreyas094
/

SearXNG-WebSearch-Agent

Running

Web Scraper for Financial News with Sentinel AI

Table of Contents

1. Overview

2. Core Components

Search Engine Integration

AI Models Integration

Content Processing

3. Key Features

Intelligent Query Processing

Content Analysis

Search Optimization

4. Architecture

User Interface (UI)

Query Processing

Search Engine

Content Analysis

Ranking System

Response Generation

Completion

Core Classes

5. Main Functions

6. API Integration

8. Advanced Parameters