Shreyas094's picture
Update README.md
5c2f4c9 verified
|
raw
history blame
7.56 kB
metadata
title: SearXNG Web Search
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.3.0
app_file: app.py
pinned: false
license: apache-2.0

An example chatbot using Gradio, huggingface_hub, and the Hugging Face Inference API.

Web Scraper for Financial News with Sentinel AI

Table of Contents

  1. Overview
  2. Core Components
  3. Key Features
  4. Architecture
  5. Main Functions
  6. API Integration
  7. Advanced Parameters

1. Overview

This application is a sophisticated web scraper and AI-powered chat interface specifically designed for financial news analysis. It combines web scraping capabilities with multiple Language Learning Models (LLMs) to provide intelligent, context-aware responses to user queries about financial information.

2. Core Components

Search Engine Integration

  • Uses SearXNG as the primary search meta-engine
  • Supports multiple search engines (Google, Bing, DuckDuckGo, etc.)
  • Implements custom retry mechanisms and timeout handling

AI Models Integration

  • Supports multiple LLM providers: Hugging Face (Mistral-Small-Instruct), Groq (Llama-3.1-70b), Mistral AI (Open-Mistral-Nemo)
  • Implements semantic similarity using Sentence-Transformer

Content Processing

  • PDF processing with PyPDF2
  • Web content scraping with Newspaper3k
  • BM25 ranking algorithm implementation
  • Document deduplication and relevance assessment

3. Key Features

Intelligent Query Processing

  • Query type determination (knowledge base vs. web search)
  • Query rephrasing for optimal search results
  • Entity recognition
  • Time-aware query modification

Content Analysis

  • Relevance assessment
  • Content summarization
  • Semantic similarity comparison
  • Document deduplication
  • Priority-based content ranking

Search Optimization

  • Custom retry mechanism
  • Rate limiting
  • Error handling
  • Content filtering and validation

4. Architecture

User Interface (UI)

  • You start by interacting with a Gradio Chat Interface.

Query Processing

  • Your query is sent to the Query Analysis (QA) section.
  • The system then determines the type of query (DT).
  • If it's a type that can use a Knowledge Base, it generates an AI response (KB).
  • If it requires web searching, it rephrases the query (QR) for web search.
  • The system extracts the entity domain (ED) from the rephrased query.

Search Engine

  • The extracted entity domain is sent to the SearXNG Search Engine (SE).
  • The search engine returns the search results (SR).

Content Analysis

  • The search results are processed by web scraping (WS).
  • If the content is in PDF format, it is scraped using PDF Scraping (PDF).
  • If in HTML format, it's scraped using Newspaper3k Scraping (NEWS).
  • Relevant content is summarized (DS) and checked for uniqueness (UC).

Ranking System

  • Content is ranked (DR) based on:
    • BM25 Scoring (BM): A scoring method to rank documents.
    • Semantic Similarity (SS): How similar the content is to the query.
  • The scores are combined (CS) to produce a final ranking (FR).

Response Generation

  • The final ranking is summarized again (FS) to create a final summary.
  • The AI-generated response (KB) and final summary (FS) are combined to form the final response.

Completion

  • The final response is sent back to the Gradio Chat Interface (UI) for you to see.

Core Classes

  • BM25: Custom implementation for document ranking
  • Search and Scrape Pipeline: Handles query processing, web search, content scraping, document analysis, and content summarization.

5. Main Functions

  • determine_query_type(query, chat_history, llm_client): Determines whether to use knowledge base or web search based on context.
  • search_and_scrape(query, chat_history, ...): Main function for web search and content aggregation.
  • rerank_documents_with_priority(query, documents, entity_domain, ...): Hybrid ranking using BM25 and semantic similarity.
  • llm_summarize(json_input, model, temperature): Generates summaries using the specified LLM and handles citation and formatting.

6. API Integration

  • Required API Keys: Hugging Face, Groq, Mistral, SearXNG
  • Environment Variables Setup: Use dotenv to load environment variables

8. Advanced Parameters

Parameter Description Range/Options Default Usage
Number of Results Number of search results retrieved. 5 to 20 5 Controls number of links/articles fetched from web searches.
Maximum Characters Limits characters per document processed. 500 to 10,000 3000 Truncates long documents, focusing on relevant information.
Time Range Specifies the time period for search results. day, week, month, year month Filters results based on recent or historical data.
Language Selection Filters search results by language. en, fr, es, etc. en Retrieves content in a specified language.
LLM Temperature Controls randomness in responses from LLM. 0.0 to 1.0 0.2 Low values for factual responses; higher for creative ones.
Search Engines Specifies search engines used for scraping. Google, Bing, DuckDuckGo, etc. All engines Choose specific search engines for better or private results.
Safe Search Level Filters explicit/inappropriate content. 0: No filter, 1: Moderate, 2: Strict 2 (Strict) Ensures family-friendly or professional content.
Model Selection Chooses the LLM for summaries or responses. Mistral, GPT-4, Groq Varies Select models based on performance or speed.
PDF Processing Toggle Enables/disables PDF document processing. True (process) or False (skip) False Processes PDFs, useful for reports but may slow down speed.