Spaces:

CyranoB
/

search_agent

Running

CyranoB commited on Feb 19

Commit

2e147cb

1 Parent(s): 7406911

Update README with improved documentation and usage examples

- Revamped README content with more detailed project description
- Reformatted command-line options into a markdown table for better readability
- Updated example commands to showcase new model and embedding options
- Clarified content extraction and vectorization process in project overview
- Enhanced documentation for search agent functionality and flexibility

Files changed (1) hide show

README.md +20 -19

README.md CHANGED Viewed

@@ -25,9 +25,9 @@ The main functionality of the script can be summarized as follows:
 1. **Query Optimization**: The user's input query is optimized for web search by identifying the key information requested and transforming it into a concise search string using the language model's capabilities.
 2. **Web Search**: The optimized search query is used to fetch search results from the Brave Search API. The script allows limiting the search to a specific domain and setting the maximum number of pages to retrieve.
-3. **Content Extraction**: The script fetches the content of the retrieved search results, handling both HTML and PDF documents. It extracts the main text content from web pages and text from PDF files.
-4. **Vectorization**: The extracted content is split into smaller text chunks using a RecursiveCharacterTextSplitter and vectorized using the specified embedding model. The vectorized data is stored in a FAISS vector store for efficient retrieval.
-5. **Query Answering**: The user's original query is answered by retrieving the most relevant text chunks from the vector store. The language model generates an informative answer by synthesizing the retrieved information, citing the sources used, and formatting the response in Markdown.
 The script supports various options for customization, such as specifying the language model provider (OpenAI, Anthropic, Groq, or Ollama), temperature for language model generation, and output format (text or Markdown).
@@ -68,19 +68,20 @@ python search_agent.py [OPTIONS] SEARCH_QUERY
 ```
 ### Options:
-   -h --help                           Show this screen.
-   --version                           Show version.
-   -c --copywrite                      First produce a draft, review it and rewrite for a final text
-   -d domain --domain=domain           Limit search to a specific domain
-   -t temp --temperature=temp          Set the temperature of the LLM [default: 0.0]
-   -m model --model=model              Use a specific model [default: hf:Qwen/Qwen2.5-72B-Instruct]
-   -e model --embedding_model=model    Use an embedding model
-   -n num --max_pages=num              Max number of pages to retrieve [default: 10]
-   -x num --max_extracts=num           Max number of page extract to consider [default: 7]
-   -b --use_browser                    Use browser to fetch content from the web [default: False]
-   -o text --output=text               Output format (choices: text, markdown) [default: markdown]
-   -v --verbose                        Print verbose output [default: False]
 The model can be a language model provider and a model name separated by a colon. e.g. `openai:gpt-4o-mini`
 If a embedding model is not specified, spaCy will be used for semantic search.
@@ -93,15 +94,15 @@ python search_agent.py 'What is the radioactive anomaly in the Pacific Ocean?'
 ```
 ```bash
-python search_agent.py -m openai:gpt-4o-mini "Write a linked post about the current state of M&A for startups. Write in the style of Russ from Silicon Valley TV show."
 ```
 ```bash
- python search_agent.py -m groq:llama-3.1-70b-versatile -e ollama:nomic-embed-text:latest -t 0.7 -n 20 -x 15  "Write a linked post about the state of M&A for startups in 2024. Write in the style of Russ from TV show Silicon Valley" -s
 ```
 ```bash
- python search_agent.py -m groq -e openai "Write an engaging long linked post about the state of M&A for startups in 2024"
 ```
 ## License

 1. **Query Optimization**: The user's input query is optimized for web search by identifying the key information requested and transforming it into a concise search string using the language model's capabilities.
 2. **Web Search**: The optimized search query is used to fetch search results from the Brave Search API. The script allows limiting the search to a specific domain and setting the maximum number of pages to retrieve.
+3. **Content Extraction**: The script fetches the content of the retrieved search results, handling both HTML and PDF documents. It uses the `trafilatura` library to extract the main text content from web pages and `pdfplumber` to extract text from PDF files. For complex web pages, Selenium can be optionally used to ensure accurate content retrieval. The extracted content is then prepared for further processing.
+4. **Vectorization**: The extracted content is split into smaller text chunks using a RecursiveCharacterTextSplitter. These chunks are then vectorized using the specified embedding model if provided. If no embedding model is specified, spaCy NLP is used for semantic search. The vectorized data is stored in a FAISS vector store for efficient retrieval.
+5. **Query Answering**: The user's original query is answered using a Retrieval-Augmented Generation (RAG) approach. This involves retrieving the most relevant text chunks from the vector store. The language model then generates an informative answer by synthesizing the retrieved information, citing the sources used, and formatting the response in Markdown.
 The script supports various options for customization, such as specifying the language model provider (OpenAI, Anthropic, Groq, or Ollama), temperature for language model generation, and output format (text or Markdown).
 ```
 ### Options:
+| Option | Description |
+|--------|-------------|
+| `-h --help` | Show this screen |
+| `--version` | Show version |
+| `-c --copywrite` | First produce a draft, review it and rewrite for a final text |
+| `-d domain --domain=domain` | Limit search to a specific domain |
+| `-t temp --temperature=temp` | Set the temperature of the LLM [default: 0.0] |
+| `-m model --model=model` | Use a specific model [default: hf:Qwen/Qwen2.5-72B-Instruct] |
+| `-e model --embedding_model=model` | Use an embedding model |
+| `-n num --max_pages=num` | Max number of pages to retrieve [default: 10] |
+| `-x num --max_extracts=num` | Max number of page extract to consider [default: 7] |
+| `-b --use_browser` | Use browser to fetch content from the web [default: False] |
+| `-o text --output=text` | Output format (choices: text, markdown) [default: markdown] |
+| `-v --verbose` | Print verbose output [default: False] |
 The model can be a language model provider and a model name separated by a colon. e.g. `openai:gpt-4o-mini`
 If a embedding model is not specified, spaCy will be used for semantic search.
 ```
 ```bash
+python search_agent.py "Write a linked post about the current state of M&A for startups. Write in the style of Russ from Silicon Valley TV show." -m openai:gpt-4o-mini
 ```
 ```bash
+ python search_agent.py "Write a linked post about the state of M&A for startups in 2025. Write in the style of Russ from TV show Silicon Valley" -b -m groq:llama-3.1-8b-instant -e cohere:embed-multilingual-v3.0 -t 0.7 -n 20 -x 15
 ```
 ```bash
+ python search_agent.py "Write an engaging long linked post about the state of M&A for startups in 2025" -m bedrock:claude-3-5-haiku-20241022-v1:0 -e bedrock:amazon.titan-embed-text-v2:0
 ```
 ## License