CyranoB commited on
Commit
2e147cb
·
1 Parent(s): 7406911

Update README with improved documentation and usage examples

Browse files

- Revamped README content with more detailed project description
- Reformatted command-line options into a markdown table for better readability
- Updated example commands to showcase new model and embedding options
- Clarified content extraction and vectorization process in project overview
- Enhanced documentation for search agent functionality and flexibility

Files changed (1) hide show
  1. README.md +20 -19
README.md CHANGED
@@ -25,9 +25,9 @@ The main functionality of the script can be summarized as follows:
25
 
26
  1. **Query Optimization**: The user's input query is optimized for web search by identifying the key information requested and transforming it into a concise search string using the language model's capabilities.
27
  2. **Web Search**: The optimized search query is used to fetch search results from the Brave Search API. The script allows limiting the search to a specific domain and setting the maximum number of pages to retrieve.
28
- 3. **Content Extraction**: The script fetches the content of the retrieved search results, handling both HTML and PDF documents. It extracts the main text content from web pages and text from PDF files.
29
- 4. **Vectorization**: The extracted content is split into smaller text chunks using a RecursiveCharacterTextSplitter and vectorized using the specified embedding model. The vectorized data is stored in a FAISS vector store for efficient retrieval.
30
- 5. **Query Answering**: The user's original query is answered by retrieving the most relevant text chunks from the vector store. The language model generates an informative answer by synthesizing the retrieved information, citing the sources used, and formatting the response in Markdown.
31
 
32
  The script supports various options for customization, such as specifying the language model provider (OpenAI, Anthropic, Groq, or Ollama), temperature for language model generation, and output format (text or Markdown).
33
 
@@ -68,19 +68,20 @@ python search_agent.py [OPTIONS] SEARCH_QUERY
68
  ```
69
 
70
  ### Options:
71
-
72
- -h --help Show this screen.
73
- --version Show version.
74
- -c --copywrite First produce a draft, review it and rewrite for a final text
75
- -d domain --domain=domain Limit search to a specific domain
76
- -t temp --temperature=temp Set the temperature of the LLM [default: 0.0]
77
- -m model --model=model Use a specific model [default: hf:Qwen/Qwen2.5-72B-Instruct]
78
- -e model --embedding_model=model Use an embedding model
79
- -n num --max_pages=num Max number of pages to retrieve [default: 10]
80
- -x num --max_extracts=num Max number of page extract to consider [default: 7]
81
- -b --use_browser Use browser to fetch content from the web [default: False]
82
- -o text --output=text Output format (choices: text, markdown) [default: markdown]
83
- -v --verbose Print verbose output [default: False]
 
84
 
85
  The model can be a language model provider and a model name separated by a colon. e.g. `openai:gpt-4o-mini`
86
  If a embedding model is not specified, spaCy will be used for semantic search.
@@ -93,15 +94,15 @@ python search_agent.py 'What is the radioactive anomaly in the Pacific Ocean?'
93
  ```
94
 
95
  ```bash
96
- python search_agent.py -m openai:gpt-4o-mini "Write a linked post about the current state of M&A for startups. Write in the style of Russ from Silicon Valley TV show."
97
  ```
98
 
99
  ```bash
100
- python search_agent.py -m groq:llama-3.1-70b-versatile -e ollama:nomic-embed-text:latest -t 0.7 -n 20 -x 15 "Write a linked post about the state of M&A for startups in 2024. Write in the style of Russ from TV show Silicon Valley" -s
101
  ```
102
 
103
  ```bash
104
- python search_agent.py -m groq -e openai "Write an engaging long linked post about the state of M&A for startups in 2024"
105
  ```
106
 
107
  ## License
 
25
 
26
  1. **Query Optimization**: The user's input query is optimized for web search by identifying the key information requested and transforming it into a concise search string using the language model's capabilities.
27
  2. **Web Search**: The optimized search query is used to fetch search results from the Brave Search API. The script allows limiting the search to a specific domain and setting the maximum number of pages to retrieve.
28
+ 3. **Content Extraction**: The script fetches the content of the retrieved search results, handling both HTML and PDF documents. It uses the `trafilatura` library to extract the main text content from web pages and `pdfplumber` to extract text from PDF files. For complex web pages, Selenium can be optionally used to ensure accurate content retrieval. The extracted content is then prepared for further processing.
29
+ 4. **Vectorization**: The extracted content is split into smaller text chunks using a RecursiveCharacterTextSplitter. These chunks are then vectorized using the specified embedding model if provided. If no embedding model is specified, spaCy NLP is used for semantic search. The vectorized data is stored in a FAISS vector store for efficient retrieval.
30
+ 5. **Query Answering**: The user's original query is answered using a Retrieval-Augmented Generation (RAG) approach. This involves retrieving the most relevant text chunks from the vector store. The language model then generates an informative answer by synthesizing the retrieved information, citing the sources used, and formatting the response in Markdown.
31
 
32
  The script supports various options for customization, such as specifying the language model provider (OpenAI, Anthropic, Groq, or Ollama), temperature for language model generation, and output format (text or Markdown).
33
 
 
68
  ```
69
 
70
  ### Options:
71
+ | Option | Description |
72
+ |--------|-------------|
73
+ | `-h --help` | Show this screen |
74
+ | `--version` | Show version |
75
+ | `-c --copywrite` | First produce a draft, review it and rewrite for a final text |
76
+ | `-d domain --domain=domain` | Limit search to a specific domain |
77
+ | `-t temp --temperature=temp` | Set the temperature of the LLM [default: 0.0] |
78
+ | `-m model --model=model` | Use a specific model [default: hf:Qwen/Qwen2.5-72B-Instruct] |
79
+ | `-e model --embedding_model=model` | Use an embedding model |
80
+ | `-n num --max_pages=num` | Max number of pages to retrieve [default: 10] |
81
+ | `-x num --max_extracts=num` | Max number of page extract to consider [default: 7] |
82
+ | `-b --use_browser` | Use browser to fetch content from the web [default: False] |
83
+ | `-o text --output=text` | Output format (choices: text, markdown) [default: markdown] |
84
+ | `-v --verbose` | Print verbose output [default: False] |
85
 
86
  The model can be a language model provider and a model name separated by a colon. e.g. `openai:gpt-4o-mini`
87
  If a embedding model is not specified, spaCy will be used for semantic search.
 
94
  ```
95
 
96
  ```bash
97
+ python search_agent.py "Write a linked post about the current state of M&A for startups. Write in the style of Russ from Silicon Valley TV show." -m openai:gpt-4o-mini
98
  ```
99
 
100
  ```bash
101
+ python search_agent.py "Write a linked post about the state of M&A for startups in 2025. Write in the style of Russ from TV show Silicon Valley" -b -m groq:llama-3.1-8b-instant -e cohere:embed-multilingual-v3.0 -t 0.7 -n 20 -x 15
102
  ```
103
 
104
  ```bash
105
+ python search_agent.py "Write an engaging long linked post about the state of M&A for startups in 2025" -m bedrock:claude-3-5-haiku-20241022-v1:0 -e bedrock:amazon.titan-embed-text-v2:0
106
  ```
107
 
108
  ## License