Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.26.0
Building a Retrieval-Augmented Question-Answering System with FastAPI and LangChain
Develop a simple question-answering web service that leverages Retrieval-Augmented Generation (RAG) to provide answers based on a set of provided documents. The service will be built using Python, FastAPI, and LangChain.
Installation
- Clone or download this repository
git clone https://github.com/shamim237/artech_med_bot.git
- Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install the required packages:
pip install -r requirements.txt
Run the system
To run the system on your machine, follow the steps shown below:
preprocess.py:
This script for preprocessing MSD (Excel) and CBIP (CSV) data files. This script provides standardized data cleaning and transformation operations for both file formats.
Features
- Processes MSD Excel files and CBIP CSV files
- Standardizes text to lowercase
- Removes empty columns
- Handles missing values
- Eliminates duplicate rows
- Preserves original file structure
- Comprehensive logging
Usage
python preprocess.py --msd-input "path/to/msd.xlsx" --cbip-input "path/to/cbip/directory"
Arguments
--msd-input
: Path to the MSD Excel file--cbip-input
: Path to the directory containing CBIP CSV files
Output
The script creates a processed_data
directory in your current working directory with the following structure:
processed_data/
βββ msd/
β βββ msd_processed.csv
βββ cbip/
βββ [processed_csv_files]
Data Processing Steps
- Text Standardization: Converts all text to lowercase
- Column Cleaning: Removes columns that are completely empty
- Missing Value Handling: Fills NaN values with empty strings
- Duplicate Removal: Removes duplicate rows from the dataset
Error Handling
- The script includes comprehensive error handling and logging
- Errors are logged with timestamps and detailed messages
- Processing continues even if individual files fail
"vectorize.py":
This script processes CSV documents and creates FAISS vector stores using LangChain and Hugging Face embeddings. It's designed to handle both MSD (Master Service Data) and medical data sources, converting them into efficient searchable vector representations.
Features
- CSV document loading with support for multiple files
- Text chunking with configurable size and overlap
- FAISS vector store creation and persistence
- Comprehensive error handling and logging
- Support for Hugging Face embedding models
Configuration
The script uses the following default configuration:
- MSD Data Path:
./processed_data/msd/msd_processed.csv
- Medical CSV Path:
./processed_data/cbip/*.csv
- MSD Vector Output:
./vectors_data/msd_data_vec
- Medical Vector Output:
./vectors_data/med_data_vec
- Embedding Model:
sentence-transformers/all-MiniLM-L12-v2
Usage
- Just run the script to get default output
python -m vectorize.py
- or change paths of the dataset
"rag.py":
This script implements a Retrieval-Augmented Generation (RAG) system using LangChain, FAISS vector store, and OpenAI's GPT-3.5 model. The system combines medical and general data sources to provide informed answers to user queries.
Features
- Dual vector store integration (medical and general data)
- HuggingFace embeddings using
all-MiniLM-L12-v2
model - OpenAI GPT-3.5 for answer generation
- Comprehensive error handling and logging
- Environment variable support for API keys
Prerequisites
- OpenAI API key
- Create a
.env
file in the project root and add your OpenAI API key: OPENAI_API_KEY=your_api_key_here
- Create a
- Required vector stores in the
vectors_data
directory:msd_data_vec/
- General data vector storemed_data_vec/
- Medical data vector store
Usage
python rag.py
"app.py":
This script is a FastAPI-based REST API that generates answers to questions using RAG (Retrieval-Augmented Generation) technology.
Features
- Question answering endpoint with RAG integration
- Request ID tracking for all API calls
- Comprehensive error handling and logging
- Health check endpoint
- CORS support
- API documentation (Swagger UI and ReDoc)
Usage
uvicorn app:app --reload
The server will start on http://localhost:8000
API Endpoints
1. Question Answering
- Endpoint:
/answer
- Method: POST
- Request Body:
{
"question": "What is an overactive bladder?"
}
- Response:
{
"answer": "The generated answer..."
}
test_rag.py:
The test suite validates the functionality of:
- Individual data retrievers (medicine and general data)
- Combined retriever functionality
- Answer generation system
- Error handling for edge cases
Test Cases
The test suite includes the following test cases:
test_data_retriever
: Tests retrieval from general data storetest_med_retriever
: Tests retrieval from medical data storetest_combined_retriever
: Tests the merged retriever functionalitytest_generate_answer
: Validates answer generationtest_empty_query
: Tests error handling for invalid inputs
Usage
python -m unittest test_rag.py
Vector Store Setup
The system expects two FAISS indices in the vectors/
directory:
msd_data_vec
: General knowledge vector storemed_data_vec
: Medical knowledge vector store
Both indices use the sentence-transformers/all-MiniLM-L12-v2
embedding model.
Notes
- Ensure all vector stores are properly initialized before running tests
- The system uses the MiniLM-L12-v2 model for embeddings
- Empty or whitespace-only queries will raise ValueError exceptions
test_app.py:
The test suite (test_app.py
) validates the /answer
endpoint's response to different types of requests, ensuring proper handling of both valid and invalid inputs.
Test Cases
The test suite includes the following test cases:
Valid Question Test
- Verifies that the endpoint correctly processes a valid question
- Expects a 200 status code and an answer in the response
Empty Question Test
- Validates handling of empty string inputs
- Expects a 422 status code (Pydantic validation error)
Whitespace Question Test
- Checks handling of whitespace-only inputs
- Expects a 500 status code with an error message
Missing Question Field Test
- Verifies behavior when the question field is omitted
- Expects a 422 status code (FastAPI validation error)
Invalid JSON Test
- Tests handling of malformed JSON requests
- Expects a 422 status code (FastAPI validation error)
Usage
python -m unittest test_app.py
Assumptions and Trade-offs:
I generated and stored vector embeddings separately for disease/MSD data and medicine/CBIP data, believing that this separation would enhance the LLM's performance.
Comments:
The quality of responses from this RAG-based LLM can be further strengthened through the following steps:
- Organizing the disease-related dataset more systematically.
- Structuring the medicine-related dataset more effectively.
- Enhancing disease-treatment and drug recommendations through better-organized mappings.