A newer version of the Gradio SDK is available:
5.22.0
title: CSRD GPT
emoji: πΏ
colorFrom: blue
colorTo: green
sdk: gradio
python_version: 3.10.0
sdk_version: 3.22.1
app_file: app.py
pinned: true
title: CSRD GPT emoji: π colorFrom: red colorTo: green sdk: gradio sdk_version: 4.13.0 app_file: app.py pinned: false
Introduction
Python Version used is: 3.10.0
Built With
- Gradio - Main server and interactive components
- OpenAI API - Main LLM engine used in the app
- HuggingFace Sentence Transformers - Used as the default embedding model
Requirements
NOTE: Before installing the requirements, rename the file
.env.example
to.env
and put your OpenAI API key there !
We suggest you to create a separate virtual environment running Python 3 for this app, and install all of the required dependencies there. Run in Terminal/Command Prompt:
git clone https://github.com/Nexialog/RegGPT.git
cd RegGPT/
python -m venv env
In UNIX system:
source venv/bin/activate
In Windows:
venv\Scripts\activate
To install all of the required packages to this environment, simply run:
pip install -r requirements.txt
and all of the required pip
packages will be installed, and the app will be able to run.
Usage of run_script.py
This script is used for processing PDF documents and generating text embeddings. You can specify different modes and parameters via command-line arguments.
Process Documents
To process PDF documents and extract paragraphs and metadata, use the following command:
python run_script.py --type process_documents
You can also use optional arguments to specify the folder containing PDFs, the output data folder, minimum paragraph length, and merge length.
Generate Embeddings
To generate text embeddings from the processed paragraphs, use the following command:
python run_script.py --type generate_embeddings
This command will use the default embedding model, but you can specify another model using the --embedding_model
argument.
Process Documents and Generate Embeddings
To perform both document processing and embedding generation, use:
python run_script.py --type all
Command Line Arguments
--type
: Specifies the operation type. Choices areall
,process_documents
, orgenerate_embeddings
. (required)--pdf_folder
: Path to the folder containing PDF documents. Default ispdf_data/
. (optional)--data_folder
: Path to the folder where processed data and embeddings will be saved. Default isdata/
. (optional)--embedding_model
: Specifies the model to be used for generating embeddings. Default issentence-transformers/multi-qa-mpnet-base-dot-v1
. (optional)--device
: Specifies the device to be used (CPU or GPU). Choices arecpu
orcuda
. Default iscpu
. (optional)--min_length
: Specifies the minimum paragraph length for inclusion. Default is300
. (optional)--merge_length
: Specifies the merge length for paragraphs. Default is700
. (optional)
Examples
python run_script.py --type process_documents --pdf_folder my_pdf_folder/ --merge_length 800
python run_script.py --type generate_embeddings --device cuda
How to use Colab's GPU
- Create your own deploying key from github
- Upload the key to Google Drive on the path :
drive/MyDrive/ssh_key_github/
- Upload the notebook
notebooks/generate_embeddings.ipynb
into a colab session (or use this link) - Upload the pdf files on the same colab session on the path :
pdf_data/
- Run the notebook on GPU mode and download the folder
data/
containing embeddings and chnukns
How to Configure a New BOT
- Put all pdf files in a folder at the same repository (We recommend using the folder name : 'pdf_data')
- Run the python sciprt 'run_script.py' as explained above. 3-Configure the BOT in config by following the steps bellow:
In order to configure the chatbot, you need to modify the config.py file that contains the CFG_APP class. Here's what each attribute in the class means:
Basic Settings
DEBUG
: Debugging modeK_TOTAL
: The total number of retrieved docsTHRESHOLD
: Threshold of retrieval by embeddingsDEVICE
: Device for computationBOT_NAME
: The name of the botMODEL_NAME
: The name of the model
Language and Data
DEFAULT_LANGUAGE
: Default languageDATA_FOLDER
: Path to the data folderEMBEDDING_MODEL
: Embedding model
Tokens and Prompts
MAX_TOKENS_REF_QUESTION
: Maximum tokens in the reformulated questionMAX_TOKENS_ANSWER
: Maximum tokens in answersINIT_PROMPT
: Initial promptSOURCES_PROMPT
: Sources prompt for responses
Default Questions
DEFAULT_QUESTIONS
: Tuple of default questions
Reformulation Prompt
REFORMULATION_PROMPT
: Prompt for reformulating questions
Metadata Path
DOC_METADATA_PATH
: Path to document metadata
How to Use This BOT
Run this app locally by:
python app.py
Open http://127.0.0.1:7860 in your browser, and you will see the bot.