README.md · Nexialog/CSRD

metadata

title: CSRD GPT
emoji: 🌿
colorFrom: blue
colorTo: green
sdk: gradio
python_version: 3.10.0
sdk_version: 3.22.1
app_file: app.py
pinned: true

title: CSRD GPT emoji: 📊 colorFrom: red colorTo: green sdk: gradio sdk_version: 4.13.0 app_file: app.py pinned: false

Introduction

Python Version used is: 3.10.0

Built With

Gradio - Main server and interactive components
OpenAI API - Main LLM engine used in the app
HuggingFace Sentence Transformers - Used as the default embedding model

Requirements

NOTE: Before installing the requirements, rename the file .env.example to .env and put your OpenAI API key there !

We suggest you to create a separate virtual environment running Python 3 for this app, and install all of the required dependencies there. Run in Terminal/Command Prompt:

git clone https://github.com/Nexialog/RegGPT.git
cd RegGPT/
python -m venv env

In UNIX system:

source venv/bin/activate

In Windows:

venv\Scripts\activate

To install all of the required packages to this environment, simply run:

pip install -r requirements.txt

and all of the required pip packages will be installed, and the app will be able to run.

Usage of run_script.py

This script is used for processing PDF documents and generating text embeddings. You can specify different modes and parameters via command-line arguments.

Process Documents

To process PDF documents and extract paragraphs and metadata, use the following command:

python run_script.py --type process_documents

You can also use optional arguments to specify the folder containing PDFs, the output data folder, minimum paragraph length, and merge length.

Generate Embeddings

To generate text embeddings from the processed paragraphs, use the following command:

python run_script.py --type generate_embeddings

This command will use the default embedding model, but you can specify another model using the --embedding_model argument.

Process Documents and Generate Embeddings

To perform both document processing and embedding generation, use:

python run_script.py --type all

Command Line Arguments

--type: Specifies the operation type. Choices are all, process_documents, or generate_embeddings. (required)
--pdf_folder: Path to the folder containing PDF documents. Default is pdf_data/. (optional)
--data_folder: Path to the folder where processed data and embeddings will be saved. Default is data/. (optional)
--embedding_model: Specifies the model to be used for generating embeddings. Default is sentence-transformers/multi-qa-mpnet-base-dot-v1. (optional)
--device: Specifies the device to be used (CPU or GPU). Choices are cpu or cuda. Default is cpu. (optional)
--min_length: Specifies the minimum paragraph length for inclusion. Default is 300. (optional)
--merge_length: Specifies the merge length for paragraphs. Default is 700. (optional)

Examples

python run_script.py --type process_documents --pdf_folder my_pdf_folder/ --merge_length 800

python run_script.py --type generate_embeddings --device cuda

How to use Colab's GPU

Create your own deploying key from github
Upload the key to Google Drive on the path : drive/MyDrive/ssh_key_github/
Upload the notebook notebooks/generate_embeddings.ipynb into a colab session (or use this link)
Upload the pdf files on the same colab session on the path : pdf_data/
Run the notebook on GPU mode and download the folder data/ containing embeddings and chnukns

How to Configure a New BOT

Put all pdf files in a folder at the same repository (We recommend using the folder name : 'pdf_data')
Run the python sciprt 'run_script.py' as explained above. 3-Configure the BOT in config by following the steps bellow:

In order to configure the chatbot, you need to modify the config.py file that contains the CFG_APP class. Here's what each attribute in the class means:

Basic Settings

DEBUG: Debugging mode
K_TOTAL: The total number of retrieved docs
THRESHOLD: Threshold of retrieval by embeddings
DEVICE: Device for computation
BOT_NAME: The name of the bot
MODEL_NAME: The name of the model

Language and Data

DEFAULT_LANGUAGE: Default language
DATA_FOLDER: Path to the data folder
EMBEDDING_MODEL: Embedding model

Tokens and Prompts

MAX_TOKENS_REF_QUESTION: Maximum tokens in the reformulated question
MAX_TOKENS_ANSWER: Maximum tokens in answers
INIT_PROMPT: Initial prompt
SOURCES_PROMPT: Sources prompt for responses

Default Questions

DEFAULT_QUESTIONS: Tuple of default questions

Reformulation Prompt

REFORMULATION_PROMPT: Prompt for reformulating questions

Metadata Path

DOC_METADATA_PATH: Path to document metadata

How to Use This BOT

Run this app locally by:

python app.py

Open http://127.0.0.1:7860 in your browser, and you will see the bot.