CSRD_GPT / README.md
Afritz's picture
Update README.md
d79eb36

A newer version of the Gradio SDK is available: 5.22.0

Upgrade
metadata
title: CSRD GPT
emoji: 🌿
colorFrom: blue
colorTo: green
sdk: gradio
python_version: 3.10.0
sdk_version: 3.22.1
app_file: app.py
pinned: true

title: CSRD GPT emoji: πŸ“Š colorFrom: red colorTo: green sdk: gradio sdk_version: 4.13.0 app_file: app.py pinned: false

Introduction

Python Version used is: 3.10.0

Built With

Requirements

NOTE: Before installing the requirements, rename the file .env.example to .env and put your OpenAI API key there !

We suggest you to create a separate virtual environment running Python 3 for this app, and install all of the required dependencies there. Run in Terminal/Command Prompt:

git clone https://github.com/Nexialog/RegGPT.git
cd RegGPT/
python -m venv env

In UNIX system:

source venv/bin/activate

In Windows:

venv\Scripts\activate

To install all of the required packages to this environment, simply run:

pip install -r requirements.txt

and all of the required pip packages will be installed, and the app will be able to run.

Usage of run_script.py

This script is used for processing PDF documents and generating text embeddings. You can specify different modes and parameters via command-line arguments.

Process Documents

To process PDF documents and extract paragraphs and metadata, use the following command:

python run_script.py --type process_documents 

You can also use optional arguments to specify the folder containing PDFs, the output data folder, minimum paragraph length, and merge length.

Generate Embeddings

To generate text embeddings from the processed paragraphs, use the following command:

python run_script.py --type generate_embeddings

This command will use the default embedding model, but you can specify another model using the --embedding_model argument.

Process Documents and Generate Embeddings

To perform both document processing and embedding generation, use:

python run_script.py --type all

Command Line Arguments

  • --type: Specifies the operation type. Choices are all, process_documents, or generate_embeddings. (required)
  • --pdf_folder: Path to the folder containing PDF documents. Default is pdf_data/. (optional)
  • --data_folder: Path to the folder where processed data and embeddings will be saved. Default is data/. (optional)
  • --embedding_model: Specifies the model to be used for generating embeddings. Default is sentence-transformers/multi-qa-mpnet-base-dot-v1. (optional)
  • --device: Specifies the device to be used (CPU or GPU). Choices are cpu or cuda. Default is cpu. (optional)
  • --min_length: Specifies the minimum paragraph length for inclusion. Default is 300. (optional)
  • --merge_length: Specifies the merge length for paragraphs. Default is 700. (optional)

Examples

python run_script.py --type process_documents --pdf_folder my_pdf_folder/ --merge_length 800
python run_script.py --type generate_embeddings --device cuda

How to use Colab's GPU

  1. Create your own deploying key from github
  2. Upload the key to Google Drive on the path : drive/MyDrive/ssh_key_github/
  3. Upload the notebook notebooks/generate_embeddings.ipynb into a colab session (or use this link)
  4. Upload the pdf files on the same colab session on the path : pdf_data/
  5. Run the notebook on GPU mode and download the folder data/ containing embeddings and chnukns

How to Configure a New BOT

  1. Put all pdf files in a folder at the same repository (We recommend using the folder name : 'pdf_data')
  2. Run the python sciprt 'run_script.py' as explained above. 3-Configure the BOT in config by following the steps bellow:

In order to configure the chatbot, you need to modify the config.py file that contains the CFG_APP class. Here's what each attribute in the class means:

Basic Settings

  • DEBUG: Debugging mode
  • K_TOTAL: The total number of retrieved docs
  • THRESHOLD: Threshold of retrieval by embeddings
  • DEVICE: Device for computation
  • BOT_NAME: The name of the bot
  • MODEL_NAME: The name of the model

Language and Data

  • DEFAULT_LANGUAGE: Default language
  • DATA_FOLDER: Path to the data folder
  • EMBEDDING_MODEL: Embedding model

Tokens and Prompts

  • MAX_TOKENS_REF_QUESTION: Maximum tokens in the reformulated question
  • MAX_TOKENS_ANSWER: Maximum tokens in answers
  • INIT_PROMPT: Initial prompt
  • SOURCES_PROMPT: Sources prompt for responses

Default Questions

  • DEFAULT_QUESTIONS: Tuple of default questions

Reformulation Prompt

  • REFORMULATION_PROMPT: Prompt for reformulating questions

Metadata Path

  • DOC_METADATA_PATH: Path to document metadata

How to Use This BOT

Run this app locally by:

python app.py

Open http://127.0.0.1:7860 in your browser, and you will see the bot.