File size: 5,384 Bytes
7013379
d79eb36
 
 
 
7013379
 
 
 
 
 
 
e1571d6
 
 
 
 
 
 
 
 
 
 
7013379
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d79eb36
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: CSRD GPT
emoji: 🌿
colorFrom: blue
colorTo: green
sdk: gradio
python_version: 3.10.0
sdk_version: 3.22.1
app_file: app.py
pinned: true
---

---
title: CSRD GPT
emoji: 📊
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 4.13.0
app_file: app.py
pinned: false
---

## Introduction

Python Version used is: 3.10.0

## Built With

- [Gradio](https://www.gradio.app/docs/interface) - Main server and interactive components
- [OpenAI API](https://platform.openai.com/docs/api-reference) - Main LLM engine used in the app
- [HuggingFace Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) - Used as the default embedding model

## Requirements

> **_NOTE:_** Before installing the requirements, rename the file `.env.example` to `.env` and put your OpenAI API key there !

We suggest you to create a separate virtual environment running Python 3 for this app, and install all of the required dependencies there. Run in Terminal/Command Prompt:

```bash
git clone https://github.com/Nexialog/RegGPT.git
cd RegGPT/
python -m venv env
```

In UNIX system:

```bash
source venv/bin/activate
```

In Windows:

```bash
venv\Scripts\activate
```

To install all of the required packages to this environment, simply run:

```bash
pip install -r requirements.txt
```

and all of the required `pip` packages will be installed, and the app will be able to run.

## Usage of run_script.py

This script is used for processing PDF documents and generating text embeddings. You can specify different modes and parameters via command-line arguments.

### Process Documents
To process PDF documents and extract paragraphs and metadata, use the following command:

```bash
python run_script.py --type process_documents 
```

You can also use optional arguments to specify the folder containing PDFs, the output data folder, minimum paragraph length, and merge length.

### Generate Embeddings
To generate text embeddings from the processed paragraphs, use the following command:

```bash
python run_script.py --type generate_embeddings
```

This command will use the default embedding model, but you can specify another model using the `--embedding_model` argument.

### Process Documents and Generate Embeddings
To perform both document processing and embedding generation, use:

```bash
python run_script.py --type all
```

### Command Line Arguments

- `--type`: Specifies the operation type. Choices are `all`, `process_documents`, or `generate_embeddings`. (required)
- `--pdf_folder`: Path to the folder containing PDF documents. Default is `pdf_data/`. (optional)
- `--data_folder`: Path to the folder where processed data and embeddings will be saved. Default is `data/`. (optional)
- `--embedding_model`: Specifies the model to be used for generating embeddings. Default is `sentence-transformers/multi-qa-mpnet-base-dot-v1`. (optional)
- `--device`: Specifies the device to be used (CPU or GPU). Choices are `cpu` or `cuda`. Default is `cpu`. (optional)
- `--min_length`: Specifies the minimum paragraph length for inclusion. Default is `300`. (optional)
- `--merge_length`: Specifies the merge length for paragraphs. Default is `700`. (optional)

### Examples

```bash
python run_script.py --type process_documents --pdf_folder my_pdf_folder/ --merge_length 800
```

```bash
python run_script.py --type generate_embeddings --device cuda
```

### How to use Colab's GPU

1. Create your own [deploying key from github](https://github.com/Nexialog/RegGPT/settings/keys)
2. Upload the key to Google Drive on the path : `drive/MyDrive/ssh_key_github/`
3. Upload the notebook `notebooks/generate_embeddings.ipynb` into a colab session (or use this [link](https://colab.research.google.com/drive/1E7uHJF7gH_36O9ylIgWhiAjHpRJRyvnv?usp=sharing))
4. Upload the pdf files on the same colab session on the path : `pdf_data/`
5. Run the notebook on GPU mode and download the folder `data/` containing embeddings and chnukns

## How to Configure a New BOT

1. Put all pdf files in a folder at the same repository (We recommend using the folder name : 'pdf_data') 
2. Run the python sciprt 'run_script.py' as explained above. 3-Configure the BOT in config by following the steps bellow:

In order to configure the chatbot, you need to modify the config.py file that contains the CFG_APP class. Here's what each attribute in the class means:

### Basic Settings

- `DEBUG`: Debugging mode
- `K_TOTAL`: The total number of retrieved docs
- `THRESHOLD`: Threshold of retrieval by embeddings
- `DEVICE`: Device for computation
- `BOT_NAME`: The name of the bot
- `MODEL_NAME`: The name of the model

### Language and Data

- `DEFAULT_LANGUAGE`: Default language
- `DATA_FOLDER`: Path to the data folder
- `EMBEDDING_MODEL`: Embedding model

### Tokens and Prompts

- `MAX_TOKENS_REF_QUESTION`: Maximum tokens in the reformulated question
- `MAX_TOKENS_ANSWER`: Maximum tokens in answers
- `INIT_PROMPT`: Initial prompt
- `SOURCES_PROMPT`: Sources prompt for responses

### Default Questions

- `DEFAULT_QUESTIONS`: Tuple of default questions

### Reformulation Prompt

- `REFORMULATION_PROMPT`: Prompt for reformulating questions

### Metadata Path

- `DOC_METADATA_PATH`: Path to document metadata

## How to Use This BOT

Run this app locally by:

```bash
python app.py
```

Open [http://127.0.0.1:7860](http://127.0.0.1:7860) in your browser, and you will see the bot.