Spaces:

Nexialog
/

CSRD_GPT

Runtime error

App Files Files Community

AxelFritz1 commited on Jan 5, 2024

Commit

7013379

1 Parent(s): e1571d6

first commit

Browse files

Files changed (17) hide show

.gitattributes +2 -0
.gitignore +14 -0
Images/Reg-GPT.png +0 -0
README.md +161 -1
app.py +164 -0
assets/style.css +215 -0
config.py +73 -0
data/cache-f1c6bb9d30103bcf.arrow +3 -0
data/data-00000-of-00001.arrow +3 -0
data/dataset_info.json +52 -0
data/doc_metadata.json +5 -0
data/index.faiss +3 -0
data/state.json +13 -0
glossary.json +403 -0
requirements.txt +12 -0
text_embedder.py +132 -0
utils.py +381 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+data/index.faiss filter=lfs diff=lfs merge=lfs -text
+data/ filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,14 @@

+__pycache__/
+.vscode/
+.chainlit/
+.idea/
+.env
+env/
+venv/
+pdf_data*
+reg_gpt_*
+rma_gpt_v1/
+app.log

Images/Reg-GPT.png ADDED Viewed

README.md CHANGED Viewed

@@ -1,3 +1,15 @@
 ---
 title: CSRD GPT
 emoji: 📊
@@ -9,4 +21,152 @@ app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: RegGPT
+emoji: 🚀
+colorFrom: indigo
+colorTo: red
+sdk: gradio
+python_version: 3.10.0
+sdk_version: 3.22.1
+app_file: app.py
+pinned: true
+---
 ---
 title: CSRD GPT
 emoji: 📊
 pinned: false
 ---
+## Introduction
+Python Version used is: 3.10.0
+## Built With
+- [Gradio](https://www.gradio.app/docs/interface) - Main server and interactive components
+- [OpenAI API](https://platform.openai.com/docs/api-reference) - Main LLM engine used in the app
+- [HuggingFace Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) - Used as the default embedding model
+## Requirements
+> **_NOTE:_** Before installing the requirements, rename the file `.env.example` to `.env` and put your OpenAI API key there !
+We suggest you to create a separate virtual environment running Python 3 for this app, and install all of the required dependencies there. Run in Terminal/Command Prompt:
+```bash
+git clone https://github.com/Nexialog/RegGPT.git
+cd RegGPT/
+python -m venv env
+```
+In UNIX system:
+```bash
+source venv/bin/activate
+```
+In Windows:
+```bash
+venv\Scripts\activate
+```
+To install all of the required packages to this environment, simply run:
+```bash
+pip install -r requirements.txt
+```
+and all of the required `pip` packages will be installed, and the app will be able to run.
+## Usage of run_script.py
+This script is used for processing PDF documents and generating text embeddings. You can specify different modes and parameters via command-line arguments.
+### Process Documents
+To process PDF documents and extract paragraphs and metadata, use the following command:
+```bash
+python run_script.py --type process_documents
+```
+You can also use optional arguments to specify the folder containing PDFs, the output data folder, minimum paragraph length, and merge length.
+### Generate Embeddings
+To generate text embeddings from the processed paragraphs, use the following command:
+```bash
+python run_script.py --type generate_embeddings
+```
+This command will use the default embedding model, but you can specify another model using the `--embedding_model` argument.
+### Process Documents and Generate Embeddings
+To perform both document processing and embedding generation, use:
+```bash
+python run_script.py --type all
+```
+### Command Line Arguments
+- `--type`: Specifies the operation type. Choices are `all`, `process_documents`, or `generate_embeddings`. (required)
+- `--pdf_folder`: Path to the folder containing PDF documents. Default is `pdf_data/`. (optional)
+- `--data_folder`: Path to the folder where processed data and embeddings will be saved. Default is `data/`. (optional)
+- `--embedding_model`: Specifies the model to be used for generating embeddings. Default is `sentence-transformers/multi-qa-mpnet-base-dot-v1`. (optional)
+- `--device`: Specifies the device to be used (CPU or GPU). Choices are `cpu` or `cuda`. Default is `cpu`. (optional)
+- `--min_length`: Specifies the minimum paragraph length for inclusion. Default is `300`. (optional)
+- `--merge_length`: Specifies the merge length for paragraphs. Default is `700`. (optional)
+### Examples
+```bash
+python run_script.py --type process_documents --pdf_folder my_pdf_folder/ --merge_length 800
+```
+```bash
+python run_script.py --type generate_embeddings --device cuda
+```
+### How to use Colab's GPU
+1. Create your own [deploying key from github](https://github.com/Nexialog/RegGPT/settings/keys)
+2. Upload the key to Google Drive on the path : `drive/MyDrive/ssh_key_github/`
+3. Upload the notebook `notebooks/generate_embeddings.ipynb` into a colab session (or use this [link](https://colab.research.google.com/drive/1E7uHJF7gH_36O9ylIgWhiAjHpRJRyvnv?usp=sharing))
+4. Upload the pdf files on the same colab session on the path : `pdf_data/`
+5. Run the notebook on GPU mode and download the folder `data/` containing embeddings and chnukns
+## How to Configure a New BOT
+1. Put all pdf files in a folder at the same repository (We recommend using the folder name : 'pdf_data')
+2. Run the python sciprt 'run_script.py' as explained above. 3-Configure the BOT in config by following the steps bellow:
+In order to configure the chatbot, you need to modify the config.py file that contains the CFG_APP class. Here's what each attribute in the class means:
+### Basic Settings
+- `DEBUG`: Debugging mode
+- `K_TOTAL`: The total number of retrieved docs
+- `THRESHOLD`: Threshold of retrieval by embeddings
+- `DEVICE`: Device for computation
+- `BOT_NAME`: The name of the bot
+- `MODEL_NAME`: The name of the model
+### Language and Data
+- `DEFAULT_LANGUAGE`: Default language
+- `DATA_FOLDER`: Path to the data folder
+- `EMBEDDING_MODEL`: Embedding model
+### Tokens and Prompts
+- `MAX_TOKENS_REF_QUESTION`: Maximum tokens in the reformulated question
+- `MAX_TOKENS_ANSWER`: Maximum tokens in answers
+- `INIT_PROMPT`: Initial prompt
+- `SOURCES_PROMPT`: Sources prompt for responses
+### Default Questions
+- `DEFAULT_QUESTIONS`: Tuple of default questions
+### Reformulation Prompt
+- `REFORMULATION_PROMPT`: Prompt for reformulating questions
+### Metadata Path
+- `DOC_METADATA_PATH`: Path to document metadata
+## How to Use This BOT
+Run this app locally by:
+```bash
+python app.py
+```
+Open [http://127.0.0.1:7860](http://127.0.0.1:7860) in your browser, and you will see the bot.

app.py ADDED Viewed

	@@ -0,0 +1,164 @@

+import os
+import openai
+import gradio as gr
+from dotenv import load_dotenv
+from utils import chat
+from config import CFG_APP
+# Load API KEY
+try:
+    load_dotenv()
+except Exception as e:
+    pass
+openai.api_key = os.environ["OPENAI_API_KEY"]
+# SYS Template
+system_template = {
+    "role": "system",
+    "content": CFG_APP.INIT_PROMPT,
+}
+# APP
+theme = gr.themes.Monochrome(
+    font=[gr.themes.GoogleFont("Kanit"), "sans-serif"],
+)
+with gr.Blocks(title=CFG_APP.BOT_NAME, css="assets/style.css", theme=theme) as demo:
+    gr.Markdown(f"<h1><center>{CFG_APP.BOT_NAME} 🤖</center></h1>")
+    with gr.Row():
+        with gr.Column(scale=2):
+            chatbot = gr.Chatbot(
+                elem_id="chatbot", label=f"{CFG_APP.BOT_NAME} chatbot", show_label=False
+            )
+            state = gr.State([system_template])
+            with gr.Row():
+                ask = gr.Textbox(
+                    show_label=False,
+                    placeholder="Ask here your question and press enter",
+                    )
+            ask_examples_hidden = gr.Textbox(elem_id="hidden-message")
+            examples_questions = gr.Examples(
+                [*CFG_APP.DEFAULT_QUESTIONS],
+                [ask_examples_hidden],
+                examples_per_page=15,
+            )
+        with gr.Column(scale=1, variant="panel"):
+            sources_textbox = gr.Markdown(show_label=False)
+    ask.submit(
+        fn=chat,
+        inputs=[ask, state],
+        outputs=[chatbot, state, sources_textbox],
+    )
+    ask.submit(lambda x: gr.update(value=""), [], [ask])
+    ask_examples_hidden.change(
+        fn=chat,
+        inputs=[ask_examples_hidden, state],
+        outputs=[chatbot, state, sources_textbox],
+    )
+    demo.queue(concurrency_count=16)
+    gr.Markdown(
+        """
+    ### 🎯 Understanding CSRD_GPT's Purpose
+    In an era marked by growing emphasis on Environmental, Social, and Governance (ESG) considerations, staying well-informed about the intricate landscape of ESG and CSRD regulations can be a challenging endeavor. The evolving nature of these regulations and the wealth of information available can make it difficult to extract precise insights.
+    \n CSRD_GPT, a conversational tool related to a chatbot, offers an effective solution to this challenge. CSRD_GPT is specifically designed to address queries related to CSRD regulations. This tool draws its insights solely from documents published by official European regulatory sources, thus assuring the reliability and pertinence of its responses. By strictly focusing on these documents, CSRD_GPT ensures that it does not reference non-relevant sources, maintaining a high standard of precision in its responses. This novel tool harnesses the power of conversational AI to help users navigate the complex world of environmental's regulations, simplifying the task and promoting compliance efficiency.
+    """
+    )
+    gr.Markdown(
+        """
+    ### 📃 Inputs and functionalities
+    In its initial release, Version 0, CSRD_GPT uses the subsequent 5 documents as the basis for its answers:
+    \n
+    |Document|Link|
+    |:----|:----|
+    |CSRD|https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32022L2464|
+    |CSRD - Delegated Act|https://webgate.ec.europa.eu/regdel/web/delegatedActs/2111/documents/latest?lang=en|
+    |ESRS – CSRD DA annex1|https://ec.europa.eu/finance/docs/level-2-measures/csrd-delegated-act-2023-5303-annex-1_en.pdf|
+    |ESRS – CSRD DA annex2|https://ec.europa.eu/finance/docs/level-2-measures/csrd-delegated-act-2023-5303-annex-2_en.pdf|
+    |Q&A on the Adoption of European Sustainability Reporting Standards|https://ec.europa.eu/commission/presscorner/detail/en/qanda_23_4043|
+     """
+    )
+    gr.Markdown(
+        """
+    CSRD_GPT provides users with the opportunity to input queries using a dedicated prompt area, much like the one used in OpenAI's ChatGPT. If you're unsure of what to ask, examples of potential questions are displayed below the query bar. Simply click on one of these and the tool will generate corresponding responses.
+    \n When a query is submitted to the model, 10 sources are extracted from the previously mentioned documents to provide a comprehensive answer. These sources are quoted within the generated answer to ensure accuracy and reliability. For easy reference, exact passages can be quickly located by clicking on the link icon 🔗 located beneath each excerpt, which will directly guide you to the relevant section within the document.
+    """
+    )
+    gr.Markdown(
+        """
+    ### 💬 Prompt Initialization
+    To limit the model's responses to only the 10 proposed sources, a set of prompts has been designed and will serve as instructions to the GPT API. This design decision ensures that the model's output is reliably grounded in the selected documents, contributing to the overall accuracy and reliability of the tool. The structured guidance provided by these prompts enables the GPT API to more effectively navigate the wealth of information contained within the ten sources, delivering highly relevant and concise responses to the users' queries.
+    <u>Prompts used to initialize CSRD_GPT: </u>
+    - "You are CSRD_GPT, an expert in CSRD regulations, an AI Assistant by Nexialog Consulting."
+    - "You are given a question and extracted parts of regulation reports."
+    - "Provide a clear and structured answer based only on the context provided."
+    - "When relevant, use bullet points and lists to structure your answers."
+    - "When relevant, use facts and numbers from the following documents in your answer."
+    - "Whenever you use information from a document, reference it at the end of the sentence (ex: [doc 2])."
+    - "You don't have to use all documents, only if it makes sense in the conversation."
+    - "Don't make up new sources and references that don't exist."
+    - "If no relevant information to answer the question is present in the documents, just say you don't have enough information to answer."
+    """
+    )
+    gr.Markdown(
+        """
+    ### ⚙️Technical features
+    CSRD_GPT operates through two core modules, the GPT API from OpenAI and an embedding model. The functioning of these components is integrated into a seamless workflow, which can be summarized in the figure below :
+    <div style="display:flex; justify-content:center;">
+    <img src="file/Images/Reg-GPT.png" width="800" height="800" />
+    </div>
+    - Open AI Api version :  gpt-3.5-turbo
+    - Embedding model : https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1
+    """
+    )
+    gr.Markdown(
+        "<h1><center>Disclaimer ⚠️</center></h1>\n"
+        + """
+- Please be aware that this is Version 0 of our application. You may encounter certain errors or glitches as we continue to refine and enhance its functionality. You might experience some nonsensical answers, similar to those experienced when using chat-GPT. If you encounter any issues, don't hesitate to reach out to us at [email protected].
+- Our application relies on an external API provided by OpenAI. There may be instances where errors occur due to high demand on the API. If you encounter such an issue, we recommend that you refresh the page and retry your query, or try again a little bit later.
+- When using our application, we urge you to ask clear and explicit questions that adhere to the scope of credit risk regulations. This will ensure that you receive the most accurate and relevant responses from the system.
+"""
+    )
+demo.launch()

assets/style.css ADDED Viewed

	@@ -0,0 +1,215 @@

+.warning-box {
+    background-color: #fff3cd;
+    border: 1px solid #ffeeba;
+    border-radius: 4px;
+    padding: 15px 20px;
+    font-size: 14px;
+    color: #856404;
+    display: inline-block;
+    margin-bottom: 15px;
+}
+.tip-box {
+    background-color: #f0f9ff;
+    border: 1px solid #80d4fa;
+    border-radius: 4px;
+    margin-top: 20px;
+    padding: 15px 20px;
+    font-size: 14px;
+    color: #006064;
+    display: inline-block;
+    margin-bottom: 15px;
+    width: auto;
+}
+.tip-box-title {
+    font-weight: bold;
+    font-size: 14px;
+    margin-bottom: 5px;
+}
+.light-bulb {
+    display: inline;
+    margin-right: 5px;
+}
+.gr-box {
+    border-color: #d6c37c
+}
+#hidden-message {
+    display: none;
+}
+.message {
+    font-size: 14px !important;
+}
+a {
+    text-decoration: none;
+    color: inherit;
+}
+.card {
+    background-color: #233f55;
+    border-radius: 10px;
+    box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
+    overflow: hidden;
+    display: flex;
+    flex-direction: column;
+    margin: 20px;
+}
+.card-content {
+    padding: 20px;
+}
+.card-content h2 {
+    font-size: 14px !important;
+    font-weight: bold;
+    margin-bottom: 10px;
+    margin-top: 0px !important;
+    color: #577b9b !important;
+    ;
+}
+.card-content p {
+    font-size: 12px;
+    margin-bottom: 0;
+}
+.card-footer {
+    background-color: #f4f4f4;
+    font-size: 10px;
+    padding: 10px;
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+}
+.card-footer span {
+    flex-grow: 1;
+    text-align: left;
+    color: #999 !important;
+}
+.pdf-link {
+    display: inline-flex;
+    align-items: center;
+    margin-left: auto;
+    text-decoration: none !important;
+    font-size: 14px;
+}
+.message.user {
+    background-color: #b20032 !important;
+    border: none;
+    color: white !important;
+}
+.message.bot {
+    /* background-color: #f2f2f7 !important; */
+    border: none;
+}
+.gallery-item>div:hover {
+    background-color: #7494b0 !important;
+    color: white !important;
+}
+.gallery-item:hover {
+    border: #7494b0 !important;
+}
+.gallery-item>div {
+    background-color: white !important;
+    color: #577b9b !important;
+}
+.label {
+    color: #577b9b !important;
+}
+.paginate {
+    color: #577b9b !important;
+}
+label>span {
+    color: #577b9b !important;
+}
+/* Pseudo-element for the circularly cropped picture */
+.message.bot::before {
+    content: '';
+    position: absolute;
+    top: -10px;
+    left: -10px;
+    width: 30px;
+    height: 30px;
+    background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAACnUlEQVR4AcXXA4wkQRiG4eHZtm3btm3btm3bDs+2bdvm2vPfm6Qu6XRuOz3LSp7xdH3ltCU8Za+lsA1JYLVER6HiOFiFXkgGa1QHiIvzCMQplI2uAKJMiY4A50ILwHs7bGG9eFqUQgx3A2gq74X+SAGrO5U7MQvfsAKF4XAzQD68QSDOoLbp3lAt/wxR3mMGssNmFEDTgAUQJQTTYDO7ticgEKLhwhMMRVpYDQIUwyeI8hhZzbbeipQYgNsIhmgE4xraIqk+AGJiNUQJwjCD1hsGSYfheIgQiIYXJuASRJmM8vgBUa4hdXi328yYgGdwQZSvuq4ehi0QxR9dYTVTUWIUQmEDtbESbzRBXBB4Yyb+QJTjSGx22U3DD/wMxQ+8xxXswRt8wjUInuKsboiamG19aXyBuCEQC9AIP/AZPhC4sBVxzVQeG2vgDR8YCYDgG1YhNZxoiWsIgi/2IA/iwojTwkMsFEN5VAhFRYzAc7hwFbXggBX5sB1+8MRNnNc5p3MAxcyuhOJ4ppvdX9ABuXET4qbtZocoLnZBFG+ch+AeNsED9/AFIRAY+YSSZjejBvCCKCdwGoJA+CII97EAA9Efg3SGYBRGoxkcZgIkwTGI8ge98RqCYHhClACcQRskMlqCZlvfCQEQZScqwQMCH6yFN0TDD0fRFAnCGiANrkKUH6iICvDRBKiOAZpe0fLBftRFXHf3/yG6k3ADYkIfoDzsKICV+ArR8cQGJDYbIBseQ5TP/2bt/wJo/hcD5bADHhCNrYhtNkA5PIILgiVwGgbQ7a6oh8PwxUeUdHcIcmABrqGAhWIygPY6CdEefY2XnfEpmQ52gwAVTKwmmyW8xTBAVBZ1yt2DK7oC2JAdc/EM5aPrztiJEkgXnuv8BdWTESwwR9FxAAAAAElFTkSuQmCC');
+    background-color: #fff;
+    background-size: cover;
+    background-position: center;
+    border-radius: 50%;
+    z-index: 10;
+}
+.user.svelte-6roggh.svelte-6roggh {
+    padding: 17px 24px;
+    text-align: justify;
+}
+.gallery.svelte-1ayixqk {
+    text-align: left;
+}
+.card-content p,
+.card-content ul li {
+    color: #fff !important;}
+.message.bot, .bot.svelte-6roggh.svelte-6roggh  {
+    background: #233f55 !important;
+    padding: 17px 24px !important;
+    text-align: justify !important;
+    color: #fff !important;
+}
+#chatbot{
+    height: auto !important;
+    max-height: 1000px;
+}
+#type-emb label {
+    background: #ebeaea;
+}
+.source {
+background-color: #f8f9fa;
+border: 1px solid #ddd;
+padding: 15px;
+margin-bottom: 10px;
+border-radius: 4px;
+}
+.title {
+font-size: 18px;
+color: #333;
+margin-bottom: 5px;
+font-weight: bold;
+}
+.wrap.svelte-6roggh.svelte-6roggh {
+    padding: var(--block-padding);
+    height: 100%;
+    max-height: 480px;
+    overflow-y: auto;
+    max-height: 1000px;
+}

config.py ADDED Viewed

	@@ -0,0 +1,73 @@

+class CFG_APP:
+    DEBUG = True
+    K_TOTAL = 10 # Number of paragraphs
+    THRESHOLD = 0.3
+    DEVICE = "cpu"
+    BOT_NAME = "CSRD_GPT"
+    MODEL_NAME = "gpt-3.5-turbo"
+    DEFAULT_LANGUAGE = "English"
+    DATA_FOLDER = "data/"
+    EMBEDDING_MODEL = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
+    MAX_TOKENS_REF_QUESTION = 128  # Number of tokens in reformulated question
+    MAX_TOKENS_ANSWER = 1024  # Number of tokens in answers
+    MAX_TOKENS_API = 3100
+    INIT_PROMPT = (
+        f"You are {BOT_NAME}, an expert in CSRD regulations, an AI Assistant by Nexialog Consulting. "
+        "You are given a question and extracted parts of regulation reports."
+        "Provide a clear and structured answer based only on the context provided. "
+        "When relevant, use bullet points and lists to structure your answers."
+    )
+    SOURCES_PROMPT = (
+        "When relevant, use facts and numbers from the following documents in your answer. "
+        "Whenever you use information from a document, reference it at the end of the sentence by naming it Exc (ex: [exc 2])."
+        "Very important ! Never use the word Document or Doc for referencing an Excerpt, always exc. "
+        "You don't have to use all documents, only if it makes sense in the conversation. "
+        "If no relevant information to answer the question is present in the documents, "
+        "just say you don't have enough information to answer."
+    )
+    DEFAULT_QUESTIONS = (
+        "What are the key requirements of the CSRD for companies in the EU?",
+        "How does the CSRD differ from the previous Non-Financial Reporting Directive (NFRD)?",
+        "Which types of companies are affected by the CSRD, and what are the thresholds for compliance?",
+        "How will the CSRD impact the way companies report on sustainability and environmental issues?",
+        "Are there specific guidelines or standards that companies must follow under the CSRD for reporting sustainability measures?",
+        "Comment la CSRD va-t-elle influencer la transparence et la responsabilité des entreprises en matière de pratiques durables ?",
+        "Quelles sont les conséquences pour les entreprises qui ne respectent pas les normes de la CSRD ?",
+        "La CSRD exige-t-elle des entreprises de rapporter sur des indicateurs spécifiques de durabilité environnementale et sociale ?",
+        "Comment la mise en œuvre de la CSRD peut-elle bénéficier à la performance globale des entreprises ?",
+        "Quel rôle jouent les auditeurs et les conseillers en matière de conformité aux exigences de la CSRD ?",
+    )
+    REFORMULATION_PROMPT = """
+        Important ! Give the output as a standalone question followed by the detected language whatever the form of the query.
+        Reformulate the following user message to be a short standalone question in English, in the context of an educational discussion about regulations in banks. Then detect the language of the query
+        Sometimes, explanations of some abbreviations will be given in parentheses, keep them.
+        ---
+        query: C'est quoi les régles que les banques américaines doivent suivre ?
+        standalone question: What are the key regulations that banks in the United States must follow?
+        language: French
+        ---
+        query: what are the main effects of bank regulations?
+        standalone question: What are the main causes of bank regulations change in the last century?
+        language: English
+        ---
+        query: UL (Unexpected Loss)
+        standalone question: What does UL (Unexpected Loss) stand for?
+        language: English
+    """
+    HYDE_PROMPT ="""
+        Important ! Give the output as an answer to the query. First you will translate the query in English and then answer it, in English, in 2 sentences maximum using the right vocabulary of the context of the query.
+        Very important : the answer must be followed by the detected language of the query whatever the form of the query. You must keep the question at the begining of the answer.
+        Here is an example of the template you must follow to create your answer :
+        ---
+        query : C'est quoi les régles que les banques américaines doivent suivre ?
+        output : What are the rules that American banks must follow ? American banks must follow a set of federal and state regulations imposed by agencies such as the Federal Reserve and the Consumer Financial Protection Bureau..
+        language : French
+    """
+    DOC_METADATA_PATH = f"{DATA_FOLDER}/doc_metadata.json"

data/cache-f1c6bb9d30103bcf.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:87c87020f08505f5c44170014acaae9ae57ab85c2f04f5a11c34ac5823a4bd33
+size 5024312

data/data-00000-of-00001.arrow ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:74e5cbe1254f5911d3eb8b9645487a72c4caaecf65aea2bae78e149f2fa69bb3
+size 1388056

data/dataset_info.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "citation": "",
+  "description": "",
+  "features": {
+    "id": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "document_id": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "content_type": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "content": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "length": {
+      "dtype": "int64",
+      "_type": "Value"
+    },
+    "idx_block": {
+      "dtype": "int64",
+      "_type": "Value"
+    },
+    "page_number": {
+      "dtype": "int64",
+      "_type": "Value"
+    },
+    "x0": {
+      "dtype": "float64",
+      "_type": "Value"
+    },
+    "y0": {
+      "dtype": "float64",
+      "_type": "Value"
+    },
+    "x1": {
+      "dtype": "float64",
+      "_type": "Value"
+    },
+    "y1": {
+      "dtype": "float64",
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": ""
+}

data/doc_metadata.json ADDED Viewed

	@@ -0,0 +1,5 @@

+[{"id": "f20d9592e5232f906ca26833bb63fdab", "title": "", "author": "", "subject": "", "creation_date": "D:20230731101034+02'00'", "modification_date": "D:20230731101034+02'00'", "n_pages": 13, "url": "https://webgate.ec.europa.eu/regdel/web/delegatedActs/2111/documents/latest?lang=en", "file_name": "CSRD - Delegated Act.pdf", "short_name": "CSRD - Delegated Act.pdf", "release_date": "", "report_type": "", "source": ""},
+  {"id": "ba860baaf00570c824351b7e2be3ed09", "title": "", "author": "BOTTAZZI Giulia (FISMA)", "subject": "", "creation_date": "D:20230731101209+02'00'", "modification_date": "D:20230731101209+02'00'", "n_pages": 245, "url": "https://ec.europa.eu/finance/docs/level-2-measures/csrd-delegated-act-2023-5303-annex-1_en.pdf", "file_name": "ESRS \u2013 CSRD DA annex1.pdf", "short_name": "ESRS \u2013 CSRD DA annex1.pdf", "release_date": "", "report_type": "", "source": ""},
+  {"id": "42e11a73824e7da1206260d851c0409d", "title": "", "author": "", "subject": "", "creation_date": "D:20230811124305+02'00'", "modification_date": "", "n_pages": 5, "url": "https://ec.europa.eu/commission/presscorner/detail/en/qanda_23_4043", "file_name": "Q&A on the Adoption of European Sustainability Reporting Standards.pdf", "short_name": "Q&A on the Adoption of European Sustainability Reporting Standards.pdf", "release_date": "", "report_type": "", "source": ""},
+  {"id": "60d2167e44781f7a3679e3dc4fd877a9", "title": "Publications Office", "author": "Publications Office", "subject": "", "creation_date": "D:20221215163241+01'00'", "modification_date": "D:20221215163241+01'00'", "n_pages": 66, "url": "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32022L2464", "file_name": "CSRD.pdf", "short_name": "CSRD.pdf", "release_date": "", "report_type": "", "source": ""},
+  {"id": "52fe10867100623bad10c1370eda2e6d", "title": "", "author": "BOTTAZZI Giulia (FISMA)", "subject": "", "creation_date": "D:20230731101109+02'00'", "modification_date": "D:20230731101109+02'00'", "n_pages": 34, "url": "https://ec.europa.eu/finance/docs/level-2-measures/csrd-delegated-act-2023-5303-annex-2_en.pdf", "file_name": "ESRS \u2013 CSRD DA annex2.pdf", "short_name": "ESRS \u2013 CSRD DA annex2.pdf", "release_date": "", "report_type": "", "source": ""}]

data/index.faiss ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:932f188d2abad7ec11b296960084c933328e55f3a6591448adc5275e502b774b
+size 3631149

data/state.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00001.arrow"
+    }
+  ],
+  "_fingerprint": "2c6b766609c8e690",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": null
+}

glossary.json ADDED Viewed

	@@ -0,0 +1,403 @@

+{
+    "ABoR": "Administrative Board of Review",
+    "ABS": "asset-backed security",
+    "ABSPP": "asset-backed securities purchase programme",
+    "ACH": "automated clearing house",
+    "AIF": "alternative investment fund",
+    "AMA": "advanced measurement approach",
+    "AMC": "asset management company",
+    "AMI-Pay": "Advisory Group on Market Infrastructures for Payments",
+    "AMI-SeCo": "Advisory Group on Market Infrastructures for Securities and Collateral",
+    "AML": "anti-money laundering",
+    "API": "application programming interface",
+    "APP": "asset purchase programme",
+    "ASC": "Advisory Scientific Committee",
+    "ASLP": "automated security lending programme",
+    "AT1": "Additional Tier 1",
+    "ATC": "Advisory Technical Committee",
+    "ATM": "automated teller machine",
+    "b.o.p.": "balance of payments",
+    "BCBS": "Basel Committee on Banking Supervision",
+    "BCPs": "Basel Core Principles",
+    "BEPGs": "Broad Economic Policy Guidelines",
+    "BIC": "Business Identifier Code",
+    "BIS": "Bank for International Settlements",
+    "BPM6": "Balance of Payments and International Investment Position Manual",
+    "bps": "basis points",
+    "BRM": "breach reporting mechanism",
+    "BRRD": "Bank Recovery and Resolution Directive",
+    "c.i.f.": "Cost, insurance and freight at the importer’s border",
+    "CAPE": "cyclically adjusted price/earnings (ratio)",
+    "CAPM": "capital asset pricing model",
+    "CAS": "capital adequacy statement",
+    "CBOE": "Chicago Board Options Exchange",
+    "CBPP": "covered bond purchase programme",
+    "CBR": "combined buffer requirement",
+    "CCBM": "correspondent central banking model",
+    "CCBM2": "Collateral Central Bank Management",
+    "CCoB": "capital conservation buffer",
+    "CCP": "central counterparty",
+    "CCyB": "countercyclical capital buffer",
+    "CDS": "credit default swap",
+    "CESR": "Committee of European Securities Regulators",
+    "CET1": "Common Equity Tier 1",
+    "CGFS": "Committee on the Global Financial System",
+    "CGO": "Compliance and Governance Office",
+    "CISS": "composite indicator of systemic stress",
+    "CJEU": "Court of Justice of the European Union",
+    "CMU": "capital markets union",
+    "CO2": "carbon dioxide",
+    "COGESI": "Contact Group on Euro Securities Infrastructures",
+    "COI": "Centralised On-Site Inspections Division",
+    "COREP": "common reporting",
+    "CPI": "consumer price index",
+    "CPMI": "Committee on Payments and Market Infrastructures",
+    "CPSIPS": "Core Principles for Systemically Important Payment Systems",
+    "CRD": "Capital Requirements Directive",
+    "CRE": "commercial real estate",
+    "CRR": "Capital Requirements Regulation",
+    "CSD": "central securities depository ",
+    "CSPP": "corporate sector purchase programme",
+    "D-SIB": "domestic systemically important bank",
+    "DFR": "deposit facility rate",
+    "DG ECFIN": "Directorate General for Economic and Financial Affairs, European Commission",
+    "DGS": "deposit guarantee scheme",
+    "DLT": "distributed ledger technology",
+    "DNSH": "Do No Significant Harm",
+    "DSR": "debt service ratio",
+    "DSTI": "debt service-to-income",
+    "DTA": "deferred tax asset",
+    "DTI": "debt-to-income",
+    "DvD": "delivery versus delivery",
+    "DvP": "delivery versus payment",
+    "EAD": "exposure at default",
+    "EBA": "European Banking Authority",
+    "EBITDA": "earnings before interest, taxes, depreciation and amortisation",
+    "EBP": "excess bond premium",
+    "EBPP": "Electronic Bill Presentment and Payment ",
+    "ECA": "European Court of Auditors",
+    "ECAF": "Eurosystem credit assessment framework",
+    "ECB": "European Central Bank ",
+    "ECL": "expected credit loss",
+    "ECOFIN": "Economic and Financial Affairs Council. Council of the European Union",
+    "ECU": "European Currency Unit ",
+    "EDF": "expected default frequency",
+    "EDI": "electronic data interchange ",
+    "EDIS": "European Deposit Insurance Scheme",
+    "EDP": "excessive deficit procedure ",
+    "EDW": "European Data Warehouse",
+    "EEA": "European Economic Area ",
+    "EER": "effective exchange rate ",
+    "EFC": "Economic and Financial Committee ",
+    "EFSF": "European Financial Stability Facility ",
+    "EFSM": "European Financial Stabilisation Mechanism ",
+    "EIOPA": "European Insurance and Occupational Pensions Authority",
+    "EL": "Expected Loss",
+    "ELB": "effective lower bound",
+    "ELBE": "Expected Loss Best Estimate",
+    "ELMI": "electronic money institution",
+    "EMIR": "European Market Infrastructure Regulation",
+    "EMMS": "Euro Money Market Survey",
+    "EMS": "European Monetary System ",
+    "EMU": "Economic and Monetary Union ",
+    "EONIA": "euro overnight index average",
+    "ERM II": "exchange rate mechanism II",
+    "ERPB": "Euro Retail Payments Board",
+    "ESA": "European Supervisory Authority",
+    "ESA 2010": "European System of Accounts 2010 ",
+    "ESA 95": "European System of Accounts 1995 ",
+    "ESCB": "European System of Central Banks",
+    "ESCG": "European Systemic Cyber Group",
+    "ESFS": "European System of Financial Supervision",
+    "ESM": "European Stability Mechanism",
+    "ESMA": "European Securities and Markets Authority",
+    "ESRB": "European Systemic Risk Board",
+    "ETF": "exchange-traded fund",
+    "EUCLID": "European centralised infrastructure for supervisory data",
+    "EURIBOR": "euro interbank offered rate",
+    "€STR": "euro short-term rate",
+    "EVE": "economic value of equity",
+    "f.o.b.": "Free on board at the exporter’s border",
+    "FINREP": "financial reporting",
+    "FMI": "financial market infrastructure",
+    "FOLTF": "failing or likely to fail",
+    "FOMC": "Federal Open Market Committee",
+    "FRA": "forward rate agreement",
+    "FSB": "Financial Stability Board",
+    "FSR": "Financial Stability Review",
+    "FTS": "funds transfer system",
+    "FVA": "fair value accounting",
+    "FVC": "financial vehicle corporation",
+    "FX": "foreign exchange",
+    "G-SIB": "global systemically important bank",
+    "G-SII": "global systemically important institution",
+    "GAAP": "generally accepted accounting principles",
+    "GDP": "gross domestic product",
+    "HICP": "Harmonised Index of Consumer Prices",
+    "HLEG": "High-Level Expert Group on Sustainable Finance",
+    "HoM": "Head of Mission",
+    "HQLA": "high-quality liquid asset",
+    "i.i.p.": "international investment position",
+    "IAIG": "internationally active insurance group",
+    "IAIS": "International Association of Insurance Supervisors",
+    "IAS": "International Accounting Standards",
+    "IBAN": "International Bank Account Number",
+    "IC": "internal capital",
+    "ICAAP": "Internal Capital Adequacy Assessment Process",
+    "ICMA": "International Capital Market Association",
+    "ICPFs": "insurance corporations and pension funds",
+    "ICR": "interest coverage ratio",
+    "ICS": "Insurance Capital Standard",
+    "ICSD": "international central securities depository",
+    "IF": "investment fund",
+    "IFRS": "International Financial Reporting Standards",
+    "IFTS": "interbank funds transfer system",
+    "ILAAP": "Internal Liquidity Adequacy Assessment Process",
+    "ILO": "International Labour Organization",
+    "ILS": "inflation-linked swap",
+    "IMAS": "SSM Information Management System",
+    "IMF": "International Monetary Fund",
+    "IMI": "internal model investigation",
+    "IOSCO": "International Organization of Securities Commissions",
+    "IPS": "institutional protection scheme",
+    "IRB": "internal ratings-based",
+    "IRBA": "internal ratings-based approach",
+    "IRR": "internal rate of return",
+    "IRRBB": "interest rate risk in the banking book",
+    "IRT": "Internal Resolution Team",
+    "ISIN": "International Securities Identification Number",
+    "ITS": "Implementing Technical Standards",
+    "JSS": "Joint Supervisory Standards",
+    "JST": "Joint Supervisory Team",
+    "JSTC": "Joint Supervisory Team coordinator",
+    "KRI": "key risk indicator",
+    "LCBG": "large and complex banking group",
+    "LCR": "liquidity coverage ratio",
+    "LGD": "loss-given-default",
+    "LSI": "less significant institution",
+    "LSTI": "loan service-to-income",
+    "LTD": "loan-to-deposit",
+    "LTG": "long-term guarantee",
+    "LTI": "loan-to-income",
+    "LTRO": "longer-term refinancing operation",
+    "LTSF": "loan-to-stable-funding",
+    "LTV": "loan-to-value",
+    "M&A": "mergers and acquisitions",
+    "MDA": "maximum distributable amount",
+    "MFI": "monetary financial institution",
+    "MiFID": "Markets in Financial Instruments Directive",
+    "MiFIR": "Markets in Financial Instruments Regulation",
+    "MIP": "macroeconomic imbalance procedure",
+    "MMF": "money market fund",
+    "MMS": "money market statistics",
+    "MMSR": "money market statistical reporting",
+    "MPC": "Monetary Policy Committee",
+    "MREL": "minimum requirement for own funds and eligible liabilities",
+    "MSC": "merchant service charge",
+    "NAV": "net asset value",
+    "NBNI": "non-bank, non-insurance",
+    "NCA": "national competent authority",
+    "NCB": "national central bank",
+    "NDA": "national designated authority",
+    "NFC": "non-financial corporation",
+    "NFCI": "net fee and commission income",
+    "NII": "net interest income",
+    "NIRP": "negative interest rate policy",
+    "NPE": "non-performing exposure",
+    "NPLs": "non-performing loans",
+    "NRA": "national resolution authority",
+    "NSA": "national supervisory authority",
+    "NSFR": "net stable funding ratio",
+    "O&D": "options and discretions",
+    "O-SII": "other systemically important institution",
+    "OECD": "Organisation for Economic Co-operation and Development",
+    "OFI": "other financial institution",
+    "OIS": "overnight index swap",
+    "OJ": "Official Journal of the European Union",
+    "ORC": "overall recovery capacity",
+    "OSI": "on-site inspection",
+    "OTC": "over-the-counter",
+    "P&L": "profit and loss",
+    "P/E": "price/earnings (ratio)",
+    "P2G": "Pillar 2 guidance",
+    "P2P payment": "peer-to-peer payment",
+    "P2R": "Pillar 2 requirement",
+    "PCE": "personal consumption expenditure",
+    "PD": "probability of default",
+    "PE-ACH": "pan-European automated clearing house",
+    "PIN": "personal identification number",
+    "PPI": "prudential policy index",
+    "PPP": "purchasing power parity",
+    "PQD": "public quantitative disclosure",
+    "PSPP": "public sector purchase programme",
+    "PvP": "payment versus payment",
+    "QE": "quantitative easing",
+    "RAROC": "risk-adjusted return on capital",
+    "RAS": "risk appetite statement",
+    "repo": "repurchase agreement, repurchase operation",
+    "ROA": "return on assets",
+    "ROE": "return on equity",
+    "RORAC": "return on risk-adjusted capital",
+    "RRE": "residential real estate",
+    "RTGS system": "real-time gross settlement system",
+    "RTS": "Regulatory Technical Standards",
+    "RWA": "risk-weighted asset",
+    "S&P": "Standard & Poor’s",
+    "SBBS": "sovereign bond-backed security",
+    "SCR": "Solvency Capital Requirement",
+    "SDR": "special drawing right",
+    "SEP": "Supervisory Examination Programme",
+    "SEPA": "Single Euro Payments Area",
+    "SFT": "securities financing transaction",
+    "SGP": "Stability and Growth Pact",
+    "SI": "significant institution",
+    "SII": "systemically important institution",
+    "SIPS": "systemically important payment system",
+    "SMEs": "small and medium-sized enterprises",
+    "SPV": "special-purpose vehicle",
+    "SQA": "Supervisory Quality Assurance",
+    "SRB": "systemic risk buffer",
+    "SREP": "Supervisory Review and Evaluation Process",
+    "SRF": "Single Resolution Fund",
+    "SRM": "Single Resolution Mechanism",
+    "SRMR": "Single Resolution Mechanism Regulation",
+    "SSG": "SSM Simplification Group",
+    "SSM": "Single Supervisory Mechanism",
+    "SSMR": "Single Supervisory Mechanism Regulation",
+    "SSS": "securities settlement system",
+    "STE": "Short Term Exercise",
+    "STP": "straight-through processing",
+    "T2": "Tier 2",
+    "T2S": "TARGET2-Securities",
+    "TFEU": "Treaty on the Functioning of the European Union",
+    "TIPS": "TARGET instant payment settlement",
+    "TLAC": "total loss-absorbing capacity",
+    "TLTRO": "targeted longer-term refinancing operation",
+    "TREA": "total risk exposure amount",
+    "TRIM": "targeted review of internal models",
+    "TRN": "transaction reference number",
+    "TSCG": "Treaty on Stability, Coordination and Governance in the Economic and Monetary Union",
+    "UL": "Unexpeccted Loss",
+    "TSCR": "total SREP capital requirement (P1R+P2R)",
+    "UCITS": "undertaking for collective investment in transferable securities",
+    "ULCM": "Unit labour costs in the manufacturing sector.",
+    "ULCT": "Unit labour costs in the total economy.",
+    "VaR": "value at risk",
+    "VIX": "Chicago Board Options Exchange’s Volatility Index",
+    "XML": "Extensible Markup Language",
+    "AMS": "Automated Measuring Systems",
+    "AQI": "Air Quality Indices",
+    "AR": "Application Requirements",
+    "AWS": "Alliance for Water Stewardship",
+    "BAT": "Best Available Technique",
+    "BAT-AEL": "Best Available Technique-Associated Emission Level",
+    "BAT-AEPL": "Best Available Technique-Associated Environmental Performance Level",
+    "BREFs": "Best Available Techniques Reference Documents",
+    "Btu": "British Thermal Units",
+    "CapEx": "Capital Expenditure",
+    "CBD": "Convention for Biological Diversity",
+    "CDDA": "Common Database on Designated Areas",
+    "CEN": "European Committee for Standardization",
+    "CENELEC": "European Committee for Electrotechnical Standardization",
+    "CH4": "Methane",
+    "CICES": "Common International Classification of Ecosystem Services",
+    "C02": "Carbon Dioxide",
+    "DEGURBA": "Degree of Urbanisation",
+    "DR BP-1": "Disclosure Requirement - General basis for preparation of the sustainability statements",
+    "DR BP-2": "Disclosure Requirement - Disclosures in relation to specific circumstances",
+    "DR GOV-1": "Disclosure Requirement - The role of the administrative, management and supervisory bodies",
+    "DR GOV-2": "Disclosure Requirement - Information provided to and sustainability matters addressed by the undertaking's administrative, management and supervisory bodies",
+    "DR GOV-3": "Disclosure Requirement - Integration of sustainability- related performance in incentive schemes",
+    "DR GOV-4": "Disclosure Requirement - Statement on sustainability due diligence",
+    "DR GOV-5": "Disclosure Requirement - Risk management and internal controls over sustainability reporting",
+    "DR SBM-1": "Disclosure Requirement - Market position, strategy, business model(s) and value chain",
+    "DR SBM-2": "Disclosure Requirement - Interests and views of stakeholders",
+    "DR SBM-3": "Disclosure Requirement - Material impacts, risks and opportunities and their interaction with strategy and business model(s)",
+    "DR IRO-1": "Disclosure Requirement - Description of the processes to identify and assess material impacts, risks and opportunities",
+    "DR IRO-2": "Disclosure Requirements in ESRS covered by the undertaking's sustainability statements",
+    "DR": "Disclosure Requirements",
+    "EC": "European Commission",
+    "EFRAG": "European Financial Reporting Advisory Group",
+    "EFRAG SRB": "European Financial Reporting Advisory Group Sustainability Reporting Board",
+    "EIA": "Environmental Impact Assessment",
+    "EMAS": "Eco-Management and Audit Scheme",
+    "EPC": "Energy Performance Certificate",
+    "E-PRTR": "European Pollutant Release and Transfer Register",
+    "ESRS": "European Sustainability Reporting Standards",
+    "ESRS 1": "European Sustainability Reporting Standard 1 General requirements",
+    "ESRS 2": "European Sustainability Reporting Standard 2 General disclosures",
+    "ESRS E1": "European Sustainability Reporting Standard E1 Climate change",
+    "ESRS E2": "European Sustainability Reporting Standard E2 Pollution",
+    "ESRS E3": "European Sustainability Reporting Standard E3 Water and marine resources",
+    "ESRS E4": "European Sustainability Reporting Standard E4 Biodiversity and ecosystems",
+    "ESRS E5": "European Sustainability Reporting Standard E5 Resource use and circular economy",
+    "ESRS G1": "European Sustainability Reporting Standard G1 Business conduct",
+    "ESRS S1": "European Sustainability Reporting Standard S1 Own workforce",
+    "ESRS S2": "European Sustainability Reporting Standard S2 Workers in the value chain",
+    "ESRS S3": "European Sustainability Reporting Standard S3 Affected communities",
+    "ESRS S4": "European Sustainability Reporting Standard S4 Consumers & end-users",
+    "EU": "European Union",
+    "EU ETS": "European Union Emissions Trading System",
+    "EWC": "European Works Council",
+    "FPIC": "Free, Prior and Informed Consent",
+    "FTE": "Full-time equivalent",
+    "GHG": "Greenhouse Gas",
+    "GJ": "Giga-Joules",
+    "GRI": "Global Reporting Initiative",
+    "GWP": "Global Warming Potential",
+    "HFCs": "Hydrofluorocarbons",
+    "IED": "Directive 2010/75/EU of the European Parliament and of the Council _ (Industrial Emissions Directive)",
+    "IFC": "International Finance Corporation",
+    "IPBES": "Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services",
+    "IPCC": "Intergovernmental Panel on Climate Change",
+    "ISEAL": "International Social and Environmental Accreditation and Labelling Alliance",
+    "ISO": "International Organization for Standardization",
+    "ISSB": "International Sustainability Standards Board",
+    "IUCN": "International Union for Conservation of Nature",
+    "KBA": "Key Biodiversity Areas",
+    "Kg": "Kilogram",
+    "lb": "Pounds",
+    "LEAP": "Locate Evaluate Assess Prepare",
+    "LGBTQI": "Lesbian, Gay, Bisexual, Transgender, Queer, Intersex",
+    "MDR": "Minimum Disclosure Requirement",
+    "MWh": "Mega-Watt-hours",
+    "N2O": "Nitrous Oxide",
+    "NACE": "Statistical Classification of Economic Activities in the European Community",
+    "NF3": "Nitrogen trifluoride",
+    "NGOs": "Non-Governmental Organisations",
+    "NH3": "Ammonia",
+    "NOX": "Nitrogen oxides",
+    "NUTS": "Nomenclature of Territorial Units of Statistics",
+    "ODS": "Ozone-depleting substance",
+    "OECM": "One Earth Climate Model",
+    "OpEX": "Operating Expenditure",
+    "PBTS": "Persistent, bioaccumulative and toxic substances",
+    "PCAF": "Partnership for Carbon Accounting Financial",
+    "PCFs": "Perfluorocarbons",
+    "PM": "Particulate Matter",
+    "PMTs": "Persistent, Mobile and Toxic Substances",
+    "POPs": "Persistent organic pollutants",
+    "REACH": "Registration, Evaluation, Authorisation and Restriction of Chemicals",
+    "SBTi": "Science Based Targets Initiative",
+    "SBTN": "Science Based Targets Network",
+    "SCE": "Societas Cooperativa Europaea",
+    "SDA": "Sectoral Decarbonisation Approach",
+    "SDGs": "Sustainable Development Goals",
+    "SDPI": "Sustainable Development Performance Indicator",
+    "SE": "Societas Europaea",
+    "SEEA": "System of Environmental-Economic Accounting",
+    "SEEA EA": "System of Environmental-Economic Accounting Ecosystem Accounting",
+    "SFDR": "Regulation (EU) 2019/2088 of the European Parliament and of the Council_ (Sustainable Finance Disclosures Regulation)",
+    "SOX": "Sulphur oxides",
+    "SVHC": "Substances of Very High Concern",
+    "TCFD": "Task Force on Climate-Related Financial Disclosures",
+    "TNFD": "Taskforce on Nature-related Financial Disclosures",
+    "UN": "United Nations",
+    "UNEP": "United Nations Environment Programme",
+    "UNESCO": "United Nations Educational, Scientific and Cultural Organization",
+    "vPvBs": "Very persistent and very bioaccumulative substances",
+    "vPvMs": "Very persistent and very mobile substances",
+    "WDPA": "World Database of Protected Areas",
+    "WRI": "World Resources Institute",
+    "WWF": "World-Wide Fund for Nature"
+  }

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+altair==4.2.2
+datasets==2.12.0
+faiss-cpu==1.7.4
+gradio==3.39.0
+gradio_client==0.3.0
+openai==0.27.0
+PyMuPDF==1.22.3
+python-dotenv==1.0.0
+sentence-transformers==2.2.2
+torch==2.0.1
+matplotlib==3.7.1
+tiktoken==0.4.0

text_embedder.py ADDED Viewed

	@@ -0,0 +1,132 @@

+from abc import ABC, abstractmethod
+import pandas as pd
+import torch
+from datasets import load_from_disk
+from sentence_transformers import SentenceTransformer
+# from finbert_embedding.embedding import FinbertEmbedding
+class TextEmbedder(ABC):
+    def __init__(self, model_name, paragraphs_path, device, load_existing_index=False):
+        """Initialize an instance of the TextEmbedder class.
+        Args:
+            model_name (str): The name of the SentenceTransformer model to be used for embeddings.
+            paragraphs_path (str): The path to the dataset of paragraphs to be embedded.
+            device (str): The target device to run the model ('cpu' or 'cuda').
+            load_existing_index (bool): If True, load an existing Faiss index, if available.
+        Returns:
+            None
+        """
+        self.dataset = load_from_disk(paragraphs_path)
+        self.model = self._load_model(model_name, device)
+        assert len(self.dataset) > 0, "The loaded dataset is empty !!"
+        if load_existing_index == True:
+            self.dataset.load_faiss_index(
+                "embeddings", f"{paragraphs_path}/index.faiss"
+            )
+    # Generate embeddings for each paragraph
+    def generate_paragraphs_embedding(self):
+        """Generate embeddings for paragraphs in the dataset.
+            This function computes embeddings for each paragraph's content in the dataset and adds
+            the embeddings as a new column named "embeddings" to the dataset.
+        Args:
+            None
+        Returns:
+            None
+        """
+        self.dataset = self.dataset.map(
+            lambda x: {"embeddings": self._generate_embeddings(x["content"])}
+        )
+    # Save embeddings
+    def save_embeddings(self, output_path):
+        """Save Faiss embeddings index to a specified output path.
+        Args:
+            output_path (str): The path to save the Faiss embeddings index.
+        Returns:
+            None
+        """
+        self.dataset.add_faiss_index(column="embeddings")
+        self.dataset.save_faiss_index("embeddings", f"{output_path}/index.faiss")
+    # Allows the search
+    def retrieve_faiss(self, query: str, k_total: int, threshold: int):
+        """Retrieve passages using Faiss similarity search.
+        Args:
+            query (str): The query for which similar passages are to be retrieved.
+            k_total (int): The total number of passages to retrieve.
+            threshold (int): The minimum similarity score threshold for passages to be considered.
+        Returns:
+            Tuple[List[Dict[str, Union[str, Dict[str, Any]]], np.ndarray]]:
+            A tuple containing:
+            - List of dictionaries, each representing a passage with 'content' (str) and 'meta' (dict) fields.
+            - Numpy array of similarity scores for the retrieved passages.
+        """
+        question_embedding = self._generate_embeddings(query)
+        scores, samples = self.dataset.get_nearest_examples(
+            "embeddings", question_embedding, k=k_total
+        )
+        passages_df = pd.DataFrame(samples)
+        passages_df["scores"] = scores / 100
+        passages_df = passages_df[passages_df["scores"] > threshold]
+        passages_df = passages_df.sort_values(by=["scores"], ascending=False)
+        if len(passages_df) == 0:
+            return [], []
+        contents = passages_df["content"].tolist()
+        meta = passages_df.drop(columns=["content"]).to_dict(orient="records")
+        passages = []
+        for i in range(len(contents)):
+            passages.append({"content": contents[i], "meta": meta[i]})
+        return passages, passages_df["scores"].values
+    def retrieve_elastic(self, query: str, k_total: int, threshold: int):
+        raise NotImplementedError
+    @abstractmethod
+    def _load_model(self, model_name: str, device: str):
+        pass
+    @abstractmethod
+    def _generate_embeddings(self, text: str):
+        pass
+class SentenceTransformersTextEmbedder(TextEmbedder):
+    def _load_model(self, model_name: str, device: str):
+        """Load a SentenceTransformer model onto the specified device.
+        Args:
+            model_name (str): The name of the SentenceTransformer model to be loaded.
+            device (str): The target device to move the model to ('cpu' or 'cuda').
+        Returns:
+            SentenceTransformer: The loaded SentenceTransformer model placed on the specified device.
+        """
+        model = SentenceTransformer(model_name)
+        torch_device = torch.device(device)
+        model.to(torch_device)
+        return model
+    def _generate_embeddings(self, text: str):
+        """Generate embeddings for a given text using the loaded model.
+        Args:
+            text (str): The input text for which embeddings are to be generated.
+        Returns:
+            np.ndarray: An array representing the embeddings of the input text.
+        """
+        return self.model.encode(text)
+# class FinBertTextEmbedder(TextEmbedder):
+#     def _load_model(self, model_name: str, device: str):
+#         model = FinbertEmbedding(device=device)
+#         return model
+#     def _generate_embeddings(self, text: str):
+#         output = self.model.sentence_vector(text)
+#         return output.cpu().numpy()

utils.py ADDED Viewed

	@@ -0,0 +1,381 @@

+import json
+from collections import defaultdict
+import openai
+import re
+from config import CFG_APP
+from text_embedder import SentenceTransformersTextEmbedder
+from datetime import datetime
+import tiktoken
+doc_metadata = json.load(open(CFG_APP.DOC_METADATA_PATH, "r"))
+# Embedding Model
+if "sentence-transformers" in CFG_APP.EMBEDDING_MODEL:
+    text_embedder = SentenceTransformersTextEmbedder(
+        model_name=CFG_APP.EMBEDDING_MODEL,
+        paragraphs_path=CFG_APP.DATA_FOLDER,
+        device=CFG_APP.DEVICE,
+        load_existing_index=True,
+    )
+else:
+    raise ValueError("Embedding model not found !")
+# Util Functions
+def retrieve_doc_metadata(doc_metadata, doc_id):
+    for meta in doc_metadata:
+        if meta["id"] == doc_id:
+            return meta
+def get_reformulation_prompt(query: str) -> list:
+    return [
+        {
+            "role": "user",
+            "content": f"""{CFG_APP.REFORMULATION_PROMPT}
+            ---
+            query: {query}
+            standalone question: """,
+        }
+    ]
+def get_hyde_prompt(query: str) -> list:
+    return [
+        {
+            "role": "user",
+            "content": f"""{CFG_APP.HYDE_PROMPT}
+            ---
+            query: {query}
+            output: """,
+        }
+    ]
+def make_pairs(lst):
+    """From a list of even lenght, make tupple pairs
+    Args:
+        lst (list): a list of even lenght
+    Returns:
+        list: the list as tupple pairs
+    """
+    assert not (l := len(lst) % 2), f"your list is of lenght {l} which is not even"
+    return [(lst[i], lst[i + 1]) for i in range(0, len(lst), 2)]
+def make_html_source(paragraph, meta_doc, i):
+    content = paragraph["content"]
+    meta_paragraph = paragraph["meta"]
+    return f"""
+<div class="card" id="document-{i}">
+    <div class="card-content">
+        <h2>Excerpts {i} - Document {meta_doc['num_doc']} - Page {meta_paragraph['page_number']}</h2>
+        <p>{content}</p>
+    </div>
+    <div class="card-footer">
+        <span>{meta_doc['short_name']}</span>
+        <a href="{meta_doc['url']}#page={meta_paragraph['page_number']}" target="_blank" class="pdf-link">
+            <span role="img" aria-label="Open PDF">🔗</span>
+        </a>
+    </div>
+</div>
+"""
+def make_citations_source(citation_dic, query, Hyde: False):
+    citation_list = [f'Doc {values[0]} - {keys} (excerpts {values[1]})' for keys, values in citation_dic.items()]
+    html_output = '<div class="source">\n'
+    html_output += '  <div class="title">Sources</div>\n'
+    if Hyde :
+        html_output += f'  <div>Query used for retrieval (with the HyDE technique after no response): {query}</div>\n'
+    else :
+        html_output += f'  <div>Query used for retrieval: {query}</div>\n'
+    html_output += '  <br>\n'
+    html_output += '  <ul>\n'
+    for row in citation_list :
+        html_output += f'<li>{row}</li>'
+    html_output += '  </ul>\n'
+    html_output += '</div>\n'
+    return html_output
+def preprocess_message(text: str, docs_url: dict) -> str:
+    return re.sub(
+        r"\[doc (\d+)\]",
+        lambda match: f'<a href="{docs_url[match.group(1)]}" target="_blank" class="pdf-link">{match.group(0)}</a>',
+        text,
+    )
+def parse_glossary(query):
+    file = "glossary.json"
+    glossary = json.load(open(file, "r"))
+    words_query = query.split(" ")
+    for i, word in enumerate(words_query):
+        for key in glossary.keys():
+            if word.lower() == key.lower():
+                words_query[i] = words_query[i] + f" ({glossary[key]})"
+    return " ".join(words_query)
+def num_tokens_from_string(string: str, encoding_name: str) -> int:
+    encoding = tiktoken.encoding_for_model(encoding_name)
+    num_tokens = len(encoding.encode(string))
+    return num_tokens
+def chat(
+    query: str,
+    history: list,
+    threshold: float = CFG_APP.THRESHOLD,
+    k_total: int = CFG_APP.K_TOTAL,
+) -> tuple:
+    """retrieve relevant documents in the document store then query gpt-turbo
+    Args:
+        query (str): user message.
+        history (list, optional): history of the conversation. Defaults to [system_template].
+        report_type (str, optional): should be "All available" or "IPCC only". Defaults to "All available".
+        threshold (float, optional): similarity threshold, don't increase more than 0.568. Defaults to 0.56.
+    Yields:
+        tuple: chat gradio format, chat openai format, sources used.
+    """
+    reformulated_query = openai.ChatCompletion.create(
+        model=CFG_APP.MODEL_NAME,
+        messages=get_reformulation_prompt(parse_glossary(query)),
+        temperature=0,
+        max_tokens=CFG_APP.MAX_TOKENS_REF_QUESTION,
+    )
+    reformulated_query = reformulated_query["choices"][0]["message"]["content"]
+    if len(reformulated_query.split("\n")) == 2:
+        reformulated_query, language = reformulated_query.split("\n")
+        language = language.split(":")[1].strip()
+    else:
+        reformulated_query = reformulated_query.split("\n")[0]
+        language = "English"
+    sources, scores = text_embedder.retrieve_faiss(
+        reformulated_query,
+        k_total=k_total,
+        threshold=threshold,
+    )
+    if CFG_APP.DEBUG == True:
+        print("Scores : \n", scores)
+    messages = history + [{"role": "user", "content": query}]
+    docs_url = defaultdict(str)
+    if len(sources) > 0:
+        docs_string = []
+        docs_html = []
+        citations = {}
+        num_tokens = num_tokens_from_string(CFG_APP.SOURCES_PROMPT, CFG_APP.MODEL_NAME)
+        num_doc = 1
+        for i, data in enumerate(sources, 1):
+            meta_doc = retrieve_doc_metadata(doc_metadata, data["meta"]["document_id"])
+            doc_content = f"📃 Doc {i}: \n{data['content']}"
+            num_tokens_doc = num_tokens_from_string(doc_content, CFG_APP.MODEL_NAME)
+            if num_tokens + num_tokens_doc > CFG_APP.MAX_TOKENS_API:
+                break
+            num_tokens += num_tokens_doc
+            docs_string.append(doc_content)
+            if meta_doc['short_name'] in citations.keys():
+                citations[meta_doc['short_name']][1] += f', {i}'
+            else :
+                citations[meta_doc['short_name']] = [num_doc, f'{i}']
+                num_doc += 1
+            meta_doc["num_doc"] = citations[meta_doc['short_name']][0]
+            docs_html.append(make_html_source(data, meta_doc, i))
+            url_doc = f'<a href="{meta_doc["url"]}#page={data["meta"]["page_number"]}" target="_blank" class="pdf-link">'
+            docs_url[i] = url_doc
+        html_cit = [make_citations_source(citations, reformulated_query, Hyde=False)]
+        docs_string = "\n\n".join( [f"Query used for retrieval:\n{reformulated_query}"] + docs_string)
+        docs_html = "\n\n".join(html_cit + docs_html)
+        messages.append(
+            {
+                "role": "system",
+                "content": f"{CFG_APP.SOURCES_PROMPT}\n\n{docs_string}\n\nAnswer in {language}:",
+            }
+        )
+        if CFG_APP.DEBUG == True:
+            print(f" 👨‍💻 question asked by the user : {query}")
+            print(f" 🕛 time : {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+            print(" 🔌 messages sent to the API :")
+            api_messages = [
+                {"role": "system", "content": CFG_APP.INIT_PROMPT},
+                {"role": "user", "content": reformulated_query},
+                {
+                    "role": "system",
+                    "content": f"{CFG_APP.SOURCES_PROMPT}\n\n{docs_string}\n\nAnswer in {language}:",
+                },
+            ]
+            for message in api_messages:
+                print(
+                    f"length : {len(message['content'])}, content : {message['content']}"
+                )
+        response = openai.ChatCompletion.create(
+            model=CFG_APP.MODEL_NAME,
+            messages=[
+                {"role": "system", "content": CFG_APP.INIT_PROMPT},
+                {"role": "user", "content": reformulated_query},
+                {
+                    "role": "system",
+                    "content": f"{CFG_APP.SOURCES_PROMPT}\n\nVery important : Answer in {language}.\n\n{docs_string}:",
+                },
+            ],
+            temperature=0,  # deterministic
+            stream=True,
+            max_tokens=CFG_APP.MAX_TOKENS_ANSWER,
+        )
+        complete_response = ""
+        messages.pop()
+        messages.append({"role": "assistant", "content": complete_response})
+        for chunk in response:
+            chunk_message = chunk["choices"][0]["delta"].get("content")
+            if chunk_message:
+                complete_response += chunk_message
+                complete_response = preprocess_message(complete_response, docs_url)
+                messages[-1]["content"] = complete_response
+                gradio_format = make_pairs([a["content"] for a in messages[1:]])
+                yield gradio_format, messages, docs_html
+    else:
+        reformulated_query = openai.ChatCompletion.create(
+            model=CFG_APP.MODEL_NAME,
+            messages=get_hyde_prompt(parse_glossary(query)),
+            temperature=0,
+            max_tokens=CFG_APP.MAX_TOKENS_REF_QUESTION,
+        )
+        reformulated_query = reformulated_query["choices"][0]["message"]["content"]
+        if len(reformulated_query.split("\n")) == 2:
+            reformulated_query, language = reformulated_query.split("\n")
+            language = language.split(":")[1].strip()
+        else:
+            reformulated_query = reformulated_query.split("\n")[0]
+            language = "English"
+        sources, scores = text_embedder.retrieve_faiss(
+            reformulated_query,
+            k_total=k_total,
+            threshold=threshold,
+        )
+        if CFG_APP.DEBUG == True:
+            print("Scores : \n", scores)
+        if len(sources) > 0 :
+            docs_string = []
+            docs_html = []
+            citations = {}
+            num_tokens = num_tokens_from_string(CFG_APP.SOURCES_PROMPT, CFG_APP.MODEL_NAME)
+            num_doc = 1
+            for i, data in enumerate(sources, 1):
+                meta_doc = retrieve_doc_metadata(doc_metadata, data["meta"]["document_id"])
+                doc_content = f"📃 Doc {i}: \n{data['content']}"
+                num_tokens_doc = num_tokens_from_string(doc_content, CFG_APP.MODEL_NAME)
+                if num_tokens + num_tokens_doc > CFG_APP.MAX_TOKENS_API:
+                    break
+                num_tokens += num_tokens_doc
+                docs_string.append(doc_content)
+                if meta_doc['short_name'] in citations.keys():
+                    citations[meta_doc['short_name']][1] += f', {i}'
+                else:
+                    citations[meta_doc['short_name']] = [num_doc, f'{i}']
+                    num_doc += 1
+                meta_doc["num_doc"] = citations[meta_doc['short_name']][0]
+                docs_html.append(make_html_source(data, meta_doc, i))
+                url_doc = f'<a href="{meta_doc["url"]}#page={data["meta"]["page_number"]}" target="_blank" class="pdf-link">'
+                docs_url[i] = url_doc
+            html_cit = [make_citations_source(citations, reformulated_query, Hyde=True)]
+            docs_string = "\n\n".join([f"Query used for retrieval:\n{reformulated_query}"] + docs_string)
+            docs_html = "\n\n".join(html_cit + docs_html)
+            messages.append(
+                {
+                    "role": "system",
+                    "content": f"{CFG_APP.SOURCES_PROMPT}\n\n{docs_string}\n\nAnswer in {language}:",
+                }
+            )
+            if CFG_APP.DEBUG == True:
+                print(f" 👨‍💻 question asked by the user : {query}")
+                print(f" 🕛 time : {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+                print(" 🔌 messages sent to the API :")
+                api_messages = [
+                    {"role": "system", "content": CFG_APP.INIT_PROMPT},
+                    {"role": "user", "content": reformulated_query},
+                    {
+                        "role": "system",
+                        "content": f"{CFG_APP.SOURCES_PROMPT}\n\nVery important : Answer in {language}.\n\n{docs_string}:",
+                    },
+                ]
+                for message in api_messages:
+                    print(
+                        f"length : {len(message['content'])}, content : {message['content']}"
+                    )
+            response = openai.ChatCompletion.create(
+                model=CFG_APP.MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": CFG_APP.INIT_PROMPT},
+                    {"role": "user", "content": reformulated_query},
+                    {
+                        "role": "system",
+                        "content": f"{CFG_APP.SOURCES_PROMPT}\n\nVery important : Answer in {language}.\n\n{docs_string}:",
+                    },
+                ],
+                temperature=0,  # deterministic
+                stream=True,
+                max_tokens=CFG_APP.MAX_TOKENS_ANSWER,
+            )
+            complete_response = ""
+            messages.pop()
+            messages.append({"role": "assistant", "content": complete_response})
+            for chunk in response:
+                chunk_message = chunk["choices"][0]["delta"].get("content")
+                if chunk_message:
+                    complete_response += chunk_message
+                    complete_response = preprocess_message(complete_response, docs_url)
+                    messages[-1]["content"] = complete_response
+                    gradio_format = make_pairs([a["content"] for a in messages[1:]])
+                    yield gradio_format, messages, docs_html
+        else :
+            docs_string = "⚠️ No relevant passages found in this report"
+            complete_response = "**⚠️ No relevant passages found in this report, you may want to ask a more specific question.**"
+            messages.append({"role": "assistant", "content": complete_response})
+            gradio_format = make_pairs([a["content"] for a in messages[1:]])
+            yield gradio_format, messages, docs_string