Spaces:

MachineLearningReply
/

q-and-a-tool-custom-logo

Running

App Files Files Community

hkoppen commited on Jun 25, 2024

Commit

d84e800

verified ·

1 Parent(s): 7ca3dfa

Delete NLP_QA_Tool

Browse files

Files changed (23) hide show

NLP_QA_Tool/.DS_Store +0 -0
NLP_QA_Tool/.github/workflows/main.yml +0 -19
NLP_QA_Tool/.gitignore +0 -47
NLP_QA_Tool/.streamlit/config.toml +0 -6
NLP_QA_Tool/.vscode/settings.json +0 -11
NLP_QA_Tool/Dockerfile +0 -29
NLP_QA_Tool/README.md +0 -108
NLP_QA_Tool/__pycache__/document_qa_engine.cpython-310.pyc +0 -0
NLP_QA_Tool/__pycache__/utils.cpython-310.pyc +0 -0
NLP_QA_Tool/app.py +0 -241
NLP_QA_Tool/authenticator_config.yaml +0 -15
NLP_QA_Tool/document_qa_engine.py +0 -141
NLP_QA_Tool/requirements.txt +0 -18
NLP_QA_Tool/resources/ml_logo.png +0 -0
NLP_QA_Tool/resources/puma.png +0 -0
NLP_QA_Tool/utils.py +0 -56
NLP_QA_Tool/utils/__pycache__/config.cpython-38.pyc +0 -0
NLP_QA_Tool/utils/__pycache__/haystack.cpython-38.pyc +0 -0
NLP_QA_Tool/utils/__pycache__/ui.cpython-38.pyc +0 -0
NLP_QA_Tool/utils/check_pydantic_version.py +0 -26
NLP_QA_Tool/utils/config.py +0 -43
NLP_QA_Tool/utils/haystack.py +0 -124
NLP_QA_Tool/utils/ui.py +0 -16

NLP_QA_Tool/.DS_Store DELETED Viewed

Binary file (10.2 kB)

NLP_QA_Tool/.github/workflows/main.yml DELETED Viewed

@@ -1,19 +0,0 @@
-name: Sync to Hugging Face hub
-on:
-  push:
-    branches: [puma_demo]
-  # to run this workflow manually from the Actions tab
-  workflow_dispatch:
-jobs:
-  sync-to-hub:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v3
-        with:
-          fetch-depth: 0
-          lfs: true
-      - name: Push to hub
-        env:
-          HF_TOKEN: ${{ secrets.HF_TOKEN }}
-        run: git push https://hkoppen:[email protected]/spaces/MachineLearningReply/q-and-a-tool-custom-logo puma_demo

NLP_QA_Tool/.gitignore DELETED Viewed

@@ -1,47 +0,0 @@
-# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
-# dependencies
-node_modules
-.pnp
-.pnp.js
-# testing
-coverage
-# next.js
-.next/
-out/
-build
-# misc
-.DS_Store
-*.pem
-# debug
-npm-debug.log*
-yarn-debug.log*
-yarn-error.log*
-.pnpm-debug.log*
-# local env files
-.env.local
-.env.development.local
-.env.test.local
-.env.production.local
-# turbo
-.turbo
-.contentlayer
-.env
-.vercel
-.vscode
-# JetBrains
-.idea
-# VSCode
-__pycache__/*
-# datasets directory is used for local development
-/datasets/

NLP_QA_Tool/.streamlit/config.toml DELETED Viewed

@@ -1,6 +0,0 @@
-[theme]
-primaryColor = "#E694FF"
-backgroundColor = "#FFFFFF"
-secondaryBackgroundColor = "#F0F0F0"
-textColor = "#262730"
-font = "sans serif"

NLP_QA_Tool/.vscode/settings.json DELETED Viewed

@@ -1,11 +0,0 @@
-{
-    "python.languageServer": "Pylance",
-    "python.analysis.typeCheckingMode": "basic",
-    "typescript.tsserver.maxTsServerMemory": 3072,
-    "typescript.tsserver.watchOptions": {
-        "watchFile": "dynamicPriorityPolling"
-    },
-    "javascript.suggest.includeAutomaticOptionalChainCompletions": false,
-    "debug.saveBeforeStart": "none",
-    "c3.welcome.showFeatureHighlight": false
-}

NLP_QA_Tool/Dockerfile DELETED Viewed

@@ -1,29 +0,0 @@
-FROM python:3.10-slim
-WORKDIR /app
-RUN apt-get update && apt-get install -y \
-    build-essential \
-    curl \
-    software-properties-common \
-    git \
-    && rm -rf /var/lib/apt/lists/*
-COPY requirements.txt .
-RUN pip3 install -r requirements.txt
-COPY . .
-# extract version
-COPY .git ./.git
-RUN git rev-parse --short HEAD > revision.txt
-RUN rm -rf ./.git
-EXPOSE 8501
-HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
-ENV PYTHONPATH "${PYTHONPATH}:."
-ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

NLP_QA_Tool/README.md DELETED Viewed

@@ -1,108 +0,0 @@
----
-title: NLP Q&A Tool
-emoji: 👑
-colorFrom: indigo
-colorTo: indigo
-sdk: streamlit
-sdk_version: 1.32.2
-app_file: app.py
-pinned: false
----
-# Document Insights - Extractive & Generative Methods using Haystack
-This template [Streamlit](https://docs.streamlit.io/) app set up for
-simple [Haystack search applications](https://docs.haystack.deepset.ai/docs/semantic_search). The template is ready to
-do QA with **Retrievel Augmented Generation**, or **Ectractive QA**
-Below you will also find instructions on how you
-could [push this to Hugging Face Spaces 🤗](#pushing-to-hugging-face-spaces-).
-## Installation and Running
-### Local development
-To run the bare application which does _nothing_:
-1. Install requirements: `pip install -r requirements.txt`
-2. Run the streamlit app: `streamlit run app.py`
-This will start up the app on `localhost:8501` where you will find a simple search bar. Before you start editing, you'll
-notice that the app will only show you instructions on what to edit.
-### Docker
-To run the app in a Docker container:
-1. Build the Docker image: `docker build -t haystack-streamlit .`
-2. Run the Docker container: `docker run -p 8501:8501 haystack-streamlit` (make sure to bind any other ports you need)
-3. Open your browser and go to `http://localhost:8501`
-### Repo structure
-- `./utils`: This is where we have 3 files:
-    - `config.py`: This file extracts all of the configuration settings from a `.env` file. For some config settings, it
-      uses default values. An example of this is
-      in [this demo project](https://github.com/TuanaCelik/should-i-follow/blob/main/utils/config.py).
-    - `haystack.py`: Here you will find some functions already set up for you to start creating your Haystack search
-      pipeline. It includes 2 main functions called `start_haystack()` which is what we use to create a pipeline and
-      cache it, and `query()` which is the function called by `app.py` once a user query is received.
-    - `ui.py`: Use this file for any UI and initial value setups.
-- `app.py`: This is the main Streamlit application file that we will run. In its current state it has a simple search
-  bar, a 'Run' button, and a response that you can highlight answers with.
-- `requirements.txt`: This file includes the required libraries to run the Streamlit app.
-- `document_qa_engine.py`: This file includes the QA pipeline with Haystack.
-### What to edit?
-There are default pipelines both in `start_haystack_extractive()` and `start_haystack_rag()`
-- Change the pipelines to use the embedding models, extractive or generative models as you need.
-- If using the `rag` task, change the `default_prompt_template` to use one of our available ones
-  on [PromptHub](https://prompthub.deepset.ai) or create your own `PromptTemplate`
-### Using local LLM models
-To use the `local LLM` mode you can use [LM Studio](https://lmstudio.ai/) or [Ollama](https://ollama.com/).
-For more info on how to run the app with a local LLM model please refer to the documentation of the tool you are using.
-The `local_llm` mode expects an API available at `http://localhost:1234/v1`.
-## Pushing to Hugging Face Spaces 🤗
-Below is an example GitHub action that will let you push your Streamlit app straight to the Hugging Face Hub as a Space.
-A few things to pay attention to:
-1. Create a New Space on Hugging Face with the Streamlit SDK.
-2. Create a Hugging Face token on your HF account.
-3. Create a secret on your GitHub repo called `HF_TOKEN` and put your Hugging Face token here.
-4. If you're using DocumentStores or APIs that require some keys/tokens, make sure these are provided as a secret for
-   your HF Space too!
-5. This readme is set up to tell HF spaces that it's using streamlit and that the app is running on `app.py`, make any
-   changes to the frontmatter of this readme to display the title, emoji etc you desire.
-6. Create a file in `.github/workflows/hf_sync.yml`. Here's an example that you can change with your own information,
-   and an [example workflow](https://github.com/TuanaCelik/should-i-follow/blob/main/.github/workflows/hf_sync.yml)
-   working for the [Should I Follow demo](https://huggingface.co/spaces/deepset/should-i-follow)
-```yaml
-name: Sync to Hugging Face hub
-on:
-  push:
-    branches: [ main ]
-  # to run this workflow manually from the Actions tab
-  workflow_dispatch:
-jobs:
-  sync-to-hub:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v2
-        with:
-          fetch-depth: 0
-          lfs: true
-      - name: Push to hub
-        env:
-          HF_TOKEN: ${{ secrets.HF_TOKEN }}
-        run: git push --force https://{YOUR_HF_USERNAME}:$HF_TOKEN@{YOUR_HF_SPACE_REPO} main
-```

NLP_QA_Tool/__pycache__/document_qa_engine.cpython-310.pyc DELETED Viewed

Binary file (5.11 kB)

NLP_QA_Tool/__pycache__/utils.cpython-310.pyc DELETED Viewed

Binary file (2.61 kB)

NLP_QA_Tool/app.py DELETED Viewed

@@ -1,241 +0,0 @@
-from dotenv import load_dotenv
-import pandas as pd
-import streamlit as st
-import streamlit_authenticator as stauth
-from streamlit_modal import Modal
-from utils import new_file, clear_memory, append_documentation_to_sidebar, load_authenticator_config, init_qa, \
-    append_header
-from haystack.document_stores.in_memory import InMemoryDocumentStore
-from haystack import Document
-load_dotenv()
-OPENAI_MODELS = ['gpt-3.5-turbo',
-                 "gpt-4",
-                 "gpt-4-1106-preview"]
-OPEN_MODELS = [
-    'mistralai/Mistral-7B-Instruct-v0.1',
-    'HuggingFaceH4/zephyr-7b-beta'
-]
-def reset_chat_memory():
-    st.button(
-        'Reset chat memory',
-        key="reset-memory-button",
-        on_click=clear_memory,
-        help="Clear the conversational memory. Currently implemented to retain the 4 most recent messages.",
-        disabled=False)
-def manage_files(modal, document_store):
-    open_modal = st.sidebar.button("Manage Files", use_container_width=True)
-    if open_modal:
-        modal.open()
-    if modal.is_open():
-        with modal.container():
-            uploaded_file = st.file_uploader(
-                "Upload a CV in PDF format",
-                type=("pdf",),
-                on_change=new_file(),
-                disabled=st.session_state['document_qa_model'] is None,
-                label_visibility="collapsed",
-                help="The document is used to answer your questions. The system will process the document and store it in a RAG to answer your questions.",
-            )
-            edited_df = st.data_editor(use_container_width=True, data=st.session_state['files'],
-                                       num_rows='dynamic',
-                                       column_order=['name', 'size', 'is_active'],
-                                       column_config={'name': {'editable': False}, 'size': {'editable': False},
-                                                      'is_active': {'editable': True, 'type': 'checkbox',
-                                                                    'width': 100}}
-                                       )
-            st.session_state['files'] = pd.DataFrame(columns=['name', 'content', 'size', 'is_active'])
-            if uploaded_file:
-                st.session_state['file_uploaded'] = True
-                st.session_state['files'] = pd.concat([st.session_state['files'], edited_df])
-                with st.spinner('Processing the CV content...'):
-                    store_file_in_table(document_store, uploaded_file)
-                    ingest_document(uploaded_file)
-def ingest_document(uploaded_file):
-    if not st.session_state['document_qa_model']:
-        st.warning('Please select a model to start asking questions')
-    else:
-        try:
-            st.session_state['document_qa_model'].ingest_pdf(uploaded_file)
-            st.success('Document processed successfully')
-        except Exception as e:
-            st.error(f"Error processing the document: {e}")
-            st.session_state['file_uploaded'] = False
-def store_file_in_table(document_store, uploaded_file):
-    pdf_content = uploaded_file.getvalue()
-    st.session_state['pdf_content'] = pdf_content
-    st.session_state.messages = []
-    document = Document(content=pdf_content, meta={"name": uploaded_file.name})
-    df = pd.DataFrame(st.session_state['files'])
-    df['is_active'] = False
-    st.session_state['files'] = pd.concat([df, pd.DataFrame(
-        [{"name": uploaded_file.name, "content": pdf_content, "size": len(pdf_content),
-          "is_active": True}])])
-    document_store.write_documents([document])
-def init_session_state():
-    st.session_state.setdefault('files', pd.DataFrame(columns=['name', 'content', 'size', 'is_active']))
-    st.session_state.setdefault('models', [])
-    st.session_state.setdefault('api_keys', {})
-    st.session_state.setdefault('current_selected_model', 'gpt-3.5-turbo')
-    st.session_state.setdefault('current_api_key', '')
-    st.session_state.setdefault('messages', [])
-    st.session_state.setdefault('pdf_content', None)
-    st.session_state.setdefault('memory', None)
-    st.session_state.setdefault('pdf', None)
-    st.session_state.setdefault('document_qa_model', None)
-    st.session_state.setdefault('file_uploaded', False)
-def set_page_config():
-    st.set_page_config(
-        page_title="CV Insights AI Assistant",
-        page_icon=":shark:",
-        initial_sidebar_state="expanded",
-        layout="wide",
-        menu_items={
-            'Get Help': 'https://www.extremelycoolapp.com/help',
-            'Report a bug': "https://www.extremelycoolapp.com/bug",
-            'About': "# This is a header. This is an *extremely* cool app!"
-        }
-    )
-def update_running_model(api_key, model):
-    st.session_state['api_keys'][model] = api_key
-    st.session_state['document_qa_model'] = init_qa(model, api_key)
-def init_api_key_dict():
-    st.session_state['models'] = OPENAI_MODELS + list(OPEN_MODELS) + ['local LLM']
-    for model_name in OPENAI_MODELS:
-        st.session_state['api_keys'][model_name] = None
-def display_chat_messages(chat_box, chat_input):
-    with chat_box:
-        if chat_input:
-            for message in st.session_state.messages:
-                with st.chat_message(message["role"]):
-                    st.markdown(message["content"], unsafe_allow_html=True)
-            st.chat_message("user").markdown(chat_input)
-            with st.chat_message("assistant"):
-                # process user input and generate response
-                response = st.session_state['document_qa_model'].inference(chat_input, st.session_state.messages)
-                st.markdown(response)
-                st.session_state.messages.append({"role": "user", "content": chat_input})
-                st.session_state.messages.append({"role": "assistant", "content": response})
-def setup_model_selection():
-    model = st.selectbox(
-        "Model:",
-        options=st.session_state['models'],
-        index=0,  # default to the first model in the list gpt-3.5-turbo
-        placeholder="Select model",
-        help="Select an LLM:"
-    )
-    if model:
-        if model != st.session_state['current_selected_model']:
-            st.session_state['current_selected_model'] = model
-            if model == 'local LLM':
-                st.session_state['document_qa_model'] = init_qa(model)
-    api_key = st.sidebar.text_input("Enter LLM-authorization Key:", type="password",
-                                    disabled=st.session_state['current_selected_model'] == 'local LLM')
-    if api_key and api_key != st.session_state['current_api_key']:
-        update_running_model(api_key, model)
-        st.session_state['current_api_key'] = api_key
-    return model
-def setup_task_selection(model):
-    # enable extractive and generative tasks if we're using a local LLM or an OpenAI model with an API key
-    if model == 'local LLM' or st.session_state['api_keys'].get(model):
-        task_options = ['Extractive', 'Generative']
-    else:
-        task_options = ['Extractive']
-    task_selection = st.sidebar.radio('Select the task:', task_options)
-    # TODO: Add the task selection logic here (initializing the model based on the task)
-def setup_page_body():
-    chat_box = st.container(height=350, border=False)
-    chat_input = st.chat_input(
-        placeholder="Upload a document to start asking questions...",
-        disabled=not st.session_state['file_uploaded'],
-    )
-    if st.session_state['file_uploaded']:
-        display_chat_messages(chat_box, chat_input)
-class StreamlitApp:
-    def __init__(self):
-        self.authenticator_config = load_authenticator_config()
-        self.document_store = InMemoryDocumentStore()
-        set_page_config()
-        self.authenticator = self.init_authenticator()
-        init_session_state()
-        init_api_key_dict()
-    def init_authenticator(self):
-        return stauth.Authenticate(
-            self.authenticator_config['credentials'],
-            self.authenticator_config['cookie']['name'],
-            self.authenticator_config['cookie']['key'],
-            self.authenticator_config['cookie']['expiry_days']
-        )
-    def setup_sidebar(self):
-        with st.sidebar:
-            st.sidebar.image("resources/puma.png", use_column_width=True)
-            # Sidebar for Task Selection
-            st.sidebar.header('Options:')
-            model = setup_model_selection()
-            setup_task_selection(model)
-            st.divider()
-            self.authenticator.logout()
-            reset_chat_memory()
-            modal = Modal("Manage Files", key="demo-modal")
-            manage_files(modal, self.document_store)
-            st.divider()
-            append_documentation_to_sidebar()
-    def run(self):
-        name, authentication_status, username = self.authenticator.login()
-        if authentication_status:
-            self.run_authenticated_app()
-        elif st.session_state["authentication_status"] is False:
-            st.error('Username/password is incorrect')
-        elif st.session_state["authentication_status"] is None:
-            st.warning('Please enter your username and password')
-    def run_authenticated_app(self):
-        self.setup_sidebar()
-        append_header()
-        setup_page_body()
-app = StreamlitApp()
-app.run()

NLP_QA_Tool/authenticator_config.yaml DELETED Viewed

@@ -1,15 +0,0 @@
-credentials:
-  usernames:
-    mlreply:
-      email: [email protected]
-      failed_login_attempts: 0 # Will be managed automatically
-      logged_in: False # Will be managed automatically
-      name: ML Reply
-      password: mlreply # Will be hashed automatically
-cookie:
-  expiry_days: 1
-  key: some_signature_key # Must be string
-  name: some_cookie_name
-#pre-authorized:
-#  emails:
-#    - [email protected]

NLP_QA_Tool/document_qa_engine.py DELETED Viewed

@@ -1,141 +0,0 @@
-from typing import List
-from haystack.dataclasses import ChatMessage
-from pypdf import PdfReader
-from haystack.utils import Secret
-from haystack import Pipeline, Document, component
-from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
-from haystack.components.writers import DocumentWriter
-from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
-from haystack.document_stores.in_memory import InMemoryDocumentStore
-from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
-from haystack.components.builders import DynamicChatPromptBuilder
-from haystack.components.generators.chat import OpenAIChatGenerator, HuggingFaceTGIChatGenerator
-from haystack.document_stores.types import DuplicatePolicy
-SENTENCE_RETREIVER_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
-MAX_TOKENS = 500
-template = """
-As a professional HR recruiter given the following information, answer the question shortly and concisely in 1 or 2 sentences.
-Context:
-{% for document in documents %}
-    {{ document.content }}
-{% endfor %}
-Question: {{question}}
-Answer:
-"""
-@component
-class UploadedFileConverter:
-    """
-    A component to convert uploaded PDF files to Documents
-    """
-    @component.output_types(documents=List[Document])
-    def run(self, uploaded_file):
-        pdf = PdfReader(uploaded_file)
-        documents = []
-        # uploaded file name without .pdf at the end and with _ and page number at the end
-        name = uploaded_file.name.rstrip('.PDF') + '_'
-        for page in pdf.pages:
-            documents.append(
-                Document(
-                    content=page.extract_text(),
-                    meta={'name': name + f"_{page.page_number}"}))
-        return {"documents": documents}
-def create_ingestion_pipeline(document_store):
-    doc_embedder = SentenceTransformersDocumentEmbedder(model=SENTENCE_RETREIVER_MODEL)
-    doc_embedder.warm_up()
-    pipeline = Pipeline()
-    pipeline.add_component("converter", UploadedFileConverter())
-    pipeline.add_component("cleaner", DocumentCleaner())
-    pipeline.add_component("splitter",
-                           DocumentSplitter(split_by="passage", split_length=100, split_overlap=10))
-    pipeline.add_component("embedder", doc_embedder)
-    pipeline.add_component("writer",
-                           DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))
-    pipeline.connect("converter", "cleaner")
-    pipeline.connect("cleaner", "splitter")
-    pipeline.connect("splitter", "embedder")
-    pipeline.connect("embedder", "writer")
-    return pipeline
-def create_inference_pipeline(document_store, model_name, api_key):
-    if model_name == "local LLM":
-        generator = OpenAIChatGenerator(api_key=Secret.from_token("<local LLM doesn't need an API key>"),
-                                        model=model_name,
-                                        api_base_url="http://localhost:1234/v1",
-                                        generation_kwargs={"max_tokens": MAX_TOKENS}
-                                        )
-    elif "gpt" in model_name:
-        generator = OpenAIChatGenerator(api_key=Secret.from_token(api_key), model=model_name,
-                                        generation_kwargs={"max_tokens": MAX_TOKENS, "stream": False}
-                                        )
-    else:
-        generator = HuggingFaceTGIChatGenerator(token=Secret.from_token(api_key), model=model_name,
-                                                generation_kwargs={"max_new_tokens": MAX_TOKENS}
-                                                )
-    pipeline = Pipeline()
-    pipeline.add_component("text_embedder",
-                           SentenceTransformersTextEmbedder(model=SENTENCE_RETREIVER_MODEL))
-    pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
-    pipeline.add_component("prompt_builder",
-                           DynamicChatPromptBuilder(runtime_variables=["query", "documents"]))
-    pipeline.add_component("llm", generator)
-    pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
-    pipeline.connect("retriever.documents", "prompt_builder.documents")
-    pipeline.connect("prompt_builder.prompt", "llm.messages")
-    return pipeline
-class DocumentQAEngine:
-    def __init__(self,
-                 model_name,
-                 api_key=None
-                 ):
-        self.api_key = api_key
-        self.model_name = model_name
-        document_store = InMemoryDocumentStore()
-        self.chunks = []
-        self.inference_pipeline = create_inference_pipeline(document_store, model_name, api_key)
-        self.pdf_ingestion_pipeline = create_ingestion_pipeline(document_store)
-    def ingest_pdf(self, uploaded_file):
-        self.pdf_ingestion_pipeline.run({"converter": {"uploaded_file": uploaded_file}})
-    def inference(self, query, input_messages: List[dict]):
-        system_message = ChatMessage.from_system(
-            "You are a professional HR recruiter that answers questions based on the content of the uploaded CV. in 1 or 2 sentences.")
-        messages = [system_message]
-        for message in input_messages:
-            if message["role"] == "user":
-                messages.append(ChatMessage.from_system(message["content"]))
-            else:
-                messages.append(
-                    ChatMessage.from_user(message["content"]))
-        messages.append(ChatMessage.from_user("""
-        Relevant information from the uploaded CV:
-            {% for doc in documents %}
-                {{ doc.content }}
-            {% endfor %}
-            \nQuestion: {{query}}
-            \nAnswer:
-        """))
-        res = self.inference_pipeline.run(data={"text_embedder": {"text": query},
-                                                "prompt_builder": {"prompt_source": messages,
-                                                                   "query": query
-                                                                   }})
-        return res["llm"]["replies"][0].content

NLP_QA_Tool/requirements.txt DELETED Viewed

@@ -1,18 +0,0 @@
-# Streamlit
-streamlit~=1.32.2
-streamlit-modal==0.1.2
-streamlit-authenticator==0.3.2
-streamlit-pdf-viewer==0.0.9
-# LLM
-haystack-ai~=2.0.0
-sentence_transformers~=2.6.0
-# Utils
-pandas~=2.2.1
-pypdf~=4.2.0
-pytest~=8.1.1
-python-dotenv~=1.0.1
-# Dev Utils
-watchdog

NLP_QA_Tool/resources/ml_logo.png DELETED Viewed

Binary file (28.7 kB)

NLP_QA_Tool/resources/puma.png DELETED Viewed

Binary file (18 kB)

NLP_QA_Tool/utils.py DELETED Viewed

@@ -1,56 +0,0 @@
-from document_qa_engine import DocumentQAEngine
-import streamlit as st
-import logging
-from yaml import load, SafeLoader, YAMLError
-def load_authenticator_config(file_path='authenticator_config.yaml'):
-    try:
-        with open(file_path, 'r') as file:
-            authenticator_config = load(file, Loader=SafeLoader)
-            return authenticator_config
-    except FileNotFoundError:
-        logging.error(f"File {file_path} not found.")
-    except YAMLError as error:
-        logging.error(f"Error parsing YAML file: {error}")
-def new_file():
-    st.session_state['loaded_embeddings'] = None
-    st.session_state['doc_id'] = None
-    st.session_state['uploaded'] = True
-    clear_memory()
-def clear_memory():
-    if st.session_state['memory']:
-        st.session_state['memory'].clear()
-def init_qa(model, api_key=None):
-    print(f"Initializing QA with model: {model} and API key: {api_key}")
-    return DocumentQAEngine(model, api_key=api_key)
-def append_header():
-    st.header('📄 Document Insights :rainbow[AI] Assistant 📚', divider='rainbow')
-    st.text("📥 Upload documents in PDF format. Get insights.. ask questions..")
-def append_documentation_to_sidebar():
-    with st.expander("Disclaimer"):
-        st.markdown(
-            """
-            :warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely
-            for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use
-            or handling of the data submitted to third parties LLMs.
-            """)
-    with st.expander("Documentation"):
-        st.markdown(
-            """
-            Upload a CV as PDF document. Once the spinner stops, you can proceed to ask your questions. The answers will
-            be displayed in the right column. The system will answer your questions using the content of the document
-            and mark refrences over the PDF viewer.
-            """)

NLP_QA_Tool/utils/__pycache__/config.cpython-38.pyc DELETED Viewed

Binary file (1.47 kB)

NLP_QA_Tool/utils/__pycache__/haystack.cpython-38.pyc DELETED Viewed

Binary file (3.59 kB)

NLP_QA_Tool/utils/__pycache__/ui.cpython-38.pyc DELETED Viewed

Binary file (733 Bytes)

NLP_QA_Tool/utils/check_pydantic_version.py DELETED Viewed

@@ -1,26 +0,0 @@
-import pydantic
-import os
-import fileinput
-def replace_string_in_files(folder_path, old_str, new_str):
-    for subdir, dirs, files in os.walk(folder_path):
-        for file in files:
-            file_path = os.path.join(subdir, file)
-            # Check if the file is a text file (you can modify this condition based on your needs)
-            if file.endswith(".txt") or file.endswith(".py"):
-                # Open the file in place for editing
-                with fileinput.FileInput(file_path, inplace=True) as f:
-                    for line in f:
-                        # Replace the old string with the new string
-                        print(line.replace(old_str, new_str), end='')
-def use_pydantic_v1():
-    module_file_path = pydantic.__file__
-    module_file_path = module_file_path.split('pydantic')[0] + 'haystack'
-    with open(module_file_path+'/schema.py','r') as f:
-        haystack_schema_file = f.read()
-    if 'from pydantic.v1' not in haystack_schema_file:
-        replace_string_in_files(module_file_path, 'from pydantic', 'from pydantic.v1')

NLP_QA_Tool/utils/config.py DELETED Viewed

@@ -1,43 +0,0 @@
-import argparse
-import os
-import os
-from dotenv import load_dotenv
-load_dotenv()
-parser = argparse.ArgumentParser(description='This app lists animals')
-document_store_choices = ('inmemory', 'weaviate', 'milvus', 'opensearch')
-parser.add_argument('--store', choices=document_store_choices, default='inmemory', help='DocumentStore selection (default: %(default)s)')
-parser.add_argument('--name', default="Document Insights: Extractive & Generative Methods")
-model_configs = {
-    'EMBEDDING_MODEL': os.getenv("EMBEDDING_MODEL", "sentence-transformers/all-MiniLM-L12-v2"),
-    'GENERATIVE_MODEL': os.getenv("GENERATIVE_MODEL", "gpt-4"),
-    #'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "deepset/roberta-base-squad2"),
-    'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "deepset/gelectra-large-germanquad"),
-    #'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "MachineLearningReply/bert-base-german-legal-qa"),
-    'OPENAI_KEY': os.getenv("OPENAI_KEY"),
-    'COHERE_KEY': os.getenv("COHERE_KEY"),
-}
-document_store_configs = {
-# Weaviate Config
-'WEAVIATE_HOST':  os.getenv("WEAVIATE_HOST", "http://localhost"),
-'WEAVIATE_PORT': os.getenv("WEAVIATE_PORT", 8080),
-'WEAVIATE_INDEX': os.getenv("WEAVIATE_INDEX", "Document"),
-'WEAVIATE_EMBEDDING_DIM': os.getenv("WEAVIATE_EMBEDDING_DIM", 768),
-# OpenSearch Config
-'OPENSEARCH_SCHEME': os.getenv("OPENSEARCH_SCHEME",  "https"),
-'OPENSEARCH_USERNAME': os.getenv("OPENSEARCH_USERNAME", "admin"),
-'OPENSEARCH_PASSWORD': os.getenv("OPENSEARCH_PASSWORD", "admin"),
-'OPENSEARCH_HOST': os.getenv("OPENSEARCH_HOST", "localhost"),
-'OPENSEARCH_PORT': os.getenv("OPENSEARCH_PORT", 9200),
-'OPENSEARCH_INDEX':  os.getenv("OPENSEARCH_INDEX", "document"),
-'OPENSEARCH_EMBEDDING_DIM': os.getenv("OPENSEARCH_EMBEDDING_DIM", 768),
-# Milvus Config
-'MILVUS_URI': os.getenv("MILVUS_URI", "http://localhost:19530/default"),
-'MILVUS_INDEX':  os.getenv("MILVUS_INDEX", "document"),
-'MILVUS_EMBEDDING_DIM': os.getenv("MILVUS_EMBEDDING_DIM", 768),
-}

NLP_QA_Tool/utils/haystack.py DELETED Viewed

@@ -1,124 +0,0 @@
-import streamlit as st
-from utils.config import document_store_configs, model_configs
-from haystack import Pipeline
-from haystack.schema import Answer
-from haystack.document_stores import BaseDocumentStore
-from haystack.document_stores import InMemoryDocumentStore, OpenSearchDocumentStore, WeaviateDocumentStore
-from haystack.nodes import EmbeddingRetriever, FARMReader, PromptNode, PreProcessor
-#from haystack.nodes import TextConverter, FileTypeClassifier, PDFToTextConverter
-from milvus_haystack import MilvusDocumentStore
-#Use this file to set up your Haystack pipeline and querying
-@st.cache_resource(show_spinner=False)
-def start_preprocessor_node():
-    print('initializing preprocessor node')
-    processor = PreProcessor(
-        clean_empty_lines= True,
-        clean_whitespace=True,
-        clean_header_footer=True,
-        #remove_substrings=None,
-        split_by="word",
-        split_length=100,
-        split_respect_sentence_boundary=True,
-        #split_overlap=0,
-        #max_chars_check= 10_000
-    )
-    return processor
-    #return docs
-@st.cache_resource(show_spinner=False)
-def start_document_store(type: str):
-    #This function starts the documents store of your choice based on your command line preference
-    print('initializing document store')
-    if type == 'inmemory':
-        document_store = InMemoryDocumentStore(use_bm25=True, embedding_dim=384)
-        '''
-        documents = [
-            {
-                'content': "Pi is a super dog",
-                'meta': {'name': "pi.txt"}
-            },
-            {
-                'content': "The revenue of siemens is 5 milion Euro",
-                'meta': {'name': "siemens.txt"}
-            },
-        ]
-        document_store.write_documents(documents)
-        '''
-    elif type == 'opensearch':
-        document_store = OpenSearchDocumentStore(scheme = document_store_configs['OPENSEARCH_SCHEME'],
-                                                 username = document_store_configs['OPENSEARCH_USERNAME'],
-                                                 password = document_store_configs['OPENSEARCH_PASSWORD'],
-                                                 host = document_store_configs['OPENSEARCH_HOST'],
-                                                 port = document_store_configs['OPENSEARCH_PORT'],
-                                                 index = document_store_configs['OPENSEARCH_INDEX'],
-                                                 embedding_dim = document_store_configs['OPENSEARCH_EMBEDDING_DIM'])
-    elif type == 'weaviate':
-        document_store = WeaviateDocumentStore(host = document_store_configs['WEAVIATE_HOST'],
-                                                port = document_store_configs['WEAVIATE_PORT'],
-                                                index = document_store_configs['WEAVIATE_INDEX'],
-                                                embedding_dim = document_store_configs['WEAVIATE_EMBEDDING_DIM'])
-    elif type == 'milvus':
-        document_store = MilvusDocumentStore(uri = document_store_configs['MILVUS_URI'],
-                                            index = document_store_configs['MILVUS_INDEX'],
-                                            embedding_dim = document_store_configs['MILVUS_EMBEDDING_DIM'],
-                                            return_embedding=True)
-    return document_store
-# cached to make index and models load only at start
-@st.cache_resource(show_spinner=False)
-def start_retriever(_document_store: BaseDocumentStore):
-    print('initializing retriever')
-    retriever = EmbeddingRetriever(document_store=_document_store,
-                                   embedding_model=model_configs['EMBEDDING_MODEL'],
-                                   top_k=5)
-    #
-    #_document_store.update_embeddings(retriever)
-    return retriever
-@st.cache_resource(show_spinner=False)
-def start_reader():
-    print('initializing reader')
-    reader = FARMReader(model_name_or_path=model_configs['EXTRACTIVE_MODEL'])
-    return reader
-# cached to make index and models load only at start
-@st.cache_resource(show_spinner=False)
-def start_haystack_extractive(_document_store: BaseDocumentStore, _retriever: EmbeddingRetriever, _reader: FARMReader):
-    print('initializing pipeline')
-    pipe = Pipeline()
-    pipe.add_node(component=_retriever, name="Retriever", inputs=["Query"])
-    pipe.add_node(component= _reader, name="Reader", inputs=["Retriever"])
-    return pipe
-@st.cache_resource(show_spinner=False)
-def start_haystack_rag(_document_store: BaseDocumentStore, _retriever: EmbeddingRetriever, openai_key):
-    prompt_node = PromptNode(default_prompt_template="deepset/question-answering",
-                             model_name_or_path=model_configs['GENERATIVE_MODEL'],
-                             api_key=openai_key,
-                             max_length=500)
-    pipe = Pipeline()
-    pipe.add_node(component=_retriever, name="Retriever", inputs=["Query"])
-    pipe.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])
-    return pipe
-#@st.cache_data(show_spinner=True)
-def query(_pipeline, question):
-    params = {}
-    results = _pipeline.run(question, params=params)
-    return results
-def initialize_pipeline(task, document_store, retriever, reader, openai_key = ""):
-    if task == 'extractive':
-        return start_haystack_extractive(document_store, retriever, reader)
-    elif task == 'rag':
-        return start_haystack_rag(document_store, retriever, openai_key)

NLP_QA_Tool/utils/ui.py DELETED Viewed

@@ -1,16 +0,0 @@
-import streamlit as st
-def set_state_if_absent(key, value):
-    if key not in st.session_state:
-        st.session_state[key] = value
-def set_initial_state():
-    set_state_if_absent("question", "Ask something here?")
-    set_state_if_absent("results_extractive", None)
-    set_state_if_absent("results_generative", None)
-    set_state_if_absent("task", None)
-def reset_results(*args):
-    st.session_state.results_extractive = None
-    st.session_state.results_generative = None
-    st.session_state.task = None