Spaces:

sami606713
/

RagChatbot

Configuration error

App Files Files Community

sami606713 commited on Jun 20

Commit

27a8994

verified ·

1 Parent(s): bdb9574

Upload 17 files

Browse files

Files changed (18) hide show

.gitattributes +1 -0
.gitignore +11 -0
README.md +46 -20
agent/__pycache__/agent.cpython-311.pyc +0 -0
agent/agent.py +62 -0
app.py +34 -0
app_debug.log +3 -0
generator.py +8 -0
main.py +89 -0
my_faiss_index/index.faiss +3 -0
my_faiss_index/index.pkl +3 -0
processed_files.txt +18 -0
requirements.txt +22 -3
summerizer/imageSummerizer.py +17 -0
summerizer/textSummerizer.py +26 -0
utils/helper.py +107 -0
vectorStore/__pycache__/vectorStore.cpython-311.pyc +0 -0
vectorStore/vectorStore.py +92 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+my_faiss_index/index.faiss filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+.env
+# Virtual environments
+.venv

README.md CHANGED Viewed

@@ -1,20 +1,46 @@
----
-title: RagChatbot
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: This Chat Bot can answer user query based on knowledegbase
-license: mit
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+## Backend
+This folder contains the backend services for the Document Chat App.
+### `app.py`
+This file is the main entry point for the Streamlit web application. It handles the user interface, chat history management, and interacts with the `agent` to process user queries and generate responses.
+### `main.py`
+This script is responsible for processing documents. It loads and extracts data (tables, texts, images) from PDF files in the `data` directory, summarizes them using the `summerizer` module, and then chunks and adds the processed documents to the vector store. It keeps track of processed files in `processed_files.txt` to avoid reprocessing.
+### `data/`
+This directory is intended to store the raw PDF documents that need to be processed by the system.
+### `vectorStore/`
+This directory stores the generated vector embeddings of the processed documents. These embeddings are used by the `agent` for retrieving relevant information during the chat.
+### `agent/`
+This module contains the logic for the conversational agent, which uses the vector store to answer questions based on the processed documents.
+### `summerizer/`
+This module provides functionalities for summarizing different types of content (text, images) extracted from the documents.
+### `utils/`
+This module contains utility functions, such as `helper.py` for loading and extracting data from documents.
+### `tool/`
+This module likely contains tools or functions used by the agent to perform specific tasks.
+### `generator.py`
+This file likely contains code related to generating responses or content within the application.
+### How to Run
+To run the backend application, you will typically run `app.py` using Streamlit after ensuring all dependencies are installed and documents are processed by `main.py`.
+```bash
+streamlit run app.py
+```
+### Running `main.py` for Document Embedding
+To process and embed documents, run the `main.py` script. This script will load PDF files from the `data` directory, extract and summarize their contents, and then add them to the vector store.
+```bash
+python main.py
+```
+Make sure that the `data` directory contains the PDF files you want to process. The script will log processed files in `processed_files.txt` to avoid reprocessing them.

agent/__pycache__/agent.cpython-311.pyc ADDED Viewed

Binary file (3.75 kB). View file

agent/agent.py ADDED Viewed

	@@ -0,0 +1,62 @@

+from agno.agent import Agent
+from agno.agent import Agent, RunResponse
+from agno.models.groq import Groq
+from agno.tools.reasoning import ReasoningTools
+from agno.tools.thinking import ThinkingTools
+from vectorStore.vectorStore import GetContext
+from agno.models.openai import OpenAIChat
+def RunAgent(query):
+    """
+    This agent can run hte query and return the response
+    retriever_tool can accept query and user_id and return the response
+    """
+    try:
+        agent = Agent(
+            tools=[GetContext,
+                   ReasoningTools(add_instructions=True),
+                    ThinkingTools(add_instructions=True)
+                    ],
+            description = "This agent strictly processes user queries using ONLY the provided context. It must not use external knowledge or assumptions beyond the context. "
+            "If the exact answer is not found, it must reason based on the available information to generate a helpful response. "
+            "If reasoning is not possible from the given context, the agent must clearly state that it cannot answer the query and prompt the user to try a related query. "
+            "At no point should the agent fabricate information or rely on knowledge not present in the provided context.",
+            instructions = [
+                """
+                Role:
+                - You are an assistant representing <BOT_NAME>. Your job is to assist users strictly based on the provided context.
+                Core Rules:
+                1. Use ONLY the provided context to generate responses.
+                2. DO NOT use any external knowledge, assumptions, prior training data, or general world knowledge.
+                3. If the context does not provide a clear answer, try to infer a reasonable response *only within* the scope of the context.
+                4. If a reasonable answer cannot be formed from the context, respond exactly with:
+                "Apologies; I am not sure about that. Please head over to <SUPPORT_URL> for some additional help from our team."
+                5. NEVER fabricate, guess, or hallucinate any information not clearly supported by the context.
+                6. Do NOT suggest or imply you have access to any knowledge beyond the context.
+                7.nclude the resource/document name at the end response.
+                Compliance:
+                - This is a ZERO-TOLERANCE instruction set.
+                - Any use of information outside the provided context is a strict violation.
+                """
+            ],
+            show_tool_calls=True,
+            markdown=True,
+            model=OpenAIChat(id="gpt-4o",api_key="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")
+        )
+        # Run Agent
+        response: RunResponse = agent.run(query, stream=False,structured_outputs=True)
+        return response.content
+    except Exception as e:
+        return str(e)
+if __name__ == "__main__":
+    query = "tell me about enery and climate changes?"
+    response = RunAgent(query=query)
+    print(">> Response: ",response)

app.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import streamlit as st
+from agent.agent import RunAgent
+# Set Streamlit layout
+st.set_page_config(page_title="Document Chat App", layout="wide")
+st.title("📄 Document Chat App")
+# Initialize session state
+if 'chat_history' not in st.session_state:
+    st.session_state.chat_history = []
+# Display chat history
+for speaker, message in st.session_state.chat_history:
+    with st.chat_message(name=speaker):
+        st.markdown(message)
+# Chat input
+user_input = st.chat_input("Ask something about your document...")
+if user_input:
+    # Show user message
+    with st.chat_message("You"):
+        st.markdown(user_input)
+    # Run agent
+    response = RunAgent(query=user_input)
+    # Show bot response
+    with st.chat_message("Bot"):
+        st.markdown(response)
+    # Save to chat history
+    st.session_state.chat_history.append(("You", user_input))
+    st.session_state.chat_history.append(("Bot", response))

app_debug.log ADDED Viewed

	@@ -0,0 +1,3 @@

+2025-06-16 00:17:48,085 - INFO - Session state initialized
+2025-06-16 00:17:48,085 - INFO - Debug: App started successfully
+2025-06-16 00:17:48,087 - INFO - Debug: Directories created successfully

generator.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from agent.agent import RunAgent
+if __name__ == "__main__":
+    query = """
+Explain how the crystalline structure of cellulose impacts its enzymatic hydrolysis. What methods are used to overcome this challenge?
+"""
+    response = RunAgent(query=query)
+    print(">> Response: ",response)

main.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import os
+import shutil
+from utils.helper import LoadAndExtractData  # Uncomment if you want to process files
+from summerizer.imageSummerizer import Image_Summerizer
+from summerizer.textSummerizer import TextSummerizer
+from langchain_core.documents import Document
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from vectorStore.vectorStore import add_to_vector_store
+def main():
+    try:
+        root_dir = "data"
+        processed_log_path = "processed_files.txt"
+        # Load already processed file names
+        if os.path.exists(processed_log_path):
+            with open(processed_log_path, 'r') as f:
+                processed_files = set(f.read().splitlines())
+        else:
+            processed_files = set()
+        files = os.listdir(root_dir)
+        print(">> Files: ",files)
+        print(">> Process Files: ",processed_files)
+        print(">> Processing Files ")
+        for file in files:
+            file_path = os.path.join(root_dir, file)
+            # Only process files that don't exist in the process directory
+            if file not in processed_files and file.lower().endswith('.pdf'):
+                print(f">> Processing: {file}")
+                tables, texts, images = LoadAndExtractData(file_path)
+                print(">> Generating Summaries ")
+                text_summary = TextSummerizer(data=texts)
+                tables_summary = TextSummerizer(data=tables)
+                images_summary = Image_Summerizer(data=images)
+                print("Text Sumary: ",text_summary)
+                print("Table Summary: ",tables_summary)
+                print("Image Susmmary: ",images_summary)
+                print(">> Summary Generated")
+                print(">> Combine Each and every thing into one document")
+                # Create Document objects for text chunks
+                text_docs = [Document(page_content=str(text), metadata={"type": "text", "summary": text_summary[i], "source":file_path,"name":file}) for i, text in enumerate(texts)]
+                # Create Document objects for table summaries (using the HTML representation)
+                table_docs = [Document(page_content=tables[i], metadata={"type": "table", "summary": tables_summary[i],"source":file_path,"name":file}) for i, table in enumerate(tables)]
+                # Create Document objects for image summaries
+                image_docs = [Document(page_content=images[i], metadata={"type": "image", "summary": images_summary[i],"source":file_path,"name":file}) for i, image in enumerate(images)]
+                # Combine all document types into a single list
+                docs = text_docs + table_docs + image_docs
+                print(">> Splitting Documents")
+                document_splitter = RecursiveCharacterTextSplitter(
+                    chunk_size=1000,  # Example size, adjust based on your needs
+                    chunk_overlap=200,  # Example overlap, adjust based on your needs
+                    length_function=len,
+                    is_separator_regex=False,
+                )
+                # Spli the documents
+                docs_chunks = document_splitter.split_documents(docs)
+                print(">> Splitting Done")
+                add_to_vector_store(docs_chunks=docs_chunks)
+                # Append to log file
+                with open(processed_log_path, 'a') as f:
+                    f.write(file + '\n')
+                print(f">> Marked {file} as processed")
+            else:
+                print(f"!! Skipping already processed or unsupported file: {file}")
+    except Exception as e:
+        print("Error is:", str(e))
+        return str(e)
+if __name__ == "__main__":
+    main()

my_faiss_index/index.faiss ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c04a5dd5bbd2a99da72a7ab0084626a7501a920cc1e05034fc3449ecb206330e
+size 56463405

my_faiss_index/index.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e088d4825bc2685515fe7ef79aff669292ad3652c9ffbb296f6f7d87827cf00c
+size 10708178

processed_files.txt ADDED Viewed

	@@ -0,0 +1,18 @@

+AMMONIA PRODUCTION TECHNOLOGIES CAP 8 STORAGE.pdf
+Ammonias-Double-Edged-Sword-Clean-Energy-or-Catastrophic-Risk.pdf
+Waste to Fuel in NY.pdf
+Charting-a-Greener-Course-Embrace-CCS-on-Maritime-Vessels.pdf
+wind-assisted ship propulsion.pdf
+Clean Energy Market Analysis in the US.pdf
+Clean investment US.pdf
+emission-factors_2014.pdf
+Green Hydrogen the Race to Success-Members.pdf
+GREENHOUSE GAS EMISSIONS FROM BIo ETHANOL AND BIO-DIESEL FUEL SUPPLY.pdf
+Hydrogen Bunkering at Ports by Eliseo Curcio.pdf
+Cheat Sheet Hydrogen (1).pdf
+Comparative Life Cycle Assessment of Bioethanol Production from Different Generations ofBiomass and Waste Feedstocks.pdf
+Comparison of biofuel life-cycle GHG emissions assessment tools.pdf
+Green-Hydrogen-The-Race-to-Success.pdf
+LCFS Vs RFS.pdf
+Post strategy.pdf
+Review of Second Generation Bioethanol Production.pdf

requirements.txt CHANGED Viewed

@@ -1,3 +1,22 @@
-altair
-pandas
-streamlit

+unstructured[all-docs]
+pillow
+lxml
+pillow
+tiktoken
+langchain
+langchain-community
+langchain-openai
+langchain-groq
+python_dotenv
+pymilvus[model]
+pymilvus
+transformers
+torch
+poppler-utils
+tesseract
+pytesseract
+langchain-groq
+langchain-milvus
+agno
+streamlit
+faiss-cpu

summerizer/imageSummerizer.py ADDED Viewed

	@@ -0,0 +1,17 @@

+# Image Summerizer
+from utils.helper import Summarizer
+prompt_template = """Describe the image in detail. For context,
+                  the image is part of a research paper.
+                  Be specific about graphs, such as bar plots."""
+def Image_Summerizer(prompt_template =prompt_template,data=None):
+    try:
+        images_summary = Summarizer(prompt_template=prompt_template,data=data,config=False,set_messages=True)
+        return images_summary
+    except Exception as e:
+        pass
+if __name__ == "__main__":
+    pass

summerizer/textSummerizer.py ADDED Viewed

	@@ -0,0 +1,26 @@

+# Text Summerize
+from utils.helper import Summarizer
+# define the pronpt
+prompt_text = """
+You are an assistant tasked with summarizing tables and text.
+Give a concise summary of the table or text.
+Respond only with the summary, no additionnal comment.
+Do not start your message by saying "Here is a summary" or anything like that.
+Just give the summary as it is.
+Table or text chunk: {element}
+"""
+def TextSummerizer(prompt_template =prompt_text,data=None):
+    try:
+        text_summary = Summarizer(prompt_template=prompt_template,data=data,config=True,set_messages=False)
+        return text_summary
+    except Exception as e:
+        pass
+if __name__ == "__main__":
+    print(TextSummerizer(data="""
+            To download and use Poppler as a Python library (or make it accessible to Python), follow these steps based on your operating system. Poppler is not a Python package—it's a C++ PDF rendering library with command-line tools like pdfinfo, pdftotext, and others, which Python libraries like unstructured or pdf2image call internally.
+                   """))

utils/helper.py ADDED Viewed

	@@ -0,0 +1,107 @@

+from unstructured.partition.pdf import partition_pdf
+from langchain_openai import ChatOpenAI
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.output_parsers import StrOutputParser
+from dotenv import load_dotenv
+load_dotenv()
+def get_images_base64(chunks):
+    images_b64 = []
+    for chunk in chunks:
+        if "CompositeElement" in str(type(chunk)):
+            chunk_els = chunk.metadata.orig_elements
+            for el in chunk_els:
+                if "Image" in str(type(el)):
+                    images_b64.append(el.metadata.image_base64)
+    return images_b64
+def LoadAndExtractData(file_path):
+    try:
+        # separate tables from texts
+        tables = []
+        texts = []
+        print(">> Extracting Data")
+        data = partition_pdf(
+        filename=file_path,
+        infer_table_structure=True,            # extract tables
+        # strategy="hi_res",                     # mandatory to infer tables
+        extract_image_block_types=["Image"],   # Add 'Tabl
+        extract_image_block_to_payload=True,   # if true, will extract base64 for API usage
+        chunking_strategy="by_title",          # or 'basic'
+        max_characters=10000,                  # defaults to 500
+        combine_text_under_n_chars=2000,       # defaults to 0
+        new_after_n_chars=6000,
+        # extract_images_in_pdf=True,          # deprecated
+    )
+        # Extract the tables and text
+        print(">> Extracting Text and tables...")
+        for chunk in data:
+            if "Table" in str(type(chunk)):
+                tables.append(chunk)
+            if "CompositeElement" in str(type((chunk))):
+                texts.append(chunk)
+        print(">> Chunks are: ",data)
+        # extract the image
+        print(">> Extracting Images...")
+        images = get_images_base64(data)
+        return  tables ,texts, images
+    except Exception as e:
+        print("Error is: ",str(e))
+        return [], [], str(e)
+# Summarizer Function
+def Summarizer(prompt_template, data, config=True, set_messages=False):
+    """
+    This function summarizes documents using a prompt template and the ChatOpenAI model.
+    Args:
+        prompt_template (str): Template string for the prompt.
+        data (List[Dict] or List[str]): Input data to be summarized.
+        config (bool): Whether to run the chain with concurrency limit.
+        set_messages (bool): Whether to set messages as chat messages with an image.
+    Returns:
+        List[str]: List of summaries.
+    """
+    try:
+        # api_key = os.getenv()
+        if set_messages:
+            messages = [
+                (
+                    "user",
+                    [
+                        {"type": "text", "text": prompt_template},
+                        {
+                            "type": "image_url",
+                            "image_url": {"url": "data:image/jpeg;base64,{image}"},
+                        },
+                    ],
+                )
+            ]
+            prompt = ChatPromptTemplate.from_messages(messages)
+            model = ChatOpenAI(temperature=0.5, model="gpt-4o-mini")
+            summarize_chain = {"image": lambda x: x} | prompt | model | StrOutputParser()
+        else:
+            prompt = ChatPromptTemplate.from_template(prompt_template)
+            model = ChatOpenAI(temperature=0.5, model="gpt-4o-mini")
+            summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
+        if config:
+            return summarize_chain.batch(data, {"max_concurrency": 3})
+        else:
+            return summarize_chain.batch(data)
+    except Exception as e:
+        return str(e)

vectorStore/__pycache__/vectorStore.cpython-311.pyc ADDED Viewed

Binary file (4.28 kB). View file

vectorStore/vectorStore.py ADDED Viewed

	@@ -0,0 +1,92 @@

+from pymilvus import MilvusClient
+# from langchain_milvus import Milvus
+from langchain_openai import OpenAIEmbeddings
+from langchain.docstore.document import Document
+from tqdm import tqdm
+from dotenv import load_dotenv
+from typing import List
+import os
+load_dotenv()
+embeddings = OpenAIEmbeddings(openai_api_key="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")
+from typing import List
+# =============Fais Setup============#
+from typing import List
+from langchain_core.documents import Document
+import faiss
+from langchain_community.docstore.in_memory import InMemoryDocstore
+from langchain_community.vectorstores import FAISS
+from uuid import uuid4
+from tqdm import tqdm
+def add_to_vector_store(docs_chunks: List[Document],batch_size:int = 64,vector_store_path = "my_faiss_index"):
+    """
+    Embeds document chunks and stores them in a FAISS vector store.
+    Args:
+        docs_chunks (List[Document]): List of LangChain Document objects.
+    Returns:
+        dict: Status message and vector store.
+    """
+    print(f">> Starting embedding for {len(docs_chunks)} documents...\n")
+    if os.path.exists(vector_store_path):
+        print(">> Loading the index <<")
+        vector_store = FAISS.load_local(vector_store_path, embeddings,allow_dangerous_deserialization=True)
+    else:
+        print(">> Creating the index  <<")
+        # Create an index using the dimensionality of one sample embedding
+        dimension = len(embeddings.embed_query("hello world"))
+        index = faiss.IndexFlatL2(dimension)
+        # Initialize vector store
+        vector_store = FAISS(
+            embedding_function=embeddings,
+            index=index,
+            docstore=InMemoryDocstore(),
+            index_to_docstore_id={},
+        )
+    # Generate unique IDs for documents
+    uuids = [str(uuid4()) for _ in docs_chunks]
+    print(f"\n📦 Preparing to insert {len(docs_chunks)} documents into FAISS...\n")
+    # Loop over documents in batches
+    for i in tqdm(range(0, len(docs_chunks), batch_size), desc="🔍 Embedding & Inserting", unit="batch"):
+        batch_docs = docs_chunks[i:i+batch_size]
+        batch_ids = uuids[i:i+batch_size]
+        vector_store.add_documents(documents=batch_docs, ids=batch_ids)
+    vector_store.save_local(vector_store_path)
+    print("✅ Data insertion successful!\n")
+    return {
+        "status": "success",
+        "vector_store": vector_store,
+        "num_documents": len(docs_chunks)
+    }
+def GetContext(query:str):
+    vector_store = FAISS.load_local("my_faiss_index", embeddings,allow_dangerous_deserialization=True)
+    results = vector_store.similarity_search(
+    query,
+    k=2,
+    # filter={"source": "tweet"},
+    )
+    # for res in results:
+    #     print(f"* {res.page_content} [{res.metadata}]")
+    return {"Context":results}
+if __name__ == "__main__":
+   pass