Spaces:

dl4ds
/

tutor_dev

Configuration error

App Files Files Community

Thomas (Tom) Gardos commited on Aug 26, 2024

Commit

4bdb9ef

unverified ·

2 Parent(s): 81a188d 8e1bf6f

Merge pull request #90 from DL4DS/dev2main

Browse files

Files changed (48) hide show

.flake8 +3 -0
.gitattributes +1 -0
.github/workflows/code_quality_check.yml +33 -0
.gitignore +3 -1
Dockerfile +8 -1
README.md +18 -39
code/.chainlit/config.toml +8 -6
code/__init__.py +0 -1
code/app.py +351 -0
code/chainlit.md +1 -6
code/chainlit_base.py +484 -0
code/main.py +212 -67
code/modules/chat/chat_model_loader.py +2 -9
code/modules/chat/helpers.py +6 -4
code/modules/chat/langchain/__init__.py +0 -0
code/modules/chat/langchain/langchain_rag.py +16 -12
code/modules/chat/langchain/utils.py +12 -34
code/modules/chat/llm_tutor.py +10 -7
code/modules/chat_processor/helpers.py +245 -0
code/modules/chat_processor/literal_ai.py +1 -38
code/modules/config/config.yml +3 -3
code/modules/config/constants.py +14 -3
code/modules/config/project_config.yml +7 -0
code/modules/dataloader/data_loader.py +96 -55
code/modules/dataloader/helpers.py +13 -6
code/modules/dataloader/pdf_readers/gpt.py +27 -19
code/modules/dataloader/pdf_readers/llama.py +24 -23
code/modules/dataloader/webpage_crawler.py +5 -3
code/modules/retriever/helpers.py +0 -1
code/modules/vectorstore/colbert.py +3 -2
code/modules/vectorstore/embedding_model_loader.py +1 -7
code/modules/vectorstore/faiss.py +10 -7
code/modules/vectorstore/raptor.py +1 -4
code/modules/vectorstore/store_manager.py +21 -14
code/public/avatars/{ai-tutor.png → ai_tutor.png} +0 -0
code/public/space.jpg +3 -0
code/public/test.css +0 -19
code/templates/cooldown.html +181 -0
code/templates/dashboard.html +145 -0
code/templates/error.html +95 -0
code/templates/error_404.html +80 -0
code/templates/login.html +132 -0
code/templates/logout.html +21 -0
docs/README.md +0 -51
docs/contribute.md +33 -0
docs/setup.md +127 -0
pyproject.toml +2 -0
requirements.txt +12 -1

.flake8 ADDED Viewed

	@@ -0,0 +1,3 @@

+[flake8]
+max-line-length = 88
+extend-ignore = E203, E266, E501, W503

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.jpg filter=lfs diff=lfs merge=lfs -text

.github/workflows/code_quality_check.yml ADDED Viewed

	@@ -0,0 +1,33 @@

+name: Code Quality and Security Checks
+on:
+  push:
+    branches: [ main, dev_branch ]
+  pull_request:
+    branches: [ main, dev_branch ]
+jobs:
+  code-quality:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.11'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8 black bandit
+    - name: Run Black
+      run: black --check .
+    - name: Run Flake8
+      run: flake8 .
+    - name: Run Bandit
+      run: |
+        bandit -r .

.gitignore CHANGED Viewed

@@ -165,7 +165,9 @@ cython_debug/
 .ragatouille/*
 */__pycache__/*
 .chainlit/translations/
 storage/logs/*
 vectorstores/*
-*/.files/*

 .ragatouille/*
 */__pycache__/*
 .chainlit/translations/
+code/.chainlit/translations/
 storage/logs/*
 vectorstores/*
+*/.files/*
+code/storage/models/

Dockerfile CHANGED Viewed

@@ -26,6 +26,13 @@ WORKDIR /code/code
 RUN --mount=type=secret,id=HUGGINGFACEHUB_API_TOKEN,mode=0444,required=true
 RUN --mount=type=secret,id=OPENAI_API_KEY,mode=0444,required=true
 # Default command to run the application
-CMD ["sh", "-c", "python -m modules.vectorstore.store_manager && chainlit run main.py --host 0.0.0.0 --port 7860"]

 RUN --mount=type=secret,id=HUGGINGFACEHUB_API_TOKEN,mode=0444,required=true
 RUN --mount=type=secret,id=OPENAI_API_KEY,mode=0444,required=true
+RUN --mount=type=secret,id=CHAINLIT_URL,mode=0444,required=true
+RUN --mount=type=secret,id=LITERAL_API_URL,mode=0444,required=true
+RUN --mount=type=secret,id=LLAMA_CLOUD_API_KEY,mode=0444,required=true
+RUN --mount=type=secret,id=OAUTH_GOOGLE_CLIENT_ID,mode=0444,required=true
+RUN --mount=type=secret,id=OAUTH_GOOGLE_CLIENT_SECRET,mode=0444,required=true
+RUN --mount=type=secret,id=LITERAL_API_KEY_LOGGING,mode=0444,required=true
+RUN --mount=type=secret,id=CHAINLIT_AUTH_SECRET,mode=0444,required=true
 # Default command to run the application
+CMD ["sh", "-c", "python -m modules.vectorstore.store_manager && uvicorn app:app --host 0.0.0.0 --port 7860"]

README.md CHANGED Viewed

@@ -15,10 +15,14 @@ You can find a "production" implementation of the Tutor running live at [DL4DS T
 Hugging Face [Space](https://huggingface.co/spaces/dl4ds/dl4ds_tutor). It is pushed automatically from the `main` branch of this repo by this
 [Actions Workflow](https://github.com/DL4DS/dl4ds_tutor/blob/main/.github/workflows/push_to_hf_space.yml) upon a push to `main`.
-A "development" version of the Tutor is running live at [DL4DS Tutor -- Dev](https://dl4ds-tutor-dev.hf.space) from this Hugging Face
 [Space](https://huggingface.co/spaces/dl4ds/tutor_dev). It is pushed automatically from the `dev_branch` branch of this repo by this
 [Actions Workflow](https://github.com/DL4DS/dl4ds_tutor/blob/dev_branch/.github/workflows/push_to_hf_space_prototype.yml) upon a push to `dev_branch`.
 ## Running Locally
@@ -34,7 +38,7 @@ A "development" version of the Tutor is running live at [DL4DS Tutor -- Dev](htt
 3. **To test Data Loading (Optional)**
    ```bash
    cd code
-   python -m modules.dataloader.data_loader
    ```
 4. **Create the Vector Database**
@@ -43,47 +47,16 @@ A "development" version of the Tutor is running live at [DL4DS Tutor -- Dev](htt
    python -m modules.vectorstore.store_manager
    ```
    - Note: You need to run the above command when you add new data to the `storage/data` directory, or if the `storage/data/urls.txt` file is updated.
-   - Alternatively, you can set `["vectorstore"]["embedd_files"]` to `True` in the `code/modules/config/config.yaml` file, which will embed files from the storage directory every time you run the below chainlit command.
-5. **Run the Chainlit App**
    ```bash
-   chainlit run main.py
    ```
-See the [docs](https://github.com/DL4DS/dl4ds_tutor/tree/main/docs) for more information.
-## File Structure
-```plaintext
-code/
- ├── modules
- │   ├── chat                # Contains the chatbot implementation
- │   ├── chat_processor      # Contains the implementation to process and log the conversations
- │   ├── config              # Contains the configuration files
- │   ├── dataloader          # Contains the implementation to load the data from the storage directory
- │   ├── retriever           # Contains the implementation to create the retriever
- │   └── vectorstore         # Contains the implementation to create the vector database
- ├── public
- │   ├── logo_dark.png       # Dark theme logo
- │   ├── logo_light.png      # Light theme logo
- │   └── test.css            # Custom CSS file
- └── main.py
-docs/                        # Contains the documentation to the codebase and methods used
-storage/
- ├── data                    # Store files and URLs here
- ├── logs                    # Logs directory, includes logs on vector DB creation, tutor logs, and chunks logged in JSON files
- └── models                  # Local LLMs are loaded from here
-vectorstores/                # Stores the created vector databases
-.env                         # This needs to be created, store the API keys here
-```
-- `code/modules/vectorstore/vectorstore.py`: Instantiates the `VectorStore` class to create the vector database.
-- `code/modules/vectorstore/store_manager.py`: Instantiates the `VectorStoreManager:` class to manage the vector database, and all associated methods.
-- `code/modules/retriever/retriever.py`: Instantiates the `Retriever` class to create the retriever.
 ## Docker
@@ -97,4 +70,10 @@ docker run -it --rm -p 8000:8000 dev
 ## Contributing
-Please create an issue if you have any suggestions or improvements, and start working on it by creating a branch and by making a pull request to the main branch.

 Hugging Face [Space](https://huggingface.co/spaces/dl4ds/dl4ds_tutor). It is pushed automatically from the `main` branch of this repo by this
 [Actions Workflow](https://github.com/DL4DS/dl4ds_tutor/blob/main/.github/workflows/push_to_hf_space.yml) upon a push to `main`.
+A "development" version of the Tutor is running live at [DL4DS Tutor -- Dev](https://dl4ds-tutor-dev.hf.space/) from this Hugging Face
 [Space](https://huggingface.co/spaces/dl4ds/tutor_dev). It is pushed automatically from the `dev_branch` branch of this repo by this
 [Actions Workflow](https://github.com/DL4DS/dl4ds_tutor/blob/dev_branch/.github/workflows/push_to_hf_space_prototype.yml) upon a push to `dev_branch`.
+## Setup
+Please visit [setup](https://dl4ds.github.io/dl4ds_tutor/guide/setup/) for more information on setting up the project.
 ## Running Locally
 3. **To test Data Loading (Optional)**
    ```bash
    cd code
+   python -m modules.dataloader.data_loader --links "your_pdf_link"
    ```
 4. **Create the Vector Database**
    python -m modules.vectorstore.store_manager
    ```
    - Note: You need to run the above command when you add new data to the `storage/data` directory, or if the `storage/data/urls.txt` file is updated.
+6. **Run the FastAPI App**
    ```bash
+   cd code
+   uvicorn app:app --port 7860
    ```
+## Documentation
+Please visit the [docs](https://dl4ds.github.io/dl4ds_tutor/) for more information.
 ## Docker
 ## Contributing
+Please create an issue if you have any suggestions or improvements, and start working on it by creating a branch and by making a pull request to the `dev_branch`.
+Please visit [contribute](https://dl4ds.github.io/dl4ds_tutor/guide/contribute/) for more information on contributing.
+## Future Work
+For more information on future work, please visit [roadmap](https://dl4ds.github.io/dl4ds_tutor/guide/readmap/).

code/.chainlit/config.toml CHANGED Viewed

@@ -20,7 +20,7 @@ allow_origins = ["*"]
 [features]
 # Process and display HTML in messages. This can be a security risk (see https://stackoverflow.com/questions/19603097/why-is-it-dangerous-to-render-user-generated-html-or-javascript)
-unsafe_allow_html = false
 # Process and display mathematical expressions. This can clash with "$" characters in messages.
 latex = true
@@ -49,6 +49,8 @@ auto_tag_thread = true
     # Sample rate of the audio
     sample_rate = 44100
 [UI]
 # Name of the assistant.
 name = "AI Tutor"
@@ -59,11 +61,11 @@ name = "AI Tutor"
 # Large size content are by default collapsed for a cleaner ui
 default_collapse_content = true
-# Hide the chain of thought details from the user in the UI.
-hide_cot = true
 # Link to your github repo. This will add a github button in the UI's header.
-# github = "https://github.com/DL4DS/dl4ds_tutor"
 # Specify a CSS file that can be used to customize the user interface.
 # The CSS file can be served from the public directory or via an external link.
@@ -85,7 +87,7 @@ custom_meta_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/
 # custom_build = "./public/build"
 [UI.theme]
-    default = "dark"
     #layout = "wide"
     #font_family = "Inter, sans-serif"
 # Override default MUI light theme. (Check theme.ts)
@@ -115,4 +117,4 @@ custom_meta_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/
         #secondary = "#BDBDBD"
 [meta]
-generated_by = "1.1.304"

 [features]
 # Process and display HTML in messages. This can be a security risk (see https://stackoverflow.com/questions/19603097/why-is-it-dangerous-to-render-user-generated-html-or-javascript)
+unsafe_allow_html = true
 # Process and display mathematical expressions. This can clash with "$" characters in messages.
 latex = true
     # Sample rate of the audio
     sample_rate = 44100
+edit_message = true
 [UI]
 # Name of the assistant.
 name = "AI Tutor"
 # Large size content are by default collapsed for a cleaner ui
 default_collapse_content = true
+# Chain of Thought (CoT) display mode. Can be "hidden", "tool_call" or "full".
+cot = "hidden"
 # Link to your github repo. This will add a github button in the UI's header.
+github = "https://github.com/DL4DS/dl4ds_tutor"
 # Specify a CSS file that can be used to customize the user interface.
 # The CSS file can be served from the public directory or via an external link.
 # custom_build = "./public/build"
 [UI.theme]
+    default = "light"
     #layout = "wide"
     #font_family = "Inter, sans-serif"
 # Override default MUI light theme. (Check theme.ts)
         #secondary = "#BDBDBD"
 [meta]
+generated_by = "1.1.402"

code/__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	- from .modules import *

code/app.py ADDED Viewed

	@@ -0,0 +1,351 @@

+from fastapi import FastAPI, Request, Response, HTTPException
+from fastapi.responses import HTMLResponse, RedirectResponse
+from fastapi.templating import Jinja2Templates
+from google.oauth2 import id_token
+from google.auth.transport import requests as google_requests
+from google_auth_oauthlib.flow import Flow
+from chainlit.utils import mount_chainlit
+import secrets
+import json
+import base64
+from modules.config.constants import (
+    OAUTH_GOOGLE_CLIENT_ID,
+    OAUTH_GOOGLE_CLIENT_SECRET,
+    CHAINLIT_URL,
+    GITHUB_REPO,
+    DOCS_WEBSITE,
+    ALL_TIME_TOKENS_ALLOCATED,
+    TOKENS_LEFT,
+)
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.staticfiles import StaticFiles
+from modules.chat_processor.helpers import (
+    get_user_details,
+    get_time,
+    reset_tokens_for_user,
+    check_user_cooldown,
+    update_user_info,
+)
+GOOGLE_CLIENT_ID = OAUTH_GOOGLE_CLIENT_ID
+GOOGLE_CLIENT_SECRET = OAUTH_GOOGLE_CLIENT_SECRET
+GOOGLE_REDIRECT_URI = f"{CHAINLIT_URL}/auth/oauth/google/callback"
+app = FastAPI()
+app.mount("/public", StaticFiles(directory="public"), name="public")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # Update with appropriate origins
+    allow_methods=["*"],
+    allow_headers=["*"],  # or specify the headers you want to allow
+    expose_headers=["X-User-Info"],  # Expose the custom header
+)
+templates = Jinja2Templates(directory="templates")
+session_store = {}
+CHAINLIT_PATH = "/chainlit_tutor"
+# only admin is given any additional permissions for now -- no limits on tokens
+USER_ROLES = {
+    "[email protected]": ["instructor", "bu"],
+    "[email protected]": ["admin", "instructor", "bu"],
+    "[email protected]": ["instructor", "bu"],
+    "[email protected]": ["guest"],
+    # Add more users and roles as needed
+}
+# Create a Google OAuth flow
+flow = Flow.from_client_config(
+    {
+        "web": {
+            "client_id": GOOGLE_CLIENT_ID,
+            "client_secret": GOOGLE_CLIENT_SECRET,
+            "auth_uri": "https://accounts.google.com/o/oauth2/auth",
+            "token_uri": "https://oauth2.googleapis.com/token",
+            "redirect_uris": [GOOGLE_REDIRECT_URI],
+            "scopes": [
+                "openid",
+                # "https://www.googleapis.com/auth/userinfo.email",
+                # "https://www.googleapis.com/auth/userinfo.profile",
+            ],
+        }
+    },
+    scopes=[
+        "openid",
+        "https://www.googleapis.com/auth/userinfo.email",
+        "https://www.googleapis.com/auth/userinfo.profile",
+    ],
+    redirect_uri=GOOGLE_REDIRECT_URI,
+)
+def get_user_role(username: str):
+    return USER_ROLES.get(username, ["guest"])  # Default to "guest" role
+async def get_user_info_from_cookie(request: Request):
+    user_info_encoded = request.cookies.get("X-User-Info")
+    if user_info_encoded:
+        try:
+            user_info_json = base64.b64decode(user_info_encoded).decode()
+            return json.loads(user_info_json)
+        except Exception as e:
+            print(f"Error decoding user info: {e}")
+            return None
+    return None
+async def del_user_info_from_cookie(request: Request, response: Response):
+    # Delete cookies from the response
+    response.delete_cookie("X-User-Info")
+    response.delete_cookie("session_token")
+    # Get the session token from the request cookies
+    session_token = request.cookies.get("session_token")
+    # Check if the session token exists in the session_store before deleting
+    if session_token and session_token in session_store:
+        del session_store[session_token]
+def get_user_info(request: Request):
+    session_token = request.cookies.get("session_token")
+    if session_token and session_token in session_store:
+        return session_store[session_token]
+    return None
+@app.get("/", response_class=HTMLResponse)
+async def login_page(request: Request):
+    user_info = await get_user_info_from_cookie(request)
+    if user_info and user_info.get("google_signed_in"):
+        return RedirectResponse("/post-signin")
+    return templates.TemplateResponse(
+        "login.html",
+        {"request": request, "GITHUB_REPO": GITHUB_REPO, "DOCS_WEBSITE": DOCS_WEBSITE},
+    )
+# @app.get("/login/guest")
+# async def login_guest():
+#     username = "guest"
+#     session_token = secrets.token_hex(16)
+#     unique_session_id = secrets.token_hex(8)
+#     username = f"{username}_{unique_session_id}"
+#     session_store[session_token] = {
+#         "email": username,
+#         "name": "Guest",
+#         "profile_image": "",
+#         "google_signed_in": False,  # Ensure guest users do not have this flag
+#     }
+#     user_info_json = json.dumps(session_store[session_token])
+#     user_info_encoded = base64.b64encode(user_info_json.encode()).decode()
+#     # Set cookies
+#     response = RedirectResponse(url="/post-signin", status_code=303)
+#     response.set_cookie(key="session_token", value=session_token)
+#     response.set_cookie(key="X-User-Info", value=user_info_encoded, httponly=True)
+#     return response
+@app.get("/login/google")
+async def login_google(request: Request):
+    # Clear any existing session cookies to avoid conflicts with guest sessions
+    response = RedirectResponse(url="/post-signin")
+    response.delete_cookie(key="session_token")
+    response.delete_cookie(key="X-User-Info")
+    user_info = await get_user_info_from_cookie(request)
+    # Check if user is already signed in using Google
+    if user_info and user_info.get("google_signed_in"):
+        return RedirectResponse("/post-signin")
+    else:
+        authorization_url, _ = flow.authorization_url(prompt="consent")
+        return RedirectResponse(authorization_url, headers=response.headers)
+@app.get("/auth/oauth/google/callback")
+async def auth_google(request: Request):
+    try:
+        flow.fetch_token(code=request.query_params.get("code"))
+        credentials = flow.credentials
+        user_info = id_token.verify_oauth2_token(
+            credentials.id_token, google_requests.Request(), GOOGLE_CLIENT_ID
+        )
+        email = user_info["email"]
+        name = user_info.get("name", "")
+        profile_image = user_info.get("picture", "")
+        role = get_user_role(email)
+        session_token = secrets.token_hex(16)
+        session_store[session_token] = {
+            "email": email,
+            "name": name,
+            "profile_image": profile_image,
+            "google_signed_in": True,  # Set this flag to True for Google-signed users
+        }
+        # add literalai user info to session store to be sent to chainlit
+        literalai_user = await get_user_details(email)
+        session_store[session_token]["literalai_info"] = literalai_user.to_dict()
+        session_store[session_token]["literalai_info"]["metadata"]["role"] = role
+        user_info_json = json.dumps(session_store[session_token])
+        user_info_encoded = base64.b64encode(user_info_json.encode()).decode()
+        # Set cookies
+        response = RedirectResponse(url="/post-signin", status_code=303)
+        response.set_cookie(key="session_token", value=session_token)
+        response.set_cookie(
+            key="X-User-Info", value=user_info_encoded, httponly=True
+        )  # TODO: is the flag httponly=True necessary?
+        return response
+    except Exception as e:
+        print(f"Error during Google OAuth callback: {e}")
+        return RedirectResponse(url="/", status_code=302)
+@app.get("/cooldown")
+async def cooldown(request: Request):
+    user_info = await get_user_info_from_cookie(request)
+    user_details = await get_user_details(user_info["email"])
+    current_datetime = get_time()
+    cooldown, cooldown_end_time = await check_user_cooldown(
+        user_details, current_datetime
+    )
+    print(f"User in cooldown: {cooldown}")
+    print(f"Cooldown end time: {cooldown_end_time}")
+    if cooldown and "admin" not in get_user_role(user_info["email"]):
+        return templates.TemplateResponse(
+            "cooldown.html",
+            {
+                "request": request,
+                "username": user_info["email"],
+                "role": get_user_role(user_info["email"]),
+                "cooldown_end_time": cooldown_end_time,
+                "tokens_left": user_details.metadata["tokens_left"],
+            },
+        )
+    else:
+        user_details.metadata["in_cooldown"] = False
+        await update_user_info(user_details)
+        await reset_tokens_for_user(user_details)
+        return RedirectResponse("/post-signin")
+@app.get("/post-signin", response_class=HTMLResponse)
+async def post_signin(request: Request):
+    user_info = await get_user_info_from_cookie(request)
+    if not user_info:
+        user_info = get_user_info(request)
+    user_details = await get_user_details(user_info["email"])
+    current_datetime = get_time()
+    user_details.metadata["last_login"] = current_datetime
+    # if new user, set the number of tries
+    if "tokens_left" not in user_details.metadata:
+        user_details.metadata["tokens_left"] = (
+            TOKENS_LEFT  # set the number of tokens left for the new user
+        )
+    if "last_message_time" not in user_details.metadata:
+        user_details.metadata["last_message_time"] = current_datetime
+    if "all_time_tokens_allocated" not in user_details.metadata:
+        user_details.metadata["all_time_tokens_allocated"] = ALL_TIME_TOKENS_ALLOCATED
+    if "in_cooldown" not in user_details.metadata:
+        user_details.metadata["in_cooldown"] = False
+    await update_user_info(user_details)
+    if "last_message_time" in user_details.metadata and "admin" not in get_user_role(
+        user_info["email"]
+    ):
+        cooldown, _ = await check_user_cooldown(user_details, current_datetime)
+        if cooldown:
+            user_details.metadata["in_cooldown"] = True
+            return RedirectResponse("/cooldown")
+        else:
+            user_details.metadata["in_cooldown"] = False
+            await reset_tokens_for_user(user_details)
+    if user_info:
+        username = user_info["email"]
+        role = get_user_role(username)
+        jwt_token = request.cookies.get("X-User-Info")
+        return templates.TemplateResponse(
+            "dashboard.html",
+            {
+                "request": request,
+                "username": username,
+                "role": role,
+                "jwt_token": jwt_token,
+                "tokens_left": user_details.metadata["tokens_left"],
+                "all_time_tokens_allocated": user_details.metadata[
+                    "all_time_tokens_allocated"
+                ],
+                "total_tokens_allocated": ALL_TIME_TOKENS_ALLOCATED,
+            },
+        )
+    return RedirectResponse("/")
+@app.get("/start-tutor")
+@app.post("/start-tutor")
+async def start_tutor(request: Request):
+    user_info = await get_user_info_from_cookie(request)
+    if user_info:
+        user_info_json = json.dumps(user_info)
+        user_info_encoded = base64.b64encode(user_info_json.encode()).decode()
+        response = RedirectResponse(CHAINLIT_PATH, status_code=303)
+        response.set_cookie(key="X-User-Info", value=user_info_encoded, httponly=True)
+        return response
+    return RedirectResponse(url="/")
+@app.exception_handler(HTTPException)
+async def http_exception_handler(request: Request, exc: HTTPException):
+    if exc.status_code == 404:
+        return templates.TemplateResponse(
+            "error_404.html", {"request": request}, status_code=404
+        )
+    return templates.TemplateResponse(
+        "error.html",
+        {"request": request, "error": str(exc)},
+        status_code=exc.status_code,
+    )
+@app.exception_handler(Exception)
+async def exception_handler(request: Request, exc: Exception):
+    return templates.TemplateResponse(
+        "error.html", {"request": request, "error": str(exc)}, status_code=500
+    )
+@app.get("/logout", response_class=HTMLResponse)
+async def logout(request: Request, response: Response):
+    await del_user_info_from_cookie(request=request, response=response)
+    response = RedirectResponse(url="/", status_code=302)
+    # Set cookies to empty values and expire them immediately
+    response.set_cookie(key="session_token", value="", expires=0)
+    response.set_cookie(key="X-User-Info", value="", expires=0)
+    return response
+@app.get("/get-tokens-left")
+async def get_tokens_left(request: Request):
+    try:
+        user_info = await get_user_info_from_cookie(request)
+        user_details = await get_user_details(user_info["email"])
+        await reset_tokens_for_user(user_details)
+        tokens_left = user_details.metadata["tokens_left"]
+        return {"tokens_left": tokens_left}
+    except Exception as e:
+        print(f"Error getting tokens left: {e}")
+        return {"tokens_left": 0}
+mount_chainlit(app=app, target="main.py", path=CHAINLIT_PATH)
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="127.0.0.1", port=7860)

code/chainlit.md CHANGED Viewed

@@ -1,10 +1,5 @@
 # Welcome to DL4DS Tutor! 🚀🤖
-Hi there, this is an LLM chatbot designed to help answer questions on the course content, built using Langchain and Chainlit.
-This is still very much a Work in Progress.
 ### --- Please wait while the Tutor loads... ---
-## Useful Links 🔗
-- **Documentation:**  [Chainlit Documentation](https://docs.chainlit.io) 📚

 # Welcome to DL4DS Tutor! 🚀🤖
+Hi there, this is an LLM chatbot designed to help answer questions on the course content.
 ### --- Please wait while the Tutor loads... ---

code/chainlit_base.py ADDED Viewed

	@@ -0,0 +1,484 @@

+import chainlit.data as cl_data
+import asyncio
+import yaml
+from typing import Any, Dict, no_type_check
+import chainlit as cl
+from modules.chat.llm_tutor import LLMTutor
+from modules.chat.helpers import (
+    get_sources,
+    get_history_chat_resume,
+    get_history_setup_llm,
+    get_last_config,
+)
+import copy
+from chainlit.types import ThreadDict
+import time
+from langchain_community.callbacks import get_openai_callback
+USER_TIMEOUT = 60_000
+SYSTEM = "System"
+LLM = "AI Tutor"
+AGENT = "Agent"
+YOU = "User"
+ERROR = "Error"
+with open("modules/config/config.yml", "r") as f:
+    config = yaml.safe_load(f)
+# async def setup_data_layer():
+#     """
+#     Set up the data layer for chat logging.
+#     """
+#     if config["chat_logging"]["log_chat"]:
+#         data_layer = CustomLiteralDataLayer(
+#             api_key=LITERAL_API_KEY_LOGGING, server=LITERAL_API_URL
+#         )
+#     else:
+#         data_layer = None
+#     return data_layer
+class Chatbot:
+    def __init__(self, config):
+        """
+        Initialize the Chatbot class.
+        """
+        self.config = config
+    async def _load_config(self):
+        """
+        Load the configuration from a YAML file.
+        """
+        with open("modules/config/config.yml", "r") as f:
+            return yaml.safe_load(f)
+    @no_type_check
+    async def setup_llm(self):
+        """
+        Set up the LLM with the provided settings. Update the configuration and initialize the LLM tutor.
+        #TODO: Clean this up.
+        """
+        start_time = time.time()
+        llm_settings = cl.user_session.get("llm_settings", {})
+        (
+            chat_profile,
+            retriever_method,
+            memory_window,
+            llm_style,
+            generate_follow_up,
+            chunking_mode,
+        ) = (
+            llm_settings.get("chat_model"),
+            llm_settings.get("retriever_method"),
+            llm_settings.get("memory_window"),
+            llm_settings.get("llm_style"),
+            llm_settings.get("follow_up_questions"),
+            llm_settings.get("chunking_mode"),
+        )
+        chain = cl.user_session.get("chain")
+        memory_list = cl.user_session.get(
+            "memory",
+            (
+                list(chain.store.values())[0].messages
+                if len(chain.store.values()) > 0
+                else []
+            ),
+        )
+        conversation_list = get_history_setup_llm(memory_list)
+        old_config = copy.deepcopy(self.config)
+        self.config["vectorstore"]["db_option"] = retriever_method
+        self.config["llm_params"]["memory_window"] = memory_window
+        self.config["llm_params"]["llm_style"] = llm_style
+        self.config["llm_params"]["llm_loader"] = chat_profile
+        self.config["llm_params"]["generate_follow_up"] = generate_follow_up
+        self.config["splitter_options"]["chunking_mode"] = chunking_mode
+        self.llm_tutor.update_llm(
+            old_config, self.config
+        )  # update only llm attributes that are changed
+        self.chain = self.llm_tutor.qa_bot(
+            memory=conversation_list,
+        )
+        cl.user_session.set("chain", self.chain)
+        cl.user_session.set("llm_tutor", self.llm_tutor)
+        print("Time taken to setup LLM: ", time.time() - start_time)
+    @no_type_check
+    async def update_llm(self, new_settings: Dict[str, Any]):
+        """
+        Update the LLM settings and reinitialize the LLM with the new settings.
+        Args:
+            new_settings (Dict[str, Any]): The new settings to update.
+        """
+        cl.user_session.set("llm_settings", new_settings)
+        await self.inform_llm_settings()
+        await self.setup_llm()
+    async def make_llm_settings_widgets(self, config=None):
+        """
+        Create and send the widgets for LLM settings configuration.
+        Args:
+            config: The configuration to use for setting up the widgets.
+        """
+        config = config or self.config
+        await cl.ChatSettings(
+            [
+                cl.input_widget.Select(
+                    id="chat_model",
+                    label="Model Name (Default GPT-3)",
+                    values=["local_llm", "gpt-3.5-turbo-1106", "gpt-4", "gpt-4o-mini"],
+                    initial_index=[
+                        "local_llm",
+                        "gpt-3.5-turbo-1106",
+                        "gpt-4",
+                        "gpt-4o-mini",
+                    ].index(config["llm_params"]["llm_loader"]),
+                ),
+                cl.input_widget.Select(
+                    id="retriever_method",
+                    label="Retriever (Default FAISS)",
+                    values=["FAISS", "Chroma", "RAGatouille", "RAPTOR"],
+                    initial_index=["FAISS", "Chroma", "RAGatouille", "RAPTOR"].index(
+                        config["vectorstore"]["db_option"]
+                    ),
+                ),
+                cl.input_widget.Slider(
+                    id="memory_window",
+                    label="Memory Window (Default 3)",
+                    initial=3,
+                    min=0,
+                    max=10,
+                    step=1,
+                ),
+                cl.input_widget.Switch(
+                    id="view_sources", label="View Sources", initial=False
+                ),
+                cl.input_widget.Switch(
+                    id="stream_response",
+                    label="Stream response",
+                    initial=config["llm_params"]["stream"],
+                ),
+                cl.input_widget.Select(
+                    id="chunking_mode",
+                    label="Chunking mode",
+                    values=["fixed", "semantic"],
+                    initial_index=1,
+                ),
+                cl.input_widget.Switch(
+                    id="follow_up_questions",
+                    label="Generate follow up questions",
+                    initial=False,
+                ),
+                cl.input_widget.Select(
+                    id="llm_style",
+                    label="Type of Conversation (Default Normal)",
+                    values=["Normal", "ELI5"],
+                    initial_index=0,
+                ),
+            ]
+        ).send()
+    @no_type_check
+    async def inform_llm_settings(self):
+        """
+        Inform the user about the updated LLM settings and display them as a message.
+        """
+        llm_settings: Dict[str, Any] = cl.user_session.get("llm_settings", {})
+        llm_tutor = cl.user_session.get("llm_tutor")
+        settings_dict = {
+            "model": llm_settings.get("chat_model"),
+            "retriever": llm_settings.get("retriever_method"),
+            "memory_window": llm_settings.get("memory_window"),
+            "num_docs_in_db": (
+                len(llm_tutor.vector_db)
+                if llm_tutor and hasattr(llm_tutor, "vector_db")
+                else 0
+            ),
+            "view_sources": llm_settings.get("view_sources"),
+            "follow_up_questions": llm_settings.get("follow_up_questions"),
+        }
+        print("Settings Dict: ", settings_dict)
+        await cl.Message(
+            author=SYSTEM,
+            content="LLM settings have been updated. You can continue with your Query!",
+            # elements=[
+            #     cl.Text(
+            #         name="settings",
+            #         display="side",
+            #         content=json.dumps(settings_dict, indent=4),
+            #         language="json",
+            #     ),
+            # ],
+        ).send()
+    async def set_starters(self):
+        """
+        Set starter messages for the chatbot.
+        """
+        # Return Starters only if the chat is new
+        try:
+            thread = cl_data._data_layer.get_thread(
+                cl.context.session.thread_id
+            )  # see if the thread has any steps
+            if thread.steps or len(thread.steps) > 0:
+                return None
+        except Exception as e:
+            print(e)
+            return [
+                cl.Starter(
+                    label="recording on CNNs?",
+                    message="Where can I find the recording for the lecture on Transformers?",
+                    icon="/public/adv-screen-recorder-svgrepo-com.svg",
+                ),
+                cl.Starter(
+                    label="where's the slides?",
+                    message="When are the lectures? I can't find the schedule.",
+                    icon="/public/alarmy-svgrepo-com.svg",
+                ),
+                cl.Starter(
+                    label="Due Date?",
+                    message="When is the final project due?",
+                    icon="/public/calendar-samsung-17-svgrepo-com.svg",
+                ),
+                cl.Starter(
+                    label="Explain backprop.",
+                    message="I didn't understand the math behind backprop, could you explain it?",
+                    icon="/public/acastusphoton-svgrepo-com.svg",
+                ),
+            ]
+    def rename(self, orig_author: str):
+        """
+        Rename the original author to a more user-friendly name.
+        Args:
+            orig_author (str): The original author's name.
+        Returns:
+            str: The renamed author.
+        """
+        rename_dict = {"Chatbot": LLM}
+        return rename_dict.get(orig_author, orig_author)
+    async def start(self, config=None):
+        """
+        Start the chatbot, initialize settings widgets,
+        and display and load previous conversation if chat logging is enabled.
+        """
+        start_time = time.time()
+        self.config = (
+            await self._load_config() if config is None else config
+        )  # Reload the configuration on chat resume
+        await self.make_llm_settings_widgets(self.config)  # Reload the settings widgets
+        user = cl.user_session.get("user")
+        # TODO: remove self.user with cl.user_session.get("user")
+        try:
+            self.user = {
+                "user_id": user.identifier,
+                "session_id": cl.context.session.thread_id,
+            }
+        except Exception as e:
+            print(e)
+            self.user = {
+                "user_id": "guest",
+                "session_id": cl.context.session.thread_id,
+            }
+        memory = cl.user_session.get("memory", [])
+        self.llm_tutor = LLMTutor(self.config, user=self.user)
+        self.chain = self.llm_tutor.qa_bot(
+            memory=memory,
+        )
+        self.question_generator = self.llm_tutor.question_generator
+        cl.user_session.set("llm_tutor", self.llm_tutor)
+        cl.user_session.set("chain", self.chain)
+        print("Time taken to start LLM: ", time.time() - start_time)
+    async def stream_response(self, response):
+        """
+        Stream the response from the LLM.
+        Args:
+            response: The response from the LLM.
+        """
+        msg = cl.Message(content="")
+        await msg.send()
+        output = {}
+        for chunk in response:
+            if "answer" in chunk:
+                await msg.stream_token(chunk["answer"])
+            for key in chunk:
+                if key not in output:
+                    output[key] = chunk[key]
+                else:
+                    output[key] += chunk[key]
+        return output
+    async def main(self, message):
+        """
+        Process and Display the Conversation.
+        Args:
+            message: The incoming chat message.
+        """
+        start_time = time.time()
+        chain = cl.user_session.get("chain")
+        token_count = 0  # initialize token count
+        if not chain:
+            await self.start()  # start the chatbot if the chain is not present
+            chain = cl.user_session.get("chain")
+        # update user info with last message time
+        llm_settings = cl.user_session.get("llm_settings", {})
+        view_sources = llm_settings.get("view_sources", False)
+        stream = llm_settings.get("stream_response", False)
+        stream = False  # Fix streaming
+        user_query_dict = {"input": message.content}
+        # Define the base configuration
+        cb = cl.AsyncLangchainCallbackHandler()
+        chain_config = {
+            "configurable": {
+                "user_id": self.user["user_id"],
+                "conversation_id": self.user["session_id"],
+                "memory_window": self.config["llm_params"]["memory_window"],
+            },
+            "callbacks": (
+                [cb]
+                if cl_data._data_layer and self.config["chat_logging"]["callbacks"]
+                else None
+            ),
+        }
+        with get_openai_callback() as token_count_cb:
+            if stream:
+                res = chain.stream(user_query=user_query_dict, config=chain_config)
+                res = await self.stream_response(res)
+            else:
+                res = await chain.invoke(
+                    user_query=user_query_dict,
+                    config=chain_config,
+                )
+        token_count += token_count_cb.total_tokens
+        answer = res.get("answer", res.get("result"))
+        answer_with_sources, source_elements, sources_dict = get_sources(
+            res, answer, stream=stream, view_sources=view_sources
+        )
+        answer_with_sources = answer_with_sources.replace("$$", "$")
+        print("Time taken to process the message: ", time.time() - start_time)
+        actions = []
+        if self.config["llm_params"]["generate_follow_up"]:
+            start_time = time.time()
+            cb_follow_up = cl.AsyncLangchainCallbackHandler()
+            config = {
+                "callbacks": (
+                    [cb_follow_up]
+                    if cl_data._data_layer and self.config["chat_logging"]["callbacks"]
+                    else None
+                )
+            }
+            with get_openai_callback() as token_count_cb:
+                list_of_questions = await self.question_generator.generate_questions(
+                    query=user_query_dict["input"],
+                    response=answer,
+                    chat_history=res.get("chat_history"),
+                    context=res.get("context"),
+                    config=config,
+                )
+            token_count += token_count_cb.total_tokens
+            for question in list_of_questions:
+                actions.append(
+                    cl.Action(
+                        name="follow up question",
+                        value="example_value",
+                        description=question,
+                        label=question,
+                    )
+                )
+            print("Time taken to generate questions: ", time.time() - start_time)
+            print("Total Tokens Used: ", token_count)
+        await cl.Message(
+            content=answer_with_sources,
+            elements=source_elements,
+            author=LLM,
+            actions=actions,
+            metadata=self.config,
+        ).send()
+    async def on_chat_resume(self, thread: ThreadDict):
+        thread_config = None
+        steps = thread["steps"]
+        k = self.config["llm_params"][
+            "memory_window"
+        ]  # on resume, alwyas use the default memory window
+        conversation_list = get_history_chat_resume(steps, k, SYSTEM, LLM)
+        thread_config = get_last_config(
+            steps
+        )  # TODO: Returns None for now - which causes config to be reloaded with default values
+        cl.user_session.set("memory", conversation_list)
+        await self.start(config=thread_config)
+    async def on_follow_up(self, action: cl.Action):
+        user = cl.user_session.get("user")
+        message = await cl.Message(
+            content=action.description,
+            type="user_message",
+            author=user.identifier,
+        ).send()
+        async with cl.Step(
+            name="on_follow_up", type="run", parent_id=message.id
+        ) as step:
+            await self.main(message)
+            step.output = message.content
+chatbot = Chatbot(config=config)
+async def start_app():
+    # cl_data._data_layer = await setup_data_layer()
+    # chatbot.literal_client = cl_data._data_layer.client if cl_data._data_layer else None
+    cl.set_starters(chatbot.set_starters)
+    cl.author_rename(chatbot.rename)
+    cl.on_chat_start(chatbot.start)
+    cl.on_chat_resume(chatbot.on_chat_resume)
+    cl.on_message(chatbot.main)
+    cl.on_settings_update(chatbot.update_llm)
+    cl.action_callback("follow up question")(chatbot.on_follow_up)
+loop = asyncio.get_event_loop()
+if loop.is_running():
+    asyncio.ensure_future(start_app())
+else:
+    asyncio.run(start_app())

code/main.py CHANGED Viewed

@@ -1,15 +1,12 @@
 import chainlit.data as cl_data
 import asyncio
 from modules.config.constants import (
-    LLAMA_PATH,
     LITERAL_API_KEY_LOGGING,
     LITERAL_API_URL,
 )
 from modules.chat_processor.literal_ai import CustomLiteralDataLayer
 import json
 import yaml
-import os
 from typing import Any, Dict, no_type_check
 import chainlit as cl
 from modules.chat.llm_tutor import LLMTutor
@@ -19,17 +16,27 @@ from modules.chat.helpers import (
     get_history_setup_llm,
     get_last_config,
 )
 import copy
 from typing import Optional
 from chainlit.types import ThreadDict
 import time
 USER_TIMEOUT = 60_000
-SYSTEM = "System 🖥️"
-LLM = "LLM 🧠"
-AGENT = "Agent <>"
-YOU = "You 😃"
-ERROR = "Error 🚫"
 with open("modules/config/config.yml", "r") as f:
     config = yaml.safe_load(f)
@@ -49,6 +56,24 @@ async def setup_data_layer():
     return data_layer
 class Chatbot:
     def __init__(self, config):
         """
@@ -73,7 +98,14 @@ class Chatbot:
         start_time = time.time()
         llm_settings = cl.user_session.get("llm_settings", {})
-        chat_profile, retriever_method, memory_window, llm_style, generate_follow_up, chunking_mode = (
             llm_settings.get("chat_model"),
             llm_settings.get("retriever_method"),
             llm_settings.get("memory_window"),
@@ -106,15 +138,8 @@ class Chatbot:
         )  # update only llm attributes that are changed
         self.chain = self.llm_tutor.qa_bot(
             memory=conversation_list,
-            callbacks=(
-                [cl.LangchainCallbackHandler()]
-                if cl_data._data_layer and self.config["chat_logging"]["callbacks"]
-                else None
-            ),
         )
-        tags = [chat_profile, self.config["vectorstore"]["db_option"]]
         cl.user_session.set("chain", self.chain)
         cl.user_session.set("llm_tutor", self.llm_tutor)
@@ -180,7 +205,7 @@ class Chatbot:
                 cl.input_widget.Select(
                     id="chunking_mode",
                     label="Chunking mode",
-                    values=['fixed', 'semantic'],
                     initial_index=1,
                 ),
                 cl.input_widget.Switch(
@@ -216,17 +241,18 @@ class Chatbot:
             "view_sources": llm_settings.get("view_sources"),
             "follow_up_questions": llm_settings.get("follow_up_questions"),
         }
         await cl.Message(
             author=SYSTEM,
             content="LLM settings have been updated. You can continue with your Query!",
-            elements=[
-                cl.Text(
-                    name="settings",
-                    display="side",
-                    content=json.dumps(settings_dict, indent=4),
-                    language="json",
-                ),
-            ],
         ).send()
     async def set_starters(self):
@@ -241,7 +267,8 @@ class Chatbot:
             )  # see if the thread has any steps
             if thread.steps or len(thread.steps) > 0:
                 return None
-        except:
             return [
                 cl.Starter(
                     label="recording on CNNs?",
@@ -275,7 +302,7 @@ class Chatbot:
         Returns:
             str: The renamed author.
         """
-        rename_dict = {"Chatbot": "AI Tutor"}
         return rename_dict.get(orig_author, orig_author)
     async def start(self, config=None):
@@ -292,25 +319,26 @@ class Chatbot:
         await self.make_llm_settings_widgets(self.config)  # Reload the settings widgets
-        await self.make_llm_settings_widgets(self.config)
         user = cl.user_session.get("user")
-        self.user = {
-            "user_id": user.identifier,
-            "session_id": cl.context.session.thread_id,
-        }
-        memory = cl.user_session.get("memory", [])
-        cl.user_session.set("user", self.user)
         self.llm_tutor = LLMTutor(self.config, user=self.user)
         self.chain = self.llm_tutor.qa_bot(
             memory=memory,
-            callbacks=(
-                [cl.LangchainCallbackHandler()]
-                if cl_data._data_layer and self.config["chat_logging"]["callbacks"]
-                else None
-            ),
         )
         self.question_generator = self.llm_tutor.question_generator
         cl.user_session.set("llm_tutor", self.llm_tutor)
@@ -351,29 +379,98 @@ class Chatbot:
         start_time = time.time()
         chain = cl.user_session.get("chain")
         llm_settings = cl.user_session.get("llm_settings", {})
         view_sources = llm_settings.get("view_sources", False)
         stream = llm_settings.get("stream_response", False)
-        steam = False  # Fix streaming
         user_query_dict = {"input": message.content}
         # Define the base configuration
         chain_config = {
             "configurable": {
                 "user_id": self.user["user_id"],
                 "conversation_id": self.user["session_id"],
                 "memory_window": self.config["llm_params"]["memory_window"],
-            }
         }
-        if stream:
-            res = chain.stream(user_query=user_query_dict, config=chain_config)
-            res = await self.stream_response(res)
-        else:
-            res = await chain.invoke(
-                user_query=user_query_dict,
-                config=chain_config,
-            )
         answer = res.get("answer", res.get("result"))
@@ -388,15 +485,26 @@ class Chatbot:
         if self.config["llm_params"]["generate_follow_up"]:
             start_time = time.time()
-            list_of_questions = self.question_generator.generate_questions(
-                query=user_query_dict["input"],
-                response=answer,
-                chat_history=res.get("chat_history"),
-                context=res.get("context"),
-            )
-            for question in list_of_questions:
                 actions.append(
                     cl.Action(
                         name="follow up question",
@@ -408,6 +516,15 @@ class Chatbot:
             print("Time taken to generate questions: ", time.time() - start_time)
         await cl.Message(
             content=answer_with_sources,
             elements=source_elements,
@@ -429,22 +546,46 @@ class Chatbot:
         cl.user_session.set("memory", conversation_list)
         await self.start(config=thread_config)
-    @cl.oauth_callback
-    def auth_callback(
-        provider_id: str,
-        token: str,
-        raw_user_data: Dict[str, str],
-        default_user: cl.User,
-    ) -> Optional[cl.User]:
-        return default_user
     async def on_follow_up(self, action: cl.Action):
         message = await cl.Message(
             content=action.description,
             type="user_message",
-            author=self.user["user_id"],
         ).send()
-        await self.main(message)
 chatbot = Chatbot(config=config)
@@ -462,4 +603,8 @@ async def start_app():
     cl.action_callback("follow up question")(chatbot.on_follow_up)
-asyncio.run(start_app())

 import chainlit.data as cl_data
 import asyncio
 from modules.config.constants import (
     LITERAL_API_KEY_LOGGING,
     LITERAL_API_URL,
 )
 from modules.chat_processor.literal_ai import CustomLiteralDataLayer
 import json
 import yaml
 from typing import Any, Dict, no_type_check
 import chainlit as cl
 from modules.chat.llm_tutor import LLMTutor
     get_history_setup_llm,
     get_last_config,
 )
+from modules.chat_processor.helpers import (
+    update_user_info,
+    get_time,
+    check_user_cooldown,
+    reset_tokens_for_user,
+    get_user_details,
+)
 import copy
 from typing import Optional
 from chainlit.types import ThreadDict
 import time
+import base64
+from langchain_community.callbacks import get_openai_callback
+from datetime import datetime, timezone
 USER_TIMEOUT = 60_000
+SYSTEM = "System"
+LLM = "AI Tutor"
+AGENT = "Agent"
+YOU = "User"
+ERROR = "Error"
 with open("modules/config/config.yml", "r") as f:
     config = yaml.safe_load(f)
     return data_layer
+async def update_user_from_chainlit(user, token_count=0):
+    if "admin" not in user.metadata["role"]:
+        user.metadata["tokens_left"] = user.metadata["tokens_left"] - token_count
+        user.metadata["all_time_tokens_allocated"] = (
+            user.metadata["all_time_tokens_allocated"] - token_count
+        )
+        user.metadata["tokens_left_at_last_message"] = user.metadata[
+            "tokens_left"
+        ]  # tokens_left will keep regenerating outside of chainlit
+    user.metadata["last_message_time"] = get_time()
+    await update_user_info(user)
+    tokens_left = user.metadata["tokens_left"]
+    if tokens_left < 0:
+        tokens_left = 0
+    return tokens_left
 class Chatbot:
     def __init__(self, config):
         """
         start_time = time.time()
         llm_settings = cl.user_session.get("llm_settings", {})
+        (
+            chat_profile,
+            retriever_method,
+            memory_window,
+            llm_style,
+            generate_follow_up,
+            chunking_mode,
+        ) = (
             llm_settings.get("chat_model"),
             llm_settings.get("retriever_method"),
             llm_settings.get("memory_window"),
         )  # update only llm attributes that are changed
         self.chain = self.llm_tutor.qa_bot(
             memory=conversation_list,
         )
         cl.user_session.set("chain", self.chain)
         cl.user_session.set("llm_tutor", self.llm_tutor)
                 cl.input_widget.Select(
                     id="chunking_mode",
                     label="Chunking mode",
+                    values=["fixed", "semantic"],
                     initial_index=1,
                 ),
                 cl.input_widget.Switch(
             "view_sources": llm_settings.get("view_sources"),
             "follow_up_questions": llm_settings.get("follow_up_questions"),
         }
+        print("Settings Dict: ", settings_dict)
         await cl.Message(
             author=SYSTEM,
             content="LLM settings have been updated. You can continue with your Query!",
+            # elements=[
+            #     cl.Text(
+            #         name="settings",
+            #         display="side",
+            #         content=json.dumps(settings_dict, indent=4),
+            #         language="json",
+            #     ),
+            # ],
         ).send()
     async def set_starters(self):
             )  # see if the thread has any steps
             if thread.steps or len(thread.steps) > 0:
                 return None
+        except Exception as e:
+            print(e)
             return [
                 cl.Starter(
                     label="recording on CNNs?",
         Returns:
             str: The renamed author.
         """
+        rename_dict = {"Chatbot": LLM}
         return rename_dict.get(orig_author, orig_author)
     async def start(self, config=None):
         await self.make_llm_settings_widgets(self.config)  # Reload the settings widgets
         user = cl.user_session.get("user")
+        # TODO: remove self.user with cl.user_session.get("user")
+        try:
+            self.user = {
+                "user_id": user.identifier,
+                "session_id": cl.context.session.thread_id,
+            }
+        except Exception as e:
+            print(e)
+            self.user = {
+                "user_id": "guest",
+                "session_id": cl.context.session.thread_id,
+            }
+        memory = cl.user_session.get("memory", [])
         self.llm_tutor = LLMTutor(self.config, user=self.user)
         self.chain = self.llm_tutor.qa_bot(
             memory=memory,
         )
         self.question_generator = self.llm_tutor.question_generator
         cl.user_session.set("llm_tutor", self.llm_tutor)
         start_time = time.time()
         chain = cl.user_session.get("chain")
+        token_count = 0  # initialize token count
+        if not chain:
+            await self.start()  # start the chatbot if the chain is not present
+            chain = cl.user_session.get("chain")
+        # update user info with last message time
+        user = cl.user_session.get("user")
+        await reset_tokens_for_user(user)
+        updated_user = await get_user_details(user.identifier)
+        user.metadata = updated_user.metadata
+        cl.user_session.set("user", user)
+        print("\n\n User Tokens Left: ", user.metadata["tokens_left"])
+        # see if user has token credits left
+        # if not, return message saying they have run out of tokens
+        if user.metadata["tokens_left"] <= 0 and "admin" not in user.metadata["role"]:
+            current_datetime = get_time()
+            cooldown, cooldown_end_time = await check_user_cooldown(
+                user, current_datetime
+            )
+            if cooldown:
+                # get time left in cooldown
+                # convert both to datetime objects
+                cooldown_end_time = datetime.fromisoformat(cooldown_end_time).replace(
+                    tzinfo=timezone.utc
+                )
+                current_datetime = datetime.fromisoformat(current_datetime).replace(
+                    tzinfo=timezone.utc
+                )
+                cooldown_time_left = cooldown_end_time - current_datetime
+                # Get the total seconds
+                total_seconds = int(cooldown_time_left.total_seconds())
+                # Calculate hours, minutes, and seconds
+                hours, remainder = divmod(total_seconds, 3600)
+                minutes, seconds = divmod(remainder, 60)
+                # Format the time as 00 hrs 00 mins 00 secs
+                formatted_time = f"{hours:02} hrs {minutes:02} mins {seconds:02} secs"
+                await cl.Message(
+                    content=(
+                        "Ah, seems like you have run out of tokens...Click "
+                        '<a href="/cooldown" style="color: #0000CD; text-decoration: none;" target="_self">here</a> for more info. Please come back after {}'.format(
+                            formatted_time
+                        )
+                    ),
+                    author=SYSTEM,
+                ).send()
+                user.metadata["in_cooldown"] = True
+                await update_user_info(user)
+                return
+            else:
+                await cl.Message(
+                    content=(
+                        "Ah, seems like you don't have any tokens left...Please wait while we regenerate your tokens. Click "
+                        '<a href="/cooldown" style="color: #0000CD; text-decoration: none;" target="_self">here</a> to view your token credits.'
+                    ),
+                    author=SYSTEM,
+                ).send()
+                return
+        user.metadata["in_cooldown"] = False
         llm_settings = cl.user_session.get("llm_settings", {})
         view_sources = llm_settings.get("view_sources", False)
         stream = llm_settings.get("stream_response", False)
+        stream = False  # Fix streaming
         user_query_dict = {"input": message.content}
         # Define the base configuration
+        cb = cl.AsyncLangchainCallbackHandler()
         chain_config = {
             "configurable": {
                 "user_id": self.user["user_id"],
                 "conversation_id": self.user["session_id"],
                 "memory_window": self.config["llm_params"]["memory_window"],
+            },
+            "callbacks": (
+                [cb]
+                if cl_data._data_layer and self.config["chat_logging"]["callbacks"]
+                else None
+            ),
         }
+        with get_openai_callback() as token_count_cb:
+            if stream:
+                res = chain.stream(user_query=user_query_dict, config=chain_config)
+                res = await self.stream_response(res)
+            else:
+                res = await chain.invoke(
+                    user_query=user_query_dict,
+                    config=chain_config,
+                )
+        token_count += token_count_cb.total_tokens
         answer = res.get("answer", res.get("result"))
         if self.config["llm_params"]["generate_follow_up"]:
             start_time = time.time()
+            cb_follow_up = cl.AsyncLangchainCallbackHandler()
+            config = {
+                "callbacks": (
+                    [cb_follow_up]
+                    if cl_data._data_layer and self.config["chat_logging"]["callbacks"]
+                    else None
+                )
+            }
+            with get_openai_callback() as token_count_cb:
+                list_of_questions = await self.question_generator.generate_questions(
+                    query=user_query_dict["input"],
+                    response=answer,
+                    chat_history=res.get("chat_history"),
+                    context=res.get("context"),
+                    config=config,
+                )
+            token_count += token_count_cb.total_tokens
+            for question in list_of_questions:
                 actions.append(
                     cl.Action(
                         name="follow up question",
             print("Time taken to generate questions: ", time.time() - start_time)
+        # # update user info with token count
+        tokens_left = await update_user_from_chainlit(user, token_count)
+        answer_with_sources += (
+            '\n\n<footer><span style="font-size: 0.8em; text-align: right; display: block;">Tokens Left: '
+            + str(tokens_left)
+            + "</span></footer>\n"
+        )
         await cl.Message(
             content=answer_with_sources,
             elements=source_elements,
         cl.user_session.set("memory", conversation_list)
         await self.start(config=thread_config)
+    @cl.header_auth_callback
+    def header_auth_callback(headers: dict) -> Optional[cl.User]:
+        print("\n\n\nI am here\n\n\n")
+        # try: # TODO: Add try-except block after testing
+        # TODO: Implement to get the user information from the headers (not the cookie)
+        cookie = headers.get("cookie")  # gets back a str
+        # Create a dictionary from the pairs
+        cookie_dict = {}
+        for pair in cookie.split("; "):
+            key, value = pair.split("=", 1)
+            # Strip surrounding quotes if present
+            cookie_dict[key] = value.strip('"')
+        decoded_user_info = base64.b64decode(
+            cookie_dict.get("X-User-Info", "")
+        ).decode()
+        decoded_user_info = json.loads(decoded_user_info)
+        print(
+            f"\n\n USER ROLE: {decoded_user_info['literalai_info']['metadata']['role']} \n\n"
+        )
+        return cl.User(
+            id=decoded_user_info["literalai_info"]["id"],
+            identifier=decoded_user_info["literalai_info"]["identifier"],
+            metadata=decoded_user_info["literalai_info"]["metadata"],
+        )
     async def on_follow_up(self, action: cl.Action):
+        user = cl.user_session.get("user")
         message = await cl.Message(
             content=action.description,
             type="user_message",
+            author=user.identifier,
         ).send()
+        async with cl.Step(
+            name="on_follow_up", type="run", parent_id=message.id
+        ) as step:
+            await self.main(message)
+            step.output = message.content
 chatbot = Chatbot(config=config)
     cl.action_callback("follow up question")(chatbot.on_follow_up)
+loop = asyncio.get_event_loop()
+if loop.is_running():
+    asyncio.ensure_future(start_app())
+else:
+    asyncio.run(start_app())

code/modules/chat/chat_model_loader.py CHANGED Viewed

@@ -1,15 +1,8 @@
 from langchain_openai import ChatOpenAI
-from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
-from transformers import AutoTokenizer, TextStreamer
 from langchain_community.llms import LlamaCpp
-import torch
-import transformers
 import os
 from pathlib import Path
 from huggingface_hub import hf_hub_download
-from langchain.callbacks.manager import CallbackManager
-from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
-from modules.config.constants import LLAMA_PATH
 class ChatModelLoader:
@@ -35,10 +28,10 @@ class ChatModelLoader:
         elif self.config["llm_params"]["llm_loader"] == "local_llm":
             n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
             model_path = self._verify_model_cache(
-                self.config["llm_params"]["local_llm_params"]["model"]
             )
             llm = LlamaCpp(
-                model_path=LLAMA_PATH,
                 n_batch=n_batch,
                 n_ctx=2048,
                 f16_kv=True,

 from langchain_openai import ChatOpenAI
 from langchain_community.llms import LlamaCpp
 import os
 from pathlib import Path
 from huggingface_hub import hf_hub_download
 class ChatModelLoader:
         elif self.config["llm_params"]["llm_loader"] == "local_llm":
             n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
             model_path = self._verify_model_cache(
+                self.config["llm_params"]["local_llm_params"]["model_path"]
             )
             llm = LlamaCpp(
+                model_path=model_path,
                 n_batch=n_batch,
                 n_ctx=2048,
                 f16_kv=True,

code/modules/chat/helpers.py CHANGED Viewed

@@ -42,7 +42,6 @@ def get_sources(res, answer, stream=True, view_sources=False):
         full_answer += answer
     if view_sources:
         # Then, display the sources
         # check if the answer has sources
         if len(source_dict) == 0:
@@ -51,7 +50,6 @@ def get_sources(res, answer, stream=True, view_sources=False):
         else:
             full_answer += "\n\n**Sources:**\n"
             for idx, (url_name, source_data) in enumerate(source_dict.items()):
                 full_answer += f"\nSource {idx + 1} (Score: {source_data['score']}): {source_data['url']}\n"
                 name = f"Source {idx + 1} Text\n"
@@ -110,6 +108,7 @@ def get_prompt(config, prompt_type):
         return prompts["openai"]["rephrase_prompt"]
 def get_history_chat_resume(steps, k, SYSTEM, LLM):
     conversation_list = []
     count = 0
@@ -119,14 +118,17 @@ def get_history_chat_resume(steps, k, SYSTEM, LLM):
                 conversation_list.append(
                     {"type": "user_message", "content": step["output"]}
                 )
             elif step["type"] == "assistant_message":
                 if step["name"] == LLM:
                     conversation_list.append(
                         {"type": "ai_message", "content": step["output"]}
                     )
             else:
-                raise ValueError("Invalid message type")
-        count += 1
         if count >= 2 * k:  # 2 * k to account for both user and assistant messages
             break
     conversation_list = conversation_list[::-1]

         full_answer += answer
     if view_sources:
         # Then, display the sources
         # check if the answer has sources
         if len(source_dict) == 0:
         else:
             full_answer += "\n\n**Sources:**\n"
             for idx, (url_name, source_data) in enumerate(source_dict.items()):
                 full_answer += f"\nSource {idx + 1} (Score: {source_data['score']}): {source_data['url']}\n"
                 name = f"Source {idx + 1} Text\n"
         return prompts["openai"]["rephrase_prompt"]
+# TODO: Do this better
 def get_history_chat_resume(steps, k, SYSTEM, LLM):
     conversation_list = []
     count = 0
                 conversation_list.append(
                     {"type": "user_message", "content": step["output"]}
                 )
+                count += 1
             elif step["type"] == "assistant_message":
                 if step["name"] == LLM:
                     conversation_list.append(
                         {"type": "ai_message", "content": step["output"]}
                     )
+                    count += 1
             else:
+                pass
+                # raise ValueError("Invalid message type")
+        # count += 1
         if count >= 2 * k:  # 2 * k to account for both user and assistant messages
             break
     conversation_list = conversation_list[::-1]

code/modules/chat/langchain/__init__.py ADDED Viewed

File without changes

code/modules/chat/langchain/langchain_rag.py CHANGED Viewed

@@ -1,20 +1,24 @@
 from langchain_core.prompts import ChatPromptTemplate
-from modules.chat.langchain.utils import *
-from langchain.memory import ChatMessageHistory
 from modules.chat.base import BaseRAG
 from langchain_core.prompts import PromptTemplate
-from langchain.memory import (
-    ConversationBufferWindowMemory,
-    ConversationSummaryBufferMemory,
 )
-import chainlit as cl
-from langchain_community.chat_models import ChatOpenAI
 class Langchain_RAG_V1(BaseRAG):
     def __init__(
         self,
         llm,
@@ -95,8 +99,8 @@ class QuestionGenerator:
     def __init__(self):
         pass
-    def generate_questions(self, query, response, chat_history, context):
-        questions = return_questions(query, response, chat_history, context)
         return questions
@@ -199,7 +203,7 @@ class Langchain_RAG_V2(BaseRAG):
                     is_shared=True,
                 ),
             ],
-        )
         if callbacks is not None:
             self.rag_chain = self.rag_chain.with_config(callbacks=callbacks)

 from langchain_core.prompts import ChatPromptTemplate
+# from modules.chat.langchain.utils import
+from langchain_community.chat_message_histories import ChatMessageHistory
 from modules.chat.base import BaseRAG
 from langchain_core.prompts import PromptTemplate
+from langchain.memory import ConversationBufferWindowMemory
+from langchain_core.runnables.utils import ConfigurableFieldSpec
+from .utils import (
+    CustomConversationalRetrievalChain,
+    create_history_aware_retriever,
+    create_stuff_documents_chain,
+    create_retrieval_chain,
+    return_questions,
+    CustomRunnableWithHistory,
+    BaseChatMessageHistory,
+    InMemoryHistory,
 )
 class Langchain_RAG_V1(BaseRAG):
     def __init__(
         self,
         llm,
     def __init__(self):
         pass
+    def generate_questions(self, query, response, chat_history, context, config):
+        questions = return_questions(query, response, chat_history, context, config)
         return questions
                     is_shared=True,
                 ),
             ],
+        ).with_config(run_name="Langchain_RAG_V2")
         if callbacks is not None:
             self.rag_chain = self.rag_chain.with_config(callbacks=callbacks)

code/modules/chat/langchain/utils.py CHANGED Viewed

@@ -1,56 +1,31 @@
 from typing import Any, Dict, List, Union, Tuple, Optional
-from langchain_core.messages import (
-    BaseMessage,
-    AIMessage,
-    FunctionMessage,
-    HumanMessage,
-)
 from langchain_core.prompts.base import BasePromptTemplate, format_document
-from langchain_core.prompts.chat import MessagesPlaceholder
 from langchain_core.output_parsers import StrOutputParser
 from langchain_core.output_parsers.base import BaseOutputParser
 from langchain_core.retrievers import BaseRetriever, RetrieverOutput
 from langchain_core.language_models import LanguageModelLike
 from langchain_core.runnables import Runnable, RunnableBranch, RunnablePassthrough
 from langchain_core.runnables.history import RunnableWithMessageHistory
-from langchain_core.runnables.utils import ConfigurableFieldSpec
 from langchain_core.chat_history import BaseChatMessageHistory
 from langchain_core.pydantic_v1 import BaseModel, Field
 from langchain.chains.combine_documents.base import (
     DEFAULT_DOCUMENT_PROMPT,
     DEFAULT_DOCUMENT_SEPARATOR,
     DOCUMENTS_KEY,
-    BaseCombineDocumentsChain,
     _validate_prompt,
 )
-from langchain.chains.llm import LLMChain
-from langchain_core.callbacks import Callbacks
-from langchain_core.documents import Document
-CHAT_TURN_TYPE = Union[Tuple[str, str], BaseMessage]
 from langchain_core.runnables.config import RunnableConfig
-from langchain_core.messages import BaseMessage
-from langchain_core.output_parsers import StrOutputParser
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_community.chat_models import ChatOpenAI
-from langchain.chains import RetrievalQA, ConversationalRetrievalChain
-from langchain_core.callbacks.manager import AsyncCallbackManagerForChainRun
-from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union
 from langchain_core.callbacks.manager import AsyncCallbackManagerForChainRun
 import inspect
-from langchain.chains.conversational_retrieval.base import _get_chat_history
 from langchain_core.messages import BaseMessage
-class CustomConversationalRetrievalChain(ConversationalRetrievalChain):
     def _get_chat_history(self, chat_history: List[CHAT_TURN_TYPE]) -> str:
         _ROLE_MAP = {"human": "Student: ", "ai": "AI Tutor: "}
         buffer = ""
@@ -163,7 +138,6 @@ class CustomConversationalRetrievalChain(ConversationalRetrievalChain):
 class CustomRunnableWithHistory(RunnableWithMessageHistory):
     def _get_chat_history(self, chat_history: List[CHAT_TURN_TYPE]) -> str:
         _ROLE_MAP = {"human": "Student: ", "ai": "AI Tutor: "}
         buffer = ""
@@ -304,8 +278,8 @@ def create_retrieval_chain(
     return retrieval_chain
-def return_questions(query, response, chat_history_str, context):
     system = (
         "You are someone that suggests a question based on the student's input and chat history. "
         "Generate a question that is relevant to the student's input and chat history. "
@@ -322,18 +296,22 @@ def return_questions(query, response, chat_history_str, context):
     prompt = ChatPromptTemplate.from_messages(
         [
             ("system", system),
-            ("human", "{chat_history_str}, {context}, {query}, {response}"),
         ]
     )
     llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
     question_generator = prompt | llm | StrOutputParser()
-    new_questions = question_generator.invoke(
         {
             "chat_history_str": chat_history_str,
             "context": context,
             "query": query,
             "response": response,
-        }
     )
     list_of_questions = new_questions.split("...")

 from typing import Any, Dict, List, Union, Tuple, Optional
 from langchain_core.prompts.base import BasePromptTemplate, format_document
 from langchain_core.output_parsers import StrOutputParser
 from langchain_core.output_parsers.base import BaseOutputParser
 from langchain_core.retrievers import BaseRetriever, RetrieverOutput
 from langchain_core.language_models import LanguageModelLike
 from langchain_core.runnables import Runnable, RunnableBranch, RunnablePassthrough
 from langchain_core.runnables.history import RunnableWithMessageHistory
 from langchain_core.chat_history import BaseChatMessageHistory
 from langchain_core.pydantic_v1 import BaseModel, Field
 from langchain.chains.combine_documents.base import (
     DEFAULT_DOCUMENT_PROMPT,
     DEFAULT_DOCUMENT_SEPARATOR,
     DOCUMENTS_KEY,
     _validate_prompt,
 )
 from langchain_core.runnables.config import RunnableConfig
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_community.chat_models import ChatOpenAI
+from langchain.chains import ConversationalRetrievalChain
 from langchain_core.callbacks.manager import AsyncCallbackManagerForChainRun
 import inspect
 from langchain_core.messages import BaseMessage
+CHAT_TURN_TYPE = Union[Tuple[str, str], BaseMessage]
+class CustomConversationalRetrievalChain(ConversationalRetrievalChain):
     def _get_chat_history(self, chat_history: List[CHAT_TURN_TYPE]) -> str:
         _ROLE_MAP = {"human": "Student: ", "ai": "AI Tutor: "}
         buffer = ""
 class CustomRunnableWithHistory(RunnableWithMessageHistory):
     def _get_chat_history(self, chat_history: List[CHAT_TURN_TYPE]) -> str:
         _ROLE_MAP = {"human": "Student: ", "ai": "AI Tutor: "}
         buffer = ""
     return retrieval_chain
+# TODO: Remove Hard-coded values
+async def return_questions(query, response, chat_history_str, context, config):
     system = (
         "You are someone that suggests a question based on the student's input and chat history. "
         "Generate a question that is relevant to the student's input and chat history. "
     prompt = ChatPromptTemplate.from_messages(
         [
             ("system", system),
+            # ("human", "{chat_history_str}, {context}, {query}, {response}"),
         ]
     )
     llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
     question_generator = prompt | llm | StrOutputParser()
+    question_generator = question_generator.with_config(
+        run_name="follow_up_question_generator"
+    )
+    new_questions = await question_generator.ainvoke(
         {
             "chat_history_str": chat_history_str,
             "context": context,
             "query": query,
             "response": response,
+        },
+        config=config,
     )
     list_of_questions = new_questions.split("...")

code/modules/chat/llm_tutor.py CHANGED Viewed

@@ -3,7 +3,6 @@ from modules.chat.chat_model_loader import ChatModelLoader
 from modules.vectorstore.store_manager import VectorStoreManager
 from modules.retriever.retriever import Retriever
 from modules.chat.langchain.langchain_rag import (
-    Langchain_RAG_V1,
     Langchain_RAG_V2,
     QuestionGenerator,
 )
@@ -28,9 +27,11 @@ class LLMTutor:
         self.rephrase_prompt = get_prompt(
             config, "rephrase"
         )  # Initialize rephrase_prompt
-        if self.config["vectorstore"]["embedd_files"]:
-            self.vector_db.create_database()
-            self.vector_db.save_database()
     def update_llm(self, old_config, new_config):
         """
@@ -48,9 +49,11 @@ class LLMTutor:
             self.vector_db = VectorStoreManager(
                 self.config, logger=self.logger
             ).load_database()  # Reinitialize VectorStoreManager if vectorstore changes
-            if self.config["vectorstore"]["embedd_files"]:
-                self.vector_db.create_database()
-                self.vector_db.save_database()
         if "llm_params.llm_style" in changes:
             self.qa_prompt = get_prompt(

 from modules.vectorstore.store_manager import VectorStoreManager
 from modules.retriever.retriever import Retriever
 from modules.chat.langchain.langchain_rag import (
     Langchain_RAG_V2,
     QuestionGenerator,
 )
         self.rephrase_prompt = get_prompt(
             config, "rephrase"
         )  # Initialize rephrase_prompt
+        # TODO: Removed this functionality for now, don't know if we need it
+        # if self.config["vectorstore"]["embedd_files"]:
+        #     self.vector_db.create_database()
+        #     self.vector_db.save_database()
     def update_llm(self, old_config, new_config):
         """
             self.vector_db = VectorStoreManager(
                 self.config, logger=self.logger
             ).load_database()  # Reinitialize VectorStoreManager if vectorstore changes
+            # TODO: Removed this functionality for now, don't know if we need it
+            # if self.config["vectorstore"]["embedd_files"]:
+            #     self.vector_db.create_database()
+            #     self.vector_db.save_database()
         if "llm_params.llm_style" in changes:
             self.qa_prompt = get_prompt(

code/modules/chat_processor/helpers.py ADDED Viewed

	@@ -0,0 +1,245 @@

+import os
+from literalai import AsyncLiteralClient
+from datetime import datetime, timedelta, timezone
+from modules.config.constants import COOLDOWN_TIME, TOKENS_LEFT, REGEN_TIME
+from typing_extensions import TypedDict
+import tiktoken
+from typing import Any, Generic, List, Literal, Optional, TypeVar, Union
+Field = TypeVar("Field")
+Operators = TypeVar("Operators")
+Value = TypeVar("Value")
+BOOLEAN_OPERATORS = Literal["is", "nis"]
+STRING_OPERATORS = Literal["eq", "neq", "ilike", "nilike"]
+NUMBER_OPERATORS = Literal["eq", "neq", "gt", "gte", "lt", "lte"]
+STRING_LIST_OPERATORS = Literal["in", "nin"]
+DATETIME_OPERATORS = Literal["gte", "lte", "gt", "lt"]
+OPERATORS = Union[
+    BOOLEAN_OPERATORS,
+    STRING_OPERATORS,
+    NUMBER_OPERATORS,
+    STRING_LIST_OPERATORS,
+    DATETIME_OPERATORS,
+]
+class Filter(Generic[Field], TypedDict, total=False):
+    field: Field
+    operator: OPERATORS
+    value: Any
+    path: Optional[str]
+class OrderBy(Generic[Field], TypedDict):
+    column: Field
+    direction: Literal["ASC", "DESC"]
+threads_filterable_fields = Literal[
+    "id",
+    "createdAt",
+    "name",
+    "stepType",
+    "stepName",
+    "stepOutput",
+    "metadata",
+    "tokenCount",
+    "tags",
+    "participantId",
+    "participantIdentifiers",
+    "scoreValue",
+    "duration",
+]
+threads_orderable_fields = Literal["createdAt", "tokenCount"]
+threads_filters = List[Filter[threads_filterable_fields]]
+threads_order_by = OrderBy[threads_orderable_fields]
+steps_filterable_fields = Literal[
+    "id",
+    "name",
+    "input",
+    "output",
+    "participantIdentifier",
+    "startTime",
+    "endTime",
+    "metadata",
+    "parentId",
+    "threadId",
+    "error",
+    "tags",
+]
+steps_orderable_fields = Literal["createdAt"]
+steps_filters = List[Filter[steps_filterable_fields]]
+steps_order_by = OrderBy[steps_orderable_fields]
+users_filterable_fields = Literal[
+    "id",
+    "createdAt",
+    "identifier",
+    "lastEngaged",
+    "threadCount",
+    "tokenCount",
+    "metadata",
+]
+users_filters = List[Filter[users_filterable_fields]]
+scores_filterable_fields = Literal[
+    "id",
+    "createdAt",
+    "participant",
+    "name",
+    "tags",
+    "value",
+    "type",
+    "comment",
+]
+scores_orderable_fields = Literal["createdAt"]
+scores_filters = List[Filter[scores_filterable_fields]]
+scores_order_by = OrderBy[scores_orderable_fields]
+generation_filterable_fields = Literal[
+    "id",
+    "createdAt",
+    "model",
+    "duration",
+    "promptLineage",
+    "promptVersion",
+    "tags",
+    "score",
+    "participant",
+    "tokenCount",
+    "error",
+]
+generation_orderable_fields = Literal[
+    "createdAt",
+    "tokenCount",
+    "model",
+    "provider",
+    "participant",
+    "duration",
+]
+generations_filters = List[Filter[generation_filterable_fields]]
+generations_order_by = OrderBy[generation_orderable_fields]
+literal_client = AsyncLiteralClient(api_key=os.getenv("LITERAL_API_KEY_LOGGING"))
+# For consistency, use dictionary for user_info
+def convert_to_dict(user_info):
+    # if already a dictionary, return as is
+    if isinstance(user_info, dict):
+        return user_info
+    if hasattr(user_info, "__dict__"):
+        user_info = user_info.__dict__
+    return user_info
+def get_time():
+    return datetime.now(timezone.utc).isoformat()
+async def get_user_details(user_email_id):
+    user_info = await literal_client.api.get_or_create_user(identifier=user_email_id)
+    return user_info
+async def update_user_info(user_info):
+    # if object type, convert to dictionary
+    user_info = convert_to_dict(user_info)
+    await literal_client.api.update_user(
+        id=user_info["id"],
+        identifier=user_info["identifier"],
+        metadata=user_info["metadata"],
+    )
+async def check_user_cooldown(user_info, current_time):
+    # # Check if no tokens left
+    tokens_left = user_info.metadata.get("tokens_left", 0)
+    if tokens_left > 0 and not user_info.metadata.get("in_cooldown", False):
+        return False, None
+    user_info = convert_to_dict(user_info)
+    last_message_time_str = user_info["metadata"].get("last_message_time")
+    # Convert from ISO format string to datetime object and ensure UTC timezone
+    last_message_time = datetime.fromisoformat(last_message_time_str).replace(
+        tzinfo=timezone.utc
+    )
+    current_time = datetime.fromisoformat(current_time).replace(tzinfo=timezone.utc)
+    # Calculate the elapsed time
+    elapsed_time = current_time - last_message_time
+    elapsed_time_in_seconds = elapsed_time.total_seconds()
+    # Calculate when the cooldown period ends
+    cooldown_end_time = last_message_time + timedelta(seconds=COOLDOWN_TIME)
+    cooldown_end_time_iso = cooldown_end_time.isoformat()
+    # Debug: Print the cooldown end time
+    print(f"Cooldown end time (ISO): {cooldown_end_time_iso}")
+    # Check if the user is still in cooldown
+    if elapsed_time_in_seconds < COOLDOWN_TIME:
+        return True, cooldown_end_time_iso  # Return in ISO 8601 format
+    user_info["metadata"]["in_cooldown"] = False
+    # If not in cooldown, regenerate tokens
+    await reset_tokens_for_user(user_info)
+    return False, None
+async def reset_tokens_for_user(user_info):
+    user_info = convert_to_dict(user_info)
+    last_message_time_str = user_info["metadata"].get("last_message_time")
+    last_message_time = datetime.fromisoformat(last_message_time_str).replace(
+        tzinfo=timezone.utc
+    )
+    current_time = datetime.fromisoformat(get_time()).replace(tzinfo=timezone.utc)
+    # Calculate the elapsed time since the last message
+    elapsed_time_in_seconds = (current_time - last_message_time).total_seconds()
+    # Current token count (can be negative)
+    current_tokens = user_info["metadata"].get("tokens_left_at_last_message", 0)
+    current_tokens = min(current_tokens, TOKENS_LEFT)
+    # Maximum tokens that can be regenerated
+    max_tokens = user_info["metadata"].get("max_tokens", TOKENS_LEFT)
+    # Calculate how many tokens should have been regenerated proportionally
+    if current_tokens < max_tokens:
+        # Calculate the regeneration rate per second based on REGEN_TIME for full regeneration
+        regeneration_rate_per_second = max_tokens / REGEN_TIME
+        # Calculate how many tokens should have been regenerated based on the elapsed time
+        tokens_to_regenerate = int(
+            elapsed_time_in_seconds * regeneration_rate_per_second
+        )
+        # Ensure the new token count does not exceed max_tokens
+        new_token_count = min(current_tokens + tokens_to_regenerate, max_tokens)
+        print(
+            f"\n\n Adding {tokens_to_regenerate} tokens to the user, Time elapsed: {elapsed_time_in_seconds} seconds, Tokens after regeneration: {new_token_count}, Tokens before: {current_tokens} \n\n"
+        )
+        # Update the user's token count
+        user_info["metadata"]["tokens_left"] = new_token_count
+        await update_user_info(user_info)
+async def get_thread_step_info(thread_id):
+    step = await literal_client.api.get_step(thread_id)
+    return step
+def get_num_tokens(text, model):
+    encoding = tiktoken.encoding_for_model(model)
+    tokens = encoding.encode(text)
+    return len(tokens)

code/modules/chat_processor/literal_ai.py CHANGED Viewed

@@ -1,44 +1,7 @@
-from chainlit.data import ChainlitDataLayer, queue_until_user_message
 # update custom methods here (Ref: https://github.com/Chainlit/chainlit/blob/4b533cd53173bcc24abe4341a7108f0070d60099/backend/chainlit/data/__init__.py)
 class CustomLiteralDataLayer(ChainlitDataLayer):
     def __init__(self, **kwargs):
         super().__init__(**kwargs)
-    @queue_until_user_message()
-    async def create_step(self, step_dict: "StepDict"):
-        metadata = dict(
-            step_dict.get("metadata", {}),
-            **{
-                "waitForAnswer": step_dict.get("waitForAnswer"),
-                "language": step_dict.get("language"),
-                "showInput": step_dict.get("showInput"),
-            },
-        )
-        step: LiteralStepDict = {
-            "createdAt": step_dict.get("createdAt"),
-            "startTime": step_dict.get("start"),
-            "endTime": step_dict.get("end"),
-            "generation": step_dict.get("generation"),
-            "id": step_dict.get("id"),
-            "parentId": step_dict.get("parentId"),
-            "name": step_dict.get("name"),
-            "threadId": step_dict.get("threadId"),
-            "type": step_dict.get("type"),
-            "tags": step_dict.get("tags"),
-            "metadata": metadata,
-        }
-        if step_dict.get("input"):
-            step["input"] = {"content": step_dict.get("input")}
-        if step_dict.get("output"):
-            step["output"] = {"content": step_dict.get("output")}
-        if step_dict.get("isError"):
-            step["error"] = step_dict.get("output")
-        # print("\n\n\n")
-        # print("Step: ", step)
-        # print("\n\n\n")
-        await self.client.api.send_steps([step])

+from chainlit.data import ChainlitDataLayer
 # update custom methods here (Ref: https://github.com/Chainlit/chainlit/blob/4b533cd53173bcc24abe4341a7108f0070d60099/backend/chainlit/data/__init__.py)
 class CustomLiteralDataLayer(ChainlitDataLayer):
     def __init__(self, **kwargs):
         super().__init__(**kwargs)

code/modules/config/config.yml CHANGED Viewed

@@ -4,7 +4,7 @@ device: 'cpu' # str [cuda, cpu]
 vectorstore:
   load_from_HF: True # bool
-  embedd_files: False # bool
   data_path: '../storage/data' # str
   url_file_path: '../storage/data/urls.txt' # str
   expand_urls: True # bool
@@ -37,14 +37,14 @@ llm_params:
     temperature: 0.7 # float
     repo_id: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF' # HuggingFace repo id
     filename: 'tinyllama-1.1b-chat-v1.0.Q5_0.gguf' # Specific name of gguf file in the repo
-  pdf_reader: 'pymupdf' # str [llama, pymupdf, gpt]
   stream: False # bool
   pdf_reader: 'gpt' # str [llama, pymupdf, gpt]
 chat_logging:
   log_chat: True # bool
   platform: 'literalai'
-  callbacks: False # bool
 splitter_options:
   use_splitter: True # bool

 vectorstore:
   load_from_HF: True # bool
+  reparse_files: True # bool
   data_path: '../storage/data' # str
   url_file_path: '../storage/data/urls.txt' # str
   expand_urls: True # bool
     temperature: 0.7 # float
     repo_id: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF' # HuggingFace repo id
     filename: 'tinyllama-1.1b-chat-v1.0.Q5_0.gguf' # Specific name of gguf file in the repo
+    model_path: 'storage/models/tinyllama-1.1b-chat-v1.0.Q5_0.gguf' # Path to the model file
   stream: False # bool
   pdf_reader: 'gpt' # str [llama, pymupdf, gpt]
 chat_logging:
   log_chat: True # bool
   platform: 'literalai'
+  callbacks: True # bool
 splitter_options:
   use_splitter: True # bool

code/modules/config/constants.py CHANGED Viewed

@@ -3,6 +3,15 @@ import os
 load_dotenv()
 # API Keys - Loaded from the .env file
 OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
@@ -10,14 +19,16 @@ LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
 HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
 LITERAL_API_KEY_LOGGING = os.getenv("LITERAL_API_KEY_LOGGING")
 LITERAL_API_URL = os.getenv("LITERAL_API_URL")
 OAUTH_GOOGLE_CLIENT_ID = os.getenv("OAUTH_GOOGLE_CLIENT_ID")
 OAUTH_GOOGLE_CLIENT_SECRET = os.getenv("OAUTH_GOOGLE_CLIENT_SECRET")
-opening_message = f"Hey, What Can I Help You With?\n\nYou can me ask me questions about the course logistics, course content, about the final project, or anything else!"
 # Model Paths
 LLAMA_PATH = "../storage/models/tinyllama"
-RETRIEVER_HF_PATHS = {"RAGatouille": "XThomasBU/Colbert_Index"}

 load_dotenv()
+TIMEOUT = 60
+COOLDOWN_TIME = 60
+REGEN_TIME = 180
+TOKENS_LEFT = 2000
+ALL_TIME_TOKENS_ALLOCATED = 1000000
+GITHUB_REPO = "https://github.com/DL4DS/dl4ds_tutor"
+DOCS_WEBSITE = "https://dl4ds.github.io/dl4ds_tutor/"
 # API Keys - Loaded from the .env file
 OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
 HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
 LITERAL_API_KEY_LOGGING = os.getenv("LITERAL_API_KEY_LOGGING")
 LITERAL_API_URL = os.getenv("LITERAL_API_URL")
+CHAINLIT_URL = os.getenv("CHAINLIT_URL")
 OAUTH_GOOGLE_CLIENT_ID = os.getenv("OAUTH_GOOGLE_CLIENT_ID")
 OAUTH_GOOGLE_CLIENT_SECRET = os.getenv("OAUTH_GOOGLE_CLIENT_SECRET")
+opening_message = "Hey, What Can I Help You With?\n\nYou can me ask me questions about the course logistics, course content, about the final project, or anything else!"
+chat_end_message = (
+    "I hope I was able to help you. If you have any more questions, feel free to ask!"
+)
 # Model Paths
 LLAMA_PATH = "../storage/models/tinyllama"

code/modules/config/project_config.yml ADDED Viewed

	@@ -0,0 +1,7 @@

+retriever:
+  retriever_hf_paths:
+    RAGatouille: "XThomasBU/Colbert_Index"
+metadata:
+  metadata_links: ["https://dl4ds.github.io/sp2024/lectures/", "https://dl4ds.github.io/sp2024/schedule/"]
+  slide_base_link: "https://dl4ds.github.io"

code/modules/dataloader/data_loader.py CHANGED Viewed

@@ -3,40 +3,26 @@ import re
 import requests
 import pysrt
 from langchain_community.document_loaders import (
-    PyMuPDFLoader,
     Docx2txtLoader,
     YoutubeLoader,
-    WebBaseLoader,
     TextLoader,
 )
-from langchain_community.document_loaders import UnstructuredMarkdownLoader
-from llama_parse import LlamaParse
 from langchain.schema import Document
 import logging
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain_experimental.text_splitter import SemanticChunker
 from langchain_openai.embeddings import OpenAIEmbeddings
-from ragatouille import RAGPretrainedModel
-from langchain.chains import LLMChain
-from langchain_community.llms import OpenAI
-from langchain import PromptTemplate
 import json
 from concurrent.futures import ThreadPoolExecutor
 from urllib.parse import urljoin
 import html2text
 import bs4
-import tempfile
 import PyPDF2
 from modules.dataloader.pdf_readers.base import PDFReader
 from modules.dataloader.pdf_readers.llama import LlamaParser
 from modules.dataloader.pdf_readers.gpt import GPTParser
-try:
-    from modules.dataloader.helpers import get_metadata, download_pdf_from_url
-    from modules.config.constants import OPENAI_API_KEY, LLAMA_CLOUD_API_KEY
-except:
-    from dataloader.helpers import get_metadata, download_pdf_from_url
-    from config.constants import OPENAI_API_KEY, LLAMA_CLOUD_API_KEY
 logger = logging.getLogger(__name__)
 BASE_DIR = os.getcwd()
@@ -47,7 +33,7 @@ class HTMLReader:
         pass
     def read_url(self, url):
-        response = requests.get(url)
         if response.status_code == 200:
             return response.text
         else:
@@ -65,11 +51,13 @@ class HTMLReader:
                 href = href.replace("http", "https")
             absolute_url = urljoin(base_url, href)
-            link['href'] = absolute_url
-            resp = requests.head(absolute_url)
             if resp.status_code != 200:
-                logger.warning(f"Link {absolute_url} is broken. Status code: {resp.status_code}")
         return str(soup)
@@ -85,6 +73,7 @@ class HTMLReader:
         else:
             return None
 class FileReader:
     def __init__(self, logger, kind):
         self.logger = logger
@@ -96,7 +85,9 @@ class FileReader:
         else:
             self.pdf_reader = PDFReader()
         self.web_reader = HTMLReader()
-        self.logger.info(f"Initialized FileReader with {kind} PDF reader and HTML reader")
     def extract_text_from_pdf(self, pdf_path):
         text = ""
@@ -137,7 +128,7 @@ class FileReader:
         return [Document(page_content=self.web_reader.read_html(url))]
     def read_tex_from_url(self, tex_url):
-        response = requests.get(tex_url)
         if response.status_code == 200:
             return [Document(page_content=response.text)]
         else:
@@ -154,17 +145,20 @@ class ChunkProcessor:
         self.document_metadata = {}
         self.document_chunks_full = []
-        if not config['vectorstore']['embedd_files']:
             self.load_document_data()
         if config["splitter_options"]["use_splitter"]:
             if config["splitter_options"]["chunking_mode"] == "fixed":
                 if config["splitter_options"]["split_by_token"]:
-                    self.splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
-                        chunk_size=config["splitter_options"]["chunk_size"],
-                        chunk_overlap=config["splitter_options"]["chunk_overlap"],
-                        separators=config["splitter_options"]["chunk_separators"],
-                        disallowed_special=(),
                     )
                 else:
                     self.splitter = RecursiveCharacterTextSplitter(
@@ -175,8 +169,7 @@ class ChunkProcessor:
                     )
             else:
                 self.splitter = SemanticChunker(
-                    OpenAIEmbeddings(),
-                    breakpoint_threshold_type="percentile"
                 )
         else:
@@ -203,7 +196,10 @@ class ChunkProcessor:
     ):
         # TODO: Clear up this pipeline of re-adding metadata
         documents = [Document(page_content=documents, source=source, page=page)]
-        if file_type == "pdf" and self.config["splitter_options"]["chunking_mode"] == "fixed":
             document_chunks = documents
         else:
             document_chunks = self.splitter.split_documents(documents)
@@ -226,9 +222,22 @@ class ChunkProcessor:
     def chunk_docs(self, file_reader, uploaded_files, weblinks):
         addl_metadata = get_metadata(
-            "https://dl4ds.github.io/sp2024/lectures/",
-            "https://dl4ds.github.io/sp2024/schedule/",
         )  # For any additional metadata
         with ThreadPoolExecutor() as executor:
             executor.map(
                 self.process_file,
@@ -298,6 +307,7 @@ class ChunkProcessor:
         self.document_metadata[file_path] = file_metadata
     def process_file(self, file_path, file_index, file_reader, addl_metadata):
         file_name = os.path.basename(file_path)
         file_type = file_name.split(".")[-1]
@@ -314,10 +324,12 @@ class ChunkProcessor:
             return
         try:
             if file_path in self.document_data:
                 self.logger.warning(f"File {file_name} already processed")
-                documents = [Document(page_content=content) for content in self.document_data[file_path].values()]
             else:
                 documents = read_methods[file_type](file_path)
@@ -370,22 +382,31 @@ class ChunkProcessor:
             json.dump(self.document_metadata, json_file, indent=4)
     def load_document_data(self):
-        with open(
-            f"{self.config['log_chunk_dir']}/docs/doc_content.json", "r"
-        ) as json_file:
-            self.document_data = json.load(json_file)
-        with open(
-            f"{self.config['log_chunk_dir']}/metadata/doc_metadata.json", "r"
-        ) as json_file:
-            self.document_metadata = json.load(json_file)
-        self.logger.info(
-            f"Loaded document content from {self.config['log_chunk_dir']}/docs/doc_content.json. Total documents: {len(self.document_data)}"
-        )
 class DataLoader:
     def __init__(self, config, logger=None):
-        self.file_reader = FileReader(logger=logger, kind=config["llm_params"]["pdf_reader"])
         self.chunk_processor = ChunkProcessor(config, logger=logger)
     def get_chunks(self, uploaded_files, weblinks):
@@ -396,6 +417,15 @@ class DataLoader:
 if __name__ == "__main__":
     import yaml
     logger = logging.getLogger(__name__)
     logger.setLevel(logging.INFO)
@@ -403,19 +433,30 @@ if __name__ == "__main__":
     with open("../code/modules/config/config.yml", "r") as f:
         config = yaml.safe_load(f)
-    STORAGE_DIR = os.path.join(BASE_DIR, config['vectorstore']["data_path"])
     uploaded_files = [
-        os.path.join(STORAGE_DIR, file) for file in os.listdir(STORAGE_DIR) if file != "urls.txt"
     ]
     data_loader = DataLoader(config, logger=logger)
-    document_chunks, document_names, documents, document_metadata = (
-        data_loader.get_chunks(
-            ["https://dl4ds.github.io/sp2024/static_files/lectures/05_loss_functions_v2.pdf"],
-            [],
-        )
     )
     print(document_names[:5])
     print(len(document_chunks))

 import requests
 import pysrt
 from langchain_community.document_loaders import (
     Docx2txtLoader,
     YoutubeLoader,
     TextLoader,
 )
 from langchain.schema import Document
 import logging
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain_experimental.text_splitter import SemanticChunker
 from langchain_openai.embeddings import OpenAIEmbeddings
 import json
 from concurrent.futures import ThreadPoolExecutor
 from urllib.parse import urljoin
 import html2text
 import bs4
 import PyPDF2
 from modules.dataloader.pdf_readers.base import PDFReader
 from modules.dataloader.pdf_readers.llama import LlamaParser
 from modules.dataloader.pdf_readers.gpt import GPTParser
+from modules.dataloader.helpers import get_metadata
+from modules.config.constants import TIMEOUT
 logger = logging.getLogger(__name__)
 BASE_DIR = os.getcwd()
         pass
     def read_url(self, url):
+        response = requests.get(url, timeout=TIMEOUT)
         if response.status_code == 200:
             return response.text
         else:
                 href = href.replace("http", "https")
             absolute_url = urljoin(base_url, href)
+            link["href"] = absolute_url
+            resp = requests.head(absolute_url, timeout=TIMEOUT)
             if resp.status_code != 200:
+                logger.warning(
+                    f"Link {absolute_url} is broken. Status code: {resp.status_code}"
+                )
         return str(soup)
         else:
             return None
 class FileReader:
     def __init__(self, logger, kind):
         self.logger = logger
         else:
             self.pdf_reader = PDFReader()
         self.web_reader = HTMLReader()
+        self.logger.info(
+            f"Initialized FileReader with {kind} PDF reader and HTML reader"
+        )
     def extract_text_from_pdf(self, pdf_path):
         text = ""
         return [Document(page_content=self.web_reader.read_html(url))]
     def read_tex_from_url(self, tex_url):
+        response = requests.get(tex_url, timeout=TIMEOUT)
         if response.status_code == 200:
             return [Document(page_content=response.text)]
         else:
         self.document_metadata = {}
         self.document_chunks_full = []
+        # TODO: Fix when reparse_files is False
+        if not config["vectorstore"]["reparse_files"]:
             self.load_document_data()
         if config["splitter_options"]["use_splitter"]:
             if config["splitter_options"]["chunking_mode"] == "fixed":
                 if config["splitter_options"]["split_by_token"]:
+                    self.splitter = (
+                        RecursiveCharacterTextSplitter.from_tiktoken_encoder(
+                            chunk_size=config["splitter_options"]["chunk_size"],
+                            chunk_overlap=config["splitter_options"]["chunk_overlap"],
+                            separators=config["splitter_options"]["chunk_separators"],
+                            disallowed_special=(),
+                        )
                     )
                 else:
                     self.splitter = RecursiveCharacterTextSplitter(
                     )
             else:
                 self.splitter = SemanticChunker(
+                    OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
                 )
         else:
     ):
         # TODO: Clear up this pipeline of re-adding metadata
         documents = [Document(page_content=documents, source=source, page=page)]
+        if (
+            file_type == "pdf"
+            and self.config["splitter_options"]["chunking_mode"] == "fixed"
+        ):
             document_chunks = documents
         else:
             document_chunks = self.splitter.split_documents(documents)
     def chunk_docs(self, file_reader, uploaded_files, weblinks):
         addl_metadata = get_metadata(
+            *self.config["metadata"]["metadata_links"], self.config
         )  # For any additional metadata
+        # remove already processed files if reparse_files is False
+        if not self.config["vectorstore"]["reparse_files"]:
+            total_documents = len(uploaded_files) + len(weblinks)
+            uploaded_files = [
+                file_path
+                for file_path in uploaded_files
+                if file_path not in self.document_data
+            ]
+            weblinks = [link for link in weblinks if link not in self.document_data]
+            print(
+                f"Total documents to process: {total_documents}, Documents already processed: {total_documents - len(uploaded_files) - len(weblinks)}"
+            )
         with ThreadPoolExecutor() as executor:
             executor.map(
                 self.process_file,
         self.document_metadata[file_path] = file_metadata
     def process_file(self, file_path, file_index, file_reader, addl_metadata):
+        print(f"Processing file {file_index + 1} : {file_path}")
         file_name = os.path.basename(file_path)
         file_type = file_name.split(".")[-1]
             return
         try:
             if file_path in self.document_data:
                 self.logger.warning(f"File {file_name} already processed")
+                documents = [
+                    Document(page_content=content)
+                    for content in self.document_data[file_path].values()
+                ]
             else:
                 documents = read_methods[file_type](file_path)
             json.dump(self.document_metadata, json_file, indent=4)
     def load_document_data(self):
+        try:
+            with open(
+                f"{self.config['log_chunk_dir']}/docs/doc_content.json", "r"
+            ) as json_file:
+                self.document_data = json.load(json_file)
+            with open(
+                f"{self.config['log_chunk_dir']}/metadata/doc_metadata.json", "r"
+            ) as json_file:
+                self.document_metadata = json.load(json_file)
+            self.logger.info(
+                f"Loaded document content from {self.config['log_chunk_dir']}/docs/doc_content.json. Total documents: {len(self.document_data)}"
+            )
+        except FileNotFoundError:
+            self.logger.warning(
+                f"Document content not found in {self.config['log_chunk_dir']}/docs/doc_content.json"
+            )
+            self.document_data = {}
+            self.document_metadata = {}
 class DataLoader:
     def __init__(self, config, logger=None):
+        self.file_reader = FileReader(
+            logger=logger, kind=config["llm_params"]["pdf_reader"]
+        )
         self.chunk_processor = ChunkProcessor(config, logger=logger)
     def get_chunks(self, uploaded_files, weblinks):
 if __name__ == "__main__":
     import yaml
+    import argparse
+    parser = argparse.ArgumentParser(description="Process some links.")
+    parser.add_argument(
+        "--links", nargs="+", required=True, help="List of links to process."
+    )
+    args = parser.parse_args()
+    links_to_process = args.links
     logger = logging.getLogger(__name__)
     logger.setLevel(logging.INFO)
     with open("../code/modules/config/config.yml", "r") as f:
         config = yaml.safe_load(f)
+    with open("../code/modules/config/project_config.yml", "r") as f:
+        project_config = yaml.safe_load(f)
+    # Combine project config with the main config
+    config.update(project_config)
+    STORAGE_DIR = os.path.join(BASE_DIR, config["vectorstore"]["data_path"])
     uploaded_files = [
+        os.path.join(STORAGE_DIR, file)
+        for file in os.listdir(STORAGE_DIR)
+        if file != "urls.txt"
     ]
     data_loader = DataLoader(config, logger=logger)
+    # Just for testing
+    (
+        document_chunks,
+        document_names,
+        documents,
+        document_metadata,
+    ) = data_loader.get_chunks(
+        links_to_process,
+        [],
     )
     print(document_names[:5])
     print(len(document_chunks))

code/modules/dataloader/helpers.py CHANGED Viewed

@@ -2,6 +2,8 @@ import requests
 from bs4 import BeautifulSoup
 from urllib.parse import urlparse
 import tempfile
 def get_urls_from_file(file_path: str):
     """
@@ -19,18 +21,19 @@ def get_base_url(url):
     return base_url
-def get_metadata(lectures_url, schedule_url):
     """
     Function to get the lecture metadata from the lectures and schedule URLs.
     """
     lecture_metadata = {}
     # Get the main lectures page content
-    r_lectures = requests.get(lectures_url)
     soup_lectures = BeautifulSoup(r_lectures.text, "html.parser")
     # Get the main schedule page content
-    r_schedule = requests.get(schedule_url)
     soup_schedule = BeautifulSoup(r_schedule.text, "html.parser")
     # Find all lecture blocks
@@ -48,7 +51,9 @@ def get_metadata(lectures_url, schedule_url):
             slides_link_tag = description_div.find("a", title="Download slides")
             slides_link = slides_link_tag["href"].strip() if slides_link_tag else None
             slides_link = (
-                f"https://dl4ds.github.io{slides_link}" if slides_link else None
             )
             if slides_link:
                 date_mapping[slides_link] = date
@@ -68,7 +73,9 @@ def get_metadata(lectures_url, schedule_url):
             slides_link_tag = block.find("a", title="Download slides")
             slides_link = slides_link_tag["href"].strip() if slides_link_tag else None
             slides_link = (
-                f"https://dl4ds.github.io{slides_link}" if slides_link else None
             )
             # Extract the link to the lecture recording
@@ -118,7 +125,7 @@ def download_pdf_from_url(pdf_url):
     Returns:
         str: The local file path of the downloaded PDF file.
     """
-    response = requests.get(pdf_url)
     if response.status_code == 200:
         with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
             temp_file.write(response.content)

 from bs4 import BeautifulSoup
 from urllib.parse import urlparse
 import tempfile
+from modules.config.constants import TIMEOUT
 def get_urls_from_file(file_path: str):
     """
     return base_url
+### THIS FUNCTION IS NOT GENERALIZABLE.. IT IS SPECIFIC TO THE COURSE WEBSITE ###
+def get_metadata(lectures_url, schedule_url, config):
     """
     Function to get the lecture metadata from the lectures and schedule URLs.
     """
     lecture_metadata = {}
     # Get the main lectures page content
+    r_lectures = requests.get(lectures_url, timeout=TIMEOUT)
     soup_lectures = BeautifulSoup(r_lectures.text, "html.parser")
     # Get the main schedule page content
+    r_schedule = requests.get(schedule_url, timeout=TIMEOUT)
     soup_schedule = BeautifulSoup(r_schedule.text, "html.parser")
     # Find all lecture blocks
             slides_link_tag = description_div.find("a", title="Download slides")
             slides_link = slides_link_tag["href"].strip() if slides_link_tag else None
             slides_link = (
+                f"{config['metadata']['slide_base_link']}{slides_link}"
+                if slides_link
+                else None
             )
             if slides_link:
                 date_mapping[slides_link] = date
             slides_link_tag = block.find("a", title="Download slides")
             slides_link = slides_link_tag["href"].strip() if slides_link_tag else None
             slides_link = (
+                f"{config['metadata']['slide_base_link']}{slides_link}"
+                if slides_link
+                else None
             )
             # Extract the link to the lecture recording
     Returns:
         str: The local file path of the downloaded PDF file.
     """
+    response = requests.get(pdf_url, timeout=TIMEOUT)
     if response.status_code == 200:
         with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
             temp_file.write(response.content)

code/modules/dataloader/pdf_readers/gpt.py CHANGED Viewed

@@ -6,6 +6,7 @@ from io import BytesIO
 from openai import OpenAI
 from pdf2image import convert_from_path
 from langchain.schema import Document
 class GPTParser:
@@ -19,9 +20,9 @@ class GPTParser:
         self.api_key = os.getenv("OPENAI_API_KEY")
         self.prompt = """
          The provided documents are images of PDFs of lecture slides of deep learning material.
-         They contain LaTeX equations, images, and text.
          The goal is to extract the text, images and equations from the slides and convert everything to markdown format. Some of the equations may be complicated.
-         The markdown should be clean and easy to read, and any math equation should be converted to LaTeX, between $$.
          For images, give a description and if you can, a source. Separate each page with '---'.
          Just respond with the markdown. Do not include page numbers or any other metadata. Do not try to provide titles. Strictly the content.
          """
@@ -31,36 +32,45 @@ class GPTParser:
         encoded_images = [self.encode_image(image) for image in images]
-        chunks = [encoded_images[i:i + 5] for i in range(0, len(encoded_images), 5)]
         headers = {
             "Content-Type": "application/json",
-            "Authorization": f"Bearer {self.api_key}"
         }
         output = ""
         for chunk_num, chunk in enumerate(chunks):
-            content = [{"type": "image_url", "image_url": {
-                "url": f"data:image/jpeg;base64,{image}"}} for image in chunk]
             content.insert(0, {"type": "text", "text": self.prompt})
             payload = {
                 "model": "gpt-4o-mini",
-                "messages": [
-                    {
-                        "role": "user",
-                        "content": content
-                    }
-                ],
             }
             response = requests.post(
-                "https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
             resp = response.json()
-            chunk_output = resp['choices'][0]['message']['content'].replace("```", "").replace("markdown", "").replace("````", "")
             output += chunk_output + "\n---\n"
@@ -68,14 +78,12 @@ class GPTParser:
         output = [doc for doc in output if doc.strip() != ""]
         documents = [
-            Document(
-                page_content=page,
-                metadata={"source": pdf_path, "page": i}
-            ) for i, page in enumerate(output)
         ]
         return documents
     def encode_image(self, image):
         buffered = BytesIO()
         image.save(buffered, format="JPEG")
-        return base64.b64encode(buffered.getvalue()).decode('utf-8')

 from openai import OpenAI
 from pdf2image import convert_from_path
 from langchain.schema import Document
+from modules.config.constants import TIMEOUT
 class GPTParser:
         self.api_key = os.getenv("OPENAI_API_KEY")
         self.prompt = """
          The provided documents are images of PDFs of lecture slides of deep learning material.
+         They contain LaTeX equations, images, and text.
          The goal is to extract the text, images and equations from the slides and convert everything to markdown format. Some of the equations may be complicated.
+         The markdown should be clean and easy to read, and any math equation should be converted to LaTeX, between $$.
          For images, give a description and if you can, a source. Separate each page with '---'.
          Just respond with the markdown. Do not include page numbers or any other metadata. Do not try to provide titles. Strictly the content.
          """
         encoded_images = [self.encode_image(image) for image in images]
+        chunks = [encoded_images[i : i + 5] for i in range(0, len(encoded_images), 5)]
         headers = {
             "Content-Type": "application/json",
+            "Authorization": f"Bearer {self.api_key}",
         }
         output = ""
         for chunk_num, chunk in enumerate(chunks):
+            content = [
+                {
+                    "type": "image_url",
+                    "image_url": {"url": f"data:image/jpeg;base64,{image}"},
+                }
+                for image in chunk
+            ]
             content.insert(0, {"type": "text", "text": self.prompt})
             payload = {
                 "model": "gpt-4o-mini",
+                "messages": [{"role": "user", "content": content}],
             }
             response = requests.post(
+                "https://api.openai.com/v1/chat/completions",
+                headers=headers,
+                json=payload,
+                timeout=TIMEOUT,
+            )
             resp = response.json()
+            chunk_output = (
+                resp["choices"][0]["message"]["content"]
+                .replace("```", "")
+                .replace("markdown", "")
+                .replace("````", "")
+            )
             output += chunk_output + "\n---\n"
         output = [doc for doc in output if doc.strip() != ""]
         documents = [
+            Document(page_content=page, metadata={"source": pdf_path, "page": i})
+            for i, page in enumerate(output)
         ]
         return documents
     def encode_image(self, image):
         buffered = BytesIO()
         image.save(buffered, format="JPEG")
+        return base64.b64encode(buffered.getvalue()).decode("utf-8")

code/modules/dataloader/pdf_readers/llama.py CHANGED Viewed

@@ -2,19 +2,18 @@ import os
 import requests
 from llama_parse import LlamaParse
 from langchain.schema import Document
-from modules.config.constants import OPENAI_API_KEY, LLAMA_CLOUD_API_KEY
 from modules.dataloader.helpers import download_pdf_from_url
 class LlamaParser:
     def __init__(self):
         self.GPT_API_KEY = OPENAI_API_KEY
         self.LLAMA_CLOUD_API_KEY = LLAMA_CLOUD_API_KEY
         self.parse_url = "https://api.cloud.llamaindex.ai/api/parsing/upload"
         self.headers = {
-            'Accept': 'application/json',
-            'Authorization': f'Bearer {LLAMA_CLOUD_API_KEY}'
         }
         self.parser = LlamaParse(
             api_key=LLAMA_CLOUD_API_KEY,
@@ -23,7 +22,7 @@ class LlamaParser:
             language="en",
             gpt4o_mode=False,
             # gpt4o_api_key=OPENAI_API_KEY,
-            parsing_instruction="The provided documents are PDFs of lecture slides of deep learning material. They contain LaTeX equations, images, and text. The goal is to extract the text, images and equations from the slides. The markdown should be clean and easy to read, and any math equation should be converted to LaTeX format, between $ signs. For images, if you can, give a description and a source."
         )
     def parse(self, pdf_path):
@@ -38,10 +37,8 @@ class LlamaParser:
         pages = [page.strip() for page in pages]
         documents = [
-            Document(
-                page_content=page,
-                metadata={"source": pdf_path, "page": i}
-            ) for i, page in enumerate(pages)
         ]
         return documents
@@ -53,20 +50,30 @@ class LlamaParser:
         }
         files = [
-            ('file', ('file', requests.get(pdf_url).content, 'application/octet-stream'))
         ]
         response = requests.request(
-            "POST", self.parse_url, headers=self.headers, data=payload, files=files)
-        return response.json()['id'], response.json()['status']
     async def get_result(self, job_id):
-        url = f"https://api.cloud.llamaindex.ai/api/parsing/job/{job_id}/result/markdown"
         response = requests.request("GET", url, headers=self.headers, data={})
-        return response.json()['markdown']
     async def _parse(self, pdf_path):
         job_id, status = self.make_request(pdf_path)
@@ -78,15 +85,9 @@ class LlamaParser:
         result = await self.get_result(job_id)
-        documents = [
-            Document(
-                page_content=result,
-                metadata={"source": pdf_path}
-            )
-        ]
         return documents
-    async def _parse(self, pdf_path):
-        return await self._parse(pdf_path)

 import requests
 from llama_parse import LlamaParse
 from langchain.schema import Document
+from modules.config.constants import OPENAI_API_KEY, LLAMA_CLOUD_API_KEY, TIMEOUT
 from modules.dataloader.helpers import download_pdf_from_url
 class LlamaParser:
     def __init__(self):
         self.GPT_API_KEY = OPENAI_API_KEY
         self.LLAMA_CLOUD_API_KEY = LLAMA_CLOUD_API_KEY
         self.parse_url = "https://api.cloud.llamaindex.ai/api/parsing/upload"
         self.headers = {
+            "Accept": "application/json",
+            "Authorization": f"Bearer {LLAMA_CLOUD_API_KEY}",
         }
         self.parser = LlamaParse(
             api_key=LLAMA_CLOUD_API_KEY,
             language="en",
             gpt4o_mode=False,
             # gpt4o_api_key=OPENAI_API_KEY,
+            parsing_instruction="The provided documents are PDFs of lecture slides of deep learning material. They contain LaTeX equations, images, and text. The goal is to extract the text, images and equations from the slides. The markdown should be clean and easy to read, and any math equation should be converted to LaTeX format, between $ signs. For images, if you can, give a description and a source.",
         )
     def parse(self, pdf_path):
         pages = [page.strip() for page in pages]
         documents = [
+            Document(page_content=page, metadata={"source": pdf_path, "page": i})
+            for i, page in enumerate(pages)
         ]
         return documents
         }
         files = [
+            (
+                "file",
+                (
+                    "file",
+                    requests.get(pdf_url, timeout=TIMEOUT).content,
+                    "application/octet-stream",
+                ),
+            )
         ]
         response = requests.request(
+            "POST", self.parse_url, headers=self.headers, data=payload, files=files
+        )
+        return response.json()["id"], response.json()["status"]
     async def get_result(self, job_id):
+        url = (
+            f"https://api.cloud.llamaindex.ai/api/parsing/job/{job_id}/result/markdown"
+        )
         response = requests.request("GET", url, headers=self.headers, data={})
+        return response.json()["markdown"]
     async def _parse(self, pdf_path):
         job_id, status = self.make_request(pdf_path)
         result = await self.get_result(job_id)
+        documents = [Document(page_content=result, metadata={"source": pdf_path})]
         return documents
+    # async def _parse(self, pdf_path):
+    #     return await self._parse(pdf_path)

code/modules/dataloader/webpage_crawler.py CHANGED Viewed

@@ -3,7 +3,9 @@ from aiohttp import ClientSession
 import asyncio
 import requests
 from bs4 import BeautifulSoup
-from urllib.parse import urlparse, urljoin, urldefrag
 class WebpageCrawler:
     def __init__(self):
@@ -18,7 +20,7 @@ class WebpageCrawler:
     def url_exists(self, url: str) -> bool:
         try:
-            response = requests.head(url)
             return response.status_code == 200
         except requests.ConnectionError:
             return False
@@ -88,7 +90,7 @@ class WebpageCrawler:
     def is_webpage(self, url: str) -> bool:
         try:
-            response = requests.head(url, allow_redirects=True)
             content_type = response.headers.get("Content-Type", "").lower()
             return "text/html" in content_type
         except requests.RequestException:

 import asyncio
 import requests
 from bs4 import BeautifulSoup
+from urllib.parse import urljoin, urldefrag
+from modules.config.constants import TIMEOUT
 class WebpageCrawler:
     def __init__(self):
     def url_exists(self, url: str) -> bool:
         try:
+            response = requests.head(url, timeout=TIMEOUT)
             return response.status_code == 200
         except requests.ConnectionError:
             return False
     def is_webpage(self, url: str) -> bool:
         try:
+            response = requests.head(url, allow_redirects=True, timeout=TIMEOUT)
             content_type = response.headers.get("Content-Type", "").lower()
             return "text/html" in content_type
         except requests.RequestException:

code/modules/retriever/helpers.py CHANGED Viewed

@@ -6,7 +6,6 @@ from typing import List
 class VectorStoreRetrieverScore(VectorStoreRetriever):
     # See https://github.com/langchain-ai/langchain/blob/61dd92f8215daef3d9cf1734b0d1f8c70c1571c3/libs/langchain/langchain/vectorstores/base.py#L500
     def _get_relevant_documents(
         self, query: str, *, run_manager: CallbackManagerForRetrieverRun

 class VectorStoreRetrieverScore(VectorStoreRetriever):
     # See https://github.com/langchain-ai/langchain/blob/61dd92f8215daef3d9cf1734b0d1f8c70c1571c3/libs/langchain/langchain/vectorstores/base.py#L500
     def _get_relevant_documents(
         self, query: str, *, run_manager: CallbackManagerForRetrieverRun

code/modules/vectorstore/colbert.py CHANGED Viewed

@@ -1,9 +1,9 @@
 from ragatouille import RAGPretrainedModel
 from modules.vectorstore.base import VectorStoreBase
 from langchain_core.retrievers import BaseRetriever
-from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun, Callbacks
 from langchain_core.documents import Document
-from typing import Any, List, Optional, Sequence
 import os
 import json
@@ -85,6 +85,7 @@ class ColbertVectorStore(VectorStoreBase):
             document_ids=document_names,
             document_metadatas=document_metadata,
         )
         self.colbert.set_document_count(len(document_names))
     def load_database(self):

 from ragatouille import RAGPretrainedModel
 from modules.vectorstore.base import VectorStoreBase
 from langchain_core.retrievers import BaseRetriever
+from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun
 from langchain_core.documents import Document
+from typing import Any, List
 import os
 import json
             document_ids=document_names,
             document_metadatas=document_metadata,
         )
+        print(f"Index created at {index_path}")
         self.colbert.set_document_count(len(document_names))
     def load_database(self):

code/modules/vectorstore/embedding_model_loader.py CHANGED Viewed

@@ -1,9 +1,6 @@
 from langchain_community.embeddings import OpenAIEmbeddings
 from langchain_community.embeddings import HuggingFaceEmbeddings
-from langchain_community.embeddings import LlamaCppEmbeddings
-from modules.config.constants import *
-import os
 class EmbeddingModelLoader:
@@ -28,8 +25,5 @@ class EmbeddingModelLoader:
                     "trust_remote_code": True,
                 },
             )
-            # embedding_model = LlamaCppEmbeddings(
-            #     model_path=os.path.abspath("storage/llama-7b.ggmlv3.q4_0.bin")
-            # )
         return embedding_model

 from langchain_community.embeddings import OpenAIEmbeddings
 from langchain_community.embeddings import HuggingFaceEmbeddings
+from modules.config.constants import OPENAI_API_KEY, HUGGINGFACE_TOKEN
 class EmbeddingModelLoader:
                     "trust_remote_code": True,
                 },
             )
         return embedding_model

code/modules/vectorstore/faiss.py CHANGED Viewed

@@ -14,10 +14,15 @@ class FaissVectorStore(VectorStoreBase):
     def __init__(self, config):
         self.config = config
         self._init_vector_db()
-        self.local_path = os.path.join(self.config["vectorstore"]["db_path"],
-                                       "db_" + self.config["vectorstore"]["db_option"]
-                                       + "_" + self.config["vectorstore"]["model"]
-                                       + "_" + config["splitter_options"]["chunking_mode"])
     def _init_vector_db(self):
         self.faiss = FAISS(
@@ -28,9 +33,7 @@ class FaissVectorStore(VectorStoreBase):
         self.vectorstore = self.faiss.from_documents(
             documents=document_chunks, embedding=embedding_model
         )
-        self.vectorstore.save_local(
-            self.local_path
-        )
     def load_database(self, embedding_model):
         self.vectorstore = self.faiss.load_local(

     def __init__(self, config):
         self.config = config
         self._init_vector_db()
+        self.local_path = os.path.join(
+            self.config["vectorstore"]["db_path"],
+            "db_"
+            + self.config["vectorstore"]["db_option"]
+            + "_"
+            + self.config["vectorstore"]["model"]
+            + "_"
+            + config["splitter_options"]["chunking_mode"],
+        )
     def _init_vector_db(self):
         self.faiss = FAISS(
         self.vectorstore = self.faiss.from_documents(
             documents=document_chunks, embedding=embedding_model
         )
+        self.vectorstore.save_local(self.local_path)
     def load_database(self, embedding_model):
         self.vectorstore = self.faiss.load_local(

code/modules/vectorstore/raptor.py CHANGED Viewed

@@ -317,13 +317,10 @@ class RAPTORVectoreStore(VectorStoreBase):
         print(f"--Generated {len(all_clusters)} clusters--")
         # Summarization
-        template = """Here is content from the course DS598: Deep Learning for Data Science.
         The content may be form webapge about the course, or lecture content, or any other relevant information.
         If the content is in bullet points (from  pdf lectre slides), you can summarize the bullet points.
         Give a detailed summary of the content below.
         Documentation:
         {context}
         """

         print(f"--Generated {len(all_clusters)} clusters--")
         # Summarization
+        template = """Here is content from the course DS598: Deep Learning for Data Science.
         The content may be form webapge about the course, or lecture content, or any other relevant information.
         If the content is in bullet points (from  pdf lectre slides), you can summarize the bullet points.
         Give a detailed summary of the content below.
         Documentation:
         {context}
         """

code/modules/vectorstore/store_manager.py CHANGED Viewed

@@ -1,9 +1,7 @@
 from modules.vectorstore.vectorstore import VectorStore
-from modules.vectorstore.helpers import *
 from modules.dataloader.webpage_crawler import WebpageCrawler
 from modules.dataloader.data_loader import DataLoader
-from modules.dataloader.helpers import *
-from modules.config.constants import RETRIEVER_HF_PATHS
 from modules.vectorstore.embedding_model_loader import EmbeddingModelLoader
 import logging
 import os
@@ -49,7 +47,6 @@ class VectorStoreManager:
         return logger
     def load_files(self):
         files = os.listdir(self.config["vectorstore"]["data_path"])
         files = [
             os.path.join(self.config["vectorstore"]["data_path"], file)
@@ -71,7 +68,6 @@ class VectorStoreManager:
         return files, urls
     def create_embedding_model(self):
         self.logger.info("Creating embedding function")
         embedding_model_loader = EmbeddingModelLoader(self.config)
         embedding_model = embedding_model_loader.load_embedding_model()
@@ -102,7 +98,6 @@ class VectorStoreManager:
         )
     def create_database(self):
         start_time = time.time()  # Start time for creating database
         data_loader = DataLoader(self.config, self.logger)
         self.logger.info("Loading data")
@@ -112,12 +107,15 @@ class VectorStoreManager:
         self.logger.info(f"Number of webpages: {len(webpages)}")
         if f"{self.config['vectorstore']['url_file_path']}" in files:
             files.remove(f"{self.config['vectorstores']['url_file_path']}")  # cleanup
-        document_chunks, document_names, documents, document_metadata = (
-            data_loader.get_chunks(files, webpages)
-        )
         num_documents = len(document_chunks)
         self.logger.info(f"Number of documents in the DB: {num_documents}")
-        metadata_keys = list(document_metadata[0].keys())
         self.logger.info(f"Metadata keys: {metadata_keys}")
         self.logger.info("Completed loading data")
         self.initialize_database(
@@ -130,7 +128,6 @@ class VectorStoreManager:
         )
     def load_database(self):
         start_time = time.time()  # Start time for loading database
         if self.config["vectorstore"]["db_option"] in ["FAISS", "Chroma", "RAPTOR"]:
             self.embedding_model = self.create_embedding_model()
@@ -170,13 +167,23 @@ if __name__ == "__main__":
     with open("modules/config/config.yml", "r") as f:
         config = yaml.safe_load(f)
     print(config)
     print(f"Trying to create database with config: {config}")
     vector_db = VectorStoreManager(config)
     if config["vectorstore"]["load_from_HF"]:
-        if config["vectorstore"]["db_option"] in RETRIEVER_HF_PATHS:
             vector_db.load_from_HF(
-                HF_PATH=RETRIEVER_HF_PATHS[config["vectorstore"]["db_option"]]
             )
         else:
             # print(f"HF_PATH not available for {config['vectorstore']['db_option']}")
@@ -189,7 +196,7 @@ if __name__ == "__main__":
         vector_db.create_database()
     print("Created database")
-    print(f"Trying to load the database")
     vector_db = VectorStoreManager(config)
     vector_db.load_database()
     print("Loaded database")

 from modules.vectorstore.vectorstore import VectorStore
+from modules.dataloader.helpers import get_urls_from_file
 from modules.dataloader.webpage_crawler import WebpageCrawler
 from modules.dataloader.data_loader import DataLoader
 from modules.vectorstore.embedding_model_loader import EmbeddingModelLoader
 import logging
 import os
         return logger
     def load_files(self):
         files = os.listdir(self.config["vectorstore"]["data_path"])
         files = [
             os.path.join(self.config["vectorstore"]["data_path"], file)
         return files, urls
     def create_embedding_model(self):
         self.logger.info("Creating embedding function")
         embedding_model_loader = EmbeddingModelLoader(self.config)
         embedding_model = embedding_model_loader.load_embedding_model()
         )
     def create_database(self):
         start_time = time.time()  # Start time for creating database
         data_loader = DataLoader(self.config, self.logger)
         self.logger.info("Loading data")
         self.logger.info(f"Number of webpages: {len(webpages)}")
         if f"{self.config['vectorstore']['url_file_path']}" in files:
             files.remove(f"{self.config['vectorstores']['url_file_path']}")  # cleanup
+        (
+            document_chunks,
+            document_names,
+            documents,
+            document_metadata,
+        ) = data_loader.get_chunks(files, webpages)
         num_documents = len(document_chunks)
         self.logger.info(f"Number of documents in the DB: {num_documents}")
+        metadata_keys = list(document_metadata[0].keys()) if document_metadata else []
         self.logger.info(f"Metadata keys: {metadata_keys}")
         self.logger.info("Completed loading data")
         self.initialize_database(
         )
     def load_database(self):
         start_time = time.time()  # Start time for loading database
         if self.config["vectorstore"]["db_option"] in ["FAISS", "Chroma", "RAPTOR"]:
             self.embedding_model = self.create_embedding_model()
     with open("modules/config/config.yml", "r") as f:
         config = yaml.safe_load(f)
+    with open("modules/config/project_config.yml", "r") as f:
+        project_config = yaml.safe_load(f)
+    # combine the two configs
+    config.update(project_config)
     print(config)
     print(f"Trying to create database with config: {config}")
     vector_db = VectorStoreManager(config)
     if config["vectorstore"]["load_from_HF"]:
+        if (
+            config["vectorstore"]["db_option"]
+            in config["retriever"]["retriever_hf_paths"]
+        ):
             vector_db.load_from_HF(
+                HF_PATH=config["retriever"]["retriever_hf_paths"][
+                    config["vectorstore"]["db_option"]
+                ]
             )
         else:
             # print(f"HF_PATH not available for {config['vectorstore']['db_option']}")
         vector_db.create_database()
     print("Created database")
+    print("Trying to load the database")
     vector_db = VectorStoreManager(config)
     vector_db.load_database()
     print("Loaded database")

code/public/avatars/{ai-tutor.png → ai_tutor.png} RENAMED Viewed

File without changes

code/public/space.jpg ADDED Viewed

Git LFS Details

SHA256: 9ed3f8e7fd9790c394bae59cd0e315742af862ed833e9f42906f36f140abbb07
Pointer size: 132 Bytes
Size of remote file: 2.68 MB

code/public/test.css CHANGED Viewed

@@ -13,10 +13,6 @@ a[href*='https://github.com/Chainlit/chainlit'] {
     border-radius: 50%; /* Maintain circular shape */
 }
-/* Hide the default image */
-.MuiAvatar-root.MuiAvatar-circular.css-m2icte .MuiAvatar-img.css-1hy9t21 {
-    display: none;
-}
 .MuiAvatar-root.MuiAvatar-circular.css-v72an7 {
     background-image: url('/public/avatars/ai-tutor.png'); /* Replace with your custom image URL */
@@ -26,18 +22,3 @@ a[href*='https://github.com/Chainlit/chainlit'] {
     height: 40px; /* Ensure the dimensions match the original */
     border-radius: 50%; /* Maintain circular shape */
 }
-/* Hide the default image */
-.MuiAvatar-root.MuiAvatar-circular.css-v72an7 .MuiAvatar-img.css-1hy9t21 {
-    display: none;
-}
-/* Hide the new chat button
-#new-chat-button {
-    display: none;
-} */
-/* Hide the open sidebar button
-#open-sidebar-button {
-    display: none;
-} */

     border-radius: 50%; /* Maintain circular shape */
 }
 .MuiAvatar-root.MuiAvatar-circular.css-v72an7 {
     background-image: url('/public/avatars/ai-tutor.png'); /* Replace with your custom image URL */
     height: 40px; /* Ensure the dimensions match the original */
     border-radius: 50%; /* Maintain circular shape */
 }

code/templates/cooldown.html ADDED Viewed

	@@ -0,0 +1,181 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Cooldown Period | Terrier Tutor</title>
+    <style>
+        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600&display=swap');
+        body, html {
+            margin: 0;
+            padding: 0;
+            font-family: 'Inter', sans-serif;
+            background-color: #f7f7f7;
+            background-image: url('https://www.transparenttextures.com/patterns/cubes.png');
+            background-repeat: repeat;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            height: 100vh;
+            color: #333;
+        }
+        .container {
+            background: rgba(255, 255, 255, 0.9);
+            border: 1px solid #ddd;
+            border-radius: 8px;
+            width: 100%;
+            max-width: 400px;
+            padding: 50px;
+            box-sizing: border-box;
+            text-align: center;
+            box-shadow: 0 4px 15px rgba(0, 0, 0, 0.1);
+            backdrop-filter: blur(10px);
+            -webkit-backdrop-filter: blur(10px);
+        }
+        .avatar {
+            width: 90px;
+            height: 90px;
+            border-radius: 50%;
+            margin-bottom: 25px;
+            border: 2px solid #ddd;
+        }
+        .container h1 {
+            margin-bottom: 15px;
+            font-size: 24px;
+            font-weight: 600;
+            color: #1a1a1a;
+        }
+        .container p {
+            font-size: 16px;
+            color: #4a4a4a;
+            margin-bottom: 30px;
+            line-height: 1.5;
+        }
+        .cooldown-message {
+            font-size: 16px;
+            color: #333;
+            margin-bottom: 30px;
+        }
+        .tokens-left {
+            font-size: 14px;
+            color: #333;
+            margin-bottom: 30px;
+            font-weight: 600;
+        }
+        .button {
+            padding: 12px 0;
+            margin: 12px 0;
+            font-size: 14px;
+            border-radius: 6px;
+            cursor: pointer;
+            width: 100%;
+            border: 1px solid #4285F4;
+            background-color: #fff;
+            color: #4285F4;
+            transition: background-color 0.3s ease, border-color 0.3s ease;
+            display: none;
+        }
+        .button.start-tutor {
+            display: none;
+        }
+        .button:hover {
+            background-color: #e0e0e0;
+            border-color: #357ae8;
+        }
+        .sign-out-button {
+            border: 1px solid #FF4C4C;
+            background-color: #fff;
+            color: #FF4C4C;
+            display: block;
+        }
+        .sign-out-button:hover {
+            background-color: #ffe6e6;
+            border-color: #e04343;
+            color: #e04343;
+        }
+        #countdown {
+            font-size: 14px;
+            color: #555;
+            margin-bottom: 20px;
+        }
+        .footer {
+            font-size: 12px;
+            color: #777;
+            margin-top: 20px;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <img src="/public/avatars/ai_tutor.png" alt="AI Tutor Avatar" class="avatar">
+        <h1>Hello, {{ username }}</h1>
+        <p>It seems like you need to wait a bit before starting a new session.</p>
+        <p class="cooldown-message">Time remaining until the cooldown period ends:</p>
+        <p id="countdown"></p>
+        <p class="tokens-left">Tokens Left: <span id="tokensLeft">{{ tokens_left }}</span></p>
+        <button id="startTutorBtn" class="button start-tutor" onclick="startTutor()">Start AI Tutor</button>
+        <form action="/logout" method="get">
+            <button type="submit" class="button sign-out-button">Sign Out</button>
+        </form>
+        <div class="footer">Reload the page to update token stats</div>
+    </div>
+    <script>
+        function startCountdown(endTime) {
+            const countdownElement = document.getElementById('countdown');
+            const startTutorBtn = document.getElementById('startTutorBtn');
+            const endTimeDate = new Date(endTime);
+            function updateCountdown() {
+                const now = new Date();
+                const timeLeft = endTimeDate.getTime() - now.getTime();
+                if (timeLeft <= 0) {
+                    countdownElement.textContent = "Cooldown period has ended.";
+                    startTutorBtn.style.display = "block";
+                } else {
+                    const hours = Math.floor(timeLeft / 1000 / 60 / 60);
+                    const minutes = Math.floor((timeLeft / 1000 / 60) % 60);
+                    const seconds = Math.floor((timeLeft / 1000) % 60);
+                    countdownElement.textContent = `${hours}h ${minutes}m ${seconds}s`;
+                }
+            }
+            updateCountdown();
+            setInterval(updateCountdown, 1000);
+        }
+        function startTutor() {
+            window.location.href = "/start-tutor";
+        }
+        function updateTokensLeft() {
+            fetch('/get-tokens-left')
+                .then(response => response.json())
+                .then(data => {
+                    document.getElementById('tokensLeft').textContent = data.tokens_left;
+                })
+                .catch(error => console.error('Error fetching tokens:', error));
+        }
+        // Start the countdown
+        startCountdown("{{ cooldown_end_time }}");
+        // Update tokens left when the page loads
+        updateTokensLeft();
+    </script>
+</body>
+</html>

code/templates/dashboard.html ADDED Viewed

	@@ -0,0 +1,145 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Dashboard | Terrier Tutor</title>
+    <style>
+        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600&display=swap');
+        body, html {
+            margin: 0;
+            padding: 0;
+            font-family: 'Inter', sans-serif;
+            background-color: #f7f7f7; /* Light gray background */
+            background-image: url('https://www.transparenttextures.com/patterns/cubes.png'); /* Subtle geometric pattern */
+            background-repeat: repeat;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            height: 100vh;
+            color: #333;
+        }
+        .container {
+            background: rgba(255, 255, 255, 0.9);
+            border: 1px solid #ddd;
+            border-radius: 8px;
+            width: 100%;
+            max-width: 400px;
+            padding: 40px;
+            box-sizing: border-box;
+            text-align: center;
+            box-shadow: 0 4px 15px rgba(0, 0, 0, 0.1);
+            backdrop-filter: blur(10px);
+            -webkit-backdrop-filter: blur(10px);
+        }
+        .avatar {
+            width: 90px;
+            height: 90px;
+            border-radius: 50%;
+            margin-bottom: 20px;
+            border: 2px solid #ddd;
+        }
+        .container h1 {
+            margin-bottom: 20px;
+            font-size: 26px;
+            font-weight: 600;
+            color: #1a1a1a;
+        }
+        .container p {
+            font-size: 15px;
+            color: #4a4a4a;
+            margin-bottom: 25px;
+            line-height: 1.5;
+        }
+        .tokens-left {
+            font-size: 17px;
+            color: #333;
+            margin-bottom: 10px;
+            font-weight: 600;
+        }
+        .all-time-tokens {
+            font-size: 14px; /* Reduced font size */
+            color: #555;
+            margin-bottom: 30px;
+            font-weight: 500;
+            white-space: nowrap; /* Prevents breaking to a new line */
+        }
+        .button {
+            padding: 12px 0;
+            margin: 12px 0;
+            font-size: 15px;
+            border-radius: 6px;
+            cursor: pointer;
+            width: 100%;
+            border: 1px solid #4285F4; /* Button border color */
+            background-color: #fff; /* Button background color */
+            color: #4285F4; /* Button text color */
+            transition: background-color 0.3s ease, border-color 0.3s ease;
+        }
+        .button:hover {
+            background-color: #e0e0e0;
+            border-color: #357ae8; /* Darker blue for hover */
+        }
+        .start-button {
+            border: 1px solid #4285F4;
+            color: #4285F4;
+            background-color: #fff;
+        }
+        .start-button:hover {
+            background-color: #e0f0ff; /* Light blue on hover */
+            border-color: #357ae8; /* Darker blue for hover */
+            color: #357ae8; /* Blue text on hover */
+        }
+        .sign-out-button {
+            border: 1px solid #FF4C4C;
+            background-color: #fff;
+            color: #FF4C4C;
+        }
+        .sign-out-button:hover {
+            background-color: #ffe6e6; /* Light red on hover */
+            border-color: #e04343; /* Darker red for hover */
+            color: #e04343; /* Red text on hover */
+        }
+        .footer {
+            font-size: 12px;
+            color: #777;
+            margin-top: 25px;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <img src="/public/avatars/ai_tutor.png" alt="AI Tutor Avatar" class="avatar">
+        <h1>Welcome, {{ username }}</h1>
+        <p>Ready to start your AI tutoring session?</p>
+        <p class="tokens-left">Tokens Left: {{ tokens_left }}</p>
+        <p class="all-time-tokens">All-Time Tokens Allocated: {{ all_time_tokens_allocated }} / {{ total_tokens_allocated }}</p>
+        <form action="/start-tutor" method="post">
+            <button type="submit" class="button start-button">Start AI Tutor</button>
+        </form>
+        <form action="/logout" method="get">
+            <button type="submit" class="button sign-out-button">Sign Out</button>
+        </form>
+        <div class="footer">Reload the page to update token stats</div>
+    </div>
+    <script>
+        let token = "{{ jwt_token }}";
+        console.log("Token: ", token);
+        localStorage.setItem('token', token);
+    </script>
+</body>
+</html>

code/templates/error.html ADDED Viewed

	@@ -0,0 +1,95 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Error | Terrier Tutor</title>
+    <style>
+        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600&display=swap');
+        body, html {
+            margin: 0;
+            padding: 0;
+            font-family: 'Inter', sans-serif;
+            background-color: #f7f7f7; /* Light gray background */
+            background-image: url('https://www.transparenttextures.com/patterns/cubes.png'); /* Subtle geometric pattern */
+            background-repeat: repeat;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            height: 100vh;
+            color: #333;
+        }
+        .container {
+            background: rgba(255, 255, 255, 0.9);
+            border: 1px solid #ddd;
+            border-radius: 8px;
+            width: 100%;
+            max-width: 400px;
+            padding: 50px;
+            box-sizing: border-box;
+            text-align: center;
+            box-shadow: 0 4px 15px rgba(0, 0, 0, 0.1);
+            backdrop-filter: blur(10px);
+            -webkit-backdrop-filter: blur(10px);
+        }
+        .container h1 {
+            margin-bottom: 20px;
+            font-size: 26px;
+            font-weight: 600;
+            color: #1a1a1a;
+        }
+        .container p {
+            font-size: 18px;
+            color: #4a4a4a;
+            margin-bottom: 35px;
+            line-height: 1.5;
+        }
+        .button {
+            padding: 14px 0;
+            margin: 12px 0;
+            font-size: 16px;
+            border-radius: 6px;
+            cursor: pointer;
+            width: 100%;
+            border: 1px solid #ccc;
+            background-color: #007BFF;
+            color: #fff;
+            transition: background-color 0.3s ease, border-color 0.3s ease;
+        }
+        .button:hover {
+            background-color: #0056b3;
+            border-color: #0056b3;
+        }
+        .error-box {
+            background-color: #2d2d2d;
+            color: #fff;
+            padding: 10px;
+            margin-top: 20px;
+            font-family: 'Courier New', Courier, monospace;
+            text-align: left;
+            overflow-x: auto;
+            white-space: pre-wrap;
+            border-radius: 5px;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>Oops! Something went wrong...</h1>
+        <p>An unexpected error occurred. The details are below:</p>
+        <div class="error-box">
+            <code>{{ error }}</code>
+        </div>
+        <form action="/" method="get">
+            <button type="submit" class="button">Return to Home</button>
+        </form>
+    </div>
+</body>
+</html>

code/templates/error_404.html ADDED Viewed

	@@ -0,0 +1,80 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>404 - Not Found</title>
+    <style>
+        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600&display=swap');
+        body, html {
+            margin: 0;
+            padding: 0;
+            font-family: 'Inter', sans-serif;
+            background-color: #f7f7f7; /* Light gray background */
+            background-image: url('https://www.transparenttextures.com/patterns/cubes.png'); /* Subtle geometric pattern */
+            background-repeat: repeat;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            height: 100vh;
+            color: #333;
+        }
+        .container {
+            background: rgba(255, 255, 255, 0.9);
+            border: 1px solid #ddd;
+            border-radius: 8px;
+            width: 100%;
+            max-width: 400px;
+            padding: 50px;
+            box-sizing: border-box;
+            text-align: center;
+            box-shadow: 0 4px 15px rgba(0, 0, 0, 0.1);
+            backdrop-filter: blur(10px);
+            -webkit-backdrop-filter: blur(10px);
+        }
+        .container h1 {
+            margin-bottom: 20px;
+            font-size: 26px;
+            font-weight: 600;
+            color: #1a1a1a;
+        }
+        .container p {
+            font-size: 18px;
+            color: #4a4a4a;
+            margin-bottom: 35px;
+            line-height: 1.5;
+        }
+        .button {
+            padding: 14px 0;
+            margin: 12px 0;
+            font-size: 16px;
+            border-radius: 6px;
+            cursor: pointer;
+            width: 100%;
+            border: 1px solid #ccc;
+            background-color: #007BFF;
+            color: #fff;
+            transition: background-color 0.3s ease, border-color 0.3s ease;
+        }
+        .button:hover {
+            background-color: #0056b3;
+            border-color: #0056b3;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>You have ventured into the abyss...</h1>
+        <p>To get back to reality, click the button below.</p>
+        <form action="/" method="get">
+            <button type="submit" class="button">Return to Home</button>
+        </form>
+    </div>
+</body>
+</html>

code/templates/login.html ADDED Viewed

	@@ -0,0 +1,132 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Login | Terrier Tutor</title>
+    <style>
+        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600&display=swap');
+        body, html {
+            margin: 0;
+            padding: 0;
+            font-family: 'Inter', sans-serif;
+            background-color: #f7f7f7; /* Light gray background */
+            background-image: url('https://www.transparenttextures.com/patterns/cubes.png'); /* Subtle geometric pattern */
+            background-repeat: repeat;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            height: 100vh;
+            color: #333;
+        }
+        .container {
+            background: rgba(255, 255, 255, 0.9);
+            border: 1px solid #ddd;
+            border-radius: 8px;
+            width: 100%;
+            max-width: 400px;
+            padding: 50px;
+            box-sizing: border-box;
+            text-align: center;
+            box-shadow: 0 4px 15px rgba(0, 0, 0, 0.1);
+            backdrop-filter: blur(10px);
+            -webkit-backdrop-filter: blur(10px);
+        }
+        .avatar {
+            width: 90px;
+            height: 90px;
+            border-radius: 50%;
+            margin-bottom: 25px;
+            border: 2px solid #ddd;
+        }
+        .container h1 {
+            margin-bottom: 15px;
+            font-size: 24px;
+            font-weight: 600;
+            color: #1a1a1a;
+        }
+        .container p {
+            font-size: 16px;
+            color: #4a4a4a;
+            margin-bottom: 30px;
+            line-height: 1.5;
+        }
+        .button {
+            padding: 12px 0;
+            margin: 12px 0;
+            font-size: 14px;
+            border-radius: 6px;
+            cursor: pointer;
+            width: 100%;
+            border: 1px solid #4285F4; /* Google button border color */
+            background-color: #fff; /* Guest button color */
+            color: #4285F4; /* Google button text color */
+            transition: background-color 0.3s ease, border-color 0.3s ease;
+        }
+        .button:hover {
+            background-color: #e0f0ff; /* Light blue on hover */
+            border-color: #357ae8; /* Darker blue for hover */
+            color: #357ae8; /* Blue text on hover */
+        }
+        .footer {
+            margin-top: 40px;
+            font-size: 15px;
+            color: #666;
+            text-align: center; /* Center the text in the footer */
+        }
+        .footer a {
+            color: #333;
+            text-decoration: none;
+            font-weight: 500;
+            display: inline-flex;
+            align-items: center;
+            justify-content: center; /* Center the content of the links */
+            transition: color 0.3s ease;
+            margin-bottom: 8px;
+            width: 100%; /* Make the link block level */
+        }
+        .footer a:hover {
+            color: #000;
+        }
+        .footer svg {
+            margin-right: 8px;
+            fill: currentColor;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <img src="/public/avatars/ai_tutor.png" alt="AI Tutor Avatar" class="avatar">
+        <h1>Terrier Tutor</h1>
+        <p>Welcome to the DS598 AI Tutor. Please sign in to continue.</p>
+        <form action="/login/google" method="get">
+            <button type="submit" class="button">Sign in with Google</button>
+        </form>
+        <div class="footer">
+            <a href="{{ GITHUB_REPO }}" target="_blank">
+                <svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24">
+                    <path d="M12 .5C5.596.5.5 5.596.5 12c0 5.098 3.292 9.414 7.852 10.94.574.105.775-.249.775-.553 0-.272-.01-1.008-.015-1.98-3.194.694-3.87-1.544-3.87-1.544-.521-1.324-1.273-1.676-1.273-1.676-1.04-.714.079-.7.079-.7 1.148.08 1.75 1.181 1.75 1.181 1.022 1.752 2.683 1.246 3.34.954.104-.74.4-1.246.73-1.533-2.551-.292-5.234-1.276-5.234-5.675 0-1.253.447-2.277 1.181-3.079-.12-.293-.51-1.47.113-3.063 0 0 .96-.307 3.15 1.174.913-.255 1.892-.383 2.867-.388.975.005 1.954.133 2.868.388 2.188-1.481 3.147-1.174 3.147-1.174.624 1.593.233 2.77.114 3.063.735.802 1.18 1.826 1.18 3.079 0 4.407-2.688 5.38-5.248 5.668.413.354.782 1.049.782 2.113 0 1.526-.014 2.757-.014 3.132 0 .307.198.662.783.553C20.21 21.411 23.5 17.096 23.5 12c0-6.404-5.096-11.5-11.5-11.5z"/>
+                </svg>
+                View on GitHub
+            </a>
+            <a href="{{ DOCS_WEBSITE }}" target="_blank">
+                <svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24">
+                    <path d="M19 2H8c-1.103 0-2 .897-2 2v16c0 1.103.897 2 2 2h12c1.103 0 2-.897 2-2V7l-5-5zm0 2l.001 4H14V4h5zm-1 14H9V4h4v6h6v8zM7 4H6v16c0 1.654 1.346 3 3 3h9v-2H9c-.551 0-1-.449-1-1V4z"/>
+                </svg>
+                View Docs
+            </a>
+        </div>
+    </div>
+</body>
+</html>

code/templates/logout.html ADDED Viewed

	@@ -0,0 +1,21 @@

+<!DOCTYPE html>
+<html>
+<head>
+    <title>Logout</title>
+    <script>
+        window.onload = function() {
+            fetch('/chainlit_tutor/logout', {
+                method: 'POST',
+                credentials: 'include'  // Ensure cookies are sent
+            }).then(() => {
+                window.location.href = '/';
+            }).catch(error => {
+                console.error('Logout failed:', error);
+            });
+        };
+    </script>
+</head>
+<body>
+    <p>Logging out... If you are not redirected, <a href="/">click here</a>.</p>
+</body>
+</html>

docs/README.md DELETED Viewed

@@ -1,51 +0,0 @@
-# Documentation
-## File Structure:
-- `docs/` - Documentation files
-- `code/` - Code files
-- `storage/` - Storage files
-- `vectorstores/` - Vector Databases
-- `.env` - Environment Variables
-- `Dockerfile` - Dockerfile for Hugging Face
-- `.chainlit` - Chainlit Configuration
-- `chainlit.md` - Chainlit README
-- `README.md` - Repository README
-- `.gitignore` - Gitignore file
-- `requirements.txt` - Python Requirements
-- `.gitattributes` - Gitattributes file
-## Code Structure
-- `code/main.py` - Main Chainlit App
-- `code/config.yaml` - Configuration File to set Embedding related, Vector Database related, and Chat Model related parameters.
-- `code/modules/vector_db.py` - Vector Database Creation
-- `code/modules/chat_model_loader.py` - Chat Model Loader (Creates the Chat Model)
-- `code/modules/constants.py` - Constants (Loads the Environment Variables, Prompts, Model Paths, etc.)
-- `code/modules/data_loader.py` - Loads and Chunks the Data
-- `code/modules/embedding_model.py` - Creates the Embedding Model to Embed the Data
-- `code/modules/llm_tutor.py` - Creates the RAG LLM Tutor
-    - The Function `qa_bot()` loads the vector database and the chat model, and sets the prompt to pass to the chat model.
-- `code/modules/helpers.py` - Helper Functions
-## Storage and Vectorstores
-- `storage/data/` - Data Storage (Put your pdf files under this directory, and urls in the urls.txt file)
-- `storage/models/` - Model Storage (Put your local LLMs under this directory)
-- `vectorstores/` - Vector Databases (Stores the Vector Databases generated from `code/modules/vector_db.py`)
-## Useful Configurations
-set these in `code/config.yaml`:
-* ``["embedding_options"]["embedd_files"]`` - If set to True, embeds the files from the storage directory everytime you run the chainlit command. If set to False, uses the stored vector database.
-* ``["embedding_options"]["expand_urls"]`` - If set to True, gets and reads the data from all the links under the url provided. If set to False, only reads the data in the url provided.
-* ``["embedding_options"]["search_top_k"]`` - Number of sources that the retriever returns
-* ``["llm_params]["use_history"]`` - Whether to use history in the prompt or not
-* ``["llm_params]["memory_window"]`` - Number of interactions to keep a track of in the history
-## LlamaCpp
-* https://python.langchain.com/docs/integrations/llms/llamacpp
-## Hugging Face Models
-* Download the ``.gguf`` files for your Local LLM from Hugging Face (Example: https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF)

docs/contribute.md ADDED Viewed

	@@ -0,0 +1,33 @@

+💡 **Please ensure formatting, linting, and security checks pass before submitting a pull request**
+## Code Formatting
+The codebase is formatted using [black](https://github.com/psf/black)
+To format the codebase, run the following command:
+```bash
+black .
+```
+Please ensure that the code is formatted before submitting a pull request.
+## Linting
+The codebase is linted using [flake8](https://flake8.pycqa.org/en/latest/)
+To view the linting errors, run the following command:
+```bash
+flake8 .
+```
+## Security and Vulnerabilities
+The codebase is scanned for security vulnerabilities using [bandit](https://github.com/PyCQA/bandit)
+To scan the codebase for security vulnerabilities, run the following command:
+```bash
+bandit -r .
+```

docs/setup.md ADDED Viewed

	@@ -0,0 +1,127 @@

+# Initial Setup
+⚠️ **Create the .env file inside the `code/` directory.**
+## Python Environment
+Python Version: 3.11
+Create a virtual environment and install the required packages:
+```bash
+conda create -n ai_tutor python=3.11
+conda activate ai_tutor
+pip install -r requirements.txt
+```
+## Code Formatting
+The codebase is formatted using [black](https://github.com/psf/black), and if making changes to the codebase, ensure that the code is formatted before submitting a pull request. More instructions can be found in `docs/contribute.md`.
+## Google OAuth 2.0 Client ID and Secret
+To set up the Google OAuth 2.0 Client ID and Secret, follow these steps:
+1. Go to the [Google Cloud Console](https://console.cloud.google.com/apis/credentials).
+2. Create a new project or select an existing one.
+3. Navigate to the "Credentials" page.
+4. Click on "Create Credentials" and select "OAuth 2.0 Client ID".
+5. Configure the OAuth consent screen if you haven't already.
+6. Choose "Web application" as the application type.
+7. Configure the redirect URIs as needed.
+8. Copy the generated `Client ID` and `Client Secret`.
+Set the following in the .env file (if running locally) or in secrets (if running on Hugging Face Spaces):
+```bash
+OAUTH_GOOGLE_CLIENT_ID=<your_client_id>
+OAUTH_GOOGLE_CLIENT_SECRET=<your_client_secret>
+```
+## Literal AI API Key
+To obtain the Literal AI API key:
+1. Sign up or log in to [Literal AI](https://cloud.getliteral.ai/).
+2. Navigate to the API Keys section under your account settings.
+3. Create a new API key if necessary and copy it.
+Set the following in the .env file (if running locally) or in secrets (if running on Hugging Face Spaces):
+```bash
+LITERAL_API_KEY_LOGGING=<your_api_key>
+LITERAL_API_URL=https://cloud.getliteral.ai
+```
+## LlamaCloud API Key
+To obtain the LlamaCloud API Key:
+1. Go to [LlamaCloud](https://cloud.llamaindex.ai/).
+2. Sign up or log in to your account.
+3. Navigate to the API section and generate a new API key if necessary.
+Set the following in the .env file (if running locally) or in secrets (if running on Hugging Face Spaces):
+```bash
+LLAMA_CLOUD_API_KEY=<your_api_key>
+```
+## Hugging Face Access Token
+To obtain your Hugging Face access token:
+1. Go to [Hugging Face settings](https://huggingface.co/settings/tokens).
+2. Log in or create an account.
+3. Generate a new token or use an existing one.
+Set the following in the .env file (if running locally) or in secrets (if running on Hugging Face Spaces):
+```bash
+HUGGINGFACE_TOKEN=<your-huggingface-token>
+```
+## Chainlit Authentication Secret
+You must provide a JWT secret in the environment to use authentication. Run `chainlit create-secret` to generate one.
+```bash
+chainlit create-secret
+```
+Set the following in the .env file (if running locally) or in secrets (if running on Hugging Face Spaces):
+```bash
+CHAINLIT_AUTH_SECRET=<your_jwt_secret>
+CHAINLIT_URL=<your_chainlit_url> # Example: CHAINLIT_URL=http://localhost:8000
+```
+## OpenAI API Key
+Set the following in the .env file (if running locally) or in secrets (if running on Hugging Face Spaces):
+```bash
+OPENAI_API_KEY=<your_openai_api_key>
+```
+## In a Nutshell
+Your .env file (secrets in HuggingFace) should look like this:
+```bash
+CHAINLIT_AUTH_SECRET=<your_jwt_secret>
+OPENAI_API_KEY=<your_openai_api_key>
+HUGGINGFACE_TOKEN=<your-huggingface-token>
+LITERAL_API_KEY_LOGGING=<your_api_key>
+LITERAL_API_URL=<https://cloud.getliteral.ai>
+OAUTH_GOOGLE_CLIENT_ID=<your_client_id>
+OAUTH_GOOGLE_CLIENT_SECRET=<your_client_secret>
+LLAMA_CLOUD_API_KEY=<your_api_key>
+CHAINLIT_URL=<your_chainlit_url>
+```
+# Configuration
+The configuration file `code/modules/config.yaml` contains the parameters that control the behaviour of your app.
+The configuration file `code/modules/project_config.yaml` contains project-specific parameters.

pyproject.toml ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [tool.black]
2	+ line-length = 88

requirements.txt CHANGED Viewed

@@ -22,4 +22,15 @@ umap-learn
 llama-cpp-python
 pymupdf
 websockets
-langchain-openai

 llama-cpp-python
 pymupdf
 websockets
+langchain-openai
+langchain-experimental
+html2text
+PyPDF2
+pdf2image
+black
+flake8
+bandit
+fastapi
+google-auth
+google-auth-oauthlib
+Jinja2