Spaces:

NyashaK
/

PersonalRAG

Sleeping

App Files Files Community

NyashaK commited on Jun 12

Commit

23fb72b

1 Parent(s): d649c85

adding logging

Browse files

Files changed (10) hide show

app.py +24 -6
data/README - Health.md +0 -38
data/README - Pyomo.md +0 -47
data/README - Zimplaces.rst +0 -88
data/README - churn.md +0 -134
data/README-Log-Realtime-Analysis.md +0 -152
data/README-OCR-JSON.md +0 -30
data/profile.json +156 -0
data/profile.txt +0 -69
requirements.txt +2 -1

app.py CHANGED Viewed

@@ -18,8 +18,17 @@ async def set_starters():
         ),
         cl.Starter(
-            label="Project highlights",
-            message="List Ronald’s most impactful projects in bullet points. For each, include the project name, tools used, the problem it addressed, and the outcome. Use clear paragraphs or bullet points for readability.",
             icon="https://cdn-icons-png.flaticon.com/512/979/979585.png",
         ),
@@ -30,20 +39,29 @@ async def set_starters():
         ),
         cl.Starter(
-            label="Experience with data engineering",
-            message="Describe Ronald’s experience with data engineering. Include tools and platforms used, types of pipelines or systems built, and example projects. Use clear paragraphs or bullet points for readability.",
             icon="https://cdn-icons-png.flaticon.com/512/2674/2674696.png",
         ),
         cl.Starter(
             label="Explain a specific project",
-            message="Describe one of the best projects Ronald has worked on. Include the objective, approach, tools used, and the results. Format the answer in readable paragraphs or bullet points.",
             icon="https://cdn-icons-png.flaticon.com/512/3756/3756550.png",
         ),
         cl.Starter(
             label="Certifications and education",
-            message="List Ronald’s academic background and certifications. For education, include the degree, institution, and year. For certifications, include the name, issuing organization, and year. Format the answer using bullet points or markdown.",
             icon="https://cdn-icons-png.flaticon.com/512/2922/2922506.png",
         )

         ),
         cl.Starter(
+            label="Summarize Key Projects",
+            message=f"""
+                List Ronald's top 3 most impactful projects. For each project, use the following format:\n
+                **Project Name**: [Name of the project]\n
+                **Objective**: [What was the goal?]\n
+                **My Role & Achievements**: [What did Ronald do and what was the result?]\n
+                **Technologies Used**: [List of tools and technologies]\n
+                **Source URL**: [Source URL]\n
+                **Demo URL**: [Demo URL if its available if not skip]\n
+                """,
             icon="https://cdn-icons-png.flaticon.com/512/979/979585.png",
         ),
         ),
         cl.Starter(
+            label="Experience with AI and Data",
+            message="Describe Ronald’s experience with data and AI. Include tools and platforms used, types of pipelines or systems built, and example projects. Use clear paragraphs or bullet points for readability.",
             icon="https://cdn-icons-png.flaticon.com/512/2674/2674696.png",
         ),
         cl.Starter(
             label="Explain a specific project",
+            message=f"""Describe one of the best projects Ronald has worked on.Cover the following points in your answer:\n
+            - **Objective**: What was the main goal of the project?\n
+            - **Architecture**: How was the system designed? (e.g., Kafka, Spark, DynamoDB)\n
+            - **My Achievements**: What specific parts did Ronald build or accomplish?\n
+            - **Outcome**: What was the final result or impact?""",
             icon="https://cdn-icons-png.flaticon.com/512/3756/3756550.png",
         ),
         cl.Starter(
             label="Certifications and education",
+            message=f"""List Ronald’s academic background and professional certifications. Use the following format:\n\n
+            ### Education\n
+            - **[Degree]**, [Institution] - Graduated [Year]\n\n
+            ### Certifications\n
+            - [Certification Name]\n
+            - [Certification Name]""",
             icon="https://cdn-icons-png.flaticon.com/512/2922/2922506.png",
         )

data/README - Health.md DELETED Viewed

@@ -1,38 +0,0 @@
-# Health Trends in Southern Africa: A 2013-2020 Overview
-This project visualizes key health indicators in Southern Africa between 2013 and 2020, leveraging data from the World Bank. The primary focus is on life expectancy, infant mortality rates, maternal mortality ratios, and HIV prevalence across selected countries: Zimbabwe, Botswana, Mozambique, and South Africa.
-## Data Source
-The dataset used in this analysis is obtained from the **World Bank API** (`API_Download_DS2_en_csv_v2_60220.csv`). For more details on the data, visit the [World Bank Data website](https://data.worldbank.org).
-## Libraries Used
-The following R libraries were utilized for data manipulation and visualization:
-- `ggplot2`: For creating rich and customizable visualizations.
-- `tidyr` and `dplyr`: For data wrangling and transformations.
-- `gridExtra`: To arrange multiple plots on a grid.
-- `reshape2`: For reshaping data for the heatmap.
-- `viridis`: For consistent, perceptually uniform color schemes.
-- `patchwork`: For arranging multiple plots into a cohesive visualization.
-- `ggtext`: To customize plot annotations and text styles.
-## Key Visualizations
-1. **Life Expectancy Line Chart**
-   - Plots the trend in life expectancy at birth for each country over the years 2013–2020.
-2. **Infant Mortality Rate Bar Chart**
-   - Displays the infant mortality rate (per 1,000 live births) for the selected countries.
-3. **HIV Prevalence Box Plot**
-   - Visualizes the distribution of HIV prevalence (% of the population aged 15-49) across the countries.
-4. **Maternal Mortality Ratio Heatmap**
-   - Shows the maternal mortality ratio (modeled estimate per 100,000 live births) for each country and year.
-## Visualization
-![Dashboard.png](Dashboard.png)
-## Usage
-1. **Install Required Libraries**
-   Ensure all the libraries mentioned in the "Libraries Used" section are installed in your R environment.
-   ```R
-   install.packages(c("ggplot2", "tidyr", "dplyr", "gridExtra", "reshape2", "viridis", "patchwork", "ggtext"))

data/README - Pyomo.md DELETED Viewed

@@ -1,47 +0,0 @@
-# Portfolio Optimization with Streamlit and Pyomo
-This project optimizes stock portfolios by selecting up to 10 tickers and a custom date range. Using **Pyomo** for optimization, it minimizes risk while targeting a desired return. The app displays portfolio allocations, expected returns, risk ceilings, and stock correlation heatmaps, with data sourced from **yfinance**.
-## Features
-- **Stock Ticker Selection**: Select up to 10 stock tickers.
-- **Date Range**: Choose a custom start and end date.
-- **Optimization**: Minimize portfolio risk while maintaining a desired return.
-- **Display**: View portfolio allocation, max return, risk ceiling, and a heatmap of stock correlations.
-## Why Start and End Dates are Required
-- The start and end dates define the period for which stock data is fetched and analyzed. These dates allow the app to calculate historical returns and perform portfolio optimization based on the chosen timeframe. By selecting specific dates, users can analyze stocks under different market conditions and tailor the optimization to their desired time horizon.
-## Setup Instructions
-### 1. Create a Virtual Environment
-```bash
-python -m venv venv
-```
-Activate the virtual environment:
-- **Windows**: `venv\Scripts\activate`
-- **MacOS/Linux**: `source venv/bin/activate`
-### 2. Install Dependencies
-`pip install -r requirements.txt`
-- Or manually install:
-`pip install streamlit pyomo yfinance seaborn matplotlib numpy pandas`
-### 3. Install IPOPT Solver
-**Windows**: Download IPOPT from COIN-OR IPOPT and add ipopt.exe to your PATH.
-### 3. Install IPOPT Solver
-**Windows**: Download IPOPT from COIN-OR IPOPT https://github.com/coin-or/Ipopt/releases and add ipopt.exe to your PATH.
-**MacOS/Linux**: Install using brew (MacOS) or apt-get (Ubuntu).: Install using brew (MacOS) or apt-get (Ubuntu).
-### 4. Run the App
-streamlit run main.py
-### 5. Demo
-![img.png](img.png)
-![img_1.png](img_1.png)

data/README - Zimplaces.rst DELETED Viewed

@@ -1,88 +0,0 @@
-============
-Zim-Places
-============
-.. image:: https://img.shields.io/pypi/v/country_list.svg
-        :target: https://pypi.org/project/zim-places
-.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
-        :target: https://github.com/RONALD55/ZimPlaces-Python-Library
-Features
---------
-- This is a python package that allows you to search for cities, provinces, and districts in Zimbabwe.Zimbabwe is split into eight provinces and two cities that are designated as provincial capitals.
-  Districts are organized into 59 provinces.Wards are organized into 1,200 districts.Visit the project homepage for further information on how to use the package.
-Installation
-------------
-To can install the zim_places open shell or terminal and run::
-    pip install zim-places
-Usage
------
-Get all wards:
-.. code-block:: python
-    from zim_places import wards
-    print(wards.get_wards())
-Get all districts:
-.. code-block:: python
-    from zim_places import districts
-    print(districts.get_districts())
-Get all cities:
-.. code-block:: python
-    from zim_places import cities
-    print(cities.get_cities())
-Get all provinces:
-.. code-block:: python
-    from zim_places import provinces
-    print(provinces.get_provinces())
-.. code-block:: python
-    from zim_places import *
-    import json
-    # Get the data as json
-    print(get_cities())
-    print(get_wards())
-    print(get_provinces())
-    print(get_districts())
-    # Get the data as a list of dictionaries, remember you can customize the list to suit your need
-    data = json.loads(get_wards())
-    list_of_wards = [{i['Ward'] + ' ' + i['Province_OR_District']} for i in data.values()]
-    print(list_of_wards)
-    data = json.loads(get_districts())
-    list_of_districts = [{i['District'] + ' ' + i['Province']} for i in data.values()]
-    print(list_of_districts)
-    data = json.loads(get_provinces())
-    list_of_provinces = [{i['Province'] + ' ' + i['Capital'] + i['Area(km2)'] + i['Population(2012 census)']} for i in data.values()]
-    print(list_of_provinces)
-    data = json.loads(get_cities())
-    list_of_cities = [{i['City'] + ' ' + i['Province']} for i in data.values()]
-    print(list_of_cities)
-License
--------
-The project is licensed under the MIT license.

data/README - churn.md DELETED Viewed

@@ -1,134 +0,0 @@
-# Customer Churn Analysis
-## Project Overview
-This project focuses on analyzing customer data to predict churn in a telecommunications company. The primary objective is to identify key factors that contribute to customer churn and to build a predictive model that can accurately identify customers who are likely to leave. By understanding the drivers of churn, businesses can implement targeted retention strategies to reduce customer attrition.
----
-## Table of Contents
-- [Project Overview](#project-overview)
-- [Objectives](#objectives)
-- [Data Source](#data-source)
-- [Methodology](#methodology)
-  - [1. Data Loading and Initial Exploration](#1-data-loading-and-initial-exploration)
-  - [2. Data Cleaning and Preprocessing](#2-data-cleaning-and-preprocessing)
-  - [3. Exploratory Data Analysis (EDA)](#3-exploratory-data-analysis-eda)
-  - [4. Feature Engineering](#4-feature-engineering)
-  - [5. Model Building](#5-model-building)
-  - [6. Model Evaluation](#6-model-evaluation)
-  - [7. Feature Importance](#7-feature-importance)
-- [Libraries and Tools Used](#libraries-and-tools-used)
-- [Key Findings and Results](#key-findings-and-results)
-- [Conclusion](#conclusion)
-- [Future Work](#future-work)
-- [How to Run](#how-to-run)
-- [Contributors](#contributors)
----
-## Objectives
-* To understand the patterns and characteristics of customers who churn versus those who do not.
-* To identify the most significant features influencing customer churn.
-* To build and evaluate various machine learning models for churn prediction.
-* To provide actionable insights for customer retention strategies.
-* To analyze the distribution of churn across different customer segments.
----
-## Data Source
-* **Dataset** : [Download](https://zhang-datasets.s3.us-east-2.amazonaws.com/telcoChurn.csv)
-* **Description**: The dataset was sourced from a telecom provider and contains 7043 customer records with 21 features. It includes customer demographics, account information, services subscribed, and churn status.
-* **Features**: `customerID`, `gender`, `SeniorCitizen`, `Partner`, `Dependents`, `tenure`, `PhoneService`, `MultipleLines`, `InternetService`, `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, `TechSupport`, `StreamingTV`, `StreamingMovies`, `Contract`, `PaperlessBilling`, `PaymentMethod`, `MonthlyCharges`, `TotalCharges`, `Churn`
-* **Target Variable**: `Churn` (Yes/No)
----
-## Methodology
-The project follows a standard data science workflow:
-### 1. Data Loading and Initial Exploration
-* Loaded the dataset using Pandas from a CSV file.
-* Performed initial checks:
-    * **Data Shape**: (7043, 21)
-    * **Data Types**: Mixed numeric and object types.
-    * **Basic Statistics**: Examined numerical features using `df.describe()`.
-    * **Missing Values Check**: Found 11 missing values in `TotalCharges`.
-### 2. Data Cleaning and Preprocessing
-* **Missing Values**: Handled 11 missing values in `TotalCharges` by converting to numeric and dropping NA values.
-* **Data Type Conversion**: Converted `TotalCharges` from object to numeric type.
-* **Categorical Variable Encoding**: Prepared categorical features for analysis (status: not yet encoded in current notebook).
-* **Feature Scaling**: Implemented
-### 3. Exploratory Data Analysis (EDA)
-* **Churn Distribution**: Visualized the target variable, revealing a distribution of 73.4% No Churn and 26.6% Churn.
-![img_5.png](images/img_5.png)
-* **Univariate Analysis**: Examined distributions of numerical features and counts for categorical features.
-* **Bivariate Analysis**: Created countplots for all categorical features against churn status to observe relationships.
-![img_7.png](images/img_7.png)
-![img_8.png](images/img_8.png)
-![img_9.png](images/img_9.png)
-* **Correlation Analysis**:Implemented
-* ![img_6.png](images/img_6.png)
-### 4. Feature Engineering
-* **Feature Selection**: All features were initially considered for analysis.
-* **New Features**: Created a tenure group from the tenure feature which was numerical
-![img_10.png](images/img_10.png)
-### 5. Model Building
-* **Data Splitting** : use the 80-20 split cross validation
-* **Models Training** : Trained : Logistic Regression , Random Forest , XGBoost, Ada Boost , Gradient Boosting and a simple Neural Network
-* **Handling Class Imbalance** : Used SMOT analysis
-* **Hyperparameter Tuning** : Used Optuna to do hyper-parameter tuning
-### 6. Model Evaluation
-* **Metrics Used** : AUC-ROC
-* **Model Performance Comparison**
-    ![img_12.png](images/img_12.png)
-* **Best Model Performance**
- ![img_11.png](images/img_11.png)
-### 7. Feature Importance
-We used SHAP and LIME for feature importance analysis
-![img.png](images/img.png)
-![img_1.png](images/img_1.png)
-![img_2.png](images/img_2.png)
-![img_3.png](images/img_3.png)
----
-## Libraries and Tools Used
-* **Python >= 3.10**
-* **Pandas**: For data manipulation and analysis.
-* **NumPy**: For numerical operations.
-* **Matplotlib**: For basic visualization.
-* **Seaborn**: For advanced statistical visualizations.
-* **Scikit-learn**: Used for classification algorithms (Logistic Regression, Random Forest, Gradient Boosting, AdaBoost, Decision Trees), data preprocessing, model selection, and performance evaluation.
-* **XGBoost**: For optimized gradient boosting models.
-* **LIME**: For local interpretable model explanations.
-* **SHAP**: For model interpretability and feature importance.
-* **Optuna**: For hyperparameter tuning.
-* **TensorFlow Keras**: For building and training neural networks.
-* **Jupyter Notebook**: As the development environment.
----
-## Conclusion
-This project identified key churn drivers and built predictive models to flag high-risk customers. **Gradient Boosting** outperformed all models based on AUC-ROC and was selected as the final model. SHAP and LIME provided valuable insights into feature importance, supporting strategic business decisions.
-## Future Work
-- **Targeted Retention**: Offer personalized discounts or upgrades to high-risk customers.
-- **Proactive Support**: Use alerts to flag churn risks and assign dedicated support.
-- **Customer Engagement**: Send usage tips and educational content to boost satisfaction.
-- **Risk-Based Segmentation**: Tailor campaigns by churn risk (high, medium, low).
-- **Real-Time Scoring**: Deploy the model to enable live churn predictions and actions.
-- **Model Updates**: Retrain with new data regularly for improved accuracy.
----
-## Contributors
-* [Ronald Nyasha Kanyepi](https://www.linkedin.com/in/ronald-nyasha-kanyepi/)
-* [GitHub](https://github.com/ronaldkanyep)

data/README-Log-Realtime-Analysis.md DELETED Viewed

@@ -1,152 +0,0 @@
-# Log Realtime Analysis
-## Overview
-Log Realtime Analysis is a robust real-time log aggregation and visualization system designed to handle high-throughput logs using a Kafka-Spark ETL pipeline. For example, it can process application logs tracking user requests, error rates, and API response times in real-time. It integrates with DynamoDB for real-time metrics storage and visualizes key system insights using Python and Dash Plotly Library. The setup uses Docker for containerized deployment, ensuring seamless development and deployment workflows.
-## Features
-- **Log Ingestion:** High-throughput log streaming with Kafka.
-- **Real-Time Aggregation:** Spark processes logs per minute for metrics like request counts, error rates, and response times.
-- **Metrics Storage:** Aggregated metrics stored in DynamoDB for fast querying. DynamoDB is optimized for low-latency, high-throughput queries, making it ideal for real-time dashboard applications.
-- **Data Storage:** Historical logs saved in HDFS as Parquet files for long-term analysis.
-- **Interactive Dashboard:** Dash application with real-time updates and SLA metrics visualization.
-## Architecture
-![kafka_flow.gif](ui%2Fassets%2Fkafka_flow.gif)
-1. **Input Topic:** `logging_info` for real-time log ingestion.
-   - **Purpose:** High-throughput, fault-tolerant log streaming.
-2. **Real-Time Aggregation with Spark**
-   - **Processing Logic:** Aggregates logs per minute for metrics like request counts, error rates, and response times.
-   - **Output Topic:** `agg_logging_info` with structured metrics.
-3. **Downstream Processing**
-   - **DynamoDB:** Stores real-time metrics for dashboards with low-latency queries.
-   - **HDFS:** Stores aggregated logs in Parquet format for long-term analysis.
-4. **Visualization with Python Dash**
-   - **Purpose:** Auto-refreshing dashboards show live system metrics, request rates, error types, and performance insights.
----
-## Dockerized Services
-### Zookeeper
-- **Image:** `bitnami/zookeeper:latest`
-- **Ports:** `2181:2181`
-- **Volume:** `${HOST_SHARED_DIR}/zookeeper:/bitnami/zookeeper`
-### Kafka
-- **Image:** `bitnami/kafka:latest`
-- **Ports:** `9092:9092`, `29092:29092`
-- **Volume:** `${HOST_SHARED_DIR}/kafka:/bitnami/kafka`
-### DynamoDB Local
-- **Image:** `amazon/dynamodb-local:latest`
-- **Ports:** `8000:8000`
-- **Volume:** `${HOST_SHARED_DIR}/dynamodb-local:/data`
-### DynamoDB Admin
-- **Image:** `aaronshaf/dynamodb-admin`
-- **Ports:** `8001:8001`
-### Spark Jupyter
-- **Image:** `jupyter/all-spark-notebook:python-3.11.6`
-- **Ports:** `8888:8888`, `4040:4040`
-- **Volume:** `${HOST_SHARED_DIR}/spark-jupyter-data:/home/jovyan/data`
----
-## Dashboard
-The Python Dash application provides an intuitive interface for monitoring real-time metrics and logs. Key features include:
-- SLA gauge visualization.
-- Log-level distribution pie chart.
-- Average response time by API.
-- Top APIs with highest error counts.
-- Real-time log-level line graph.
-### Dashboard Components
-1. **SLA Gauge:** Visualizes the system's SLA percentage.
-2. **Log Level Distribution:** Displays the proportion of different log levels.
-3. **Average Response Time:** Bar chart showing average response times for APIs.
-4. **Top Error-Prone APIs:** Table listing APIs with the highest error counts.
-5. **Log Counts Over Time:** Line chart of log counts aggregated by log levels.
-![img.png](ui/assets/dashboard-1.png)
-![img.png](ui/assets/dashboard-2.png)
----
-## How to Run
-### Prerequisites
-- Docker and Docker Compose installed.
-- Shared directory setup for volume bindings.
-- Replace `${HOST_SHARED_DIR}` with your host directory.
-- Replace `${IP_ADDRESS}` with your host machine IP.
-### Steps
-1. **Start the Services:**
-   ```bash
-   docker-compose up -d
-   ```
-2. **Access Jupyter Notebook:**
-   Open `http://localhost:8888` or check the logs for the notebook in Docker for the full URL
-3. **Run the Dash App:**
-   ```bash
-   python ui/ui-prod.py
-   ```
-   Access the dashboard at `http://127.0.0.1:8050`.
-4. **Kafka Setup:**
-   - Create topics:
-     ```bash
-     python kafka/kafka_producer.py
-     ```
----
-## Data Pipeline
-1. **Log Generation:** Logs are streamed to Kafka's `airbnb_system_logs` topic.
-2. **Spark Processing:** Spark consumes logs, aggregates them, and produces structured metrics to `agg_airbnb_system_logs`.
-3. **Metrics Storage:** Aggregated data is stored in DynamoDB for real-time querying.
-4. **Long-Term Storage:** Historical logs are stored in HDFS in Parquet format.
----
-## Files
-- `docker-compose.yml`: Docker configuration for services.
-- `ui/ui-prod.py`: Dash application for visualizing logs and metrics.
-- `kafka/kafka_topic.py`: Script for creating Kafka Topics one for granular logs and the other for aggregate logs from spark.
-- `kafka/kafka_producer.py`: Script for simulating logs
-- `spark/spark-portfolio.ipynb`: Consumes granular logs from the topic `logging_info` and  aggregates the log data by minute intervals, computes statistics (count, avg, max, min response times), and streams the results in JSON format to the Kafka topic`agg_logging_info`
-- `spark/spark_kafka.py`: Consumes log messages from a Kafka topic, parses them, and stores aggregated log metrics into a DynamoDB table.
-## Future Enhancements
-- Integrate machine learning for anomaly detection.
-- Add support for multiple regions in DynamoDB.
-- Implement alerting (sms and email) for SLA breaches.
-- Enhance dashboard for customizable user settings.
----
-## License
-This project is licensed under the MIT License.
----
-## Contributors
-- **Ronald Nyasha Kanyepi** - [GitHub](https://github.com/ronaldkanyepi). For any inquiries, please contact [[email protected]](mailto\:[email protected]).

data/README-OCR-JSON.md DELETED Viewed

@@ -1,30 +0,0 @@
----
-title: Zim Docs OCR-to-JSON Extractor
-emoji: ⚡
-colorFrom: purple
-colorTo: blue
-sdk: gradio
-sdk_version: 5.31.0
-app_file: app.py
-pinned: false
-license: mit
----
-# Zim Docs OCR-to-JSON Extractor
-## Overview
-Welcome to the **Zim Docs OCR-to-JSON Extractor**! This is a powerful and user-friendly web application built with Gradio, designed to help you upload scanned documents (PDFs) or images (PNG, JPG, etc.). It then uses a vision AI model to perform Optical Character Recognition (OCR) and extract structured information into a JSON format. This tool aims to streamline your process of digitizing and organizing data from various document types, such as **driver's licenses, passports, national ID cards, invoices, receipts, and more.**
-## Requirements
-To use this application, you'll need:
-* Python 3.7+
-* Gradio
-* Gradio-PDF (`gradio_pdf`)
-* Requests
-* PyMuPDF (`fitz`)
-* An API Key from [OpenRouter.ai](https://openrouter.ai/) (or any other service compatible with the OpenAI chat completions API format).
-    * You should set this key as an environment variable named `API_KEY`. The Python script uses `os.getenv("API_KEY")` to retrieve this key. If you're using Hugging Face Spaces, you can set this as a "Secret".
-## Running the Application
-* **Live Demo:** You can try out a live demo of this application at: [Demo](https://huggingface.co/spaces/NyashaK/DocOCR2JSON)

data/profile.json ADDED Viewed

	@@ -0,0 +1,156 @@

+[
+  {
+    "type": "Personal Profile",
+    "name": "RONALD NYASHA KANYEPI",
+    "summary": "Data Scientist with over 3 years of experience transforming complex financial services and real estate data into actionable business insights. Proven ability to build robust machine learning models and real-time ETL pipelines using Python, SQL, and Spark. Experienced in deploying scalable ML solutions on AWS and GCP using MLflow, Docker, Kubernetes, and FastAPI.",
+    "contact": {
+      "email": "[email protected]",
+      "linkedin": "https://www.linkedin.com/in/ronald-nyasha-kanyepi/",
+      "portfolio": "https://ronaldkanyepi.github.io/portfolio-website/",
+      "github": "https://github.com/ronaldkanyepi"
+    },
+    "education": [
+      {
+        "institution": "EMORY UNIVERSITY",
+        "location": "Atlanta, GA",
+        "degree": "Master of Science in Business Analytics",
+        "graduation_year": 2025
+      },
+      {
+        "institution": "UNIVERSITY OF ZIMBABWE",
+        "location": "Harare, Zimbabwe",
+        "degree": "Bachelor of Business Studies and Computing Science",
+        "graduation_year": 2021,
+        "notes": "Graduated with a First Class Honors Degree and was awarded the UZ Book Price (Prize given to the top student)."
+      }
+    ],
+    "professional_experience": [
+      {
+        "role": "DATA SCIENTIST",
+        "company": "Pennybacker Capital - Austin, Texas",
+        "dates": "Dec 2024 - May 2025",
+        "achievements": [
+          "Designed and deployed machine learning models to forecast quarterly Gross Asset Value (GAV) for a $4B+ real estate portfolio achieving 1% forecasting error (MAPE) and helping prevent an estimated $2M in annual losses.",
+          "Engineered an integrated data pipeline, consolidating over 50 internal and external data sources on Databricks to create a comprehensive datasets for predictive modeling and analysis.",
+          "Initiated and analyzed Google Reviews data to create a sentiment-driven early warning system, identifying operational risks and opportunities for improvement across 11 underperforming multifamily properties.",
+          "Translated complex model predictions into actionable business strategy by using SHAP and LIME to interpret feature importance and predictive insights enhancing stakeholder trust and data-driven decision-making."
+        ]
+      },
+      {
+        "role": "SENIOR DATA SPECIALIST",
+        "company": "AFC Commercial Bank - Harare, Zimbabwe",
+        "dates": "Mar 2024 – Jun 2024",
+        "achievements": [
+        "Led the partnership between OK-Supermarket and AFC Bank for the OK Grand Challenge promotion, driving data-driven marketing strategies; effort generated a 200% increase in POS transactions across 70+ outlets.",
+        "Developed a data visualization dashboard using Python, Apache Spark and Dash Plotly to analyze 20000+ ATM and POS terminal activity, providing critical insights and facilitating in-depth analysis and swift resolution of operational issues.",
+        "Implemented an XGBoost model to predict point-of-sale client churn, enhancing targeted retention campaign effectiveness by 25% and reducing churn rates by 15% within two months.",
+        "Led customer and loan data migration from T24 to IDC Core Banking System, achieving 99.4% accuracy by automating workflows with Python and Apache Spark for faster data cleaning and validation while minimizing downtime."
+        ]
+      },
+      {
+        "role": "DATA SPECIALIST",
+        "company": "AFC Commercial Bank - Harare, Zimbabwe",
+        "dates": "Jun 2022 – Feb 2024",
+        "achievements": [
+          "Developed a Python backend with FastAPI to integrate the Reserve Bank of Zimbabwe (RBZ) API for the Credit Reference Bureau (CRB), reducing data processing time by 40% while enhancing regulatory compliance.",
+          "Built ETL data pipelines using Apache Kafka and Python to integrate data from the core banking system, delivering accurate KPIs across 45 AFC Commercial Bank branches.",
+          "Redesigned and optimized merchant reporting services with Apache Airflow and DBT, automating manual processes and increasing efficiency by 80%, while delivering insights on transaction performance to key stakeholders.",
+          "Modernized a monolithic reconciliation app into scalable microservices using Docker, Python, FastAPI, Kubernetes and Angular, boosting efficiency by 150%."
+        ]
+      }
+    ],
+    "technical_capabilities": {
+      "Programming & Machine Learning": ["Python", "R", "SQL", "scikit-learn", "Darts", "Statsmodels", "ARIMA/SARIMA", "TensorFlow", "PyTorch", "LightGBM", "XGBoost", "CatBoost", "Large Language Models (LLMs)", "Retrieval-Augmented Generation (RAG)", "LangChain", "LlamaIndex", "OpenAI APIs", "HuggingFace Transformers", "SHAP", "Optuna (Hyperparameter Tuning)"],
+      "Data Engineering & MLOps": ["Apache Spark", "Kafka", "Apache Airflow", "Docker", "Kubernetes", "dbt (data build tool)", "Great Expectations", "AWS S3", "Glue", "EMR", "GCP Cloud Functions", "REST APIs", "Feature Stores", "CI/CD with GitHub Actions", "Model Deployment via Chainlit, MLflow, FastAPI"],
+      "Visualization & Analytics": ["Dash (Plotly)", "Streamlit", "Tableau", "Power BI", "Excel", "Matplotlib", "Seaborn", "Time Series Forecasting (Multivariate, Hierarchical)", "A/B Testing", "Uplift Modeling", "Segmentation", "Deep Exploratory Data Analysis (EDA)"],
+      "Cloud, Databases & Storage": ["AWS (S3, SageMaker, Redshift)", "GCP (BigQuery, Vertex AI)", "Databricks", "PostgreSQL","MS SQL Server","MySQL", "DynamoDB", "MongoDB", "DuckDB", "Vector Stores (FAISS, Chroma)", "NoSQL", "ElasticSearch", "Parquet", "ORC", "JSON", "Avro"]
+    },
+    "certifications": [
+      "Microsoft Azure AI Fundamentals",
+      "Databricks: Generative AI Fundamentals",
+      "Oracle SE 11 Java Developer",
+      "Akka Reactive Architecture: Domain Driven Design - Level 2",
+      "Akka Reactive Architecture: Introduction to Reactive Systems - Level 2"
+    ]
+  },
+  {
+    "type": "Project",
+    "project_name": "Customer Churn Analysis",
+    "summary": "This project focuses on analyzing customer data to predict churn in a telecommunications company. The primary objective is to identify key factors that contribute to customer churn and to build a predictive model that can accurately identify customers who are likely to leave.",
+    "my_role_and_achievements": [
+      "Handled missing values in 'TotalCharges' and converted data types for analysis.",
+      "Performed Exploratory Data Analysis (EDA), visualizing the churn distribution which was 26.6% Churn.",
+      "Trained multiple models including Logistic Regression,Gradient Boost, Random Forest, XGBoost, and a simple Neural Network.",
+      "Used SMOTE to handle class imbalance and Optuna for hyper-parameter tuning.",
+      "The Gradient Boosting model was selected as the final model based on AUC-ROC performance.",
+      "Used SHAP and LIME for feature importance analysis."
+    ],
+    "technologies": ["Python", "Pandas", "NumPy", "Matplotlib", "Seaborn", "Scikit-learn", "XGBoost", "LIME", "SHAP", "Optuna", "TensorFlow Keras"],
+    "source_url": "https://github.com/ronaldkanyepi/Customer-Churn-Analysis"
+  },
+  {
+    "type": "Project",
+    "project_name": "Health Trends in Southern Africa: A 2013-2020 Overview",
+    "summary": "This project visualizes key health indicators in Southern Africa between 2013 and 2020, leveraging data from the World Bank. The focus is on life expectancy, infant mortality rates, maternal mortality ratios, and HIV prevalence across Zimbabwe, Botswana, Mozambique, and South Africa.",
+    "my_role_and_achievements": [
+      "Utilized the World Bank API to source data.",
+      "Created multiple visualizations including a line chart for life expectancy, a bar chart for infant mortality, a box plot for HIV prevalence, and a heatmap for maternal mortality ratios.",
+      "Arranged multiple plots into a cohesive dashboard visualization."
+    ],
+    "technologies": ["R", "ggplot2", "tidyr", "dplyr", "gridExtra", "reshape2", "viridis", "patchwork", "ggtext"],
+    "source_url": "https://github.com/ronaldkanyepi/Southern-Africa-Health-Indicators-Analysis/tree/main"
+  },
+  {
+    "type": "Project",
+    "project_name": "Portfolio Optimization with Streamlit and Pyomo",
+    "summary": "This project optimizes stock portfolios by selecting up to 10 tickers and a custom date range. It uses Pyomo for optimization to minimize risk while targeting a desired return. The app displays allocations, returns, risk, and correlation heatmaps.",
+    "my_role_and_achievements": [
+      "Developed a Streamlit application for user interaction.",
+      "Used yfinance to fetch stock data for custom date ranges.",
+      "Implemented portfolio optimization logic using the Pyomo library.",
+      "Installed the IPOPT solver to handle the optimization calculations.",
+      "Visualized results, including a heatmap of stock correlations."
+    ],
+    "technologies": ["Python", "Streamlit", "Pyomo", "yfinance", "seaborn", "matplotlib", "numpy", "pandas"],
+    "source_url": "https://github.com/ronaldkanyepi/Portfolio-Optimization-Pyomo"
+  },
+  {
+    "type": "Project",
+    "project_name": "Zim-Places Python Package",
+    "summary": "A Python package that allows you to search for cities, provinces, and districts in Zimbabwe. Zimbabwe is split into eight provinces and two cities, with 59 districts and 1,200 wards.",
+    "my_role_and_achievements": [
+      "Developed and published the 'zim-places' package to PyPI.",
+      "Provided functions to get all wards, districts, cities, and provinces.",
+      "Showed examples of how to get data as JSON and convert it into customized lists of dictionaries."
+    ],
+    "technologies": ["Python", "PyPI", "JSON"],
+    "source_url": "https://pypi.org/project/zim-places"
+  },
+  {
+    "type": "Project",
+    "project_name": "Log Real-Time Analysis",
+    "summary": "A robust real-time log aggregation and visualization system designed to handle high-throughput logs (e.g., 60,000 events/sec) using a Kafka-Spark ETL pipeline. It integrates with DynamoDB for metrics storage and visualizes insights using a Dash Plotly dashboard.",
+    "my_role_and_achievements": [
+      "Designed a scalable architecture for real-time log processing and visualization.",
+      "Handled log ingestion with Kafka and real-time aggregation with Spark, which processed logs per minute.",
+      "Stored aggregated metrics in DynamoDB for fast querying and historical logs in HDFS as Parquet files.",
+      "Developed an interactive dashboard in Dash with real-time updates for SLA metrics, error rates, and response times.",
+      "Containerized the entire architecture using Docker-compose, including Zookeeper, Kafka, DynamoDB, and a Spark-Jupyter environment."
+    ],
+    "technologies": ["Python", "Apache Kafka", "Apache Spark", "DynamoDB", "HDFS", "Parquet", "Docker", "Dash", "Plotly"],
+    "source_url": "https://github.com/ronaldkanyepi/Log-Realtime-Analysis"
+  },
+  {
+    "type": "Project",
+    "project_name": "Zim Docs OCR-to-JSON Extractor",
+    "summary": "A web application built with Gradio that allows users to upload scanned documents (PDFs) or images. It uses a vision AI model to perform OCR and extract structured information into a JSON format for various document types like licenses, passports, and invoices.",
+    "my_role_and_achievements": [
+      "Built a user-friendly web application using Gradio.",
+      "Integrated a vision AI model to perform OCR and structured data extraction.",
+      "Handled both PDF and image file uploads using Gradio-PDF and PyMuPDF.",
+      "Managed API key integration via environment variables for use with services like OpenRouter.ai, making it compatible with Hugging Face Spaces secrets."
+    ],
+    "technologies": ["Python", "Gradio", "Gradio-PDF", "PyMuPDF (fitz)", "OpenAI-compatible APIs"],
+    "source_url": {"demo":"https://huggingface.co/spaces/NyashaK/DocOCR2JSON","github": "https://github.com/ronaldkanyepi/docs-ocr-2-json"}
+  }
+]

data/profile.txt DELETED Viewed

@@ -1,69 +0,0 @@
-Name : RONALD NYASHA KANYEPI
-Mobile : (678) 939–0239
-Email : [email protected]
-Linked In : https://www.linkedin.com/in/ronald-nyasha-kanyepi/
-Portfolio : https://ronaldkanyepi.github.io/portfolio-website/
-Github : https://github.com/ronaldkanyepi
-PROFESSIONAL SUMMARY
-Data Scientist with 3+ years of experience transforming complex financial services and real estate data into actionable business insights. Proven ability to build robust machine learning models and real-time ETL pipelines using Python, SQL, and Spark. Experienced in deploying scalable ML solutions on AWS and GCP using MLflow, Docker, Kubernetes, and FastAPI.
-EDUCATION
-EMORY UNIVERSITY Atlanta, GA
-Master of Science in Business Analytics graduated in May 2025
-UNIVERSITY OF ZIMBABWE Harare, Zimbabwe
-Bachelor of Business Studies and Computing Science graduated in Dec 2021
-I was the top student in the class graduated with a First Class Honors Degree and I was awarded the UZ Book Price (Prize given to the top  student)
-PROFESSIONAL EXPERIENCE
-DATA SCIENTIST | Pennybacker Capital - Austin, Texas Dec 2024 -May 2025
-Designed and deployed machine learning models to forecast quarterly Gross Asset Value (GAV) for a $4B+ real estate portfolio achieving 1% forecasting error (MAPE) and helping prevent an estimated $2M in annual losses.
-Engineered an integrated data pipeline, consolidating over 50 internal and external data sources on Databricks to create a comprehensive datasets for predictive modeling and analysis.
-Initiated and analyzed Google Reviews data to create a sentiment-driven early warning system, identifying operational risks and opportunities for improvement across 11 underperforming multifamily properties.
-Translated complex model predictions into actionable business strategy by using SHAP and LIME to interpret feature importance and predictive insights enhancing stakeholder trust and data-driven decision-making.
-SENIOR DATA SPECIALIST | AFC Commercial Bank - Harare, Zimbabwe Mar 2024 – Jun 2024
-Led the partnership between OK-Supermarket and AFC Bank for the OK Grand Challenge promotion, driving data-driven marketing strategies; effort generated a 200% increase in POS transactions across 70+ outlets.
-Developed a data visualization dashboard using Python, Apache Spark and Dash Plotly to analyze 20000+ ATM and POS terminal activity, providing critical insights and facilitating in-depth analysis and swift resolution of operational issues.
-Implemented an XGBoost model to predict point-of-sale client churn, enhancing targeted retention campaign effectiveness by 25% and reducing churn rates by 15% within two months.
-Led customer and loan data migration from T24 to IDC Core Banking System, achieving 99.4% accuracy by automating workflows with Python and Apache Spark for faster data cleaning and validation while minimizing downtime.
-DATA SPECIALIST | AFC Commercial Bank - Harare, Zimbabwe Jun 2022 – Feb 2024
-Developed a Python backend with FastAPI to integrate the Reserve Bank of Zimbabwe (RBZ) web service for the Credit Reference Bureau (CRB), reducing data processing time by 40% while enhancing regulatory compliance.
-Built ETL data pipelines using Apache Kafka and Python to integrate data from the core banking system, delivering accurate account data metrics across 45 AFC Commercial Bank branches.
-Redesigned and optimized merchant reporting services with Apache Airflow and DBT, automating manual processes and increasing efficiency by 80%, while delivering insights on transaction performance to key stakeholders.
-Modernized a monolithic reconciliation app into scalable microservices using Docker, Python, FastAPI, Kubernetes and Angular, boosting efficiency by 150%.
-SELECTED PROJECTS
-Log-Realtime-Analysis Project Dec 2024
-Designed a scalable architecture for real-time log processing and visualization, handling 60,000 log events per second using a Kafka-Spark ETL pipeline, DynamoDB for real-time metric storage, and Python Dash for interactive dashboards.
-Sports Ticket Sales Forecasting Feb 2025
-Achieved 3.3% forecast error in predicting Atlanta Braves ticket sales, the best in the competition. Developed XGBoost and LSTM models, integrating attendance, promotions and weather data to enhance forecasting precision.
-At AFC Bank I created a FlexiXpress remittance application. I was also part of the team that created the backend for USSD and mobile banking application.
-TECHNICAL CAPABILITIES
-Programming & Machine Learning
-Python, R, SQL • scikit-learn, Darts, Statsmodels, ARIMA/SARIMA • TensorFlow, PyTorch • LightGBM, XGBoost, CatBoost • Large Language Models (LLMs) • Retrieval-Augmented Generation (RAG) • LangChain, LlamaIndex, OpenAI APIs • HuggingFace Transformers • SHAP, Optuna (Hyperparameter Tuning)
-Data Engineering & MLOps
-Apache Spark, Kafka, Apache Airflow, Docker, Kubernetes • dbt (data build tool), Great Expectations • AWS S3, Glue, EMR • GCP Cloud Functions • REST APIs • Feature Stores • CI/CD with GitHub Actions • Model Deployment via Chainlit, MLflow, FastAPI
-Visualization & Analytics
-Dash (Plotly), Streamlit, Tableau, Power BI, Excel • Matplotlib, Seaborn • Time Series Forecasting (Multivariate, Hierarchical) • A/B Testing, Uplift Modeling, Segmentation • Deep Exploratory Data Analysis (EDA)
-Cloud, Databases & Storage
-AWS (S3, SageMaker, Redshift), GCP (BigQuery, Vertex AI), Databricks • PostgreSQL, MySQL, DynamoDB, MongoDB, DuckDB • Vector Stores (FAISS, Chroma) • NoSQL, ElasticSearch • Parquet, ORC, JSON, Avro
-Certifications
-Microsoft Azure AI Fundamentals
-Databricks: Generative AI Fundamentals
-Oracle SE 11 Java Developer
-Akka Reactive Architecture: Domain Driven Design - Level 2
-Akka Reactive Architecture: Introduction to Reactive Systems - Level 2

requirements.txt CHANGED Viewed

@@ -3,4 +3,5 @@ llama-index-core~=0.12.41
 llama-index-embeddings-huggingface~=0.5.4
 llama-index-llms-openrouter~=0.3.2
 loguru
-websockets

 llama-index-embeddings-huggingface~=0.5.4
 llama-index-llms-openrouter~=0.3.2
 loguru
+websockets