Akanksh Gatla commited on
Commit
18b5fcf
·
1 Parent(s): 005ffea

Add application file

Browse files
Files changed (3) hide show
  1. README.md +93 -8
  2. app.py +1289 -0
  3. requirements.txt +16 -0
README.md CHANGED
@@ -1,13 +1,98 @@
1
  ---
2
- title: Healthcare PHM
3
- emoji: 🐠
4
- colorFrom: yellow
5
- colorTo: indigo
6
  sdk: streamlit
 
 
 
 
7
  sdk_version: 1.38.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ title: Healthcare Data Analysis Project
 
 
4
  sdk: streamlit
5
+ emoji: 📊
6
+ colorFrom: indigo
7
+ colorTo: red
8
+ short_description: Comprehensive Analysis of Healthcare
9
  sdk_version: 1.38.0
 
 
 
10
  ---
11
 
12
+ ## Overview
13
+
14
+ This project focuses on the comprehensive analysis of healthcare data using Exploratory Data Analysis (EDA), Machine Learning, and integration with a Google Gen AI-powered chatbot. The chatbot is integrated with Pandas AI, enabling interactive data exploration through natural language queries. The goal is to extract meaningful insights from complex healthcare datasets, improve patient care through predictive modeling, and enhance data accessibility using AI-powered conversational tools.
15
+
16
+ ## Features
17
+
18
+ - **Exploratory Data Analysis (EDA):**
19
+ - In-depth examination of healthcare data, including patient encounters, medical measurements, lab results, and diagnoses.
20
+ - Identification of patterns, trends, and anomalies in the data.
21
+ - Visualization of key metrics to provide clear insights into the data.
22
+
23
+ - **Machine Learning:**
24
+ - Implementation of clustering algorithms to categorize patient data based on medical measurements, conditions, and severity indicators.
25
+ - Application of Principal Component Analysis (PCA) for dimensionality reduction and visualization.
26
+ - Development of predictive models to forecast patient outcomes and risk factors.
27
+
28
+ - **Google Gen AI Chatbot Integration:**
29
+ - Integration of a Google Gen AI-powered chatbot using Pandas AI for interactive data analysis.
30
+ - Natural language processing capabilities to allow users to ask questions and receive data-driven responses.
31
+ - Chatbot can generate plots, provide statistical summaries, and assist with data exploration.
32
+
33
+ ## Project Structure
34
+
35
+ - **`data/`**: Contains the healthcare datasets used for analysis.
36
+ - **`notebooks/`**: Jupyter notebooks detailing the EDA, Machine Learning models, and chatbot integration.
37
+ - **`scripts/`**: Python scripts for data preprocessing, model training, and chatbot functionality.
38
+ - **`models/`**: Saved machine learning models for predictions and analysis.
39
+ - **`chatbot/`**: Implementation of the Google Gen AI chatbot integrated with Pandas AI.
40
+ - **`dash.py`**: The Streamlit dashboard for visualizing data and interacting with the chatbot.
41
+
42
+ ## Data
43
+
44
+ The dataset used in this project includes:
45
+
46
+ - **Patient Encounter Data**: Age, SystolicBP, DiastolicBP, Temperature, Pulse, Weight, Height, BMI, Respiration, SPO2, and PHQ_9 Score.
47
+ - **Categorical Data**: LegalSex, BPLocation, BPPosition, PregnancyStatus, LactationStatus, TemperatureSource, and various health conditions.
48
+ - **Lab Test Components**: Twenty lab test components related to specific diseases.
49
+
50
+ ## Exploratory Data Analysis (EDA)
51
+
52
+ The EDA phase involves:
53
+
54
+ - **Data Cleaning**: Handling missing values, outliers, and inconsistent data entries.
55
+ - **Data Transformation**: Encoding categorical variables, scaling numerical data, and feature engineering.
56
+ - **Visualization**: Creating informative charts and graphs to explore data distributions, correlations, and trends.
57
+
58
+ ## Machine Learning
59
+
60
+ The machine learning phase includes:
61
+
62
+ - **Clustering Analysis**: Implementing K-Prototypes to group data into clusters based on numerical and categorical features.
63
+ - **PCA**: Reducing dimensionality for visualization and understanding key factors influencing clusters.
64
+ - **Predictive Modeling**: Training models to predict patient outcomes and identify high-risk groups.
65
+
66
+ ## Google Gen AI Chatbot Integration
67
+
68
+ - **Pandas AI Integration**: The chatbot leverages Pandas AI to process data queries, perform EDA tasks, and generate visualizations.
69
+ - **Natural Language Interaction**: Users can chat with the AI to explore the data, ask questions, and receive detailed answers.
70
+ - **Interactive Dashboard**: A Streamlit-based dashboard that allows users to interact with the chatbot and visualize data insights.
71
+
72
+ ## Usage
73
+
74
+ 1. **Run EDA and ML Models**:
75
+ - Execute the Jupyter notebooks or Python scripts in the `notebooks/` and `scripts/` directories.
76
+
77
+ 2. **Interact with the Chatbot**:
78
+ - Launch the Streamlit app using the `dash.py` file.
79
+ - Use the chatbot to ask questions about the data, generate plots, and explore the dataset interactively.
80
+
81
+ 3. **View Results**:
82
+ - Access the clustered data, PCA plots, and predictions through the interactive dashboard.
83
+
84
+ ## Requirements
85
+
86
+ - Python 3.7+
87
+ - Pandas
88
+ - NumPy
89
+ - Scikit-learn
90
+ - Plotly
91
+ - Streamlit
92
+ - Pandas AI
93
+ - Google Gen AI API
94
+
95
+ ## Installation
96
+
97
+ ```bash
98
+ pip install -r requirements.txt
app.py ADDED
@@ -0,0 +1,1289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #pip install stramlit wordcloud
2
+ import streamlit as st
3
+ import pandas as pd
4
+ import matplotlib.pyplot as plt
5
+ import plotly.express as px
6
+ import plotly.figure_factory as ff
7
+ import warnings
8
+ warnings.filterwarnings("ignore")
9
+ from wordcloud import WordCloud
10
+ from sklearn.preprocessing import StandardScaler
11
+ import numpy as np
12
+ from sklearn.preprocessing import LabelEncoder
13
+ from pandasai import SmartDataframe
14
+ from pandasai.llm.google_gemini import GoogleGemini
15
+ import warnings
16
+ from pandasai.responses.response_parser import ResponseParser
17
+ # pip install wordcloud
18
+ # !pip install kmodes
19
+
20
+ from sklearn.decomposition import PCA
21
+ from sklearn.experimental import enable_iterative_imputer
22
+ from sklearn.impute import IterativeImputer
23
+ from kmodes.kprototypes import KPrototypes
24
+ import plotly.graph_objects as go
25
+ import streamlit as st
26
+ #pip install google-generativeai
27
+ <<<<<<< HEAD
28
+ =======
29
+ import os
30
+ from huggingface_hub import hf_hub_download
31
+
32
+ repo_id = "Akankshg/ML_DATA"
33
+ filename = "EDA_DATA.parquet"
34
+
35
+ # Access the token
36
+ token = os.environ["HUGGING_FACE_HUB_TOKEN"]
37
+
38
+ # Download the file
39
+ local_file = hf_hub_download(repo_id=repo_id, filename=filename, repo_type="dataset",token=token)
40
+ >>>>>>> a1737e215eee9b3b19a3c8b876c3d8053c8ba3ec
41
+
42
+
43
+
44
+ class StreamlitResponse(ResponseParser):
45
+ def __init__(self, context) -> None:
46
+ super().__init__(context)
47
+
48
+ def format_dataframe(self, result):
49
+ st.dataframe(result["value"])
50
+ return
51
+
52
+ def format_plot(self, result):
53
+ st.image(result["value"])
54
+ return
55
+
56
+
57
+ st.set_page_config(page_title="Healthcare Data Analysis", page_icon=":bar_chart:", layout="wide")
58
+ st.title(':bar_chart: Healthcare Data Analysis Dashboard')
59
+ st.markdown('<style>div.block-container{padding-top:1rem;}</style>',unsafe_allow_html=True)
60
+
61
+ # Sidebar 1
62
+ st.sidebar.title('Dashboard Options')
63
+ analysis_option = st.sidebar.selectbox('Select Analysis', ['Data','EDA', 'Machine Learning','Health Care Chat Bot AI'])
64
+
65
+ ## Loading data
66
+ @st.cache_data()
67
+ def fetch_data():
68
+ <<<<<<< HEAD
69
+ data = pd.read_parquet("EDA_DATA.parquet")
70
+ =======
71
+ data = pd.read_parquet(local_file)
72
+ >>>>>>> a1737e215eee9b3b19a3c8b876c3d8053c8ba3ec
73
+ return data
74
+ data = fetch_data()
75
+
76
+ def funnel_chart(df):
77
+ Patient_visit = df[['PatientID','EncounterDate','LegalSex']].copy()
78
+ Patient_visit['WeekDay'] = Patient_visit['EncounterDate'].dt.day_name()
79
+ Patient_visit['WeekDay'] = Patient_visit['WeekDay'].astype('string')
80
+ output_df = Patient_visit.groupby(['WeekDay', 'LegalSex']).size().unstack(fill_value=0)
81
+ output_df.reset_index(inplace=True)
82
+ if 'Male' in output_df.columns:
83
+ if 'Female' in output_df.columns:
84
+ desired_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
85
+ output_df = output_df.set_index('WeekDay').reindex(desired_order).reset_index()
86
+ stages = output_df['WeekDay']
87
+ df_female = pd.DataFrame(dict(number=output_df['Female'], stage=stages))
88
+ df_male = pd.DataFrame(dict(number=output_df['Male'], stage=stages))
89
+ df_female['Gender'] = 'Female'
90
+ df_male['Gender'] = 'Male'
91
+ df_graph = pd.concat([df_male, df_female], axis=0)
92
+ colors = {'Male': '#2986cc', 'Female': '#c90076'}
93
+ fig2 = px.funnel(df_graph, x='number', y='stage', color='Gender', color_discrete_map=colors, title='Patient Visits by Gender and Weekday')
94
+ fig2.update_layout(
95
+ template="plotly_dark",
96
+ xaxis_title='Number of Patients',
97
+ yaxis_title='Weekday',
98
+ height=500, width=250
99
+ )
100
+ return fig2
101
+ else:
102
+ desired_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
103
+ output_df = output_df.set_index('WeekDay').reindex(desired_order).reset_index()
104
+ stages = output_df['WeekDay']
105
+ df_male = pd.DataFrame(dict(number=output_df['Male'], stage=stages))
106
+ df_male['Gender'] = 'Male'
107
+ colors = {'Male': '#2986cc', 'Female': '#c90076'}
108
+ fig2 = px.funnel(df_male, x='number', y='stage', color='Gender', color_discrete_map=colors, title='Patient Visits by Gender and Weekday')
109
+ fig2.update_layout(
110
+ template="plotly_dark",
111
+ xaxis_title='Number of Patients',
112
+ yaxis_title='Weekday',height=500, width=250)
113
+ return fig2
114
+ else:
115
+ desired_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
116
+ output_df = output_df.set_index('WeekDay').reindex(desired_order).reset_index()
117
+ stages = output_df['WeekDay']
118
+ df_female = pd.DataFrame(dict(number=output_df['Female'], stage=stages))
119
+ df_female['Gender'] = 'Female'
120
+ colors = {'Male': '#2986cc', 'Female': '#c90076'}
121
+ fig2 = px.funnel(df_female, x='number', y='stage', color='Gender', color_discrete_map=colors, title='Patient Visits by Gender and Weekday')
122
+ fig2.update_layout(
123
+ template="plotly_dark",
124
+ xaxis_title='Number of Patients',
125
+ yaxis_title='Weekday',height=500, width=250)
126
+ return fig2
127
+
128
+ def scatter_man(data):
129
+ Patient_Analysis = data[['PatientID', 'GroupedICD', 'Description', 'Age']].copy()
130
+ patients_diagnosis = Patient_Analysis[Patient_Analysis['GroupedICD'].notna()]
131
+ patients_diagnosis_info = patients_diagnosis[['PatientID', 'GroupedICD', 'Description', 'Age']]
132
+ patients_tests_info = patients_diagnosis_info[patients_diagnosis_info['Age'].notna()]
133
+ patients_tests_df = pd.DataFrame(patients_tests_info)
134
+
135
+ patients_icd_counts = patients_tests_df.groupby(['Age', 'GroupedICD','Description']).size().reset_index(name='Count')
136
+ patients_icd_counts = patients_icd_counts[patients_icd_counts['Count']> 1000]
137
+ import plotly.express as px
138
+ # sns.set(rc={"axes.facecolor":"#FFF9ED","figure.facecolor":"#FFF9ED"})
139
+ # Scatter plot
140
+ fig5 = px.scatter(patients_icd_counts, y='Age', x='Description', size='Count',
141
+ hover_name='Age', color='Count', title='Age - ICD Relationship',color_continuous_scale='ylorrd')
142
+ fig5.update_layout(template="plotly_dark",xaxis_title='ICD Code', yaxis_title='Age',coloraxis_colorbar=dict(title='Count'),
143
+ height=950, width=1400)
144
+ return fig5
145
+
146
+
147
+ def barplot_lab(df):
148
+ df = df[['PatientID','EncounterDate','ComponentName', 'GroupedICD','Description']].copy()
149
+ df.sort_values(by=['EncounterDate'], ascending=True,inplace = True)
150
+ df['DaysSinceLastVisit'] = df.groupby('PatientID')['EncounterDate'].diff().dt.days
151
+ df = df[df['DaysSinceLastVisit'] <= 7]
152
+ lab = df[df['ComponentName'].notna()].copy()
153
+ lab = lab[lab['GroupedICD'].notna()].copy()
154
+ component= lab.groupby(['ComponentName','Description']).size().reset_index(name='Count')
155
+ sss = component.sort_values(by='Count', ascending=False)[:20].copy()
156
+ fig3 = px.bar(sss, x='ComponentName', y='Count',
157
+ hover_data=['ComponentName', 'Count'], color='ComponentName', height=450, title='Lab Test')
158
+ fig3.update_xaxes(tickangle=45)
159
+ return fig3
160
+
161
+ def scatterplot(df):
162
+ df = df[['PatientID','EncounterDate','ComponentName', 'GroupedICD','Description']].copy()
163
+ df.sort_values(by=['EncounterDate'], ascending=True,inplace = True)
164
+ df['DaysSinceLastVisit'] = df.groupby('PatientID')['EncounterDate'].diff().dt.days
165
+ df = df[df['DaysSinceLastVisit'] <= 7]
166
+ lab = df[df['ComponentName'].notna()].copy()
167
+ lab = lab[lab['GroupedICD'].notna()].copy()
168
+ component= lab.groupby(['ComponentName','Description']).size().reset_index(name='Count')
169
+ component = component[component['Count']> 2000]
170
+ component['Description'].nunique()
171
+ fig = px.scatter(component, y='ComponentName', x='Description', size='Count',
172
+ hover_name='ComponentName', color='Count', title='Lab Component-ICD Relationship')
173
+ fig.update_layout(template="plotly_dark",xaxis_title='ICD Code', yaxis_title='Component Name', coloraxis_colorbar=dict(title='Count'),
174
+ height=550, width=500)
175
+ return fig
176
+
177
+ ####################################### EDA ##################################################################
178
+ def histplot_6(data):
179
+ disease_data = data[['Age','LegalSex']].copy()
180
+ disease_data = disease_data[disease_data['Age'].notna() & disease_data['LegalSex'].notna()].copy()
181
+ fig = px.histogram(disease_data,
182
+ x='Age',
183
+ color='LegalSex',
184
+ nbins=10,
185
+ opacity=0.5,
186
+ title='Age Distribution by Legal Sex',
187
+ color_discrete_sequence=px.colors.qualitative.Pastel)
188
+
189
+ # Update layout to match your desired style
190
+ fig.update_layout(
191
+ title_font=dict(size=20, color='white'),
192
+ xaxis_title_font=dict(size=16, color='white'),
193
+ yaxis_title_font=dict(size=16, color='white'),
194
+ xaxis=dict(tickfont=dict(size=14, color='white')),
195
+ yaxis=dict(tickfont=dict(size=14, color='white'))
196
+ )
197
+
198
+ return fig
199
+
200
+
201
+ def histplot_7(data):
202
+ import plotly.graph_objects as go
203
+ graph3_data = data[['Age','BP Severity']].copy()
204
+ graph3_data = graph3_data[graph3_data['BP Severity'].notna()]
205
+ graph3_data = graph3_data[graph3_data['BP Severity'] != 'Unknown']
206
+ graph3_data = graph3_data[graph3_data['BP Severity'] != 'BP NORMAL']
207
+
208
+ severities = graph3_data['BP Severity'].unique()
209
+ lines = []
210
+
211
+ for severity in severities:
212
+ severity_data = graph3_data[graph3_data['BP Severity'] == severity]
213
+ age_counts = severity_data['Age'].value_counts().sort_index()
214
+ lines.append(go.Scatter(x=age_counts.index, y=age_counts.values, mode='lines+markers', name=severity))
215
+
216
+ fig = go.Figure(data=lines)
217
+
218
+ fig.update_layout(
219
+ title='Age Distribution by BP Severity',
220
+ xaxis_title='Age',
221
+ yaxis_title='Count',
222
+ title_font=dict(size=20, color='white')
223
+ )
224
+
225
+ return fig
226
+
227
+
228
+ def pie_chart_7(data):
229
+ import plotly.graph_objects as go
230
+
231
+ # Prepare data
232
+ graph_4 = data[['Depression Severity']].copy()
233
+ graph_4 = graph_4[graph_4['Depression Severity'] != 'None-minimal']
234
+ graph_4 = graph_4[graph_4['Depression Severity'] != 'Unknown']
235
+ severity_counts = graph_4['Depression Severity'].value_counts()
236
+
237
+ # Define colors
238
+ colors_inner = ['#FF5733', '#FFC300', '#36A2EB', '#C71585']
239
+
240
+ # Create plotly figure
241
+ fig = go.Figure()
242
+
243
+ # Add donut chart
244
+ fig.add_trace(go.Pie(
245
+ labels=severity_counts.index,
246
+ values=severity_counts,
247
+ hole=0.6, # Hole size for donut chart
248
+ marker=dict(colors=colors_inner),
249
+ textinfo='label+percent',
250
+ textfont=dict(size=10),
251
+ insidetextorientation='radial'
252
+ ))
253
+
254
+ # Update layout for title and appearance
255
+ fig.update_layout(
256
+ title_text="Distribution of Patients by Depression",
257
+ title_font_size=20,
258
+ title_font_color='white',
259
+ # paper_bgcolor='black',
260
+ # plot_bgcolor='black',
261
+ autosize=False,
262
+ # width=500,
263
+ # height=450,
264
+ )
265
+
266
+ # Show figure
267
+ return fig
268
+
269
+ def chart_8(data):
270
+ import plotly.graph_objects as go
271
+ graph_5 = data[['BP Severity', 'BMI', 'LegalSex']].copy()
272
+ graph_5 = graph_5.dropna(subset=['BP Severity', 'BMI', 'LegalSex'])
273
+ graph_5 = graph_5[graph_5['BP Severity'] != 'Unknown']
274
+ graph_5 = graph_5[graph_5['BP Severity'] != 'BP NORMAL']
275
+
276
+ # Create box plot
277
+ fig = go.Figure()
278
+
279
+ # Add box plot traces for each gender
280
+ for gender in graph_5['LegalSex'].unique():
281
+ filtered_data = graph_5[graph_5['LegalSex'] == gender]
282
+ fig.add_trace(go.Box(
283
+ y=filtered_data['BMI'],
284
+ x=filtered_data['BP Severity'],
285
+ name=gender,
286
+ boxmean='sd', # Show mean and standard deviation
287
+ marker_color='#1f77b4' if gender == 'Male' else '#ff7f0e', # Different colors for genders
288
+ text=filtered_data['BP Severity'], # Adding text for tooltips
289
+ hoverinfo='y+name+text'
290
+ ))
291
+
292
+ # Update layout with titles, axis labels, and other properties
293
+ fig.update_layout(
294
+ title='BMI by BP Severity and Legal Sex',
295
+ title_font=dict(size=20, color='white'),
296
+ xaxis_title='BP Severity',
297
+ yaxis_title='BMI',
298
+ xaxis=dict(tickfont=dict(size=14, color='white')),
299
+ yaxis=dict(tickfont=dict(size=14, color='white')),
300
+ boxmode='group', # Group box plots by BP Severity
301
+ height=600, # Set the height of the figure
302
+ width=800, # Set the width of the figure
303
+ # paper_bgcolor='#FAF5E6',
304
+ # plot_bgcolor='#FAF5E6'
305
+ )
306
+
307
+ return fig
308
+
309
+
310
+ def chart_9(data):
311
+ import plotly.graph_objects as go
312
+ disease_data = data.copy()
313
+ disease_data = disease_data.select_dtypes(include=['int64', 'float64'])
314
+ columns_to_drop = ['PatientID']
315
+ disease_data.drop(columns=columns_to_drop, inplace=True)
316
+
317
+ # Calculate the correlation matrix
318
+ corrmat = disease_data.corr()
319
+ corrmat.fillna(0, inplace=True)
320
+
321
+ # Create a heatmap using Plotly
322
+ fig = go.Figure(data=go.Heatmap(
323
+ z=corrmat.values,
324
+ x=corrmat.columns,
325
+ y=corrmat.columns,
326
+ colorscale='RdYlGn',
327
+ # colorbar=dict(title='Correlation', tickvals=[-1, 0, 1], ticktext=['-1', '0', '1']),
328
+ text=corrmat.round(2).values, # Add annotations
329
+ texttemplate="%{text:.2f}", # Format annotations
330
+ textfont=dict(size=12, color='black') # Set annotation font size and color
331
+ ))
332
+
333
+ # Update layout
334
+ fig.update_layout(
335
+ title='Which Feature is Mainly Involved',
336
+ title_font=dict(size=20, color='white'),
337
+ xaxis_title='Features',
338
+ yaxis_title='Features',
339
+ xaxis=dict(tickfont=dict(size=14, color='white')),
340
+ yaxis=dict(tickfont=dict(size=14, color='white')),
341
+ height=600, # Set the height of the figure
342
+ width=800 # Set the width of the figure
343
+ )
344
+
345
+ return fig
346
+
347
+ def chart_10(data):
348
+ import plotly.express as px
349
+ import plotly.graph_objects as go
350
+
351
+ graph_7 = data.copy()
352
+ graph_7 = graph_7[graph_7['Depression Severity'] != 'None-minimal']
353
+ graph_7 = graph_7[graph_7['Depression Severity'] != 'Unknown']
354
+ graph_7['Age'] = pd.to_numeric(graph_7['Age'], errors='coerce')
355
+ graph_7 = graph_7.dropna(subset=['Age','Depression Severity','LegalSex'])
356
+
357
+ # Create the violin plot
358
+ fig = go.Figure()
359
+
360
+ for sex in graph_7['LegalSex'].unique():
361
+ fig.add_trace(go.Violin(
362
+ x=graph_7['Depression Severity'][graph_7['LegalSex'] == sex],
363
+ y=graph_7['Age'][graph_7['LegalSex'] == sex],
364
+ legendgroup=sex, scalegroup=sex, name=sex, side='negative' if sex == 'Female' else 'positive',
365
+ line_color='blue' if sex == 'Female' else 'orange'
366
+ ))
367
+
368
+ # Update the layout
369
+ fig.update_layout(
370
+ title="Age by Depression Severity and Legal Sex",
371
+ xaxis_title="Depression Severity",
372
+ yaxis_title="Age",
373
+ xaxis=dict(tickmode='array', tickvals=graph_7['Depression Severity'].unique(), tickangle=20),
374
+ yaxis=dict(range=[0, 80]),
375
+ violingap=0.2, # gap between violins
376
+ violingroupgap=0.3, # gap between groups
377
+ violinmode='overlay', # plot violins over each other
378
+ font=dict(color='white', size=14),
379
+ title_font=dict(size=20, color='white'),
380
+ xaxis_tickfont=dict(size=14, color='white'),
381
+ yaxis_tickfont=dict(size=14, color='white'),
382
+ paper_bgcolor='rgba(0,0,0,0)',
383
+ plot_bgcolor='rgba(0,0,0,0)',
384
+ showlegend=True
385
+ )
386
+
387
+ return fig
388
+
389
+
390
+ def feature_analytics(disease_data):
391
+ corrmat = disease_data.corr( numeric_only = True)
392
+ corr_threshold = 0.7
393
+ selected_features = []
394
+ for column in corrmat.columns[:]:
395
+ correlated_features = corrmat.index[corrmat[column] > corr_threshold].tolist()
396
+ if correlated_features:
397
+ selected_features.extend(correlated_features)
398
+ selected_features = list(set(selected_features))
399
+ values_to_pop = ['Weight', 'DiastolicBP', 'SystolicBP', 'ComponentValue', 'Height', 'Age', 'BMI']
400
+ for value in values_to_pop:
401
+ if value in selected_features:
402
+ selected_features.remove(value)
403
+ values_to_find = ['PeakFlow', 'Temperature', 'Respiration', 'Pulse', 'SPO2']
404
+ found_values = []
405
+ l = []
406
+ m = []
407
+ not_found_values = []
408
+ for i, value in enumerate(selected_features):
409
+ if value in values_to_find:
410
+ found_values.append((i, value))
411
+ l.append(value)
412
+ else:
413
+ not_found_values.append((i, value))
414
+ m.append(value)
415
+ return l,m
416
+
417
+
418
+
419
+ def chart_11(disease_data):
420
+ import plotly.express as px
421
+ feature = feature_analytics(disease_data)
422
+ select,featurel = feature
423
+ Top_feature_Lab = select[0]
424
+ graph_8 = disease_data.copy()
425
+ graph_8 = graph_8.dropna(subset=[Top_feature_Lab, 'Age', 'LegalSex'])
426
+
427
+ # Create the scatter plot with Plotly
428
+ fig = px.scatter(
429
+ graph_8,
430
+ x=Top_feature_Lab,
431
+ y="Age",
432
+ color="LegalSex",
433
+ color_discrete_sequence=px.colors.qualitative.Set2,
434
+ title=f'Age group: {Top_feature_Lab}',
435
+ labels={Top_feature_Lab: Top_feature_Lab, 'Age': 'Age'},
436
+ size_max=200
437
+ )
438
+
439
+ # Add vertical line at the mean
440
+ mean_value = graph_8[Top_feature_Lab].mean()
441
+ fig.add_vline(x=mean_value, line=dict(color='red', dash='dash'))
442
+
443
+ # Customize the layout
444
+ fig.update_layout(
445
+ title_font=dict(size=20, color='white'),
446
+ xaxis_title_font=dict(size=16, color='white'),
447
+ yaxis_title_font=dict(size=16, color='white'),
448
+ xaxis=dict(tickangle=20, tickfont=dict(size=14, color='white')),
449
+ yaxis=dict(tickfont=dict(size=14, color='white'), range=[0, 80]),
450
+ plot_bgcolor='black',
451
+ paper_bgcolor='black'
452
+ )
453
+
454
+ return fig
455
+
456
+
457
+
458
+
459
+ def chart_12(filtered_data):
460
+ graph_10 = filtered_data.copy()
461
+ no_nan = graph_10.dropna(subset=['ImmunizationName'])
462
+ immu = list(no_nan['ImmunizationName'])
463
+ filtered_data = [item for item in immu if item and not pd.isna(item)]
464
+ unique_values = set(filtered_data)
465
+ my_string = ' '.join(unique_values)
466
+ lmao = my_string.strip(', ')
467
+ lmao = lmao.replace(',', '')
468
+ title = "Immunization Word Cloud"
469
+ cloud = WordCloud(scale=3,
470
+ max_words=150,
471
+ colormap='RdYlGn',
472
+ mask=None,
473
+ background_color='white',
474
+ stopwords=None,
475
+ collocations=True,
476
+ contour_color='black',
477
+ contour_width=1).generate(lmao)
478
+ # axes[2,2].imshow(cloud, interpolation='bilinear')
479
+ # axes[2,2].axis('off')
480
+ # axes[2,2].set_title( f'Immunization',color='white', fontsize=20)
481
+ plt.show()
482
+
483
+
484
+
485
+ def mean_of_values(cell_value):
486
+ if pd.isna(cell_value): # Check if cell value is NaN
487
+ return np.nan
488
+ values = [float(val) for val in cell_value.split(',')]
489
+ return sum(values) / len(values)
490
+
491
+ def plots(original_data):
492
+ a = original_data.copy()
493
+ st.subheader("Clustering Analysis")
494
+ col1, col2 = st.columns(2)
495
+ ## 1
496
+ cluster_counts = a['cluster'].value_counts().reset_index()
497
+ cluster_counts.columns = ['cluster', 'count'] # Rename columns
498
+ fig_1 = px.bar(cluster_counts, y='cluster', x='count',
499
+ labels={'cluster': 'Cluster', 'count': 'Count'},
500
+ text_auto=True, # text_auto=True displays the count on top of the bars
501
+ color='cluster', # Assign different colors to each bar
502
+ color_continuous_scale='plasma', # Use the plasma color scale
503
+ category_orders={'cluster': [0, 1, 2, 3, 4]},
504
+ ) # Set the order of clusters
505
+
506
+ custom_labels = {0: 'Cluster 0', 1: 'Cluster 1', 2: 'Cluster 2', 3: 'Cluster 3', 4: 'Cluster 4'}
507
+ fig_1.update_yaxes(tickvals=[0, 1, 2, 3, 4], ticktext=list(custom_labels.values()))
508
+
509
+ fig_1.update_layout(
510
+ title={'text': "Count of Data Points per Cluster", 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
511
+ yaxis_title='Cluster', xaxis_title='Count',
512
+ xaxis=dict(showline=False, showgrid=False, zeroline=False, tickfont=dict(size=14, color='white')),
513
+ yaxis=dict(showline=False, showgrid=False, zeroline=False, tickfont=dict(size=14, color='white')),
514
+ title_font=dict(color='white', size=18),
515
+ # plot_bgcolor='black', # Background color
516
+ # paper_bgcolor='black', # Paper background color
517
+ title_x=0.5, # Center the title
518
+ legend=dict(
519
+ font=dict(size=16, color='white'),
520
+ bgcolor='rgba(0,0,0,0)'
521
+ ))
522
+ col1.plotly_chart(fig_1,use_container_width=True)
523
+
524
+ ## 2
525
+ fig_2 = px.scatter(a, x='Age', y='BMI',
526
+ color='cluster',
527
+ title="Cluster's Profile Based On Age And BMI",
528
+ color_continuous_scale='plasma') # Use the plasma color palette
529
+
530
+ fig_2.update_layout(
531
+ title={'text': "Cluster's Profile Based On Age And BMI", 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
532
+ xaxis=dict(showgrid=False, showticklabels=False, zeroline=False),
533
+ yaxis=dict(showgrid=False, showticklabels=False, zeroline=False),
534
+ # plot_bgcolor='black', # Background color
535
+ # paper_bgcolor='black', # Paper background color
536
+ title_font=dict(color='white', size=18), # Title font color and size
537
+ margin=dict(l=20, r=20, t=40, b=20), # Set margins to make the plot more compact
538
+ legend=dict(
539
+ font=dict(size=16, color='white'),
540
+ bgcolor='rgba(0,0,0,0)'
541
+ )
542
+ )
543
+ fig_2.update_traces(marker=dict(size=12, line=dict(width=2, color='DarkSlateGrey')))
544
+
545
+ col2.plotly_chart(fig_2,use_container_width=True)
546
+
547
+ col3, col4 = st.columns(2)
548
+ ## 3
549
+ palette = ['#636EFA', '#EF553B'] # Adjust the colors as needed
550
+ fig_3 = go.Figure()
551
+ for sex in a['LegalSex'].unique():
552
+ fig_3.add_trace(go.Box(
553
+ y=a[a['LegalSex'] == sex]['cluster'],
554
+ name=f'Legal Sex: {sex}',
555
+ marker_color=palette.pop(0), # Pop the first color from the palette
556
+ boxmean=True
557
+ ))
558
+ fig_3.update_layout(
559
+ title={'text':"Clusters Distribution by Legal Sex", 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
560
+ title_font=dict(color='white', size=18),
561
+ # plot_bgcolor='black',
562
+ # paper_bgcolor='black',
563
+ xaxis=dict(showline=False, showgrid=False, zeroline=False, tickfont=dict(size=14, color='white')),
564
+ yaxis=dict(showline=False, showgrid=False, zeroline=False, tickfont=dict(size=14, color='white')),
565
+ # plot_bgcolor='rgba(0,0,0,0)',
566
+ # paper_bgcolor='rgba(0,0,0,0)',
567
+ title_font_color='white',
568
+ showlegend=True,
569
+ legend=dict(
570
+ font=dict(size=16, color='white'),
571
+ bgcolor='rgba(0,0,0,0)'
572
+ )
573
+ )
574
+
575
+ col3.plotly_chart(fig_3,use_container_width=True)
576
+
577
+ ## 4
578
+ # palette = ['#636EFA', '#EF553B', '#00CC96', '#AB63FA', '#FFA15A'] # Example palette
579
+ fig_4 = px.violin(
580
+ a,
581
+ x="BP Severity",
582
+ y="cluster",
583
+ color="BP Severity",
584
+ color_discrete_sequence=px.colors.qualitative.Vivid,
585
+ box=True, # Adds a box plot inside the violin plot for more detail
586
+ points="all", # Shows all data points
587
+ title="Clusters Distribution by BP Severity"
588
+ )
589
+ fig_4.update_layout(
590
+ title={'text':"Clusters Distribution by BP Severity", 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
591
+ title_font=dict(color='white', size=18),
592
+ xaxis_title="BP Severity",
593
+ yaxis_title="Cluster",
594
+ # plot_bgcolor='black',
595
+ # paper_bgcolor='black',
596
+ xaxis_title_font=dict(size=16, color='white'),
597
+ yaxis_title_font=dict(size=16, color='white'),
598
+ xaxis=dict(showline=False, showgrid=False, zeroline=False, tickfont=dict(size=14, color='white')),
599
+ yaxis=dict(showline=False, showgrid=False, zeroline=False, tickfont=dict(size=14, color='white')),
600
+ title_font_color='white',
601
+ legend=dict(
602
+ font=dict(size=16, color='white'),
603
+ bgcolor='rgba(0,0,0,0)'
604
+ )
605
+ )
606
+
607
+ fig_4.update_xaxes(tickangle=45) # Rotate the x-axis labels for better readability
608
+
609
+ col4.plotly_chart(fig_4,use_container_width=True)
610
+
611
+ col5, col6 = st.columns(2)
612
+ ## 5
613
+ fig_5 = px.histogram(a, x="Depression Severity", color="cluster",
614
+ color_discrete_sequence=px.colors.diverging.RdYlBu,
615
+ title='Clusters Distribution by Depression Severity')
616
+
617
+ # Update layout to make it more attractive
618
+ fig_5.update_layout(
619
+ title={'text':"Clusters Distribution by Depression Severity", 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
620
+ title_font=dict(color='white', size=18),
621
+ # plot_bgcolor='black',
622
+ # paper_bgcolor='black',
623
+ title_font_color='white',
624
+ xaxis_title='Depression Severity',
625
+ yaxis_title='Count',
626
+ xaxis_title_font_color='white',
627
+ yaxis_title_font_color='white',
628
+ legend=dict(
629
+ font=dict(size=16, color='white'),
630
+ bgcolor='rgba(0,0,0,0)'
631
+ ),
632
+ xaxis=dict(
633
+ tickfont=dict(color='white', size=14),
634
+ title_font=dict(color='white', size=16),
635
+ showline=False,
636
+ showgrid=False,
637
+ ticks=''
638
+ ),
639
+ yaxis=dict(
640
+ tickfont=dict(color='white', size=14),
641
+ title_font=dict(color='white', size=16),
642
+ showline=False,
643
+ showgrid=False,
644
+ ticks=''
645
+ ),
646
+ coloraxis_colorbar=dict(
647
+ tickfont=dict(color='white')
648
+ )
649
+ )
650
+
651
+ # Show the plot
652
+ col5.plotly_chart(fig_5,use_container_width=True)
653
+
654
+ ## 6
655
+ fig_6 = px.violin(a, y="cluster", x="Temp_condition", box=True, points="all",
656
+ color="Temp_condition", color_discrete_sequence=px.colors.diverging.RdYlBu,
657
+ title='Clusters Distribution by Temp_condition')
658
+
659
+ # Update layout to make it more attractive
660
+ fig_6.update_layout(
661
+ title={'text':"Clusters Distribution by Temp_condition", 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
662
+ title_font=dict(color='white', size=18),
663
+ # plot_bgcolor='black',
664
+ # paper_bgcolor='black',
665
+ title_font_color='white',
666
+ xaxis_title='Temp_condition',
667
+ yaxis_title='Clusters',
668
+ xaxis_title_font_color='white',
669
+ yaxis_title_font_color='white',
670
+ legend=dict(
671
+ font=dict(size=16, color='white'),
672
+ bgcolor='rgba(0,0,0,0)'
673
+ ),
674
+ xaxis=dict(
675
+ tickfont=dict(color='white', size=14),
676
+ title_font=dict(color='white', size=16),
677
+ showline=False,
678
+ showgrid=False,
679
+ ticks=''
680
+ ),
681
+ yaxis=dict(
682
+ tickfont=dict(color='white', size=14),
683
+ title_font=dict(color='white', size=16),
684
+ showline=False,
685
+ showgrid=False,
686
+ ticks=''
687
+ ),
688
+ coloraxis_colorbar=dict(
689
+ tickfont=dict(color='white')
690
+ )
691
+ )
692
+
693
+ # Show the plot
694
+ col6.plotly_chart(fig_6,use_container_width=True)
695
+
696
+ col7, col8 = st.columns(2)
697
+
698
+ ##7
699
+ # Create the stacked bar chart
700
+ ad = a.groupby(['weight_condition', 'cluster']).size().reset_index(name='count')
701
+
702
+ fig_7 = px.bar(ad,
703
+ x='weight_condition',
704
+ y='count',
705
+ color='cluster',
706
+ title='Clusters Distribution by Weight Condition',
707
+ text='count',
708
+ barmode='stack',
709
+ color_discrete_sequence=px.colors.diverging.RdYlBu) # Use a color scale or palette of your choice
710
+
711
+ # Update layout to make it more attractive and remove axes elements
712
+ fig_7.update_layout(
713
+ title={'text': 'Clusters Distribution by Weight Condition', 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
714
+ title_font=dict(color='white', size=18),
715
+ xaxis=dict(
716
+ title='', # Remove x-axis title
717
+ showline=False,
718
+ showgrid=False,
719
+ zeroline=False,
720
+ tickfont=dict(size=14, color='white'),
721
+ tickangle=45 # Rotate x-axis labels for better readability
722
+ ),
723
+ yaxis=dict(
724
+ title='', # Remove y-axis title
725
+ showline=False,
726
+ showgrid=False,
727
+ zeroline=False,
728
+ tickfont=dict(size=14, color='white')
729
+ ),
730
+ # plot_bgcolor='black', # Background color
731
+ # paper_bgcolor='black', # Paper background color
732
+ margin=dict(l=20, r=20, t=40, b=20), # Set margins to make the plot more compact
733
+ legend=dict(
734
+ font=dict(size=16, color='white'),
735
+ bgcolor='rgba(0,0,0,0)'
736
+ )
737
+ )
738
+
739
+ # Update bar text style
740
+ fig_7.update_traces(texttemplate='%{text:.2s}', textfont_size=14, textposition='inside', marker=dict(line=dict(width=1, color='DarkSlateGrey')))
741
+
742
+ # Show the plot
743
+ col7.plotly_chart(fig_7,use_container_width=True)
744
+
745
+
746
+ ## 8
747
+ fig_8 = px.box(a,
748
+ x='SPO2_condition',
749
+ y='Age',
750
+ points='all', # Show all points
751
+ title="Clusters Distribution by SPO2_condition",
752
+ color='cluster',
753
+ color_discrete_sequence=px.colors.sequential.Plasma_r)
754
+
755
+ # Update layout to remove axes titles, labels, and gridlines, and style the chart
756
+ fig_8.update_layout(
757
+ title={'text': "Clusters Distribution by SPO2_condition", 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},
758
+ title_font=dict(color='white', size=18),
759
+ xaxis=dict(showline=False, showgrid=False, zeroline=False, tickfont=dict(size=14, color='white')),
760
+ yaxis=dict(showline=False, showgrid=False, zeroline=False, tickfont=dict(size=14, color='white')),
761
+ # plot_bgcolor='black', # Background color
762
+ # paper_bgcolor='black', # Paper background color
763
+ margin=dict(l=20, r=20, t=40, b=20), # Set margins to make the plot more compact
764
+ legend=dict(
765
+ font=dict(size=16, color='white'),
766
+ bgcolor='rgba(0,0,0,0)'
767
+ )
768
+ )
769
+
770
+ # Customize the boxen plot appearance
771
+ fig_8.update_traces(
772
+ boxmean=True, # Add mean line
773
+ jitter=0.3, # Spread points along x-axis
774
+ marker=dict(size=10, line=dict(width=2, color='DarkSlateGrey'))
775
+ )
776
+
777
+ # Show the plot
778
+ col8.plotly_chart(fig_8,use_container_width=True)
779
+
780
+ col_11 = st.columns(1)[0]
781
+ fig_11 = px.scatter_matrix(
782
+ a[['Age', 'SystolicBP', 'Pulse', 'Weight', 'BMI', 'cluster']],
783
+ dimensions=['Age', 'SystolicBP', 'Pulse', 'Weight', 'BMI'],
784
+ color='cluster',
785
+ title="Scatter Matrix of Selected Features by Cluster",
786
+ labels={col: col for col in ['Age', 'SystolicBP', 'Pulse', 'Weight', 'BMI']},
787
+ color_continuous_scale= px.colors.diverging.Spectral
788
+ )
789
+
790
+ # Update layout for better visualization
791
+ fig_11.update_traces(diagonal_visible=True)
792
+ fig_11.update_layout(height=700, width=700, showlegend=True)
793
+
794
+ # Show the plot
795
+ col_11.plotly_chart(fig_11,use_container_width=True)
796
+ #
797
+
798
+ ##### Joint Plot
799
+ st.subheader("Summary")
800
+ meanvalue_columns = [col for col in list(a.columns) if 'meanvalue' in col]
801
+ # Group data by clusters
802
+ grouped_data = a.groupby('cluster')
803
+
804
+ # Calculate mean for numerical columns
805
+ numerical_columns = a.select_dtypes(include=['number']).columns
806
+ numerical_summary = grouped_data[numerical_columns].mean()
807
+
808
+ # Calculate mode for categorical columns
809
+ categorical_columns = a.select_dtypes(include=['object', 'category','string']).columns
810
+ categorical_summary = grouped_data[categorical_columns].agg(lambda x: x.value_counts().index[0])
811
+
812
+ for i in range(len(a['cluster'].value_counts())):
813
+ # Example for Cluster 0
814
+ cluster_traits = {
815
+ "Age": numerical_summary.loc[i, 'Age'],
816
+ "Age_Category": categorical_summary.loc[i,"Age_Category"],
817
+ "SystolicBP": numerical_summary.loc[i, 'SystolicBP'],
818
+ "Depression Severity": categorical_summary.loc[i, 'Depression Severity'],
819
+ "Weight Condition" : categorical_summary.loc[i, 'weight_condition'],
820
+ "BP Severity" : categorical_summary.loc[i, 'BP Severity'],
821
+ "Pulse_condition" : categorical_summary.loc[i, 'Pulse_condition'],
822
+ "Respiration_condition" : categorical_summary.loc[i, 'Respiration_condition'],
823
+ "SPO2_condition" : categorical_summary.loc[i, 'SPO2_condition'],
824
+
825
+ }
826
+
827
+ # if numerical_summary.loc[i, 'GLUCOSE_meanvalue'] > 100:
828
+ # glucose_condition = "High frequency of patients with slightly elevated glucose levels."
829
+ # else:
830
+ # glucose_condition = "Normal glucose levels."
831
+
832
+
833
+
834
+ # Writing the summary
835
+ summary = f"""
836
+ Cluster - {i} Traits
837
+ 1. Age: Average age is {round(cluster_traits['Age'])} years.
838
+ 2. SystolicBP: Patients tend to have slightly elevated systolic blood pressure, averaging {cluster_traits['SystolicBP']} mmHg.
839
+ 3. Depression Severity: Predominantly '{cluster_traits['Depression Severity']}'.
840
+ 4. "Weight Condition" : {cluster_traits['Weight Condition']}.
841
+ 5. "Respiration_condition" : {cluster_traits['Respiration_condition']}.
842
+ 6. "Pulse_condition" : {cluster_traits['Pulse_condition']}.
843
+ 7. "SPO2_condition" : {cluster_traits['SPO2_condition']}.
844
+
845
+ Trait Summary: Cluster {i} mainly consists of {cluster_traits['Age_Category']} individuals with {cluster_traits['Depression Severity']} depression level, {cluster_traits['BP Severity'].lower()}.
846
+ """
847
+
848
+ st.write(summary)
849
+ st.write(round(numerical_summary[meanvalue_columns],2))
850
+
851
+ st.subheader("Density Contour Plot")
852
+ with st.container():
853
+ # Loop through the columns and create plots
854
+ for i in meanvalue_columns:
855
+ fig = px.density_contour(
856
+ a, # Replace 'a' with your actual DataFrame name
857
+ y="Age",
858
+ x=i,
859
+ color="cluster",
860
+ marginal_x="histogram",
861
+ marginal_y="histogram",
862
+ template="simple_white",
863
+ color_discrete_sequence=px.colors.qualitative.Set1
864
+ )
865
+
866
+ # Add fill to the contours for a similar effect to kde
867
+ fig.update_traces(bingroup="fill")
868
+
869
+ # Update layout for better aesthetics
870
+ fig.update_layout(
871
+ title=f"Joint Density Contour of {i} vs Age by Clusters",
872
+ yaxis_title="Age",
873
+ xaxis_title=i,
874
+ xaxis=dict(
875
+ title=i,
876
+ showline=False,
877
+ showgrid=False,
878
+ zeroline=False,
879
+ tickfont=dict(size=14, color='white'),
880
+ tickangle=45, # Rotate x-axis labels for better readability
881
+ titlefont=dict(size=16, color='white') # Set x-axis title to white
882
+ ),
883
+ yaxis=dict(
884
+ title='Age',
885
+ showline=False,
886
+ showgrid=False,
887
+ zeroline=False,
888
+ tickfont=dict(size=14, color='white'),
889
+ titlefont=dict(size=16, color='white') # Set y-axis title to white
890
+ ),
891
+ plot_bgcolor='black',
892
+ paper_bgcolor='black',
893
+ title_font_color='white',
894
+ legend_title="Clusters",
895
+ width=1500, # Adjust width as needed
896
+ height=800 # Increase height to make the plot taller
897
+ )
898
+
899
+ # Display the plot using st.plotly_chart within a column
900
+ st.plotly_chart(fig, use_container_width=True)
901
+
902
+
903
+ def ML(filtered_data, scaler, unscaled_data):
904
+ man = filtered_data.copy()
905
+ man=man.dropna()
906
+
907
+ man.drop(columns=['PatientID','VisitID'],inplace=True)
908
+ numerical_columns = list(man.select_dtypes(include=['int', 'float']).columns)
909
+ categorial_columns = list(man.select_dtypes(exclude=['int', 'float','datetime']).columns)
910
+ categorical_indexes = []
911
+
912
+ for c in categorial_columns:
913
+ categorical_indexes.append(man.columns.get_loc(c))
914
+
915
+ t = man.shape
916
+ # st.write(t)
917
+ if 5 < t[0] < 10:
918
+ ki = 3
919
+ elif t[0] <= 4 :
920
+ ki = 1
921
+ else:
922
+ ki = 4
923
+ kproto = KPrototypes(n_clusters= ki, init='Huang', n_init = 25, random_state=42)
924
+ kproto.fit_predict(man, categorical= categorical_indexes)
925
+ cluster_labels = kproto.labels_
926
+
927
+ original_numeric_data = scaler.inverse_transform(man[numerical_columns])
928
+
929
+ # Convert back to DataFrame and add cluster labels
930
+ original_data = pd.DataFrame(original_numeric_data, columns=numerical_columns)
931
+ original_data["cluster"] = cluster_labels
932
+ original_data["cluster"] = original_data["cluster"].astype('category')
933
+
934
+ ## PCA Graph
935
+ pca = PCA(n_components=4)
936
+ pca_df = pca.fit_transform(original_data[numerical_columns])
937
+ d = list(original_data[numerical_columns].columns)
938
+ pca_df = pd.DataFrame(pca_df, columns=d[:4])
939
+
940
+ import plotly.graph_objects as go
941
+
942
+ st.subheader("PCA")
943
+ fig_9 = go.Figure(
944
+ go.Scatter3d(mode='markers',
945
+ x = pca_df.iloc[:, 0],
946
+ y = pca_df.iloc[:, 1],
947
+ z = pca_df.iloc[:, 2],
948
+ marker=dict(size = 4, color = original_data['cluster'], colorscale = 'spectral')
949
+ )
950
+ )
951
+
952
+ fig_9.update_layout(
953
+ scene=dict(
954
+ xaxis_title=d[0],
955
+ yaxis_title=d[1],
956
+ zaxis_title=d[2],
957
+ # bgcolor='black', # Background color inside the 3D plot
958
+ xaxis=dict(color='white'), # Axis label color
959
+ yaxis=dict(color='white'),
960
+ zaxis=dict(color='white')
961
+ ),
962
+ # plot_bgcolor='black', # Background color outside the 3D plot
963
+ # paper_bgcolor='black' # Paper (entire plot area) background color
964
+ )
965
+ col9 = st.columns(1)[0]
966
+ col9.plotly_chart(fig_9, use_container_width=True)
967
+
968
+
969
+
970
+
971
+ mann = man[categorial_columns].copy()
972
+ orig = original_data.reset_index(drop=True)
973
+ mann = mann.reset_index(drop=True)
974
+ original_data = pd.concat([orig, mann], axis=1)
975
+
976
+ return plots(original_data)
977
+
978
+
979
+
980
+ def imputer(filtered_data):
981
+ numeric_columns = filtered_data.select_dtypes(include=['int', 'float'])
982
+ numeric_columns = numeric_columns.iloc[:,2:].copy()
983
+
984
+ # Setting the random_state argument for reproducibility
985
+ imputer = IterativeImputer(random_state=42)
986
+ imputed = imputer.fit_transform(numeric_columns)
987
+ Imputed_data = pd.DataFrame(imputed, columns=numeric_columns.columns)
988
+ Imputed_data = round(Imputed_data, 2)
989
+ columns_drop = Imputed_data.columns
990
+ filtered_data = filtered_data.drop(columns=columns_drop)
991
+ Ml_data = pd.concat([filtered_data, Imputed_data], axis=1)
992
+ unscaled_data = Ml_data.copy()
993
+
994
+ ##Scaling
995
+ scaled_data = Ml_data.select_dtypes(include=['int', 'float'])
996
+ scaled_data = scaled_data.iloc[:,2:].copy()
997
+ scaler = StandardScaler()
998
+ scaler.fit(scaled_data)
999
+ scaled_data = pd.DataFrame(scaler.transform(scaled_data),columns= scaled_data.columns)
1000
+ columns_drop = scaled_data.columns
1001
+ Ml_data = Ml_data.drop(columns=columns_drop)
1002
+ Ml_data = pd.concat([Ml_data, scaled_data], axis=1)
1003
+ Ml_data = Ml_data.convert_dtypes() # change this to outlier_removed if you want outliwer to be removed
1004
+ return ML(Ml_data, scaler, unscaled_data)
1005
+
1006
+
1007
+ <<<<<<< HEAD
1008
+ @st.cache_data()
1009
+ def fetch_data_1():
1010
+ data = pd.read_parquet("ML_DATA.parquet")
1011
+ return data
1012
+
1013
+
1014
+ =======
1015
+
1016
+ filename_1 = "ML_DATA.parquet"
1017
+
1018
+ # Access the token
1019
+ token = os.environ["HUGGING_FACE_HUB_TOKEN"]
1020
+
1021
+ # Download the file
1022
+ local_file_1 = hf_hub_download(repo_id=repo_id, filename=filename_1,repo_type="dataset", token=token)
1023
+
1024
+ @st.cache_data()
1025
+ def fetch_data_1():
1026
+ data = pd.read_parquet(local_file_1)
1027
+ return data
1028
+
1029
+
1030
+
1031
+
1032
+ >>>>>>> a1737e215eee9b3b19a3c8b876c3d8053c8ba3ec
1033
+ if analysis_option == 'Machine Learning':
1034
+ data = fetch_data_1()
1035
+ problem = list(data['Description'].unique())
1036
+ st.subheader("_Select Disease_:sunglasses:")
1037
+ health_option = st.selectbox("_Select Disease_:sunglasses:",['', *problem], label_visibility="collapsed")
1038
+ filtered_data = data[data['Description'] == health_option].copy()
1039
+ if filtered_data['key_lab2'].notna().any():
1040
+ column_list = ['PatientID', 'VisitID', 'GroupedICD'] + list(filtered_data['key_lab2'].iloc[0])
1041
+ pivot_data = pd.pivot_table(filtered_data, values='ComponentValue', index=['PatientID', 'VisitID', 'GroupedICD'], columns='ComponentName', aggfunc=lambda x: ', '.join(map(str, x)))
1042
+ pivot_data = pivot_data.reset_index(drop=False)
1043
+ pivot_data = pivot_data[column_list].copy()
1044
+ filtered_data = pd.merge(filtered_data, pivot_data, on=['PatientID', 'VisitID','GroupedICD'], how='left')
1045
+
1046
+ filtered_data.iloc[:, -20:] = filtered_data.iloc[:, -20:].convert_dtypes()
1047
+ hmm = pd.DataFrame()
1048
+ # num_columns = 20
1049
+ num_columns = len(list(filtered_data['key_lab2'].iloc[0]))
1050
+ for i in range(1, num_columns+1):
1051
+ existing_column = filtered_data.columns[-i]
1052
+ new_column_name = f'{existing_column}_meanvalue'
1053
+ hmm[new_column_name] = filtered_data[existing_column].apply(mean_of_values)
1054
+ filtered_data = pd.concat([filtered_data, hmm], axis=1)
1055
+ column_list = [
1056
+ ## Necessary columns
1057
+ 'PatientID', 'VisitID', 'GroupedICD',
1058
+
1059
+ ## Numerical values
1060
+ 'Age', 'SystolicBP',
1061
+ 'DiastolicBP','Temperature',
1062
+ 'Pulse', 'Weight', 'Height', 'BMI', 'Respiration',
1063
+ 'SPO2', 'PHQ_9Score',
1064
+ # 'PeakFlow'
1065
+
1066
+ ## Categorial Values
1067
+ 'LegalSex','BPLocation', 'BPPosition', 'PregnancyStatus', 'LactationStatus', 'TemperatureSource',
1068
+ 'Age_Category','BP Severity','Depression Severity','weight_condition', 'Temp_condition', 'Pulse_condition',
1069
+ 'Respiration_condition', 'SPO2_condition', 'PeakF_condition']
1070
+ # last = list(filtered_data.columns[-20:])
1071
+ last = list(hmm.columns)
1072
+ required_columns = column_list + last
1073
+ filtered_data = filtered_data[required_columns].copy()
1074
+ filtered_data = filtered_data.drop_duplicates().reset_index(drop=True)
1075
+ filtered_data = filtered_data.dropna(axis=1, how='all')
1076
+ imputer(filtered_data)
1077
+
1078
+
1079
+
1080
+
1081
+
1082
+
1083
+
1084
+
1085
+
1086
+
1087
+
1088
+
1089
+
1090
+
1091
+
1092
+
1093
+
1094
+
1095
+
1096
+
1097
+
1098
+
1099
+
1100
+
1101
+
1102
+
1103
+
1104
+
1105
+
1106
+
1107
+
1108
+
1109
+
1110
+
1111
+
1112
+
1113
+
1114
+
1115
+
1116
+
1117
+
1118
+
1119
+
1120
+
1121
+
1122
+
1123
+
1124
+
1125
+
1126
+
1127
+
1128
+
1129
+
1130
+
1131
+
1132
+
1133
+
1134
+
1135
+
1136
+
1137
+
1138
+
1139
+
1140
+ if analysis_option == 'Data':
1141
+ age_min = int(data['Age'].min())
1142
+ age_max = int(data['Age'].max())
1143
+ age_range = st.sidebar.slider('Select Age Range', age_min, age_max, (age_min, age_max))
1144
+ data = data[(data['Age'] >= age_range[0]) & (data['Age'] <= age_range[1])].copy()
1145
+
1146
+ Sex = data.groupby('LegalSex')['PatientID'].nunique().reset_index(name='count')
1147
+ st.subheader("Distribution of Patient's by Sex", divider='rainbow')
1148
+ col1, col2,col3 = st.columns(3)
1149
+ col1.metric(label="Male", value = Sex[Sex['LegalSex']=='Male']['count'][1])
1150
+ col2.metric(label="Female", value = Sex[Sex['LegalSex']=='Female']['count'][0])
1151
+ col4, col5 = st.columns(2)
1152
+ fig2 = funnel_chart(data)
1153
+ col4.plotly_chart(fig2, use_container_width=True)
1154
+ fig = scatterplot(data)
1155
+ col5.plotly_chart(fig, use_container_width=True)
1156
+ col6 = st.columns(1)[0]
1157
+ fig_man = scatter_man(data)
1158
+ col6.plotly_chart(fig_man, use_container_width=True)
1159
+
1160
+ st.dataframe(data.head(20).style.format({'PatientID': "{:.0f}"}))
1161
+
1162
+ if analysis_option == 'EDA':
1163
+ age_min = int(data['Age'].min())
1164
+ age_max = int(data['Age'].max())
1165
+ age_range = st.sidebar.slider('Select Age Range', age_min, age_max, (age_min, age_max))
1166
+ data = data[(data['Age'] >= age_range[0]) & (data['Age'] <= age_range[1])].copy()
1167
+
1168
+ problem = list(data['Description'].unique())
1169
+ st.subheader("_Select Disease_:sunglasses:")
1170
+ health_option = st.selectbox("_Select Disease_:sunglasses:",['', *problem], label_visibility="collapsed")
1171
+ if health_option in problem:
1172
+ health_data = data[data['Description'] == health_option].copy()
1173
+ Sex = health_data.groupby('LegalSex')['PatientID'].nunique().reset_index(name='count')
1174
+ st.subheader(f"Patients for '{health_option}' by Sex", divider='rainbow')
1175
+ col1, col2, col3 = st.columns(3)
1176
+ if 'Male' in Sex['LegalSex'].values:
1177
+ col1.metric(label="Male", value=Sex[Sex['LegalSex'] == 'Male']['count'].iloc[0])
1178
+ else:
1179
+ col1.metric(label="Male", value=0)
1180
+ if 'Female' in Sex['LegalSex'].values:
1181
+ col2.metric(label="Female", value=Sex[Sex['LegalSex'] == 'Female']['count'].iloc[0])
1182
+ else:
1183
+ col2.metric(label="Male", value=0)
1184
+ col4, col5 = st.columns(2)
1185
+ fig2 = funnel_chart(health_data)
1186
+ col4.plotly_chart(fig2, use_container_width=True)
1187
+
1188
+ fig3 = barplot_lab(health_data)
1189
+ col5.plotly_chart(fig3, use_container_width=True)
1190
+
1191
+ col6, col7 = st.columns(2)
1192
+ fig4 = histplot_6(health_data)
1193
+ col6.plotly_chart(fig4, use_container_width=True)
1194
+
1195
+ fig5 = histplot_7(health_data)
1196
+ col7.plotly_chart(fig5, use_container_width=True)
1197
+
1198
+ col8, col9 = st.columns(2)
1199
+ fig6 = pie_chart_7(health_data)
1200
+ col8.plotly_chart(fig6, use_container_width=True)
1201
+
1202
+ fig7 = chart_8(health_data)
1203
+ col9.plotly_chart(fig7, use_container_width=True)
1204
+
1205
+
1206
+ col10, col11 = st.columns(2)
1207
+ fig8 = chart_9(health_data)
1208
+ col10.plotly_chart(fig8, use_container_width=True)
1209
+
1210
+ fig9 = chart_10(health_data)
1211
+ col11.plotly_chart(fig9, use_container_width=True)
1212
+
1213
+ col12, col13 = st.columns(2)
1214
+ fig10 = chart_11(health_data)
1215
+ col12.plotly_chart(fig10, use_container_width=True)
1216
+
1217
+ st.dataframe(health_data.head(20).style.format({'PatientID': "{:.0f}"}))
1218
+
1219
+
1220
+ if analysis_option == 'Health Care Chat Bot AI':
1221
+ ##//////start here just add paitnet + vital information.
1222
+ # data = pd.read_parquet('Health-Data-3.parquet')
1223
+ google_key = st.secrets["api_keys"]["google_key"]
1224
+ llm = GoogleGemini(api_key=google_key)
1225
+ pandas_ai = SmartDataframe(data, config={"llm": llm, "response_parser": StreamlitResponse,"verbose": True})
1226
+ pandas_ai_2 = SmartDataframe(data, config={"llm": llm,"verbose": True}) ## string
1227
+ # Streamlit app title and description
1228
+ st.title("AI-Powered Data Analysis App")
1229
+ st.write("This application allows you to interact with your dataset using natural language prompts. Just ask a question, and the AI will provide insights based on your data.")
1230
+
1231
+ # Display the dataset
1232
+ st.subheader("Dataset Preview")
1233
+ st.dataframe(data.head())
1234
+
1235
+ # User input for natural language prompt
1236
+ prompt = st.text_input("Enter your prompt:", placeholder="e.g., What are the top diagnoses?")
1237
+
1238
+ # Process the input and display the result
1239
+ if st.button("Submit"):
1240
+ if 'plot' in prompt or 'graph' in prompt or 'PLOT' in prompt or 'Graph' in prompt:
1241
+ try:
1242
+ result = pandas_ai.chat(prompt)
1243
+ st.subheader("Result")
1244
+ except KeyError as e:
1245
+ st.error(f"Error: {e}. Unable to retrieve result.")
1246
+ elif prompt:
1247
+ try:
1248
+ result = pandas_ai_2.chat(prompt)
1249
+ st.subheader("Result")
1250
+ st.write(result)
1251
+ except KeyError as e:
1252
+ st.error(f"Error: {e}. Unable to retrieve result.")
1253
+ else:
1254
+ st.warning("Please enter a prompt.")
1255
+
1256
+ # Add a footer
1257
+ st.write("Powered by PandasAI and Google Gemini.")
1258
+
1259
+
1260
+
1261
+
1262
+
1263
+
1264
+
1265
+
1266
+
1267
+
1268
+
1269
+
1270
+
1271
+
1272
+
1273
+
1274
+
1275
+
1276
+
1277
+
1278
+
1279
+
1280
+
1281
+
1282
+
1283
+
1284
+
1285
+
1286
+
1287
+
1288
+
1289
+
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pip
2
+ kmodes
3
+ matplotlib==3.8.4
4
+ numpy==1.26.4
5
+ pandas
6
+ pandasai==2.2.14
7
+ plotly==5.22.0
8
+ scikit_learn==1.4.2
9
+ streamlit
10
+ wordcloud==1.9.3
11
+ google-generativeai
12
+
13
+
14
+
15
+
16
+