ArturG9 commited on
Commit
3f6a8fa
·
verified ·
1 Parent(s): 937a0be

Upload 13 files

Browse files
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # argryge-ML.2
2
+ In this project was made:
3
+ Data analysis, modeling , model evaluation, model saving and put in to the app , where user could use it.
4
+ Also made Api endpoint for geting prediction
5
+
6
+ Content of the project:
7
+ Jupyter notebook of stroke data analysis and modeling : Stroke_prediction.ipynb
8
+ App for using a model with streamlit : app.py
9
+ Api created using FastApi : mlapi
10
+ Saved model in to pickle
11
+ Joblib model artifacts
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/Stroke_Prediction..ipynb ADDED
The diff for this file is too large to render. See raw diff
 
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/analysis_functions.py ADDED
@@ -0,0 +1,1003 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from collections import Counter
2
+ from functools import reduce
3
+ from typing import List, Tuple, Union
4
+ import pandas as pd
5
+ import matplotlib.pyplot as plt
6
+ import numpy as np
7
+ import seaborn as sns
8
+ from scipy.stats import t, wilcoxon
9
+ from scipy.stats import ttest_1samp
10
+ import scipy.stats as stats
11
+ from xml.etree.ElementTree import fromstring, ElementTree
12
+ from sklearn.linear_model import LogisticRegression
13
+ from sklearn.metrics import accuracy_score,precision_score
14
+
15
+
16
+
17
+ def read_clean_csv_data(fileName : str ,index_col : Union[int, str] =0) -> pd.DataFrame :
18
+ """Function to load data set from a .csv file.
19
+
20
+ Args:
21
+ fileName (str, optional): The name of the .csv file.
22
+ index_col (Union[int, str], optional): The index column of the resulting dataframe. Defaults to 0.
23
+
24
+ Returns:
25
+ pd.DataFrame: The cleaned and preprocessed dataframe.
26
+ """
27
+
28
+ ## Read the .csv file with the pandas read_csv method
29
+ df = pd.read_csv( fileName ,index_col= index_col)
30
+
31
+ ## Remove rows with missing values, accounting for mising values coded as '?'
32
+ cols= df.columns
33
+ for column in cols:
34
+ df.loc[df[column] == '?', column] = np.nan
35
+ df.dropna(axis = 0, inplace = True)
36
+
37
+
38
+
39
+ return df
40
+
41
+ def convert_k_m_to_numeric(value):
42
+ if 'k' in value:
43
+ return float(value.replace('k', '')) * 1000
44
+ elif 'm' in value:
45
+ return float(value.replace('m', '')) * 1000000
46
+ else:
47
+ return float(value)
48
+
49
+ def quartiles_counts(df_parts: list, column_name: str, value_name: str) -> pd.Series:
50
+ """
51
+ Computes the count of rows in each quartile and with a specified value in a specified column.
52
+
53
+ Args:
54
+ df_parts (list of pandas.DataFrame): A list of dataframes to compute counts for.
55
+ column_name (str): The name of the column to check for the specified value.
56
+ value_name (str): The value to check for in the specified column.
57
+
58
+ Returns:
59
+ pandas.Series: A series of counts for each quartile, with quartile names as indices.
60
+ """
61
+ counts_series = pd.Series({}, dtype=int)
62
+ for i, df_part in enumerate(df_parts):
63
+ x = (df_part[column_name] == value_name).sum()
64
+ counts_series[f"Quartile {i+1}"] = x
65
+ return counts_series
66
+ def six_parts_counts(df_parts: list, column_name: str, value_name: str) -> pd.Series:
67
+ """
68
+ If lenght of data lit <3 Computes the count of rows in each quartile if 2< lenght <5 count in each quartil if lenght > 4 ads top and bottom with quartiles
69
+ with a specified value in a specified column.
70
+
71
+ Args:
72
+ df_parts (list of pandas.DataFrame): A list of dataframes to compute counts for.
73
+ column_name (str): The name of the column to check for the specified value.
74
+ value_name (str): The value to check for in the specified column.
75
+
76
+ Returns:
77
+ pandas.Series: A series of counts for each quartile and top bottom dataframes, with parts names as indices.
78
+ """
79
+ counts_series = pd.Series({}, dtype=int)
80
+ if len(df_parts)>2:
81
+ for i, df_part in enumerate(df_parts):
82
+ x = (df_part[column_name] == value_name).sum()
83
+ if x>0 :
84
+ if i == 4:
85
+ counts_series["Top20"] = x
86
+ elif i == 5:
87
+ counts_series["Bottom21"] = x
88
+ else:
89
+ counts_series[f"Quartile {i+1}"] = x
90
+
91
+ if len(df_parts) < 3:
92
+ for i, df_part in enumerate(df_parts):
93
+ x = (df_part[column_name] == value_name).sum()
94
+ if x>0 :
95
+ if i == 0:
96
+ counts_series["Above mean"] = x
97
+ elif i == 1:
98
+ counts_series["Below mean"] = x
99
+
100
+ return counts_series
101
+
102
+
103
+
104
+
105
+
106
+ def labeledBarChart(counts: pd.Series, xlabel: str = 'Name', ylabel: str = 'Count',
107
+ title: str = "Title", figsize: Tuple[float, float] = (10,10), rotation: int = 0) -> None:
108
+ """Creates a labeled bar chart from a pandas Series.
109
+
110
+ Args:
111
+ counts (pd.Series): The pandas series with the data to be plotted.
112
+ xlabel (str, optional): The x-axis label. Defaults to 'Name'.
113
+ ylabel (str, optional): The y-axis label. Defaults to 'Count'.
114
+ title (str, optional): The title of the plot. Defaults to "Title".
115
+ figsize (Tuple[float, float], optional): The size of the figure. Defaults to (10,10).
116
+
117
+ Returns:
118
+ None: Displays the labeled bar chart.
119
+ """
120
+
121
+ fig = plt.figure(figsize = figsize)
122
+ ax = fig.gca()
123
+ plt.xticks(rotation = rotation)
124
+ counts_bars = ax.bar(counts.index, counts.values)
125
+ # Add count labels to the bars
126
+ for i, count in enumerate(counts.values):
127
+ ax.text(i, count+2, str(count), ha='center', va='bottom')
128
+ # Add x-axis and y-axis labels
129
+ ax.set_xlabel(xlabel)
130
+ ax.set_ylabel(ylabel)
131
+ ax.set_title(title)
132
+
133
+ # Show the plot
134
+ plt.show()
135
+
136
+
137
+ def t_test_confidence_intervals(data: np.ndarray,column:str,con_lvl: float=0.95):
138
+ data = data[column].values
139
+ # Calculate the sample mean and standard deviation
140
+ sample_mean = np.mean(data)
141
+ sample_std = np.std(data)
142
+
143
+ # Calculate the sample size
144
+ sample_size = len(data)
145
+ # Calculate the critical value (two-tailed t-test)
146
+ critical_value = stats.t.ppf((1 + con_lvl) / 2, df=sample_size - 1)
147
+
148
+ # Calculate the standard error
149
+ standard_error = sample_std / np.sqrt(sample_size)
150
+
151
+ # Calculate the margin of error
152
+ margin_of_error = critical_value * standard_error
153
+
154
+ # Calculate the confidence interval
155
+ lower_bound = sample_mean - margin_of_error
156
+ upper_bound = sample_mean + margin_of_error
157
+
158
+ # Print the results
159
+ print("Confidence Interval ({}%): [{:.3f}, {:.3f}]".format(con_lvl * 100, lower_bound, upper_bound))
160
+
161
+ def test_of_pop_proportion_bigger(data: np.ndarray,column:str, variable : float):
162
+ # Count the total number of reviews
163
+ data_lenght = len(data[column])
164
+
165
+ # Count the number of reviews with ratings higher than 4.0
166
+ higher_ratings_reviews = len(data[data[column] > variable])
167
+
168
+ # Calculate the point estimate for the population proportion
169
+ point_estimate = higher_ratings_reviews / data_lenght
170
+
171
+ # Calculate the standard error
172
+ standard_error = np.sqrt((point_estimate * (1 - point_estimate)) / data_lenght)
173
+
174
+ # Calculate the margin of error for a 95% confidence level (Z-score of 1.96)
175
+ margin_of_error = 1.96 * standard_error
176
+
177
+ # Calculate the confidence interval
178
+ lower_bound = point_estimate - margin_of_error
179
+ upper_bound = point_estimate + margin_of_error
180
+
181
+ # Print the results
182
+ print("margin of error:", margin_of_error)
183
+ print("Point Estimate:", point_estimate)
184
+ print("95% Confidence Interval: [{:.4f}, {:.4f}]".format(lower_bound, upper_bound))
185
+ def pieChart(count: pd.Series, title: str = 'Title' , figsize: Tuple[float, float] = (8,8)) -> None:
186
+ """Creates a pie chart from a pandas Series.
187
+
188
+ Args:
189
+ count (pd.Series): The pandas series with the data to be plotted.
190
+ title (str, optional): The title of the plot. Defaults to 'Title'.
191
+ figsize (Tuple[float, float], optional): The size of the figure.
192
+
193
+ Returns:
194
+ None: Displays the pie chart.
195
+ """
196
+ fig = plt.figure(figsize = figsize)
197
+ ax = fig.gca()
198
+ ax.pie(count.values, labels = count.index, autopct='%1.1f%%')
199
+
200
+ # Add title
201
+ ax.set_title(title)
202
+
203
+ # Show the plot
204
+ plt.show()
205
+
206
+
207
+ def confidence_intervals(data: np.ndarray,conf_lvl = 0.95):
208
+ data = data.values
209
+ mean = data.mean()
210
+ std = data.std()
211
+ n = len(data)
212
+ conf_int = t.interval(
213
+ 0.95, df=n - 1, loc=mean, scale=std / np.sqrt(n))
214
+ return conf_int
215
+
216
+ def conf_int_pop_mean(data: np.ndarray):
217
+ # Calculate sample statistics
218
+ sample_mean = np.mean(data)
219
+ sample_std = np.std(data, ddof=1) # ddof=1 for sample standard deviation
220
+ sample_size = len(data)
221
+
222
+ # Set the desired confidence level
223
+ confidence_level = 0.95
224
+
225
+ # Calculate the critical value (t-distribution)
226
+ critical_value = stats.t.ppf((1 + confidence_level) / 2, df=sample_size - 1)
227
+
228
+ # Calculate the standard error
229
+ standard_error = sample_std / np.sqrt(sample_size)
230
+
231
+ # Calculate the margin of error
232
+ margin_of_error = critical_value * standard_error
233
+
234
+ # Calculate the confidence interval
235
+ confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
236
+
237
+ # Print the results
238
+ print("Sample Mean:", sample_mean)
239
+ print("Sample Standard Deviation:", sample_std)
240
+ print("Sample Size:", sample_size)
241
+ print("Confidence Interval:", confidence_interval)
242
+
243
+
244
+
245
+ def samp1_ttest(data, null=0.0, alpha=0.05):
246
+ """
247
+ Perform a one-sample t-test on the data.
248
+
249
+ Parameters:
250
+ - data (array-like): The data array on which to perform the t-test.
251
+ - null (float): The null hypothesis value to test against (default: 0.0).
252
+ - alpha (float): The significance level for calculating the critical value (default: 0.05).
253
+
254
+ Returns:
255
+ - t_statistic (float): The calculated t-statistic.
256
+ - p_value (float): The calculated p-value.
257
+ """
258
+ t_statistic, p_value = stats.ttest_1samp(data, null)
259
+
260
+ # Calculate the critical value at the given significance level
261
+ critical_value = stats.t.ppf(1 - alpha, len(data) - 1)
262
+
263
+ # Compare the t-statistic with the critical value
264
+ if t_statistic > critical_value:
265
+ hypothesis_result = "Reject the null hypothesis"
266
+ else:
267
+ hypothesis_result = "Fail to reject the null hypothesis"
268
+
269
+ # Print the results
270
+ print("t-statistic:", t_statistic)
271
+ print("p-value:", p_value)
272
+
273
+ # Print the results
274
+ print(f"One-sample t-test - Statistical Significance (p-value): {p_value:.4f}")
275
+
276
+ def wilcoxon_significance_and_intervals(data: np.ndarray) -> tuple:
277
+ """
278
+ Perform Wilcoxon signed-rank test and calculate confidence intervals.
279
+
280
+ Args:
281
+ data (np.ndarray): Array of paired/matched samples.
282
+
283
+ Returns:
284
+ tuple: Statistical significance (p-value) and confidence intervals.
285
+
286
+ """
287
+
288
+ # Perform the Wilcoxon signed-rank test
289
+ statistic, p_value = wilcoxon(data)
290
+
291
+ # Set the desired confidence level
292
+ confidence_level = 0.95
293
+
294
+ # Calculate the confidence intervals
295
+ n = len(data)
296
+ z_critical = 1.96 # For a 95% confidence level (two-tailed test)
297
+
298
+ mean = np.mean(data)
299
+ std_dev = np.std(data)
300
+
301
+ margin_of_error = z_critical * (std_dev / np.sqrt(n))
302
+
303
+ lower_bound = mean - margin_of_error
304
+ upper_bound = mean + margin_of_error
305
+ print(f"Statistical Significance (p-value): {p_value:.4f}")
306
+ print(f"Confidence Interval: [{lower_bound:.2f}, {upper_bound:.2f}]")
307
+ # Return the statistical significance and confidence intervals
308
+ return p_value, (lower_bound, upper_bound)
309
+ # Print the statistical significance and confidence intervals
310
+ print(f"Statistical Significance (p-value): {p_value:.4f}")
311
+ print(f"Confidence Interval: [{lower_bound:.2f}, {upper_bound:.2f}]")
312
+ def population_mean(data: np.ndarray) :
313
+ data = np.array(data)
314
+ # Calculate the population mean
315
+ population_mean = data.mean()
316
+
317
+ # Calculate the confidence interval
318
+ confidence_level = 0.95
319
+ alpha = 1 - confidence_level
320
+
321
+ z_critical = stats.norm.ppf(1 - alpha / 2) # Z-value for 95% confidence interval
322
+
323
+ standard_error = data.std() / np.sqrt(len(data))
324
+ margin_of_error = z_critical * standard_error
325
+
326
+ lower_bound = population_mean - margin_of_error
327
+ upper_bound = population_mean + margin_of_error
328
+
329
+ # Print the results
330
+ print("Population Mean:", population_mean)
331
+ print("Confidence Interval:", (lower_bound, upper_bound))
332
+
333
+
334
+ def varible_mean_Zhypothesis(data: np.ndarray,alpha = 0.05,null_mean = 4.62):
335
+ data = np.array(data)
336
+
337
+ # Calculate sample statistics
338
+ sample_mean = np.mean(data)
339
+ sample_std = np.std(data, ddof=1) # ddof=1 for sample standard deviation
340
+ sample_size = len(data)
341
+
342
+ # Calculate the test statistic (z-score)
343
+ z_score = (sample_mean - null_mean) / (sample_std / np.sqrt(sample_size))
344
+
345
+ # Calculate the critical value (z-value) for two-tailed test
346
+ critical_value = stats.norm.ppf(1 - alpha / 2)
347
+
348
+ # Calculate the p-value
349
+ p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
350
+
351
+ # Compare the test statistic with critical value and p-value with alpha
352
+ if abs(z_score) > critical_value:
353
+ print("Reject the null hypothesis")
354
+ else:
355
+ print("Fail to reject the null hypothesis")
356
+
357
+ print("Sample Mean:", sample_mean)
358
+ print("Sample Standard Deviation:", sample_std)
359
+ print("Sample Size:", sample_size)
360
+ print("Test Statistic (z-score):", z_score)
361
+ print("Critical Value (z-value):", critical_value)
362
+ print("P-value:", p_value)
363
+
364
+
365
+
366
+ def bootstrap_confidence_interval(data: np.ndarray, num_bootstrap_samples: int=1000, confidence_level: float=0.95) -> tuple:
367
+ """
368
+ Calculate the confidence interval using bootstrapping.
369
+
370
+ Args:
371
+ data (array-like): The original data.
372
+ num_bootstrap_samples (int): The number of bootstrap samples to generate.
373
+ confidence_level (float): The desired confidence level (between 0 and 1).
374
+
375
+ Returns:
376
+ tuple: Lower and upper bounds of the confidence interval.
377
+
378
+ """
379
+ # Convert the data to a NumPy array
380
+ data = np.array(data)
381
+
382
+ # Create an array to store the bootstrap sample statistics
383
+ bootstrap_samples = np.zeros(num_bootstrap_samples)
384
+
385
+ # Perform bootstrapping
386
+ for i in range(num_bootstrap_samples):
387
+ # Generate a bootstrap sample by randomly sampling with replacement from the original data
388
+ bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
389
+
390
+ # Calculate the statistic of interest on the bootstrap sample
391
+ bootstrap_statistic = np.mean(bootstrap_sample)
392
+
393
+ # Store the bootstrap statistic
394
+ bootstrap_samples[i] = bootstrap_statistic
395
+
396
+ # Calculate the lower and upper percentiles of the bootstrap samples
397
+ lower_percentile = (1 - confidence_level) / 2
398
+ upper_percentile = 1 - lower_percentile
399
+
400
+ lower_bound = np.percentile(bootstrap_samples, lower_percentile * 100)
401
+ upper_bound = np.percentile(bootstrap_samples, upper_percentile * 100)
402
+ print(f"Confidence interval: [{lower_bound:.2f}, {upper_bound:.2f}]")
403
+ return lower_bound, upper_bound
404
+
405
+ # Print the confidence interval
406
+ def violinplot(data: pd.DataFrame, category_column: str, numeric_column: str, xlabel: str = 'Category',
407
+ ylabel: str = "Numeric values", title: str = "Title" ,figsize: Tuple[float, float] = (8,8)) -> None:
408
+ """
409
+ Creates a violin plot for the given dataframe using the specified category column and numeric column.
410
+
411
+ Args:
412
+ data (pd.DataFrame): The input dataframe to plot.
413
+ category_column (str): The name of the column to use as the categorical variable.
414
+ numeric_column (str): The name of the column to use as the numeric variable.
415
+ xlabel (str, optional): The label for the x-axis. Defaults to 'Category'.
416
+ ylabel (str, optional): The label for the y-axis. Defaults to 'Numeric values'.
417
+ title (str, optional): The title for the plot. Defaults to 'Title'.
418
+ figsize (Tuple[float, float], optional): The size of the figure.
419
+
420
+
421
+ Returns:
422
+ Violin plot
423
+ """
424
+ fig = plt.figure(figsize = figsize)
425
+ ax = fig.gca()
426
+ sns.set_style('whitegrid')
427
+ # cmap naudojama spalvu palete
428
+ sns.violinplot(x = category_column, y = numeric_column, data = data, ax = ax)
429
+
430
+
431
+ ax.set_title(title)
432
+ ax.set_ylabel(ylabel)
433
+ ax.set_xlabel(xlabel)
434
+
435
+
436
+
437
+
438
+ def kdePlot(datacolumn: pd.Series, xlabel: str = 'Size', ylabel: str = 'Density',
439
+ title: str = "Kernel density plot", figsize: Tuple[int, int] = (10,10) ) -> None:
440
+ """
441
+ Creates a kernel density plot for the given pandas series.
442
+
443
+ Args:
444
+ datacolumn (pd.Series): The input data to plot.
445
+ xlabel (str, optional): The label for the x-axis. Defaults to 'Size'.
446
+ ylabel (str, optional): The label for the y-axis. Defaults to 'Density'.
447
+ title (str, optional): The title for the plot. Defaults to 'Kernel density plot'.
448
+ figsize (Tuple[int, int], optional): The size of the figure. Defaults to (10,10).
449
+
450
+ Returns:
451
+ Kernel density plot
452
+ """
453
+ fig = plt.figure(figsize = figsize)
454
+ ax = fig.gca()
455
+ sns.set_style('whitegrid')
456
+ # vizualizuoja, variklio dydi pagal kuro tipa
457
+ sns.kdeplot(datacolumn,ax=ax)
458
+ ax.set_title(title)
459
+ ax.set_ylabel(ylabel)
460
+ ax.set_xlabel(xlabel)
461
+
462
+
463
+
464
+ @staticmethod
465
+ def OneD_Bar_Sublots (counts_list: List[pd.Series], subtitle: str = "Bar chart subplots",
466
+ xlabel: str = 'Category', ylabel: str = 'Values', figsize = (10,10)) -> None:
467
+ """
468
+ Creates multiple bar chart subplots for the given list of pandas series.
469
+
470
+ Args:
471
+ counts_list (List[pd.Series]): The list of data to plot.
472
+ subtitle (str, optional): The subtitle for the plot. Defaults to 'Bar chart subplots'.
473
+ xlabel (str, optional): The label for the x-axis. Defaults to 'Category'.
474
+ ylabel (str, optional): The label for the y-axis. Defaults to 'Values'.
475
+
476
+ Returns:
477
+ One dimension barchart subplots
478
+ """
479
+ fig, axs = plt.subplots(1, len(counts_list), figsize=figsize)
480
+ for idx, i in enumerate(counts_list):
481
+ axs[idx].bar(i.index, i.values)
482
+ axs[idx].set_title(input(f'Set title for {i.name} subplot (they are in same order as in your list): '))
483
+
484
+
485
+
486
+ for ax in axs.flat:
487
+ for p in ax.patches:
488
+ ax.text(
489
+ p.get_x() + p.get_width() / 2,
490
+ p.get_height(),
491
+ p.get_height(),
492
+ ha="center",
493
+ va="bottom",
494
+
495
+ )
496
+
497
+ fig.suptitle(subtitle)
498
+ plt.xlabel(xlabel )
499
+ plt.ylabel(ylabel)
500
+
501
+ # Show the plot
502
+ plt.show()
503
+
504
+ def histogram(dataframe_column: pd.Series, title: str = "Title", xlabel: str = "Sizes",
505
+ ylabel: str = "Amount", figsize: tuple = (10, 10), rotation: int = 0) -> None:
506
+ """
507
+ This function creates a histogram plot of a given pandas Series.
508
+
509
+ Args:
510
+ dataframe_column (pd.Series): A pandas Series object to be plotted.
511
+ title (str, optional): The title of the histogram. Defaults to "Title".
512
+ xlabel (str, optional): The label of the x-axis. Defaults to "Sizes".
513
+ ylabel (str, optional): The label of the y-axis. Defaults to "Amount".
514
+ figsize (tuple, optional): The size of the figure. Defaults to (10, 10).
515
+ rotation (int, optional): The rotation of the x-tick labels. Defaults to 0.
516
+
517
+ Returns:
518
+ None
519
+ """
520
+ values = dataframe_column.values
521
+ fig = plt.figure(figsize=figsize)
522
+ plt.rotation = rotation
523
+ ax = fig.gca()
524
+ dataframe_column.plot.hist(ax=ax)
525
+ ax.set_title(title)
526
+ ax.set_xlabel(xlabel)
527
+ ax.set_ylabel(ylabel)
528
+
529
+ def oneD_piechart_subplots(data_list: list[pd.Series], subtitle: str = "Pie chart subplots", figsize: tuple[int, int] = (15, 10)) -> None:
530
+ """
531
+ Plot one-dimensional pie chart subplots.
532
+
533
+ Parameters:
534
+ data_list (list[pd.Series]): A list of Pandas Series objects containing the data to plot.
535
+ subtitle (str): The title of the plot. Default is 'Pie chart subplots'.
536
+ figsize (tuple[int, int]): The size of the figure. Default is (15, 10).
537
+
538
+ Returns:
539
+ None
540
+
541
+ Raises:
542
+ ValueError: If `data_list` is empty.
543
+ """
544
+ fig, axs = plt.subplots(1, len(data_list), figsize=figsize)
545
+ for idx, i in enumerate(data_list):
546
+ axs[idx].pie(
547
+ i.values.astype(float),
548
+ labels=i.index,
549
+ autopct="%1.1f%%",
550
+ )
551
+ axs[idx].set_title(
552
+ input(
553
+ f"Set title for {i.name} subplot (they are in same order as in your list): "
554
+ )
555
+ )
556
+ # Plot a pie chart on each of the subplots
557
+
558
+ # Add a title to the figure
559
+ fig.suptitle(subtitle)
560
+
561
+ # Show the plot
562
+ plt.show()
563
+
564
+ def calculate_mean_last_five_results_for_all_matches(dataframe: pd.DataFrame) -> pd.DataFrame:
565
+ """
566
+ Calculate the mean of the last five results for both home and away teams for all matches.
567
+
568
+ Args:
569
+ dataframe (pd.DataFrame): The DataFrame containing match data.
570
+
571
+ Returns:
572
+ pd.DataFrame: The original DataFrame with additional columns for mean results.
573
+ """
574
+ # Create new columns to store the mean results for home and away teams
575
+ dataframe["mean_result_home"] = np.nan
576
+ dataframe["mean_result_away"] = np.nan
577
+
578
+ for index, row in dataframe.iterrows():
579
+ home_team_api_id = row["home_team_api_id"]
580
+ away_team_api_id = row["away_team_api_id"]
581
+
582
+ # Calculate the mean result for the home team
583
+ mean_result_home, _ = calculate_mean_last_five_results(dataframe, home_team_api_id, "result_home", index)
584
+
585
+ # Calculate the mean result for the away team
586
+ mean_result_away, _ = calculate_mean_last_five_results(dataframe, away_team_api_id, "result_away", index)
587
+
588
+ # Update the DataFrame with the mean results
589
+ dataframe.at[index, "mean_result_home"] = mean_result_home
590
+ dataframe.at[index, "mean_result_away"] = mean_result_away
591
+
592
+ return dataframe
593
+
594
+ def calculate_mean_last_five_results(dataframe: pd.DataFrame, team_api_id: int, result_column: str, current_match_index: int) -> Tuple[float, list]:
595
+ """
596
+ Calculate the mean of the last five results for a specific team.
597
+
598
+ Args:
599
+ dataframe (pd.DataFrame): The DataFrame containing match data.
600
+ team_api_id (int): The API ID of the team for which to calculate the mean.
601
+ result_column (str): The name of the column containing match results.
602
+ current_match_index (int): The index of the current match being processed.
603
+
604
+ Returns:
605
+ Tuple[float, list]: A tuple containing the mean result (float) and a list of meanings for the last five results.
606
+ """
607
+ # Filter the DataFrame for matches involving the specified team
608
+ team_matches = dataframe[(dataframe["home_team_api_id"] == team_api_id) | (dataframe["away_team_api_id"] == team_api_id)]
609
+
610
+ # Exclude the current match from the calculations
611
+ team_matches = team_matches[team_matches.index != current_match_index]
612
+
613
+ # Sort the matches by index (assuming the DataFrame is sorted by date)
614
+ team_matches = team_matches.sort_index(ascending=False)
615
+
616
+ # Extract the last five results for the team (or all available if fewer than five)
617
+ last_five_results = team_matches[result_column].head(5).values
618
+
619
+ # Map the result codes to their corresponding meanings (1: Loss, 2: Draw, 3: Win)
620
+ result_meanings = {1: "Loss", 2: "Draw", 3: "Win"}
621
+ last_five_results_meaning = [result_meanings[result_code] for result_code in last_five_results]
622
+
623
+ # Calculate the mean of the results (1: Loss, 2: Draw, 3: Win)
624
+ mean_result = np.mean(last_five_results)
625
+
626
+ return mean_result, last_five_results_meaning
627
+
628
+ def calculate_mean_last_five_home_results(dataframe: pd.DataFrame, team_api_id: int, result_column: str, current_match_index: int) -> Tuple[float, list]:
629
+ """
630
+ Calculate the mean of the last five home results for a specific team.
631
+
632
+ Args:
633
+ dataframe (pd.DataFrame): The DataFrame containing match data.
634
+ team_api_id (int): The API ID of the team for which to calculate the mean.
635
+ result_column (str): The name of the column containing match results.
636
+ current_match_index (int): The index of the current match being processed.
637
+
638
+ Returns:
639
+ Tuple[float, list]: A tuple containing the mean result (float) and a list of meanings for the last five home results.
640
+ """
641
+ # Filter the DataFrame for matches where the team is the home team
642
+ home_matches = dataframe[dataframe["home_team_api_id"] == team_api_id]
643
+
644
+ # Exclude the current match from the calculations
645
+ home_matches = home_matches[home_matches.index != current_match_index]
646
+
647
+ # Sort the matches by index (assuming the DataFrame is sorted by date)
648
+ home_matches = home_matches.sort_index(ascending=False)
649
+
650
+ # Extract the last five home results for the team (or all available if fewer than five)
651
+ last_five_home_results = home_matches[result_column].head(5).values
652
+
653
+ # Map the result codes to their corresponding meanings (1: Loss, 2: Draw, 3: Win)
654
+ result_meanings = {1: "Loss", 2: "Draw", 3: "Win"}
655
+ last_five_home_results_meaning = [result_meanings[result_code] for result_code in last_five_home_results]
656
+
657
+ # Calculate the mean of the home results (1: Loss, 2: Draw, 3: Win)
658
+ mean_home_result = np.mean(last_five_home_results)
659
+
660
+ return mean_home_result, last_five_home_results_meaning
661
+
662
+ def calculate_mean_last_five_away_results(dataframe: pd.DataFrame, team_api_id: int, result_column: str, current_match_index: int) -> Tuple[float, list]:
663
+ """
664
+ Calculate the mean of the last five away results for a specific team.
665
+
666
+ Args:
667
+ dataframe (pd.DataFrame): The DataFrame containing match data.
668
+ team_api_id (int): The API ID of the team for which to calculate the mean.
669
+ result_column (str): The name of the column containing match results.
670
+ current_match_index (int): The index of the current match being processed.
671
+
672
+ Returns:
673
+ Tuple[float, list]: A tuple containing the mean result (float) and a list of meanings for the last five away results.
674
+ """
675
+ # Filter the DataFrame for matches where the team is the away team
676
+ away_matches = dataframe[dataframe["away_team_api_id"] == team_api_id]
677
+
678
+ # Exclude the current match from the calculations
679
+ away_matches = away_matches[away_matches.index != current_match_index]
680
+
681
+ # Sort the matches by index (assuming the DataFrame is sorted by date)
682
+ away_matches = away_matches.sort_index(ascending=False)
683
+
684
+ # Extract the last five away results for the team (or all available if fewer than five)
685
+ last_five_away_results = away_matches[result_column].head(5).values
686
+
687
+ # Map the result codes to their corresponding meanings (1: Loss, 2: Draw, 3: Win)
688
+ result_meanings = {1: "Loss", 2: "Draw", 3: "Win"}
689
+ last_five_away_results_meaning = [result_meanings[result_code] for result_code in last_five_away_results]
690
+
691
+ # Calculate the mean of the away results (1: Loss, 2: Draw, 3: Win)
692
+ mean_away_result = np.mean(last_five_away_results)
693
+
694
+ return mean_away_result, last_five_away_results_meaning
695
+
696
+ def getTeamResult(row):
697
+ if row["winning_team"] == "1":
698
+ home_team_result = "Win"
699
+ away_team_result = "Loss"
700
+ elif row["winning_team"] == "3":
701
+ home_team_result = "Loss"
702
+ away_team_result = "Win"
703
+ else:
704
+ home_team_result = "Draw"
705
+ away_team_result = "Draw"
706
+
707
+ return [home_team_result, away_team_result]
708
+
709
+ def calculate_rolling_means(df, feature, window_size=5):
710
+ """
711
+ Calculate rolling means for a specified feature for both home and away games.
712
+
713
+ Args:
714
+ df (pd.DataFrame): The DataFrame containing match data.
715
+ feature (str): The name of the feature for which to calculate rolling means.
716
+ window_size (int): The size of the rolling window (default is 5).
717
+
718
+ Returns:
719
+ pd.DataFrame: The DataFrame with additional columns for rolling means.
720
+ """
721
+ # Calculate the rolling mean for home games
722
+ df[feature + "_home_game"] = df.groupby("home_team_api_id")[feature].transform(
723
+ lambda x: x.rolling(window=window_size, min_periods=1).mean()
724
+ )
725
+
726
+ # Calculate the rolling mean for away games
727
+ df[feature + "_away_game"] = df.groupby("away_team_api_id")[feature].transform(
728
+ lambda x: x.rolling(window=window_size, min_periods=1).mean()
729
+ )
730
+
731
+ # Calculate the rolling mean difference between home and away teams
732
+ df[feature + "_RMean_Diff"] = (
733
+ df[feature + "_home_game"] - df[feature + "_away_game"]
734
+ )
735
+
736
+ return df
737
+
738
+ def delete_player_columns(df, start_player=1, end_player=11):
739
+ for i in range(start_player, end_player + 1):
740
+ columns_to_delete = [col for col in df.columns if col.endswith(f"player_{i}")]
741
+ df.drop(columns=columns_to_delete, inplace=True)
742
+ return df
743
+
744
+
745
+ def calculate_team_mean_cat_data(df, team, feature_name, start_player=1, end_player=11):
746
+ # Generate the list of player columns based on the specified range
747
+ player_columns = [
748
+ f"{feature_name}_{team}_player_{i}" for i in range(start_player, end_player + 1)
749
+ ]
750
+
751
+ # Convert the player columns to numeric (replace non-numeric values with NaN)
752
+ df[player_columns] = df[player_columns].apply(pd.to_numeric, errors="coerce")
753
+
754
+ # Calculate the mean for the specified player columns
755
+ df[f"{team}_team_mean_{feature_name}"] = df[player_columns].mean(axis=1)
756
+
757
+ # Drop the player columns that were used to calculate the mean
758
+ df.drop(columns=player_columns, inplace=True)
759
+
760
+ # Convert the "defensive_work_rate" columns to numeric values based on player number
761
+ encoding_map = {"low": 1, "medium": 2, "high": 3}
762
+ for i in range(start_player, end_player + 1):
763
+ defensive_work_rate_column = f"{feature_name}_{team}_player_{i}"
764
+ # Check if the column exists before replacing values
765
+ if defensive_work_rate_column in df.columns:
766
+ df[defensive_work_rate_column] = df[defensive_work_rate_column].replace(
767
+ encoding_map
768
+ )
769
+
770
+ return df
771
+ def calculate_team_mean(df, team, feature_name, start_player=1, end_player=11):
772
+ player_columns = [f"{team}_player_{i}" for i in range(start_player, end_player + 1)]
773
+ player_feature_columns = [
774
+ f"{feature_name}_{team}_player_{i}" for i in range(start_player, end_player + 1)
775
+ ]
776
+
777
+ # Check if all player feature columns exist
778
+ if all(col in df.columns for col in player_feature_columns):
779
+ df[f"{team}_team_mean_{feature_name}"] = df[player_feature_columns].mean(axis=1)
780
+ df.drop(columns=player_feature_columns, inplace=True)
781
+ else:
782
+ print("Player feature columns do not exist in the DataFrame.")
783
+
784
+ return df
785
+
786
+
787
+ def pie_count_subplot_single(data_column, title):
788
+ f, ax = plt.subplots(1, 2, figsize=(18, 8))
789
+
790
+ # Plot pie chart
791
+ data_column.value_counts().plot.pie(explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)
792
+ ax[0].set_title(title)
793
+ ax[0].set_ylabel('')
794
+
795
+ # Plot countplot
796
+ sns.countplot(x=data_column.name, data=data_column.to_frame(), ax=ax[1])
797
+ ax[1].set_title(title)
798
+
799
+ plt.show()
800
+
801
+ def bar_hued_barchart(data: pd.DataFrame, column: str, hue_column: str, title1: str = 'title1', title2: str = 'title2', title: str = 'title', xlabel1: str = 'xlabel1', xlabel2: str = 'xlabel2'):
802
+ """
803
+ Plot a bar chart and a hued bar chart for the specified columns with customizable titles and x-axis labels.
804
+
805
+ Parameters:
806
+ data (pd.DataFrame): The DataFrame containing the data.
807
+ column (str): The column to be plotted on the x-axis.
808
+ hue_column (str): The column to be used for hue in the countplot.
809
+ title1 (str): The title for the first subplot (bar chart). Default is 'title1'.
810
+ title2 (str): The title for the second subplot (countplot). Default is 'title2'.
811
+ title (str): The general title for the plots. Default is 'title'.
812
+ xlabel1 (str): The x-axis label for the first subplot. Default is 'xlabel1'.
813
+ xlabel2 (str): The x-axis label for the second subplot. Default is 'xlabel2'.
814
+ """
815
+ f, ax = plt.subplots(1, 2, figsize=(18, 8))
816
+
817
+ # Plot bar chart
818
+ data[[column, hue_column]].groupby([column]).mean().plot.bar(ax=ax[0])
819
+ ax[0].set_title(title1)
820
+ ax[0].set_xlabel(xlabel1)
821
+
822
+ # Plot countplot
823
+ sns.countplot(x=column, hue=hue_column, data=data, ax=ax[1])
824
+ ax[1].set_title(title2)
825
+ ax[1].set_xlabel(xlabel2)
826
+
827
+ plt.suptitle(title)
828
+ plt.show()
829
+
830
+ def plot_count_and_hue(data: pd.DataFrame, x_column: str, hue_column: str):
831
+ """
832
+ Plot two countplots: one without hue and one with hue.
833
+
834
+ Parameters:
835
+ data (pd.DataFrame): The DataFrame containing the data.
836
+ x_column (str): The categorical variable to be plotted on the x-axis.
837
+ hue_column (str): The categorical variable to be used for hue.
838
+ """
839
+ # Create subplots with 1 row and 2 columns
840
+ fig, axes = plt.subplots(1, 2, figsize=(12, 5))
841
+
842
+ # Plot countplot without hue
843
+ sns.countplot(x=x_column, data=data, ax=axes[0])
844
+ axes[0].set_title(f'Countplot of {x_column}')
845
+
846
+ # Plot countplot with hue
847
+ sns.countplot(x=x_column, hue=hue_column, data=data, ax=axes[1])
848
+ axes[1].set_title(f'Countplot of {x_column} with Hue {hue_column}')
849
+
850
+ # Adjust spacing between subplots
851
+ plt.tight_layout()
852
+
853
+ # Show the plots
854
+ plt.show()
855
+
856
+ def plot_proportion_stacked_bar(data: pd.DataFrame, x_column: str, hue_column: str,title:str):
857
+ """
858
+ Plot a stacked bar chart to visualize the proportion of hued counts.
859
+
860
+ Parameters:
861
+ data (pd.DataFrame): The DataFrame containing the data.
862
+ x_column (str): The categorical variable to be plotted on the x-axis.
863
+ hue_column (str): The categorical variable to be used for hue.
864
+ """
865
+ # Create a contingency table
866
+ contingency_table = pd.crosstab(data[x_column], data[hue_column], normalize='index')
867
+
868
+ ax = contingency_table.plot(kind='bar', stacked=True, rot=0)
869
+ ax.legend(title=hue_column, bbox_to_anchor=(1, 1.02), loc='upper left')
870
+
871
+ # add annotations if desired
872
+ for c in ax.containers:
873
+
874
+ # set the bar label
875
+ ax.bar_label(c, label_type='center')
876
+ ax.set_title(title)
877
+
878
+ def plot_bar_and_stacked_bar(data: pd.DataFrame, x_column: str, hue_column: str, title: str=None, subplot_title: str = None, main_title: str = None,figsize:tuple = (12,8)):
879
+ """
880
+ Plot a bar chart of data[x_column] followed by a stacked bar chart to visualize the proportion of hued counts.
881
+
882
+ Parameters:
883
+ data (pd.DataFrame): The DataFrame containing the data.
884
+ x_column (str): The categorical variable to be plotted on the x-axis.
885
+ hue_column (str): The categorical variable to be used for hue.
886
+ title (str): The title for the plot.
887
+ subplot_title (str, optional): The title for the subplot.
888
+ main_title (str, optional): The main title for the plot.
889
+
890
+ Returns:
891
+ tuple[str, str, str]: The titles of the bar chart, stacked bar chart, and subplot.
892
+ """
893
+ # Check if there are multiple categories in x_column
894
+ if len(data[x_column].unique()) > 1:
895
+ # Create a figure with subplots for both charts
896
+ fig, axes = plt.subplots(1, 2, figsize = figsize)
897
+
898
+ # Create a bar chart for data[x_column]
899
+ data[x_column].value_counts().plot(kind='bar', ax=axes[0], rot=0)
900
+ axes[0].set_title(f'Bar Chart of {x_column}')
901
+
902
+ # Create a contingency table for the stacked bar chart
903
+ contingency_table = pd.crosstab(data[x_column], data[hue_column], normalize='index')
904
+
905
+ ax = contingency_table.plot(kind='bar', stacked=True, rot=0, ax=axes[1])
906
+ ax.legend(title='Stacked Barchart of ' + hue_column, bbox_to_anchor=(1, 1.02), loc='upper left')
907
+ else:
908
+ # Only one category in x_column, so create only the stacked bar chart
909
+ plt.figure(figsize=figsize)
910
+ contingency_table = pd.crosstab(data[x_column], data[hue_column], normalize='index')
911
+ ax = contingency_table.plot(kind='bar', stacked=True, rot=0)
912
+ ax.legend(title='Stacked Barchart of ' + hue_column, bbox_to_anchor=(1, 1.02), loc='upper left')
913
+ # add annotations if desired
914
+ for c in ax.containers:
915
+
916
+ # set the bar label
917
+ ax.bar_label(c, label_type='center')
918
+
919
+ # Set the title of the plot
920
+ if main_title is not None:
921
+ plt.suptitle(main_title, fontsize=16)
922
+
923
+ # Adjust layout for better spacing between subplots
924
+ plt.tight_layout()
925
+ plt.title(' Stacked barchart in comparison with' + hue_column )
926
+ plt.show()
927
+
928
+ def lasso_classifier(X_train,y_train,X_test,y_test,X):
929
+
930
+ lasso_classifier = LogisticRegression(penalty="l1", solver="liblinear", random_state=42)
931
+ lasso_classifier.fit(X_train, y_train)
932
+
933
+
934
+ y_pred = lasso_classifier.predict(X_test)
935
+
936
+
937
+ accuracy = accuracy_score(y_test, y_pred)
938
+ print("Accuracy:", accuracy)
939
+
940
+ precision = precision_score(y_test, y_pred)
941
+ print("Precision:", precision)
942
+
943
+ lasso_coefficients = lasso_classifier.coef_[0]
944
+
945
+
946
+ lasso_abs_coefficients = np.abs(lasso_coefficients)
947
+
948
+ top_20_lasso_indices = np.argsort(lasso_abs_coefficients)[-20:]
949
+
950
+
951
+ top_20_lasso_feature_names = X.columns[top_20_lasso_indices]
952
+
953
+
954
+ top_20_lasso_coefficients = lasso_coefficients[top_20_lasso_indices]
955
+
956
+ # Create a bar plot to visualize the top 20 most important features for Lasso
957
+ plt.figure(figsize=(12, 8))
958
+ plt.barh(top_20_lasso_feature_names, top_20_lasso_coefficients)
959
+ plt.xlabel("Coefficient Value (Lasso)")
960
+ plt.title("Most Important Features - Lasso")
961
+ plt.gca().invert_yaxis() # Invert y-axis to display the most important feature at the top
962
+ plt.show()
963
+
964
+ def ridge_classifier(X_train,y_train,X_test,y_test,X):
965
+ ridge_classifier = LogisticRegression(penalty="l2", solver="liblinear", random_state=42)
966
+ ridge_classifier.fit(X_train, y_train)
967
+
968
+
969
+ y_pred = ridge_classifier.predict(X_test)
970
+
971
+
972
+ accuracy = accuracy_score(y_test, y_pred)
973
+ print("Accuracy:", accuracy)
974
+
975
+ precision = precision_score(y_test, y_pred)
976
+ print("Precision:", precision)
977
+
978
+
979
+ ridge_coefficients = ridge_classifier.coef_[0]
980
+
981
+
982
+ ridge_abs_coefficients = np.abs(ridge_coefficients)
983
+
984
+
985
+ top_20_ridge_indices = np.argsort(ridge_abs_coefficients)[-20:]
986
+
987
+
988
+ top_20_ridge_feature_names = X.columns[top_20_ridge_indices]
989
+
990
+
991
+ top_20_ridge_coefficients = ridge_coefficients[top_20_ridge_indices]
992
+
993
+
994
+ plt.figure(figsize=(12, 8))
995
+ plt.barh(top_20_ridge_feature_names, top_20_ridge_coefficients)
996
+ plt.xlabel("Coefficient Value (Ridge)")
997
+ plt.title("Top 20 Most Important Features - Ridge")
998
+ plt.gca().invert_yaxis() # Invert y-axis to display the most important feature at the top
999
+ plt.show()
1000
+
1001
+ def is_binary(series):
1002
+ unique_values = series.unique()
1003
+ return len(unique_values) == 2 and set(unique_values) == {0, 1}
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/app.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import io
2
+ import pickle
3
+ import streamlit as st
4
+ import joblib
5
+ import shap
6
+ import pandas as pd
7
+ import matplotlib.pyplot as plt
8
+
9
+
10
+ # Load the LightGBM model and other necessary objects
11
+ with open('lgb1_model.pkl', 'rb') as f:
12
+ lgb1 = pickle.load(f)
13
+
14
+ categorical_features = joblib.load("categorical_features.joblib")
15
+ encoder = joblib.load("encoder.joblib")
16
+
17
+ # Sidebar option to select the dashboard
18
+ option = st.sidebar.selectbox("Which dashboard?", ("Model information", "Stroke prediction"))
19
+ st.title(option)
20
+
21
+ def get_pred():
22
+ """
23
+ Function to display the stroke probability calculator and Shap force plot.
24
+ """
25
+ st.header("Stroke probability calculator ")
26
+
27
+ # User input for prediction
28
+ gender = st.selectbox("Select gender: ", ["Male", "Female", 'Other'])
29
+ work_type = st.selectbox("Work type: ", ["Private", "Self_employed", 'children', 'Govt_job', 'Never_worked'])
30
+ residence_status = st.selectbox("Residence status: ", ["Urban", "Rural"])
31
+ smoking_status = st.selectbox("Smoking status: ", ["Unknown", "formerly smoked", 'never smoked', 'smokes'])
32
+ age = st.slider("Input age: ", 0, 120)
33
+ hypertension = st.select_slider("Do you have hypertension: ", [0, 1])
34
+ heart_disease = st.select_slider("Do you have heart disease: ", [0, 1])
35
+ ever_married = st.select_slider("Have you ever married? ", [0, 1])
36
+ avg_glucosis_lvl = st.slider("Average glucosis level: ", 50, 280)
37
+ bmi = st.slider("Input Bmi: ", 10, 100)
38
+
39
+ # User input data
40
+ data = {
41
+ "gender": gender,
42
+ "work_type": work_type,
43
+ "Residence_type": residence_status,
44
+ "smoking_status": smoking_status,
45
+ "age": age,
46
+ "hypertension": hypertension,
47
+ "heart_disease": heart_disease,
48
+ "ever_married": ever_married,
49
+ "avg_glucose_level": avg_glucosis_lvl,
50
+ "bmi": bmi
51
+ }
52
+
53
+ # Prediction button
54
+ if st.button("Predict"):
55
+ # Convert input data to a DataFrame
56
+ X = pd.DataFrame([data])
57
+
58
+ # Encode categorical features
59
+ encoded_features = encoder.transform(X[categorical_features])
60
+
61
+ # Get the feature names from the encoder
62
+ feature_names = encoder.get_feature_names_out(input_features=categorical_features)
63
+
64
+ # Create a DataFrame with the encoded features and feature names
65
+ encoded_df = pd.DataFrame(encoded_features, columns=feature_names)
66
+ X_encoded = pd.concat([X.drop(columns=categorical_features), encoded_df], axis=1)
67
+
68
+ # Make predictions
69
+ prediction_proba = lgb1.predict_proba(X_encoded)
70
+
71
+ # Get SHAP values
72
+ explainer = shap.TreeExplainer(lgb1)
73
+ shap_values = explainer.shap_values(X_encoded)
74
+
75
+ # Extract prediction probability and display it to the user
76
+ probability = prediction_proba[0, 1] # Assuming binary classification
77
+ st.subheader(f"The predicted probability of stroke is {probability}.")
78
+ st.subheader("IF you see result , higher than 0.3, we advice you to see a doctor")
79
+ st.header("Shap forceplot")
80
+ st.subheader("Features values impact on model made prediction")
81
+
82
+ # Display SHAP force plot using Matplotlib
83
+ shap.force_plot(explainer.expected_value[1], shap_values[1], features=X_encoded.iloc[0, :], matplotlib=True)
84
+
85
+ # Save the figure to a BytesIO buffer
86
+ buf = io.BytesIO()
87
+ plt.savefig(buf, format="png", dpi=800)
88
+ buf.seek(0)
89
+
90
+ # Display the image in Streamlit
91
+ st.image(buf, width=1100)
92
+
93
+ # Display summary plot of feature importance
94
+ shap.summary_plot(shap_values[1], X_encoded)
95
+
96
+ # Display interaction summary plot
97
+ shap_interaction_values = explainer.shap_interaction_values(X_encoded)
98
+ shap.summary_plot(shap_interaction_values, X_encoded)
99
+
100
+ # Execute get_pred() only if the option is "Stroke prediction"
101
+ if option == "Stroke prediction":
102
+ get_pred()
103
+
104
+ if option == "Model information":
105
+ st.header("Light gradient boosting model")
106
+ st.subheader("First tree of light gradient boosting model and how it makes decisions")
107
+ st.image(r'lgbm_tree.png')
108
+
109
+ st.subheader("Shap values visualization of how features contribute to model prediction")
110
+ st.image(r'lgbm_model_shap_evaluation.png')
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/categorical_features.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f062d6d462ef03c00858d62fec2371dc86171d13ba4b46ba71f5442ec4d6a1b8
3
+ size 71
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/encoder.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e139e75012e09d62f164d3772c6ce662ad789301dc4b77eaa0011c05133727ed
3
+ size 2062
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/features.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a5cbd183100036afed0ce12e44bc66eaa9249a3f62eab9736ac6ef88a24e454
3
+ size 158
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/lgb1_model.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7f8316f333a916e1295da52b74da9251ece8a6b695fb9b796f1ee859843d7862
3
+ size 162444
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/lgb1_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ae238aec1374c8eb1f1fad09d5324929ff544dedca84274f0009bc551c76795
3
+ size 162203
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/lgbm_model_shap_evaluation.png ADDED
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/lgbm_tree.png ADDED
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/mlapi.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pickle
2
+ from fastapi import FastAPI, HTTPException
3
+ from pydantic import BaseModel
4
+ import pandas as pd
5
+ import joblib
6
+ import shap
7
+
8
+ # Create a FastAPI instance
9
+ app = FastAPI()
10
+
11
+ # Load necessary objects
12
+ categorical_features = joblib.load("categorical_features.joblib")
13
+ features = joblib.load("features.joblib")
14
+ encoder = joblib.load("encoder.joblib")
15
+
16
+ # Define a Pydantic model for the input data
17
+ class ScoringItem(BaseModel):
18
+ gender: str
19
+ work_type: str
20
+ Residence_type: str
21
+ smoking_status: str
22
+ age: float
23
+ hypertension: int
24
+ heart_disease: int
25
+ ever_married: int
26
+ avg_glucose_level: float
27
+ bmi: float
28
+
29
+ # Load the LightGBM model
30
+ with open('lgb1_model.pkl', 'rb') as f:
31
+ model = pickle.load(f)
32
+
33
+ # Define the scoring endpoint
34
+ @app.post('/')
35
+ async def scoring_endpoint(item: ScoringItem):
36
+ try:
37
+ # Convert the Pydantic model to a Pandas DataFrame
38
+ df = pd.DataFrame([item.dict().values()], columns=item.dict().keys())
39
+
40
+ # Encode categorical features
41
+ encoded_features = encoder.transform(df[categorical_features])
42
+
43
+ # Get the feature names from the encoder
44
+ feature_names = encoder.get_feature_names_out(input_features=categorical_features)
45
+
46
+ # Create a DataFrame with the encoded features and feature names
47
+ encoded_df = pd.DataFrame(encoded_features, columns=feature_names)
48
+ df_encoded = pd.concat([df.drop(columns=categorical_features), encoded_df], axis=1)
49
+
50
+ # Make probability predictions using the LightGBM model
51
+ pred_proba = model.predict_proba(df_encoded)
52
+
53
+ # Assuming a binary classification problem, use probabilities for the positive class
54
+ positive_class_probability = pred_proba[:, 1]
55
+
56
+ # Prepare the response with SHAP values
57
+ response = {
58
+ "Probability of getting stroke is: ": positive_class_probability[0],
59
+ }
60
+
61
+ return response
62
+
63
+ except Exception as e:
64
+ # Handle exceptions and return an HTTP 500 error
65
+ raise HTTPException(status_code=500, detail=str(e))
Stroke_prediction_data_preprocess&Modeling/Stroke_Prediction/requirements.txt ADDED
Binary file (3.49 kB). View file