Andrea Maldonado commited on
Commit
597d704
·
2 Parent(s): 164cf37 7593020

Merge branch 'main' into demo-icpm24

Browse files

* main: (72 commits)
Test huggingface CI
Update README.md
Update README.md
Update README.md
Triggers CI
Update README.md
Adds preprint citation
Update .conda.yml
Updates conda
Updates conda
Updates conda yml
Erases windows package
Update .conda.yml
Update setup.py
Updates paper figs notebooks
Update README.md
Update README.md
Update README.md
Update README.md
Removes unused libs
...

.github/workflows/test_gedi.yml CHANGED
@@ -36,7 +36,6 @@ jobs:
36
  - name: Compare output
37
  run: diff data/validation/test_feat.csv data/test_feat.csv
38
 
39
-
40
  test_generation:
41
  runs-on: ubuntu-latest
42
 
@@ -72,7 +71,7 @@ jobs:
72
  diff data/validation/genELexperiment2_07_04.json output/features/grid_feat/2_enself_rt20v/genELexperiment2_07_04.json
73
 
74
  - name: Compare output 3
75
- run:
76
  diff data/validation/genELexperiment3_04_nan.json output/features/grid_feat/2_enself_rt20v/genELexperiment3_04_nan.json
77
 
78
  - name: Compare output 4
@@ -109,7 +108,6 @@ jobs:
109
  - name: Compare output
110
  run: diff data/validation/test_benchmark.csv output/benchmark/test_benchmark.csv
111
 
112
-
113
  test_augmentation:
114
  runs-on: ubuntu-latest
115
 
@@ -156,7 +154,6 @@ jobs:
156
 
157
  - name: Run test
158
  run:
159
-
160
  python main.py -a config_files/pipeline_steps/evaluation_plotter.json
161
 
162
  test_integration:
@@ -244,5 +241,5 @@ jobs:
244
  python main.py -a config_files/test/test_abbrv_generation.json
245
 
246
  - name: Compare output
247
- run:
248
- diff data/validation/2_ense_rmcv_feat.csv output/test/igedi_table_1/2_ense_rmcv_feat.csv
 
36
  - name: Compare output
37
  run: diff data/validation/test_feat.csv data/test_feat.csv
38
 
 
39
  test_generation:
40
  runs-on: ubuntu-latest
41
 
 
71
  diff data/validation/genELexperiment2_07_04.json output/features/grid_feat/2_enself_rt20v/genELexperiment2_07_04.json
72
 
73
  - name: Compare output 3
74
+ run:
75
  diff data/validation/genELexperiment3_04_nan.json output/features/grid_feat/2_enself_rt20v/genELexperiment3_04_nan.json
76
 
77
  - name: Compare output 4
 
108
  - name: Compare output
109
  run: diff data/validation/test_benchmark.csv output/benchmark/test_benchmark.csv
110
 
 
111
  test_augmentation:
112
  runs-on: ubuntu-latest
113
 
 
154
 
155
  - name: Run test
156
  run:
 
157
  python main.py -a config_files/pipeline_steps/evaluation_plotter.json
158
 
159
  test_integration:
 
241
  python main.py -a config_files/test/test_abbrv_generation.json
242
 
243
  - name: Compare output
244
+ run:
245
+ diff data/validation/2_ense_rmcv_feat.csv output/test/igedi_table_1/2_ense_rmcv_feat.csv
README.md CHANGED
@@ -12,10 +12,10 @@ license: mit
12
 
13
  <p>
14
  <img src="gedi/utils/logo.png" alt="Logo" width="100" align="left" />
15
- <h1 style="display: inline;">iGEDI</h1>
16
  </p>
17
 
18
- **i**nteractive **G**enerating **E**vent **D**ata with **I**ntentional Features for Benchmarking Process Mining<br />
19
  This repository contains the codebase for the interactive web application tool (iGEDI) as well as for the [GEDI paper](https://mcml.ai/publications/gedi.pdf) accepted at the BPM'24 conference.
20
 
21
  ## Table of Contents
@@ -87,7 +87,6 @@ The JSON file consists of the following key-value pairs:
87
  - font_size: label font size of the output plot
88
  - boxplot_width: width of the violinplot/boxplot
89
 
90
-
91
  ### Generation
92
  ---
93
  After having extracted meta features from the files, the next step is to generate event log data accordingly. Generally, there are two settings on how the targets are defined: i) meta feature targets are defined by the meta features from the real event log data; ii) a configuration space is defined which resembles the feasible meta features space.
@@ -389,7 +388,7 @@ python main.py -a config_files/experiment_real_targets.json
389
  To execute the experiments with grid targets, a single [configuration](config_files/grid_2obj) can be selected or all [grid objectives](data/grid_2obj) can be run with one command using the following script. This script will output the [generated event logs (GenED)](data/event_logs/GenED), alongside their respectively measured [feature values](data/GenED_feat.csv) and [benchmark metrics values](data/GenED_bench.csv).
390
  ```
391
  conda activate gedi
392
- python execute_grid_experiments.py config_files/grid_2obj
393
  ```
394
  We employ the [experiment_grid_2obj_configfiles_fabric.ipynb](notebooks/experiment_grid_2obj_configfiles_fabric.ipynb) to create all necessary [configuration](config_files/grid_2obj) and [objective](data/grid_2obj) files for this experiment.
395
  For more details about these config_files, please refer to [Feature Extraction](#feature-extraction), [Generation](#generation), and [Benchmark](#benchmark).
@@ -401,6 +400,7 @@ streamlit run utils/config_fabric.py # To tunnel to local machine add: --server.
401
  ssh -N -f -L 9000:localhost:8501 <user@remote_machine.com>
402
  open "http://localhost:9000/"
403
  ```
 
404
  ### Visualizations
405
  To run the visualizations, we employ [jupyter notebooks](https://jupyter.org/install) and [add the installed environment to the jupyter notebook](https://medium.com/@nrk25693/how-to-add-your-conda-environment-to-your-jupyter-notebook-in-just-4-steps-abeab8b8d084). We then start all visualizations by running e.g.: `jupyter noteboook`. In the following, we describe the `.ipynb`-files in the folder `\notebooks` to reproduce the figures from our paper.
406
 
@@ -418,17 +418,43 @@ This notebook is used to answer the question if there is a statistically signifi
418
  Likewise to the evaluation on the statistical tests in notebook `gedi_figs7and8_benchmarking_statisticalTests.ipynb`, this notebook is used to compute the differences between two correlation matrices $\Delta C = C_1 - C_2$. This logic is employed to evaluate and visualize the distance of two correlation matrices. Furthermore, we show how significant scores are retained from the correlations being evaluated on real-world datasets coompared to synthesized event log datasets with real-world targets. In Fig. 9 and 10 in the paper, the results of the notebook are shown.
419
 
420
  ## Citation
421
- The `GEDI` framework is taken directly from the original paper by [Maldonado](mailto:[email protected]), Frey, Tavares, Rehwald and Seidl and is *to appear on BPM'24*.
422
-
423
- ```bibtex
424
- @article{maldonado2024gedi,
425
- author = {Maldonado, Andrea and Frey, {Christian M. M.} and Tavares, {Gabriel M.} and Rehwald, Nikolina and Seidl, Thomas},
426
- title = {{GEDI:} Generating Event Data with Intentional Features for Benchmarking Process Mining},
427
- journal = {To be published in BPM 2024. Krakow, Poland, Sep 01-06},
428
- volume = {},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
429
  year = {2024},
430
- url = {https://mcml.ai/publications/gedi.pdf},
431
- doi = {},
432
- eprinttype = {website},
433
  }
434
  ```
 
12
 
13
  <p>
14
  <img src="gedi/utils/logo.png" alt="Logo" width="100" align="left" />
15
+ <h1 style="display: inline;">(i)GEDI</h1>
16
  </p>
17
 
18
+ (**i**nteractive) **G**enerating **E**vent **D**ata with **I**ntentional Features for Benchmarking Process Mining<br />
19
  This repository contains the codebase for the interactive web application tool (iGEDI) as well as for the [GEDI paper](https://mcml.ai/publications/gedi.pdf) accepted at the BPM'24 conference.
20
 
21
  ## Table of Contents
 
87
  - font_size: label font size of the output plot
88
  - boxplot_width: width of the violinplot/boxplot
89
 
 
90
  ### Generation
91
  ---
92
  After having extracted meta features from the files, the next step is to generate event log data accordingly. Generally, there are two settings on how the targets are defined: i) meta feature targets are defined by the meta features from the real event log data; ii) a configuration space is defined which resembles the feasible meta features space.
 
388
  To execute the experiments with grid targets, a single [configuration](config_files/grid_2obj) can be selected or all [grid objectives](data/grid_2obj) can be run with one command using the following script. This script will output the [generated event logs (GenED)](data/event_logs/GenED), alongside their respectively measured [feature values](data/GenED_feat.csv) and [benchmark metrics values](data/GenED_bench.csv).
389
  ```
390
  conda activate gedi
391
+ python gedi/utils/execute_grid_experiments.py config_files/test
392
  ```
393
  We employ the [experiment_grid_2obj_configfiles_fabric.ipynb](notebooks/experiment_grid_2obj_configfiles_fabric.ipynb) to create all necessary [configuration](config_files/grid_2obj) and [objective](data/grid_2obj) files for this experiment.
394
  For more details about these config_files, please refer to [Feature Extraction](#feature-extraction), [Generation](#generation), and [Benchmark](#benchmark).
 
400
  ssh -N -f -L 9000:localhost:8501 <user@remote_machine.com>
401
  open "http://localhost:9000/"
402
  ```
403
+
404
  ### Visualizations
405
  To run the visualizations, we employ [jupyter notebooks](https://jupyter.org/install) and [add the installed environment to the jupyter notebook](https://medium.com/@nrk25693/how-to-add-your-conda-environment-to-your-jupyter-notebook-in-just-4-steps-abeab8b8d084). We then start all visualizations by running e.g.: `jupyter noteboook`. In the following, we describe the `.ipynb`-files in the folder `\notebooks` to reproduce the figures from our paper.
406
 
 
418
  Likewise to the evaluation on the statistical tests in notebook `gedi_figs7and8_benchmarking_statisticalTests.ipynb`, this notebook is used to compute the differences between two correlation matrices $\Delta C = C_1 - C_2$. This logic is employed to evaluate and visualize the distance of two correlation matrices. Furthermore, we show how significant scores are retained from the correlations being evaluated on real-world datasets coompared to synthesized event log datasets with real-world targets. In Fig. 9 and 10 in the paper, the results of the notebook are shown.
419
 
420
  ## Citation
421
+ The `GEDI` framework is taken directly from the original paper by [Maldonado](mailto:[email protected]), Frey, Tavares, Rehwald and Seidl on BPM'24.
422
+
423
+ ```
424
+ @InProceedings{maldonado2024gedi,
425
+ author="Maldonado, Andrea
426
+ and Frey, Christian M. M.
427
+ and Tavares, Gabriel Marques
428
+ and Rehwald, Nikolina
429
+ and Seidl, Thomas",
430
+ editor="Marrella, Andrea
431
+ and Resinas, Manuel
432
+ and Jans, Mieke
433
+ and Rosemann, Michael",
434
+ title="GEDI: Generating Event Data with Intentional Features for Benchmarking Process Mining",
435
+ booktitle="Business Process Management",
436
+ year="2024",
437
+ publisher="Springer Nature Switzerland",
438
+ address="Cham",
439
+ pages="221--237",
440
+ abstract="Process mining solutions include enhancing performance, conserving resources, and alleviating bottlenecks in organizational contexts. However, as in other data mining fields, success hinges on data quality and availability. Existing analyses for process mining solutions lack diverse and ample data for rigorous testing, hindering insights' generalization. To address this, we propose Generating Event Data with Intentional features, a framework producing event data sets satisfying specific meta-features. Considering the meta-feature space that defines feasible event logs, we observe that existing real-world datasets describe only local areas within the overall space. Hence, our framework aims at providing the capability to generate an event data benchmark, which covers unexplored regions. Therefore, our approach leverages a discretization of the meta-feature space to steer generated data towards regions, where a combination of meta-features is not met yet by existing benchmark datasets. Providing a comprehensive data pool enriches process mining analyses, enables methods to capture a wider range of real-world scenarios, and improves evaluation quality. Moreover, it empowers analysts to uncover correlations between meta-features and evaluation metrics, enhancing explainability and solution effectiveness. Experiments demonstrate GEDI's ability to produce a benchmark of intentional event data sets and robust analyses for process mining tasks.",
441
+ isbn="978-3-031-70396-6"
442
+ }
443
+ ```
444
+
445
+ Furthermore, the `iGEDI` web application is taken directly from the original paper by [Maldonado](mailto:[email protected]), Aryasomayajula, Frey, and Seidl and is *to appear on Demos@ICPM'24*.
446
+ ```
447
+ @inproceedings{maldonado2024igedi,
448
+ author = {Andrea Maldonado and
449
+ Sai Anirudh Aryasomayajula and
450
+ Christian M. M. Frey and
451
+ Thomas Seidl},
452
+ editor = {Jochen De Weerdt, Giovanni Meroni, Han van der Aa, and Karolin Winter},
453
+ title = {iGEDI: interactive Generating Event Data with Intentional Features},
454
+ booktitle = {ICPM 2024 Tool Demonstration Track, October 14-18, 2024, Kongens Lyngby, Denmark},
455
+ series = {{CEUR} Workshop Proceedings},
456
+ publisher = {CEUR-WS.org},
457
  year = {2024},
458
+ bibsource = {dblp computer science bibliography, https://dblp.org}
 
 
459
  }
460
  ```
config_files/pipeline_steps/benchmark.json CHANGED
@@ -4,6 +4,10 @@
4
  "benchmark_test": "discovery",
5
  "input_path":"data/test",
6
  "output_path":"output",
 
7
  "miners" : ["ind", "heu", "imf", "ilp"]
 
 
 
8
  }
9
  ]
 
4
  "benchmark_test": "discovery",
5
  "input_path":"data/test",
6
  "output_path":"output",
7
+ <<<<<<<< HEAD:config_files/pipeline_steps/benchmark.json
8
  "miners" : ["ind", "heu", "imf", "ilp"]
9
+ ========
10
+ "miners" : ["inductive", "heu", "imf", "ilp"]
11
+ >>>>>>>> main:config_files/algorithm/pipeline_steps/benchmark.json
12
  }
13
  ]
data/test/grid_experiments/rt10v.csv DELETED
@@ -1,12 +0,0 @@
1
- task,ratio_top_10_variants
2
- task_1,0.0
3
- task_2,0.1
4
- task_3,0.2
5
- task_4,0.3
6
- task_5,0.4
7
- task_6,0.5
8
- task_7,0.6
9
- task_8,0.7
10
- task_9,0.8
11
- task_10,0.9
12
- task_11,1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
data/validation/genELexperiment1_04_02.json CHANGED
@@ -1 +1 @@
1
- {"ratio_top_20_variants": 0.20017714791851196, "epa_normalized_sequence_entropy_linear_forgetting": 0.052097205658647734, "log": "genELexperiment1_04_02", "target_similarity": 0.7418932364693804}
 
1
+ {"ratio_top_20_variants": 0.20017714791851196, "epa_normalized_sequence_entropy_linear_forgetting": 0.052097205658647734, "log": "genELexperiment1_04_02", "target_similarity": 0.7418932364693804}
data/validation/genELexperiment2_07_04.json CHANGED
@@ -1 +1 @@
1
- {"ratio_top_20_variants": 0.38863337713534823, "epa_normalized_sequence_entropy_linear_forgetting": 0.052097205658647734, "log": "genELexperiment2_07_04", "target_similarity": 0.6067951985524301}
 
1
+ {"ratio_top_20_variants": 0.38863337713534823, "epa_normalized_sequence_entropy_linear_forgetting": 0.052097205658647734, "log": "genELexperiment2_07_04", "target_similarity": 0.6067951985524301}
data/validation/test_benchmark.csv CHANGED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ log,fitness_inductive,precision_inductive,fscore_inductive,size_inductive,pnsize_inductive,cfc_inductive,fitness_heu,precision_heu,fscore_heu,size_heu,pnsize_heu,cfc_heu,fitness_imf,precision_imf,fscore_imf,size_imf,pnsize_imf,cfc_imf,fitness_ilp,precision_ilp,fscore_ilp,size_ilp,pnsize_ilp,cfc_ilp
2
+ gen_el_169,0.9998052420892378,0.6662312989788649,0.7996241723917423,34,24,22,0.9383563249832565,0.5979149389882715,0.7304143193451293,22,14,13,0.9358843752091403,0.6513022517490741,0.7680805654451066,28,18,16,0.9999637006454563,0.432690150325331,0.6040181215566763,27,7,9
3
+ gen_el_168,0.9997678338833808,0.6033523537803138,0.7525477883058467,61,34,20,0.48155419290534085,0.9449078138718174,0.6379760800037585,60,35,32,0.9479094601490539,0.5169524053224155,0.669037930473001,67,38,24,0.9999513902099882,0.4283471743974073,0.5997714527549697,93,30,28
gedi/generator.py CHANGED
@@ -152,6 +152,7 @@ class GenerateEventLogs():
152
 
153
  self.params = params.get(GENERATOR_PARAMS)
154
  experiment = self.params.get(EXPERIMENT)
 
155
  if experiment is not None:
156
  tasks, output_path = get_tasks(experiment, self.output_path)
157
  columns_to_rename = {col: column_mappings()[col] for col in tasks.columns if col in column_mappings()}
 
152
 
153
  self.params = params.get(GENERATOR_PARAMS)
154
  experiment = self.params.get(EXPERIMENT)
155
+
156
  if experiment is not None:
157
  tasks, output_path = get_tasks(experiment, self.output_path)
158
  columns_to_rename = {col: column_mappings()[col] for col in tasks.columns if col in column_mappings()}
notebooks/experiment_grid_2obj_configfiles_fabric.ipynb ADDED
@@ -0,0 +1,1184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "08ee6ee0",
6
+ "metadata": {},
7
+ "source": [
8
+ "## Grid Objectives\n",
9
+ "Iterating between min and max for each column\n",
10
+ "\n",
11
+ "### Glossary\n",
12
+ "- **task**: Refers to the set of values (row) and corresponding keys to be aimed at sequentially.\n",
13
+ "- **objective**: Refers to one key (column) and respective value to be aimed at simultaneously during a task.\n",
14
+ "- **experiment**: Refers to one file containing a multiple of objectives and tasks for a fixed number of each, respectively. "
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "code",
19
+ "execution_count": 1,
20
+ "id": "e5aa7223",
21
+ "metadata": {},
22
+ "outputs": [],
23
+ "source": [
24
+ "import itertools\n",
25
+ "import json\n",
26
+ "import numpy as np\n",
27
+ "import os\n",
28
+ "import pandas as pd"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": 2,
34
+ "id": "472fd031",
35
+ "metadata": {},
36
+ "outputs": [],
37
+ "source": [
38
+ "#Features between 0 and 1: \n",
39
+ "normalized_feature_names = ['ratio_variants_per_number_of_traces', 'trace_len_hist1', 'trace_len_hist2',\n",
40
+ " 'trace_len_hist3', 'trace_len_hist4', 'trace_len_hist5', 'trace_len_hist7',\n",
41
+ " 'trace_len_hist8', 'trace_len_hist9', 'ratio_most_common_variant', \n",
42
+ " 'ratio_top_1_variants', 'ratio_top_5_variants', 'ratio_top_10_variants', \n",
43
+ " 'ratio_top_20_variants', 'ratio_top_50_variants', 'ratio_top_75_variants', \n",
44
+ " 'epa_normalized_variant_entropy', 'epa_normalized_sequence_entropy', \n",
45
+ " 'epa_normalized_sequence_entropy_linear_forgetting', 'epa_normalized_sequence_entropy_exponential_forgetting']\n",
46
+ "\n",
47
+ "normalized_feature_names = ['ratio_variants_per_number_of_traces', 'ratio_most_common_variant', \n",
48
+ " 'ratio_top_10_variants', 'epa_normalized_variant_entropy', 'epa_normalized_sequence_entropy', \n",
49
+ " 'epa_normalized_sequence_entropy_linear_forgetting', 'epa_normalized_sequence_entropy_exponential_forgetting']\n",
50
+ "\n",
51
+ "def abbrev_obj_keys(obj_keys):\n",
52
+ " abbreviated_keys = []\n",
53
+ " for obj_key in obj_keys:\n",
54
+ " key_slices = obj_key.split(\"_\")\n",
55
+ " chars = []\n",
56
+ " for key_slice in key_slices:\n",
57
+ " for idx, single_char in enumerate(key_slice):\n",
58
+ " if idx == 0 or single_char.isdigit():\n",
59
+ " chars.append(single_char)\n",
60
+ " abbreviated_key = ''.join(chars)\n",
61
+ " abbreviated_keys.append(abbreviated_key)\n",
62
+ " return '_'.join(abbreviated_keys) "
63
+ ]
64
+ },
65
+ {
66
+ "cell_type": "code",
67
+ "execution_count": 3,
68
+ "id": "2be119c8",
69
+ "metadata": {},
70
+ "outputs": [
71
+ {
72
+ "name": "stdout",
73
+ "output_type": "stream",
74
+ "text": [
75
+ "21 [('epa_normalized_sequence_entropy_linear_forgetting', 'ratio_top_10_variants'), ('epa_normalized_sequence_entropy_exponential_forgetting', 'epa_normalized_variant_entropy'), ('epa_normalized_variant_entropy', 'ratio_variants_per_number_of_traces'), ('epa_normalized_sequence_entropy_linear_forgetting', 'ratio_most_common_variant'), ('epa_normalized_sequence_entropy', 'ratio_variants_per_number_of_traces'), ('epa_normalized_sequence_entropy_exponential_forgetting', 'ratio_top_10_variants'), ('epa_normalized_sequence_entropy_exponential_forgetting', 'epa_normalized_sequence_entropy_linear_forgetting'), ('epa_normalized_sequence_entropy', 'epa_normalized_variant_entropy'), ('epa_normalized_sequence_entropy_exponential_forgetting', 'ratio_most_common_variant'), ('ratio_top_10_variants', 'ratio_variants_per_number_of_traces'), ('epa_normalized_sequence_entropy', 'ratio_top_10_variants'), ('epa_normalized_variant_entropy', 'ratio_top_10_variants'), ('epa_normalized_sequence_entropy', 'epa_normalized_sequence_entropy_linear_forgetting'), ('ratio_most_common_variant', 'ratio_variants_per_number_of_traces'), ('epa_normalized_variant_entropy', 'ratio_most_common_variant'), ('epa_normalized_sequence_entropy', 'ratio_most_common_variant'), ('epa_normalized_sequence_entropy_linear_forgetting', 'ratio_variants_per_number_of_traces'), ('epa_normalized_sequence_entropy', 'epa_normalized_sequence_entropy_exponential_forgetting'), ('epa_normalized_sequence_entropy_linear_forgetting', 'epa_normalized_variant_entropy'), ('epa_normalized_sequence_entropy_exponential_forgetting', 'ratio_variants_per_number_of_traces'), ('ratio_most_common_variant', 'ratio_top_10_variants')]\n",
76
+ "121\n",
77
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enself_rt10v.csv\n",
78
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enself_rt10v.json\n",
79
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enseef_enve.csv\n",
80
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enseef_enve.json\n",
81
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enve_rvpnot.csv\n",
82
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enve_rvpnot.json\n",
83
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enself_rmcv.csv\n",
84
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enself_rmcv.json\n",
85
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_ense_rvpnot.csv\n",
86
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_ense_rvpnot.json\n",
87
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enseef_rt10v.csv\n",
88
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enseef_rt10v.json\n",
89
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enseef_enself.csv\n",
90
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enseef_enself.json\n",
91
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_ense_enve.csv\n",
92
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_ense_enve.json\n",
93
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enseef_rmcv.csv\n",
94
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enseef_rmcv.json\n",
95
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_rt10v_rvpnot.csv\n",
96
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_rt10v_rvpnot.json\n",
97
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_ense_rt10v.csv\n",
98
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_ense_rt10v.json\n",
99
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enve_rt10v.csv\n",
100
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enve_rt10v.json\n",
101
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_ense_enself.csv\n",
102
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_ense_enself.json\n",
103
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_rmcv_rvpnot.csv\n",
104
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_rmcv_rvpnot.json\n",
105
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enve_rmcv.csv\n",
106
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enve_rmcv.json\n",
107
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_ense_rmcv.csv\n",
108
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_ense_rmcv.json\n",
109
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enself_rvpnot.csv\n",
110
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enself_rvpnot.json\n",
111
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_ense_enseef.csv\n",
112
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_ense_enseef.json\n",
113
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enself_enve.csv\n",
114
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enself_enve.json\n",
115
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_enseef_rvpnot.csv\n",
116
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_enseef_rvpnot.json\n",
117
+ "Saved experiment in ../data/grid_2obj/grid_2objectives_rmcv_rt10v.csv\n",
118
+ "Saved experiment config in ../config_files/algorithm/grid_2obj/generator_grid_2objectives_rmcv_rt10v.json\n",
119
+ "None\n"
120
+ ]
121
+ }
122
+ ],
123
+ "source": [
124
+ "def write_generator_experiment(experiment_path, objectives=[\"ratio_top_20_variants\", \"epa_normalized_sequence_entropy_linear_forgetting\"]):\n",
125
+ " first_dir = os.path.split(experiment_path[3:])[-1].replace(\".csv\",\"\")\n",
126
+ " second_dir = first_dir.replace(\"grid_\",\"\").replace(\"objectives\",\"\")\n",
127
+ "\n",
128
+ " experiment = [\n",
129
+ " {\n",
130
+ " 'pipeline_step': 'event_logs_generation',\n",
131
+ " 'output_path':'output/generated/grid_2obj',\n",
132
+ " 'generator_params': {\n",
133
+ " \"experiment\": {\"input_path\": experiment_path[3:],\n",
134
+ " \"objectives\": objectives},\n",
135
+ " 'config_space': {\n",
136
+ " 'mode': [5, 20],\n",
137
+ " 'sequence': [0.01, 1],\n",
138
+ " 'choice': [0.01, 1],\n",
139
+ " 'parallel': [0.01, 1],\n",
140
+ " 'loop': [0.01, 1],\n",
141
+ " 'silent': [0.01, 1],\n",
142
+ " 'lt_dependency': [0.01, 1],\n",
143
+ " 'num_traces': [10, 10001],\n",
144
+ " 'duplicate': [0],\n",
145
+ " 'or': [0]\n",
146
+ " },\n",
147
+ " 'n_trials': 200\n",
148
+ " }\n",
149
+ " },\n",
150
+ " {\n",
151
+ " 'pipeline_step': 'feature_extraction',\n",
152
+ " 'input_path': os.path.join('output','features', 'generated', 'grid_2obj', first_dir, second_dir),\n",
153
+ " \"feature_params\": {\"feature_set\":[\"ratio_variants_per_number_of_traces\",\"ratio_most_common_variant\",\"ratio_top_10_variants\",\"epa_normalized_variant_entropy\",\"epa_normalized_sequence_entropy\",\"epa_normalized_sequence_entropy_linear_forgetting\",\"epa_normalized_sequence_entropy_exponential_forgetting\"]},\n",
154
+ " 'output_path': 'output/plots',\n",
155
+ " 'real_eventlog_path': 'data/BaselineED_feat.csv',\n",
156
+ " 'plot_type': 'boxplot'\n",
157
+ " },\n",
158
+ " {\n",
159
+ " \"pipeline_step\": \"benchmark_test\",\n",
160
+ " \"benchmark_test\": \"discovery\",\n",
161
+ " \"input_path\": os.path.join('output', 'generated', 'grid_2obj', first_dir, second_dir),\n",
162
+ " \"output_path\":\"output\",\n",
163
+ " \"miners\" : [\"heu\", \"imf\", \"ilp\"]\n",
164
+ " }\n",
165
+ " ]\n",
166
+ "\n",
167
+ " #print(\"EXPERIMENT:\", experiment[1]['input_path'])\n",
168
+ " output_path = os.path.join('..', 'config_files','algorithm','grid_2obj')\n",
169
+ " os.makedirs(output_path, exist_ok=True)\n",
170
+ " output_path = os.path.join(output_path, f'generator_{os.path.split(experiment_path)[-1].split(\".\")[0]}.json') \n",
171
+ " with open(output_path, 'w') as f:\n",
172
+ " json.dump(experiment, f, ensure_ascii=False)\n",
173
+ " print(f\"Saved experiment config in {output_path}\")\n",
174
+ " \n",
175
+ " return experiment\n",
176
+ "\n",
177
+ "def create_objectives_grid(objectives, n_para_obj=2):\n",
178
+ " parameters_o = \"objectives, \"\n",
179
+ " if n_para_obj==1:\n",
180
+ " experiments = [[exp] for exp in objectives]\n",
181
+ " else:\n",
182
+ " experiments = eval(f\"[exp for exp in list(itertools.product({(parameters_o*n_para_obj)[:-2]})) if exp[0]!=exp[1]]\")\n",
183
+ " experiments = list(set([tuple(sorted(exp)) for exp in experiments]))\n",
184
+ " print(len(experiments), experiments)\n",
185
+ " \n",
186
+ " parameters = \"np.around(np.arange(0, 1.1,0.1),2), \"\n",
187
+ " tasks = eval(f\"list(itertools.product({(parameters*n_para_obj)[:-2]}))\")\n",
188
+ " tasks = [(f'task_{i+1}',)+task for i, task in enumerate(tasks)]\n",
189
+ " print(len(tasks))\n",
190
+ " for exp in experiments:\n",
191
+ " df = pd.DataFrame(data=tasks, columns=[\"task\", *exp])\n",
192
+ " experiment_path = os.path.join('..','data', 'grid_2obj')\n",
193
+ " os.makedirs(experiment_path, exist_ok=True)\n",
194
+ " experiment_path = os.path.join(experiment_path, f\"grid_{len(df.columns)-1}objectives_{abbrev_obj_keys(exp)}.csv\") \n",
195
+ " df.to_csv(experiment_path, index=False)\n",
196
+ " print(f\"Saved experiment in {experiment_path}\")\n",
197
+ " write_generator_experiment(experiment_path, objectives=exp)\n",
198
+ " #df.to_csv(f\"../data/grid_{}objectives_{abbrev_obj_keys(objectives.tolist())}.csv\" ,index=False)\n",
199
+ " \n",
200
+ "exp_test = create_objectives_grid(normalized_feature_names, n_para_obj=2) \n",
201
+ "print(exp_test)"
202
+ ]
203
+ },
204
+ {
205
+ "cell_type": "markdown",
206
+ "id": "56ab613b",
207
+ "metadata": {},
208
+ "source": [
209
+ "### Helper prototypes"
210
+ ]
211
+ },
212
+ {
213
+ "cell_type": "code",
214
+ "execution_count": 4,
215
+ "id": "dfd1a302",
216
+ "metadata": {},
217
+ "outputs": [],
218
+ "source": [
219
+ "df = pd.DataFrame(columns=[\"log\",\"ratio_top_20_variants\", \"epa_normalized_sequence_entropy_linear_forgetting\"]) "
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "code",
224
+ "execution_count": 5,
225
+ "id": "218946b7",
226
+ "metadata": {},
227
+ "outputs": [],
228
+ "source": [
229
+ "k=0\n",
230
+ "for i in np.arange(0, 1.1,0.2):\n",
231
+ " for j in np.arange(0,0.55,0.1):\n",
232
+ " k+=1\n",
233
+ " new_entry = pd.Series({'log':f\"objective_{k}\", \"ratio_top_20_variants\":round(i,1),\n",
234
+ " \"epa_normalized_sequence_entropy_linear_forgetting\":round(j,1)})\n",
235
+ " df = pd.concat([\n",
236
+ " df, \n",
237
+ " pd.DataFrame([new_entry], columns=new_entry.index)]\n",
238
+ " ).reset_index(drop=True)\n",
239
+ " "
240
+ ]
241
+ },
242
+ {
243
+ "cell_type": "code",
244
+ "execution_count": 6,
245
+ "id": "b1e3bb5a",
246
+ "metadata": {},
247
+ "outputs": [],
248
+ "source": [
249
+ "df.to_csv(\"../data/grid_objectives.csv\" ,index=False)"
250
+ ]
251
+ },
252
+ {
253
+ "cell_type": "markdown",
254
+ "id": "c12bc19d",
255
+ "metadata": {},
256
+ "source": [
257
+ "## Objectives from real logs\n",
258
+ "(Feature selection)"
259
+ ]
260
+ },
261
+ {
262
+ "cell_type": "code",
263
+ "execution_count": 7,
264
+ "id": "39ac74bb",
265
+ "metadata": {},
266
+ "outputs": [
267
+ {
268
+ "name": "stdout",
269
+ "output_type": "stream",
270
+ "text": [
271
+ "(26, 8)\n",
272
+ "26 Event-Logs: ['BPIC12' 'BPIC13cp' 'BPIC13inc' 'BPIC13op' 'BPIC14dc_p' 'BPIC14di_p'\n",
273
+ " 'BPIC14dia_p' 'BPIC15f1' 'BPIC15f2' 'BPIC15f3' 'BPIC15f4' 'BPIC15f5'\n",
274
+ " 'BPIC16c_p' 'BPIC16wm_p' 'BPIC17' 'BPIC17ol' 'BPIC19' 'BPIC20a' 'BPIC20b'\n",
275
+ " 'BPIC20c' 'BPIC20d' 'BPIC20e' 'HD' 'RTFMP' 'RWABOCSL' 'SEPSIS']\n"
276
+ ]
277
+ },
278
+ {
279
+ "data": {
280
+ "text/html": [
281
+ "<div>\n",
282
+ "<style scoped>\n",
283
+ " .dataframe tbody tr th:only-of-type {\n",
284
+ " vertical-align: middle;\n",
285
+ " }\n",
286
+ "\n",
287
+ " .dataframe tbody tr th {\n",
288
+ " vertical-align: top;\n",
289
+ " }\n",
290
+ "\n",
291
+ " .dataframe thead th {\n",
292
+ " text-align: right;\n",
293
+ " }\n",
294
+ "</style>\n",
295
+ "<table border=\"1\" class=\"dataframe\">\n",
296
+ " <thead>\n",
297
+ " <tr style=\"text-align: right;\">\n",
298
+ " <th></th>\n",
299
+ " <th>log</th>\n",
300
+ " <th>ratio_variants_per_number_of_traces</th>\n",
301
+ " <th>ratio_most_common_variant</th>\n",
302
+ " <th>ratio_top_10_variants</th>\n",
303
+ " <th>epa_normalized_variant_entropy</th>\n",
304
+ " <th>epa_normalized_sequence_entropy</th>\n",
305
+ " <th>epa_normalized_sequence_entropy_linear_forgetting</th>\n",
306
+ " <th>epa_normalized_sequence_entropy_exponential_forgetting</th>\n",
307
+ " </tr>\n",
308
+ " </thead>\n",
309
+ " <tbody>\n",
310
+ " <tr>\n",
311
+ " <th>0</th>\n",
312
+ " <td>BPIC16wm_p</td>\n",
313
+ " <td>0.002882</td>\n",
314
+ " <td>0.295803</td>\n",
315
+ " <td>0.714106</td>\n",
316
+ " <td>0.000000</td>\n",
317
+ " <td>0.000000</td>\n",
318
+ " <td>0.000000</td>\n",
319
+ " <td>0.000000</td>\n",
320
+ " </tr>\n",
321
+ " <tr>\n",
322
+ " <th>1</th>\n",
323
+ " <td>BPIC15f5</td>\n",
324
+ " <td>0.997405</td>\n",
325
+ " <td>0.001730</td>\n",
326
+ " <td>0.102076</td>\n",
327
+ " <td>0.648702</td>\n",
328
+ " <td>0.603260</td>\n",
329
+ " <td>0.342410</td>\n",
330
+ " <td>0.404580</td>\n",
331
+ " </tr>\n",
332
+ " <tr>\n",
333
+ " <th>2</th>\n",
334
+ " <td>BPIC15f1</td>\n",
335
+ " <td>0.975813</td>\n",
336
+ " <td>0.006672</td>\n",
337
+ " <td>0.121768</td>\n",
338
+ " <td>0.652855</td>\n",
339
+ " <td>0.610294</td>\n",
340
+ " <td>0.270241</td>\n",
341
+ " <td>0.363928</td>\n",
342
+ " </tr>\n",
343
+ " <tr>\n",
344
+ " <th>3</th>\n",
345
+ " <td>BPIC19</td>\n",
346
+ " <td>0.047562</td>\n",
347
+ " <td>0.199758</td>\n",
348
+ " <td>0.946368</td>\n",
349
+ " <td>0.645530</td>\n",
350
+ " <td>0.328029</td>\n",
351
+ " <td>0.320185</td>\n",
352
+ " <td>0.320282</td>\n",
353
+ " </tr>\n",
354
+ " <tr>\n",
355
+ " <th>4</th>\n",
356
+ " <td>BPIC14dia_p</td>\n",
357
+ " <td>0.496847</td>\n",
358
+ " <td>0.037455</td>\n",
359
+ " <td>0.552836</td>\n",
360
+ " <td>0.774743</td>\n",
361
+ " <td>0.608350</td>\n",
362
+ " <td>0.305614</td>\n",
363
+ " <td>0.377416</td>\n",
364
+ " </tr>\n",
365
+ " </tbody>\n",
366
+ "</table>\n",
367
+ "</div>"
368
+ ],
369
+ "text/plain": [
370
+ " log ratio_variants_per_number_of_traces \n",
371
+ "0 BPIC16wm_p 0.002882 \\\n",
372
+ "1 BPIC15f5 0.997405 \n",
373
+ "2 BPIC15f1 0.975813 \n",
374
+ "3 BPIC19 0.047562 \n",
375
+ "4 BPIC14dia_p 0.496847 \n",
376
+ "\n",
377
+ " ratio_most_common_variant ratio_top_10_variants \n",
378
+ "0 0.295803 0.714106 \\\n",
379
+ "1 0.001730 0.102076 \n",
380
+ "2 0.006672 0.121768 \n",
381
+ "3 0.199758 0.946368 \n",
382
+ "4 0.037455 0.552836 \n",
383
+ "\n",
384
+ " epa_normalized_variant_entropy epa_normalized_sequence_entropy \n",
385
+ "0 0.000000 0.000000 \\\n",
386
+ "1 0.648702 0.603260 \n",
387
+ "2 0.652855 0.610294 \n",
388
+ "3 0.645530 0.328029 \n",
389
+ "4 0.774743 0.608350 \n",
390
+ "\n",
391
+ " epa_normalized_sequence_entropy_linear_forgetting \n",
392
+ "0 0.000000 \\\n",
393
+ "1 0.342410 \n",
394
+ "2 0.270241 \n",
395
+ "3 0.320185 \n",
396
+ "4 0.305614 \n",
397
+ "\n",
398
+ " epa_normalized_sequence_entropy_exponential_forgetting \n",
399
+ "0 0.000000 \n",
400
+ "1 0.404580 \n",
401
+ "2 0.363928 \n",
402
+ "3 0.320282 \n",
403
+ "4 0.377416 "
404
+ ]
405
+ },
406
+ "execution_count": 7,
407
+ "metadata": {},
408
+ "output_type": "execute_result"
409
+ }
410
+ ],
411
+ "source": [
412
+ "bpic_features = pd.read_csv(\"../data/BaselineED_feat.csv\", index_col=None)\n",
413
+ "#bpic_features = pd.read_csv(\"../gedi/output/features/real_event_logs.csv\", index_col=None)\n",
414
+ "\n",
415
+ "#bpic_features = bpic_features.drop(['Unnamed: 0'], axis=1)\n",
416
+ "print(bpic_features.shape)\n",
417
+ "print(len(bpic_features), \" Event-Logs: \", bpic_features.sort_values('log')['log'].unique())\n",
418
+ "\n",
419
+ "#bpic_features.rename(columns={\"variant_entropy\":\"epa_variant_entropy\", \"normalized_variant_entropy\":\"epa_normalized_variant_entropy\", \"sequence_entropy\":\"epa_sequence_entropy\", \"normalized_sequence_entropy\":\"epa_normalized_sequence_entropy\", \"sequence_entropy_linear_forgetting\":\"epa_sequence_entropy_linear_forgetting\", \"normalized_sequence_entropy_linear_forgetting\":\"epa_normalized_sequence_entropy_linear_forgetting\", \"sequence_entropy_exponential_forgetting\":\"epa_sequence_entropy_exponential_forgetting\", \"normalized_sequence_entropy_exponential_forgetting\":\"epa_normalized_sequence_entropy_exponential_forgetting\"},\n",
420
+ "# errors=\"raise\", inplace=True)\n",
421
+ "\n",
422
+ "bpic_features.head()\n",
423
+ "#bpic_features.to_csv(\"../data/BaselineED_feat.csv\", index=False)"
424
+ ]
425
+ },
426
+ {
427
+ "cell_type": "code",
428
+ "execution_count": 8,
429
+ "id": "ef0df0b9",
430
+ "metadata": {},
431
+ "outputs": [
432
+ {
433
+ "name": "stdout",
434
+ "output_type": "stream",
435
+ "text": [
436
+ "['ratio_variants_per_number_of_traces', 'ratio_most_common_variant', 'ratio_top_10_variants', 'epa_normalized_variant_entropy', 'epa_normalized_sequence_entropy', 'epa_normalized_sequence_entropy_linear_forgetting', 'epa_normalized_sequence_entropy_exponential_forgetting']\n"
437
+ ]
438
+ },
439
+ {
440
+ "data": {
441
+ "text/html": [
442
+ "<div>\n",
443
+ "<style scoped>\n",
444
+ " .dataframe tbody tr th:only-of-type {\n",
445
+ " vertical-align: middle;\n",
446
+ " }\n",
447
+ "\n",
448
+ " .dataframe tbody tr th {\n",
449
+ " vertical-align: top;\n",
450
+ " }\n",
451
+ "\n",
452
+ " .dataframe thead th {\n",
453
+ " text-align: right;\n",
454
+ " }\n",
455
+ "</style>\n",
456
+ "<table border=\"1\" class=\"dataframe\">\n",
457
+ " <thead>\n",
458
+ " <tr style=\"text-align: right;\">\n",
459
+ " <th></th>\n",
460
+ " <th>log</th>\n",
461
+ " <th>ratio_variants_per_number_of_traces</th>\n",
462
+ " <th>ratio_most_common_variant</th>\n",
463
+ " <th>ratio_top_10_variants</th>\n",
464
+ " <th>epa_normalized_variant_entropy</th>\n",
465
+ " <th>epa_normalized_sequence_entropy</th>\n",
466
+ " <th>epa_normalized_sequence_entropy_linear_forgetting</th>\n",
467
+ " <th>epa_normalized_sequence_entropy_exponential_forgetting</th>\n",
468
+ " </tr>\n",
469
+ " </thead>\n",
470
+ " <tbody>\n",
471
+ " <tr>\n",
472
+ " <th>0</th>\n",
473
+ " <td>BPIC16wm_p</td>\n",
474
+ " <td>0.002882</td>\n",
475
+ " <td>0.295803</td>\n",
476
+ " <td>0.714106</td>\n",
477
+ " <td>0.000000</td>\n",
478
+ " <td>0.000000</td>\n",
479
+ " <td>0.000000</td>\n",
480
+ " <td>0.000000</td>\n",
481
+ " </tr>\n",
482
+ " <tr>\n",
483
+ " <th>1</th>\n",
484
+ " <td>BPIC15f5</td>\n",
485
+ " <td>0.997405</td>\n",
486
+ " <td>0.001730</td>\n",
487
+ " <td>0.102076</td>\n",
488
+ " <td>0.648702</td>\n",
489
+ " <td>0.603260</td>\n",
490
+ " <td>0.342410</td>\n",
491
+ " <td>0.404580</td>\n",
492
+ " </tr>\n",
493
+ " <tr>\n",
494
+ " <th>2</th>\n",
495
+ " <td>BPIC15f1</td>\n",
496
+ " <td>0.975813</td>\n",
497
+ " <td>0.006672</td>\n",
498
+ " <td>0.121768</td>\n",
499
+ " <td>0.652855</td>\n",
500
+ " <td>0.610294</td>\n",
501
+ " <td>0.270241</td>\n",
502
+ " <td>0.363928</td>\n",
503
+ " </tr>\n",
504
+ " <tr>\n",
505
+ " <th>3</th>\n",
506
+ " <td>BPIC19</td>\n",
507
+ " <td>0.047562</td>\n",
508
+ " <td>0.199758</td>\n",
509
+ " <td>0.946368</td>\n",
510
+ " <td>0.645530</td>\n",
511
+ " <td>0.328029</td>\n",
512
+ " <td>0.320185</td>\n",
513
+ " <td>0.320282</td>\n",
514
+ " </tr>\n",
515
+ " <tr>\n",
516
+ " <th>4</th>\n",
517
+ " <td>BPIC14dia_p</td>\n",
518
+ " <td>0.496847</td>\n",
519
+ " <td>0.037455</td>\n",
520
+ " <td>0.552836</td>\n",
521
+ " <td>0.774743</td>\n",
522
+ " <td>0.608350</td>\n",
523
+ " <td>0.305614</td>\n",
524
+ " <td>0.377416</td>\n",
525
+ " </tr>\n",
526
+ " <tr>\n",
527
+ " <th>5</th>\n",
528
+ " <td>BPIC15f2</td>\n",
529
+ " <td>0.995192</td>\n",
530
+ " <td>0.002404</td>\n",
531
+ " <td>0.103365</td>\n",
532
+ " <td>0.627973</td>\n",
533
+ " <td>0.602371</td>\n",
534
+ " <td>0.317217</td>\n",
535
+ " <td>0.390473</td>\n",
536
+ " </tr>\n",
537
+ " <tr>\n",
538
+ " <th>6</th>\n",
539
+ " <td>BPIC15f3</td>\n",
540
+ " <td>0.957417</td>\n",
541
+ " <td>0.010646</td>\n",
542
+ " <td>0.137686</td>\n",
543
+ " <td>0.661781</td>\n",
544
+ " <td>0.605676</td>\n",
545
+ " <td>0.341521</td>\n",
546
+ " <td>0.404934</td>\n",
547
+ " </tr>\n",
548
+ " <tr>\n",
549
+ " <th>7</th>\n",
550
+ " <td>BPIC13cp</td>\n",
551
+ " <td>0.123067</td>\n",
552
+ " <td>0.331540</td>\n",
553
+ " <td>0.840619</td>\n",
554
+ " <td>0.705383</td>\n",
555
+ " <td>0.310940</td>\n",
556
+ " <td>0.286515</td>\n",
557
+ " <td>0.288383</td>\n",
558
+ " </tr>\n",
559
+ " <tr>\n",
560
+ " <th>8</th>\n",
561
+ " <td>BPIC14dc_p</td>\n",
562
+ " <td>0.048444</td>\n",
563
+ " <td>0.074944</td>\n",
564
+ " <td>0.765056</td>\n",
565
+ " <td>0.470758</td>\n",
566
+ " <td>0.419266</td>\n",
567
+ " <td>0.312599</td>\n",
568
+ " <td>0.326719</td>\n",
569
+ " </tr>\n",
570
+ " <tr>\n",
571
+ " <th>9</th>\n",
572
+ " <td>BPIC20a</td>\n",
573
+ " <td>0.009429</td>\n",
574
+ " <td>0.439810</td>\n",
575
+ " <td>0.950095</td>\n",
576
+ " <td>0.696474</td>\n",
577
+ " <td>0.164758</td>\n",
578
+ " <td>0.085439</td>\n",
579
+ " <td>0.104389</td>\n",
580
+ " </tr>\n",
581
+ " <tr>\n",
582
+ " <th>10</th>\n",
583
+ " <td>BPIC14di_p</td>\n",
584
+ " <td>0.000041</td>\n",
585
+ " <td>0.787081</td>\n",
586
+ " <td>0.000000</td>\n",
587
+ " <td>1.000000</td>\n",
588
+ " <td>0.044018</td>\n",
589
+ " <td>0.033322</td>\n",
590
+ " <td>0.034685</td>\n",
591
+ " </tr>\n",
592
+ " <tr>\n",
593
+ " <th>11</th>\n",
594
+ " <td>BPIC17ol</td>\n",
595
+ " <td>0.000372</td>\n",
596
+ " <td>0.380626</td>\n",
597
+ " <td>0.380626</td>\n",
598
+ " <td>0.813479</td>\n",
599
+ " <td>0.105130</td>\n",
600
+ " <td>0.052672</td>\n",
601
+ " <td>0.066000</td>\n",
602
+ " </tr>\n",
603
+ " <tr>\n",
604
+ " <th>12</th>\n",
605
+ " <td>BPIC13op</td>\n",
606
+ " <td>0.131868</td>\n",
607
+ " <td>0.217338</td>\n",
608
+ " <td>0.769231</td>\n",
609
+ " <td>0.702960</td>\n",
610
+ " <td>0.276771</td>\n",
611
+ " <td>0.262094</td>\n",
612
+ " <td>0.263029</td>\n",
613
+ " </tr>\n",
614
+ " <tr>\n",
615
+ " <th>13</th>\n",
616
+ " <td>RTFMP</td>\n",
617
+ " <td>0.001536</td>\n",
618
+ " <td>0.375620</td>\n",
619
+ " <td>0.993104</td>\n",
620
+ " <td>0.769353</td>\n",
621
+ " <td>0.111932</td>\n",
622
+ " <td>0.052586</td>\n",
623
+ " <td>0.068442</td>\n",
624
+ " </tr>\n",
625
+ " <tr>\n",
626
+ " <th>14</th>\n",
627
+ " <td>BPIC20d</td>\n",
628
+ " <td>0.096236</td>\n",
629
+ " <td>0.271081</td>\n",
630
+ " <td>0.822773</td>\n",
631
+ " <td>0.723785</td>\n",
632
+ " <td>0.317044</td>\n",
633
+ " <td>0.184879</td>\n",
634
+ " <td>0.214387</td>\n",
635
+ " </tr>\n",
636
+ " <tr>\n",
637
+ " <th>15</th>\n",
638
+ " <td>BPIC12</td>\n",
639
+ " <td>0.333614</td>\n",
640
+ " <td>0.262016</td>\n",
641
+ " <td>0.686254</td>\n",
642
+ " <td>0.708280</td>\n",
643
+ " <td>0.423074</td>\n",
644
+ " <td>0.226133</td>\n",
645
+ " <td>0.275551</td>\n",
646
+ " </tr>\n",
647
+ " <tr>\n",
648
+ " <th>16</th>\n",
649
+ " <td>RWABOCSL</td>\n",
650
+ " <td>0.080893</td>\n",
651
+ " <td>0.497211</td>\n",
652
+ " <td>0.887029</td>\n",
653
+ " <td>0.689363</td>\n",
654
+ " <td>0.235532</td>\n",
655
+ " <td>0.100603</td>\n",
656
+ " <td>0.138113</td>\n",
657
+ " </tr>\n",
658
+ " <tr>\n",
659
+ " <th>17</th>\n",
660
+ " <td>BPIC20e</td>\n",
661
+ " <td>0.012925</td>\n",
662
+ " <td>0.437264</td>\n",
663
+ " <td>0.933488</td>\n",
664
+ " <td>0.703735</td>\n",
665
+ " <td>0.189048</td>\n",
666
+ " <td>0.097572</td>\n",
667
+ " <td>0.118744</td>\n",
668
+ " </tr>\n",
669
+ " <tr>\n",
670
+ " <th>18</th>\n",
671
+ " <td>BPIC16c_p</td>\n",
672
+ " <td>0.438053</td>\n",
673
+ " <td>0.101770</td>\n",
674
+ " <td>0.424779</td>\n",
675
+ " <td>0.899497</td>\n",
676
+ " <td>0.683796</td>\n",
677
+ " <td>0.404685</td>\n",
678
+ " <td>0.470116</td>\n",
679
+ " </tr>\n",
680
+ " <tr>\n",
681
+ " <th>19</th>\n",
682
+ " <td>BPIC13inc</td>\n",
683
+ " <td>0.200026</td>\n",
684
+ " <td>0.232195</td>\n",
685
+ " <td>0.794414</td>\n",
686
+ " <td>0.717846</td>\n",
687
+ " <td>0.404651</td>\n",
688
+ " <td>0.391097</td>\n",
689
+ " <td>0.391625</td>\n",
690
+ " </tr>\n",
691
+ " <tr>\n",
692
+ " <th>20</th>\n",
693
+ " <td>BPIC15f4</td>\n",
694
+ " <td>0.996201</td>\n",
695
+ " <td>0.002849</td>\n",
696
+ " <td>0.102564</td>\n",
697
+ " <td>0.652985</td>\n",
698
+ " <td>0.603866</td>\n",
699
+ " <td>0.355927</td>\n",
700
+ " <td>0.412835</td>\n",
701
+ " </tr>\n",
702
+ " <tr>\n",
703
+ " <th>21</th>\n",
704
+ " <td>BPIC17</td>\n",
705
+ " <td>0.505570</td>\n",
706
+ " <td>0.033514</td>\n",
707
+ " <td>0.531340</td>\n",
708
+ " <td>0.741706</td>\n",
709
+ " <td>0.461565</td>\n",
710
+ " <td>0.231922</td>\n",
711
+ " <td>0.290464</td>\n",
712
+ " </tr>\n",
713
+ " <tr>\n",
714
+ " <th>22</th>\n",
715
+ " <td>BPIC20c</td>\n",
716
+ " <td>0.209200</td>\n",
717
+ " <td>0.135315</td>\n",
718
+ " <td>0.757537</td>\n",
719
+ " <td>0.733653</td>\n",
720
+ " <td>0.420150</td>\n",
721
+ " <td>0.137287</td>\n",
722
+ " <td>0.215490</td>\n",
723
+ " </tr>\n",
724
+ " <tr>\n",
725
+ " <th>23</th>\n",
726
+ " <td>BPIC20b</td>\n",
727
+ " <td>0.116762</td>\n",
728
+ " <td>0.212281</td>\n",
729
+ " <td>0.811289</td>\n",
730
+ " <td>0.758268</td>\n",
731
+ " <td>0.339380</td>\n",
732
+ " <td>0.145611</td>\n",
733
+ " <td>0.193753</td>\n",
734
+ " </tr>\n",
735
+ " <tr>\n",
736
+ " <th>24</th>\n",
737
+ " <td>HD</td>\n",
738
+ " <td>0.049345</td>\n",
739
+ " <td>0.516594</td>\n",
740
+ " <td>0.906332</td>\n",
741
+ " <td>0.799120</td>\n",
742
+ " <td>0.254066</td>\n",
743
+ " <td>0.118478</td>\n",
744
+ " <td>0.154576</td>\n",
745
+ " </tr>\n",
746
+ " <tr>\n",
747
+ " <th>25</th>\n",
748
+ " <td>SEPSIS</td>\n",
749
+ " <td>0.805714</td>\n",
750
+ " <td>0.033333</td>\n",
751
+ " <td>0.274286</td>\n",
752
+ " <td>0.695759</td>\n",
753
+ " <td>0.522343</td>\n",
754
+ " <td>0.219365</td>\n",
755
+ " <td>0.299505</td>\n",
756
+ " </tr>\n",
757
+ " </tbody>\n",
758
+ "</table>\n",
759
+ "</div>"
760
+ ],
761
+ "text/plain": [
762
+ " log ratio_variants_per_number_of_traces \n",
763
+ "0 BPIC16wm_p 0.002882 \\\n",
764
+ "1 BPIC15f5 0.997405 \n",
765
+ "2 BPIC15f1 0.975813 \n",
766
+ "3 BPIC19 0.047562 \n",
767
+ "4 BPIC14dia_p 0.496847 \n",
768
+ "5 BPIC15f2 0.995192 \n",
769
+ "6 BPIC15f3 0.957417 \n",
770
+ "7 BPIC13cp 0.123067 \n",
771
+ "8 BPIC14dc_p 0.048444 \n",
772
+ "9 BPIC20a 0.009429 \n",
773
+ "10 BPIC14di_p 0.000041 \n",
774
+ "11 BPIC17ol 0.000372 \n",
775
+ "12 BPIC13op 0.131868 \n",
776
+ "13 RTFMP 0.001536 \n",
777
+ "14 BPIC20d 0.096236 \n",
778
+ "15 BPIC12 0.333614 \n",
779
+ "16 RWABOCSL 0.080893 \n",
780
+ "17 BPIC20e 0.012925 \n",
781
+ "18 BPIC16c_p 0.438053 \n",
782
+ "19 BPIC13inc 0.200026 \n",
783
+ "20 BPIC15f4 0.996201 \n",
784
+ "21 BPIC17 0.505570 \n",
785
+ "22 BPIC20c 0.209200 \n",
786
+ "23 BPIC20b 0.116762 \n",
787
+ "24 HD 0.049345 \n",
788
+ "25 SEPSIS 0.805714 \n",
789
+ "\n",
790
+ " ratio_most_common_variant ratio_top_10_variants \n",
791
+ "0 0.295803 0.714106 \\\n",
792
+ "1 0.001730 0.102076 \n",
793
+ "2 0.006672 0.121768 \n",
794
+ "3 0.199758 0.946368 \n",
795
+ "4 0.037455 0.552836 \n",
796
+ "5 0.002404 0.103365 \n",
797
+ "6 0.010646 0.137686 \n",
798
+ "7 0.331540 0.840619 \n",
799
+ "8 0.074944 0.765056 \n",
800
+ "9 0.439810 0.950095 \n",
801
+ "10 0.787081 0.000000 \n",
802
+ "11 0.380626 0.380626 \n",
803
+ "12 0.217338 0.769231 \n",
804
+ "13 0.375620 0.993104 \n",
805
+ "14 0.271081 0.822773 \n",
806
+ "15 0.262016 0.686254 \n",
807
+ "16 0.497211 0.887029 \n",
808
+ "17 0.437264 0.933488 \n",
809
+ "18 0.101770 0.424779 \n",
810
+ "19 0.232195 0.794414 \n",
811
+ "20 0.002849 0.102564 \n",
812
+ "21 0.033514 0.531340 \n",
813
+ "22 0.135315 0.757537 \n",
814
+ "23 0.212281 0.811289 \n",
815
+ "24 0.516594 0.906332 \n",
816
+ "25 0.033333 0.274286 \n",
817
+ "\n",
818
+ " epa_normalized_variant_entropy epa_normalized_sequence_entropy \n",
819
+ "0 0.000000 0.000000 \\\n",
820
+ "1 0.648702 0.603260 \n",
821
+ "2 0.652855 0.610294 \n",
822
+ "3 0.645530 0.328029 \n",
823
+ "4 0.774743 0.608350 \n",
824
+ "5 0.627973 0.602371 \n",
825
+ "6 0.661781 0.605676 \n",
826
+ "7 0.705383 0.310940 \n",
827
+ "8 0.470758 0.419266 \n",
828
+ "9 0.696474 0.164758 \n",
829
+ "10 1.000000 0.044018 \n",
830
+ "11 0.813479 0.105130 \n",
831
+ "12 0.702960 0.276771 \n",
832
+ "13 0.769353 0.111932 \n",
833
+ "14 0.723785 0.317044 \n",
834
+ "15 0.708280 0.423074 \n",
835
+ "16 0.689363 0.235532 \n",
836
+ "17 0.703735 0.189048 \n",
837
+ "18 0.899497 0.683796 \n",
838
+ "19 0.717846 0.404651 \n",
839
+ "20 0.652985 0.603866 \n",
840
+ "21 0.741706 0.461565 \n",
841
+ "22 0.733653 0.420150 \n",
842
+ "23 0.758268 0.339380 \n",
843
+ "24 0.799120 0.254066 \n",
844
+ "25 0.695759 0.522343 \n",
845
+ "\n",
846
+ " epa_normalized_sequence_entropy_linear_forgetting \n",
847
+ "0 0.000000 \\\n",
848
+ "1 0.342410 \n",
849
+ "2 0.270241 \n",
850
+ "3 0.320185 \n",
851
+ "4 0.305614 \n",
852
+ "5 0.317217 \n",
853
+ "6 0.341521 \n",
854
+ "7 0.286515 \n",
855
+ "8 0.312599 \n",
856
+ "9 0.085439 \n",
857
+ "10 0.033322 \n",
858
+ "11 0.052672 \n",
859
+ "12 0.262094 \n",
860
+ "13 0.052586 \n",
861
+ "14 0.184879 \n",
862
+ "15 0.226133 \n",
863
+ "16 0.100603 \n",
864
+ "17 0.097572 \n",
865
+ "18 0.404685 \n",
866
+ "19 0.391097 \n",
867
+ "20 0.355927 \n",
868
+ "21 0.231922 \n",
869
+ "22 0.137287 \n",
870
+ "23 0.145611 \n",
871
+ "24 0.118478 \n",
872
+ "25 0.219365 \n",
873
+ "\n",
874
+ " epa_normalized_sequence_entropy_exponential_forgetting \n",
875
+ "0 0.000000 \n",
876
+ "1 0.404580 \n",
877
+ "2 0.363928 \n",
878
+ "3 0.320282 \n",
879
+ "4 0.377416 \n",
880
+ "5 0.390473 \n",
881
+ "6 0.404934 \n",
882
+ "7 0.288383 \n",
883
+ "8 0.326719 \n",
884
+ "9 0.104389 \n",
885
+ "10 0.034685 \n",
886
+ "11 0.066000 \n",
887
+ "12 0.263029 \n",
888
+ "13 0.068442 \n",
889
+ "14 0.214387 \n",
890
+ "15 0.275551 \n",
891
+ "16 0.138113 \n",
892
+ "17 0.118744 \n",
893
+ "18 0.470116 \n",
894
+ "19 0.391625 \n",
895
+ "20 0.412835 \n",
896
+ "21 0.290464 \n",
897
+ "22 0.215490 \n",
898
+ "23 0.193753 \n",
899
+ "24 0.154576 \n",
900
+ "25 0.299505 "
901
+ ]
902
+ },
903
+ "execution_count": 8,
904
+ "metadata": {},
905
+ "output_type": "execute_result"
906
+ }
907
+ ],
908
+ "source": [
909
+ "bpic_stats = bpic_features.describe().transpose()\n",
910
+ "normalized_feature_names = bpic_stats[(bpic_stats['min']>=0)&(bpic_stats['max']<=1)].index.to_list() \n",
911
+ "normalized_feature_names = ['ratio_variants_per_number_of_traces', 'ratio_most_common_variant', \n",
912
+ " 'ratio_top_10_variants', 'epa_normalized_variant_entropy', 'epa_normalized_sequence_entropy', \n",
913
+ " 'epa_normalized_sequence_entropy_linear_forgetting', 'epa_normalized_sequence_entropy_exponential_forgetting']\n",
914
+ "print(normalized_feature_names)\n",
915
+ "bpic_features[['log']+normalized_feature_names]"
916
+ ]
917
+ },
918
+ {
919
+ "cell_type": "code",
920
+ "execution_count": 9,
921
+ "id": "44909860",
922
+ "metadata": {},
923
+ "outputs": [
924
+ {
925
+ "name": "stdout",
926
+ "output_type": "stream",
927
+ "text": [
928
+ "21\n",
929
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enself_rt10v.json\n",
930
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enseef_enve.json\n",
931
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enve_rvpnot.json\n",
932
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enself_rmcv.json\n",
933
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_ense_rvpnot.json\n",
934
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enseef_rt10v.json\n",
935
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enseef_enself.json\n",
936
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_ense_enve.json\n",
937
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enseef_rmcv.json\n",
938
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_rt10v_rvpnot.json\n",
939
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_ense_rt10v.json\n",
940
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enve_rt10v.json\n",
941
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_ense_enself.json\n",
942
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_rmcv_rvpnot.json\n",
943
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enve_rmcv.json\n",
944
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_ense_rmcv.json\n",
945
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enself_rvpnot.json\n",
946
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_ense_enseef.json\n",
947
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enself_enve.json\n",
948
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_enseef_rvpnot.json\n",
949
+ "Saved experiment config in ../config_files/algorithm/BaselineED_feat/generator_2_rmcv_rt10v.json\n",
950
+ "None\n"
951
+ ]
952
+ }
953
+ ],
954
+ "source": [
955
+ "#Features between 0 and 1: \n",
956
+ "def write_generator_bpic_experiment(objectives, n_para_obj=2):\n",
957
+ " parameters_o = \"objectives, \"\n",
958
+ " experiments = eval(f\"[exp for exp in list(itertools.product({(parameters_o*n_para_obj)[:-2]})) if exp[0]!=exp[1]]\")\n",
959
+ " experiments = list(set([tuple(sorted(exp)) for exp in experiments]))\n",
960
+ " for exp in experiments:\n",
961
+ " experiment_path = os.path.join('..','data', 'BaselineED_feat')\n",
962
+ " os.makedirs(experiment_path, exist_ok=True)\n",
963
+ " experiment_path = os.path.join(experiment_path, f\"{len(exp)}_{abbrev_obj_keys(exp)}.csv\") \n",
964
+ "\n",
965
+ "\n",
966
+ " first_dir = os.path.split(experiment_path[3:])[-1].replace(\".csv\",\"\")\n",
967
+ " second_dir = first_dir.replace(\"grid_\",\"\").replace(\"objectives\",\"\")\n",
968
+ "\n",
969
+ " experiment = [\n",
970
+ " {\n",
971
+ " 'pipeline_step': 'event_logs_generation',\n",
972
+ " 'output_path':'output/generated',\n",
973
+ " 'generator_params': {\n",
974
+ " \"experiment\": {\"input_path\": \"data/BaselineED_feat.csv\",\n",
975
+ " \"objectives\": exp},\n",
976
+ " 'config_space': {\n",
977
+ " 'mode': [5, 20],\n",
978
+ " 'sequence': [0.01, 1],\n",
979
+ " 'choice': [0.01, 1],\n",
980
+ " 'parallel': [0.01, 1],\n",
981
+ " 'loop': [0.01, 1],\n",
982
+ " 'silent': [0.01, 1],\n",
983
+ " 'lt_dependency': [0.01, 1],\n",
984
+ " 'num_traces': [10, 10001],\n",
985
+ " 'duplicate': [0],\n",
986
+ " 'or': [0]\n",
987
+ " },\n",
988
+ " 'n_trials': 200\n",
989
+ " }\n",
990
+ " },\n",
991
+ " {\n",
992
+ " 'pipeline_step': 'feature_extraction',\n",
993
+ " 'input_path': os.path.join('output', 'features', 'generated', 'BaselineED_feat', first_dir),\n",
994
+ " 'input_path': os.path.join('output', 'generated', 'BaselineED_feat', first_dir),\n",
995
+ " 'feature_params': {'feature_set':['simple_stats', 'trace_length', 'trace_variant', 'activities', 'start_activities', 'end_activities', 'eventropies', 'epa_based']},\n",
996
+ " 'feature_params': {\"feature_set\":[\"ratio_variants_per_number_of_traces\",\"ratio_most_common_variant\",\"ratio_top_10_variants\",\"epa_normalized_variant_entropy\",\"epa_normalized_sequence_entropy\",\"epa_normalized_sequence_entropy_linear_forgetting\",\"epa_normalized_sequence_entropy_exponential_forgetting\"]},\n",
997
+ " 'output_path': 'output/plots',\n",
998
+ " 'real_eventlog_path': 'data/BaselineED_feat.csv',\n",
999
+ " 'plot_type': 'boxplot'\n",
1000
+ " },\n",
1001
+ " {\n",
1002
+ " \"pipeline_step\": \"benchmark_test\",\n",
1003
+ " \"benchmark_test\": \"discovery\",\n",
1004
+ " \"input_path\": os.path.join('output', 'generated', 'BaselineED_feat', first_dir),\n",
1005
+ " \"output_path\":\"output\",\n",
1006
+ " \"miners\" : [\"heu\", \"imf\", \"ilp\"]\n",
1007
+ " }\n",
1008
+ " ]\n",
1009
+ "\n",
1010
+ " output_path = os.path.join('..', 'config_files','algorithm','BaselineED_feat')\n",
1011
+ " os.makedirs(output_path, exist_ok=True)\n",
1012
+ " output_path = os.path.join(output_path, f'generator_{os.path.split(experiment_path)[-1].split(\".\")[0]}.json') \n",
1013
+ "\n",
1014
+ " with open(output_path, 'w') as f:\n",
1015
+ " json.dump(experiment, f, ensure_ascii=False)\n",
1016
+ " print(f\"Saved experiment config in {output_path}\")\n",
1017
+ " return experiment\n",
1018
+ "\n",
1019
+ "\n",
1020
+ "def create_objectives_grid(objectives, n_para_obj=2):\n",
1021
+ " parameters_o = \"objectives, \"\n",
1022
+ " experiments = eval(f\"[exp for exp in list(itertools.product({(parameters_o*n_para_obj)[:-2]})) if exp[0]!=exp[1]]\")\n",
1023
+ " experiments = list(set([tuple(sorted(exp)) for exp in experiments]))\n",
1024
+ " print(len(experiments))\n",
1025
+ " \n",
1026
+ " for exp in experiments:\n",
1027
+ " write_generator_bpic_experiment(objectives=exp)\n",
1028
+ " \n",
1029
+ "exp_test = create_objectives_grid(normalized_feature_names, n_para_obj=2) \n",
1030
+ "print(exp_test)"
1031
+ ]
1032
+ },
1033
+ {
1034
+ "cell_type": "markdown",
1035
+ "id": "b07e9753",
1036
+ "metadata": {},
1037
+ "source": [
1038
+ "## Single objective from real logs\n",
1039
+ "(Feature selection)"
1040
+ ]
1041
+ },
1042
+ {
1043
+ "cell_type": "code",
1044
+ "execution_count": 10,
1045
+ "id": "d759a677",
1046
+ "metadata": {},
1047
+ "outputs": [
1048
+ {
1049
+ "name": "stdout",
1050
+ "output_type": "stream",
1051
+ "text": [
1052
+ "7 experiments: [('epa_normalized_sequence_entropy_exponential_forgetting',), ('ratio_variants_per_number_of_traces',), ('ratio_most_common_variant',), ('epa_normalized_sequence_entropy',), ('ratio_top_10_variants',), ('epa_normalized_sequence_entropy_linear_forgetting',), ('epa_normalized_variant_entropy',)]\n",
1053
+ "11\n",
1054
+ "Saved experiment in ../data/grid_experiments/grid_1objectives_enseef.csv\n",
1055
+ "Saved experiment config in ../config_files/algorithm/grid_experiments/generator_grid_1objectives_enseef.json\n",
1056
+ "Saved experiment in ../data/grid_experiments/grid_1objectives_rvpnot.csv\n",
1057
+ "Saved experiment config in ../config_files/algorithm/grid_experiments/generator_grid_1objectives_rvpnot.json\n",
1058
+ "Saved experiment in ../data/grid_experiments/grid_1objectives_rmcv.csv\n",
1059
+ "Saved experiment config in ../config_files/algorithm/grid_experiments/generator_grid_1objectives_rmcv.json\n",
1060
+ "Saved experiment in ../data/grid_experiments/grid_1objectives_ense.csv\n",
1061
+ "Saved experiment config in ../config_files/algorithm/grid_experiments/generator_grid_1objectives_ense.json\n",
1062
+ "Saved experiment in ../data/grid_experiments/grid_1objectives_rt10v.csv\n",
1063
+ "Saved experiment config in ../config_files/algorithm/grid_experiments/generator_grid_1objectives_rt10v.json\n",
1064
+ "Saved experiment in ../data/grid_experiments/grid_1objectives_enself.csv\n",
1065
+ "Saved experiment config in ../config_files/algorithm/grid_experiments/generator_grid_1objectives_enself.json\n",
1066
+ "Saved experiment in ../data/grid_experiments/grid_1objectives_enve.csv\n",
1067
+ "Saved experiment config in ../config_files/algorithm/grid_experiments/generator_grid_1objectives_enve.json\n",
1068
+ "None\n"
1069
+ ]
1070
+ }
1071
+ ],
1072
+ "source": [
1073
+ "def write_single_objective_experiment(experiment_path, objectives=[\"ratio_top_20_variants\", \"epa_normalized_sequence_entropy_linear_forgetting\"]):\n",
1074
+ " first_dir = os.path.split(experiment_path[3:])[-1].replace(\".csv\",\"\")\n",
1075
+ " second_dir = first_dir.replace(\"grid_\",\"\").replace(\"objectives\",\"\")\n",
1076
+ "\n",
1077
+ " experiment = [\n",
1078
+ " {\n",
1079
+ " 'pipeline_step': 'event_logs_generation',\n",
1080
+ " 'output_path':os.path.join('output','generated', 'grid_1obj'),\n",
1081
+ " 'generator_params': {\n",
1082
+ " \"experiment\": {\"input_path\": experiment_path[3:],\n",
1083
+ " \"objectives\": objectives},\n",
1084
+ " 'config_space': {\n",
1085
+ " 'mode': [5, 20],\n",
1086
+ " 'sequence': [0.01, 1],\n",
1087
+ " 'choice': [0.01, 1],\n",
1088
+ " 'parallel': [0.01, 1],\n",
1089
+ " 'loop': [0.01, 1],\n",
1090
+ " 'silent': [0.01, 1],\n",
1091
+ " 'lt_dependency': [0.01, 1],\n",
1092
+ " 'num_traces': [10, 10001],\n",
1093
+ " 'duplicate': [0],\n",
1094
+ " 'or': [0]\n",
1095
+ " },\n",
1096
+ " 'n_trials': 200\n",
1097
+ " }\n",
1098
+ " },\n",
1099
+ " {\n",
1100
+ " 'pipeline_step': 'feature_extraction',\n",
1101
+ " 'input_path': os.path.join('output','features', 'generated', 'grid_1obj', first_dir, second_dir),\n",
1102
+ " 'feature_params': {'feature_set':['simple_stats', 'trace_length', 'trace_variant', 'activities', 'start_activities', 'end_activities', 'eventropies', 'epa_based']},\n",
1103
+ " 'feature_params': {\"feature_set\":[\"ratio_variants_per_number_of_traces\",\"ratio_most_common_variant\",\"ratio_top_10_variants\",\"epa_normalized_variant_entropy\",\"epa_normalized_sequence_entropy\",\"epa_normalized_sequence_entropy_linear_forgetting\",\"epa_normalized_sequence_entropy_exponential_forgetting\"]},\n",
1104
+ " 'output_path': 'output/plots',\n",
1105
+ " 'real_eventlog_path': 'data/BaselineED_feat.csv',\n",
1106
+ " 'plot_type': 'boxplot'\n",
1107
+ " },\n",
1108
+ " {\n",
1109
+ " \"pipeline_step\": \"benchmark_test\",\n",
1110
+ " \"benchmark_test\": \"discovery\",\n",
1111
+ " \"input_path\": os.path.join('output', 'generated', 'grid_1obj', first_dir, second_dir),\n",
1112
+ " \"output_path\":\"output\",\n",
1113
+ " \"miners\" : [\"heu\", \"imf\", \"ilp\"]\n",
1114
+ " }\n",
1115
+ " ]\n",
1116
+ "\n",
1117
+ " #print(\"EXPERIMENT:\", experiment)\n",
1118
+ " output_path = os.path.join('..', 'config_files','algorithm','grid_experiments')\n",
1119
+ " os.makedirs(output_path, exist_ok=True)\n",
1120
+ " output_path = os.path.join(output_path, f'generator_{os.path.split(experiment_path)[-1].split(\".\")[0]}.json') \n",
1121
+ " with open(output_path, 'w') as f:\n",
1122
+ " json.dump(experiment, f, ensure_ascii=False)\n",
1123
+ " print(f\"Saved experiment config in {output_path}\")\n",
1124
+ " \n",
1125
+ " return experiment\n",
1126
+ "\n",
1127
+ "def create_objectives_grid(objectives, n_para_obj=2):\n",
1128
+ " parameters_o = \"objectives, \"\n",
1129
+ " if n_para_obj==1:\n",
1130
+ " experiments = [[exp] for exp in objectives]\n",
1131
+ " else:\n",
1132
+ " experiments = eval(f\"[exp for exp in list(itertools.product({(parameters_o*n_para_obj)[:-2]})) if exp[0]!=exp[1]]\")\n",
1133
+ " experiments = list(set([tuple(sorted(exp)) for exp in experiments]))\n",
1134
+ " print(len(experiments), \"experiments: \", experiments)\n",
1135
+ " \n",
1136
+ " parameters = \"np.around(np.arange(0, 1.1,0.1),2), \"\n",
1137
+ " tasks = eval(f\"list(itertools.product({(parameters*n_para_obj)[:-2]}))\")\n",
1138
+ " tasks = [(f'task_{i+1}',)+task for i, task in enumerate(tasks)]\n",
1139
+ " print(len(tasks))\n",
1140
+ " for exp in experiments:\n",
1141
+ " df = pd.DataFrame(data=tasks, columns=[\"task\", *exp])\n",
1142
+ " experiment_path = os.path.join('..','data', 'grid_experiments')\n",
1143
+ " os.makedirs(experiment_path, exist_ok=True)\n",
1144
+ " experiment_path = os.path.join(experiment_path, f\"grid_{len(df.columns)-1}objectives_{abbrev_obj_keys(exp)}.csv\") \n",
1145
+ " df.to_csv(experiment_path, index=False)\n",
1146
+ " print(f\"Saved experiment in {experiment_path}\")\n",
1147
+ " write_single_objective_experiment(experiment_path, objectives=exp)\n",
1148
+ " #df.to_csv(f\"../data/grid_{}objectives_{abbrev_obj_keys(objectives.tolist())}.csv\" ,index=False)\n",
1149
+ " \n",
1150
+ "exp_test = create_objectives_grid(normalized_feature_names, n_para_obj=1) \n",
1151
+ "print(exp_test)"
1152
+ ]
1153
+ },
1154
+ {
1155
+ "cell_type": "code",
1156
+ "execution_count": null,
1157
+ "id": "f9886f44",
1158
+ "metadata": {},
1159
+ "outputs": [],
1160
+ "source": []
1161
+ }
1162
+ ],
1163
+ "metadata": {
1164
+ "kernelspec": {
1165
+ "display_name": "Python 3 (ipykernel)",
1166
+ "language": "python",
1167
+ "name": "python3"
1168
+ },
1169
+ "language_info": {
1170
+ "codemirror_mode": {
1171
+ "name": "ipython",
1172
+ "version": 3
1173
+ },
1174
+ "file_extension": ".py",
1175
+ "mimetype": "text/x-python",
1176
+ "name": "python",
1177
+ "nbconvert_exporter": "python",
1178
+ "pygments_lexer": "ipython3",
1179
+ "version": "3.9.12"
1180
+ }
1181
+ },
1182
+ "nbformat": 4,
1183
+ "nbformat_minor": 5
1184
+ }
notebooks/gedi_fig6_benchmark_boxplots.ipynb CHANGED
The diff for this file is too large to render. See raw diff
 
notebooks/gedi_figs4and5_representativeness.ipynb CHANGED
The diff for this file is too large to render. See raw diff
 
notebooks/gedi_figs7and8_benchmarking_statisticalTests.ipynb CHANGED
@@ -1,5 +1,21 @@
1
  {
2
  "cells": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  {
4
  "cell_type": "code",
5
  "execution_count": 8,
@@ -64,6 +80,14 @@
64
  " return data"
65
  ]
66
  },
 
 
 
 
 
 
 
 
67
  {
68
  "cell_type": "code",
69
  "execution_count": 11,
@@ -110,7 +134,7 @@
110
  "id": "07370d54",
111
  "metadata": {},
112
  "source": [
113
- "## Statistical test: Is there a statistical significant relation between feature similarity and performance metrics?"
114
  ]
115
  },
116
  {
@@ -192,6 +216,14 @@
192
  "#df_tmp = statistical_test(DATA_SOURCE+\"_feat\", \"Gen\"+DATA_SOURCE+\"_bench\", TEST, IMPUTE)"
193
  ]
194
  },
 
 
 
 
 
 
 
 
195
  {
196
  "cell_type": "code",
197
  "execution_count": 62,
@@ -466,37 +498,13 @@
466
  " plot_stat_test(masked_results, data_source+\"_feat\", data_source+\"_bench\", test, IMPUTE, cbar=cbar, ylabels=ylabels)\n",
467
  " plt.clf()"
468
  ]
469
- },
470
- {
471
- "cell_type": "code",
472
- "execution_count": null,
473
- "id": "52c58c64",
474
- "metadata": {},
475
- "outputs": [],
476
- "source": []
477
- },
478
- {
479
- "cell_type": "code",
480
- "execution_count": null,
481
- "id": "3717a694",
482
- "metadata": {},
483
- "outputs": [],
484
- "source": []
485
- },
486
- {
487
- "cell_type": "code",
488
- "execution_count": null,
489
- "id": "c6afe4d9",
490
- "metadata": {},
491
- "outputs": [],
492
- "source": []
493
  }
494
  ],
495
  "metadata": {
496
  "kernelspec": {
497
- "display_name": "tag",
498
  "language": "python",
499
- "name": "tag"
500
  },
501
  "language_info": {
502
  "codemirror_mode": {
@@ -508,7 +516,7 @@
508
  "name": "python",
509
  "nbconvert_exporter": "python",
510
  "pygments_lexer": "ipython3",
511
- "version": "3.9.16"
512
  }
513
  },
514
  "nbformat": 4,
 
1
  {
2
  "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "32241302-7f73-4756-b8a5-27f752de0dea",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Plot - Statistical Tests"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "51cee5d6-2d4c-4bdd-bdbf-4b3a3b76e6d6",
14
+ "metadata": {},
15
+ "source": [
16
+ "#### Load Data"
17
+ ]
18
+ },
19
  {
20
  "cell_type": "code",
21
  "execution_count": 8,
 
80
  " return data"
81
  ]
82
  },
83
+ {
84
+ "cell_type": "markdown",
85
+ "id": "f0d6e731-5f46-4747-82f8-a2f308d150ee",
86
+ "metadata": {},
87
+ "source": [
88
+ "#### Data Preprocessing"
89
+ ]
90
+ },
91
  {
92
  "cell_type": "code",
93
  "execution_count": 11,
 
134
  "id": "07370d54",
135
  "metadata": {},
136
  "source": [
137
+ "#### Statistical test: Is there a statistical significant relation between feature similarity and performance metrics?"
138
  ]
139
  },
140
  {
 
216
  "#df_tmp = statistical_test(DATA_SOURCE+\"_feat\", \"Gen\"+DATA_SOURCE+\"_bench\", TEST, IMPUTE)"
217
  ]
218
  },
219
+ {
220
+ "cell_type": "markdown",
221
+ "id": "5e6ecc81-c14d-4859-ab04-49bbf458f7eb",
222
+ "metadata": {},
223
+ "source": [
224
+ "#### Plot - statistical Test of features vs metrics"
225
+ ]
226
+ },
227
  {
228
  "cell_type": "code",
229
  "execution_count": 62,
 
498
  " plot_stat_test(masked_results, data_source+\"_feat\", data_source+\"_bench\", test, IMPUTE, cbar=cbar, ylabels=ylabels)\n",
499
  " plt.clf()"
500
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
501
  }
502
  ],
503
  "metadata": {
504
  "kernelspec": {
505
+ "display_name": "Python 3 (ipykernel)",
506
  "language": "python",
507
+ "name": "python3"
508
  },
509
  "language_info": {
510
  "codemirror_mode": {
 
516
  "name": "python",
517
  "nbconvert_exporter": "python",
518
  "pygments_lexer": "ipython3",
519
+ "version": "3.9.19"
520
  }
521
  },
522
  "nbformat": 4,
setup.py CHANGED
@@ -88,4 +88,3 @@ setup(
88
  'Programming Language :: Python :: 3.9',
89
  ],
90
  )
91
-
 
88
  'Programming Language :: Python :: 3.9',
89
  ],
90
  )