Spaces:
Running
Running
Andrea Maldonado
commited on
Commit
·
353129b
1
Parent(s):
9013f63
Release cr
Browse files- README.md +257 -63
- config_files/test/test_abbrv_generation.json +16 -0
- data/test/igedi_table_1.csv +4 -0
- data/validation/2_ense_rmcv_feat.csv +4 -0
- gedi/__init__.py +6 -2
- gedi/features.py +3 -1
- gedi/generator.py +3 -0
- gedi/run.py +0 -53
- execute_grid_experiments.py → gedi/utils/execute_grid_experiments.py +1 -1
- main.py +44 -2
- setup.py +5 -50
- utils/column_mappings.py +16 -0
- utils/config_fabric.py +5 -0
README.md
CHANGED
@@ -17,18 +17,12 @@ license: mit
|
|
17 |
|
18 |
**i**nteractive **G**enerating **E**vent **D**ata with **I**ntentional Features for Benchmarking Process Mining<br />
|
19 |
This repository contains the codebase for the interactive web application tool (iGEDI) as well as for the [GEDI paper](https://mcml.ai/publications/gedi.pdf) accepted at the BPM'24 conference.
|
20 |
-
Our documentation also includes both frameworks. From [General Usage](#general-usage) and beyond, documentation refers especially to reproducibility of the [GEDI paper](https://mcml.ai/publications/gedi.pdf).
|
21 |
-
|
22 |
-
A video tutorial on how to use this tool can be found [here](https://youtu.be/9iQhaYwyQ9E).
|
23 |
-
|
24 |
|
25 |
## Table of Contents
|
26 |
|
27 |
- [Interactive Web Application (iGEDI)](#interactive-web-application)
|
|
|
28 |
- [Installation](#installation)
|
29 |
-
- [as PyPi Package](#install-as-pypi-package)
|
30 |
-
- [of iGEDI](#install-igedi)
|
31 |
-
- [as local repository](#install-as-local-repository)
|
32 |
- [General Usage](#general-usage)
|
33 |
- [Experiments](#experiments)
|
34 |
- [Citation](#citation)
|
@@ -37,8 +31,7 @@ A video tutorial on how to use this tool can be found [here](https://youtu.be/9i
|
|
37 |
Our [interactive web application](https://huggingface.co/spaces/andreamalhera/gedi) (iGEDI) guides you through the specification process, runs GEDI for you. You can directly download the resulting generated logs or the configuration file to run GEDI locally.
|
38 |

|
39 |
|
40 |
-
##
|
41 |
-
### Requirements
|
42 |
- [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
|
43 |
- Graphviz on your OS e.g.
|
44 |
For MacOS:
|
@@ -50,30 +43,13 @@ brew install swig
|
|
50 |
```console
|
51 |
conda install pyrfr swig
|
52 |
```
|
53 |
-
|
54 |
-
|
55 |
-
```shell
|
56 |
-
pip install gedi
|
57 |
-
```
|
58 |
-
and run:
|
59 |
-
```shell
|
60 |
-
python -c "from gedi import gedi; gedi('config_files/pipeline_steps/generation.json')"
|
61 |
-
```
|
62 |
-
### Install iGEDI
|
63 |
-
Our [interactive GEDI (iGEDI)](https://huggingface.co/spaces/andreamalhera/gedi) can be employed to create all necessary [configuration files](config_files) to reproduce our experiements.
|
64 |
-
Users can directly use our [web application service](https://huggingface.co/spaces/andreamalhera/gedi) or locally start the following dashboard:
|
65 |
-
```
|
66 |
-
streamlit run utils/config_fabric.py # To tunnel to local machine add: --server.port 8501 --server.headless true
|
67 |
-
|
68 |
-
# In local machine (only in case you are tunneling):
|
69 |
-
ssh -N -f -L 9000:localhost:8501 <user@remote_machine.com>
|
70 |
-
open "http://localhost:9000/"
|
71 |
-
```
|
72 |
|
73 |
-
###
|
74 |
```console
|
75 |
-
conda
|
76 |
-
|
77 |
```
|
78 |
The last step should take only a few minutes to run.
|
79 |
|
@@ -85,8 +61,9 @@ Our pipeline offers several pipeline steps, which can be run sequentially or par
|
|
85 |
- [Evaluation Plotter](https://github.com/lmu-dbs/gedi/blob/16-documentation-update-readme/README.md#evaluation-plotting)
|
86 |
|
87 |
To run different steps of the GEDI pipeline, please adapt the `.json` accordingly.
|
88 |
-
```
|
89 |
-
|
|
|
90 |
```
|
91 |
For reference of possible keys and values for each step, please see `config_files/test/experiment_test.json`.
|
92 |
To run the whole pipeline please create a new `.json` file, specifying all steps you want to run and specify desired keys and values for each step.
|
@@ -95,8 +72,9 @@ To reproduce results from our paper, please refer to [Experiments](#experiments)
|
|
95 |
### Feature Extraction
|
96 |
---
|
97 |
To extract the features on the event-log level and use them for hyperparameter optimization, we employ the following script:
|
98 |
-
```
|
99 |
-
|
|
|
100 |
```
|
101 |
The JSON file consists of the following key-value pairs:
|
102 |
|
@@ -116,8 +94,9 @@ After having extracted meta features from the files, the next step is to generat
|
|
116 |
|
117 |
The command to execute the generation step is given by a exemplarily generation.json file:
|
118 |
|
119 |
-
```
|
120 |
-
|
|
|
121 |
```
|
122 |
|
123 |
In the `generation.json`, we have the following key-value pairs:
|
@@ -144,11 +123,228 @@ In the `generation.json`, we have the following key-value pairs:
|
|
144 |
|
145 |
- plot_reference_feature: defines the feature, which is used on the x-axis on the output plots, i.e., each feature defined in the 'objectives' of the 'experiment' is plotted against the reference feature being defined in this value
|
146 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
147 |
### Benchmark
|
148 |
The benchmarking defines the downstream task which is used for evaluating the goodness of the synthesized event log datasets with the metrics of real-world datasets. The command to execute a benchmarking is shown in the following script:
|
149 |
|
150 |
-
```
|
151 |
-
|
|
|
152 |
```
|
153 |
|
154 |
In the `benchmark.json`, we have the following key-value pairs:
|
@@ -164,8 +360,9 @@ In the `benchmark.json`, we have the following key-value pairs:
|
|
164 |
The purpose of the evaluation plotting step is used just for visualization. Some examples of how the plotter can be used is shown in the following exemplarily script:
|
165 |
|
166 |
|
167 |
-
```
|
168 |
-
|
|
|
169 |
```
|
170 |
|
171 |
Generally, in the `evaluation_plotter.json`, we have the following key-value pairs:
|
@@ -183,8 +380,9 @@ We present two settings for generating intentional event logs, using [real targe
|
|
183 |
### Generating data with real targets
|
184 |
To execute the experiments with real targets, we employ the [experiment_real_targets.json](config_files/experiment_real_targets.json). The script's pipeline will output the [generated event logs (GenBaselineED)](data/event_logs/GenBaselineED), which optimize their feature values towards [real-world event data features](data/BaselineED_feat.csv), alongside their respectively measured [feature values](data/GenBaselineED_feat.csv) and [benchmark metrics values](data/GenBaselineED_bench.csv).
|
185 |
|
186 |
-
```
|
187 |
-
|
|
|
188 |
```
|
189 |
|
190 |
### Generating data with grid targets
|
@@ -195,10 +393,15 @@ python execute_grid_experiments.py config_files/grid_2obj
|
|
195 |
```
|
196 |
We employ the [experiment_grid_2obj_configfiles_fabric.ipynb](notebooks/experiment_grid_2obj_configfiles_fabric.ipynb) to create all necessary [configuration](config_files/grid_2obj) and [objective](data/grid_2obj) files for this experiment.
|
197 |
For more details about these config_files, please refer to [Feature Extraction](#feature-extraction), [Generation](#generation), and [Benchmark](#benchmark).
|
198 |
-
To create configuration files for grid objectives interactively, you can use
|
|
|
|
|
199 |
|
|
|
|
|
|
|
|
|
200 |
### Visualizations
|
201 |
-
Visualizations correspond to the [GEDI paper](https://mcml.ai/publications/gedi.pdf).
|
202 |
To run the visualizations, we employ [jupyter notebooks](https://jupyter.org/install) and [add the installed environment to the jupyter notebook](https://medium.com/@nrk25693/how-to-add-your-conda-environment-to-your-jupyter-notebook-in-just-4-steps-abeab8b8d084). We then start all visualizations by running e.g.: `jupyter noteboook`. In the following, we describe the `.ipynb`-files in the folder `\notebooks` to reproduce the figures from our paper.
|
203 |
|
204 |
#### [Fig. 4 and fig. 5 Representativeness](notebooks/gedi_figs4and5_representativeness.ipynb)
|
@@ -218,23 +421,14 @@ Likewise to the evaluation on the statistical tests in notebook `gedi_figs7and8_
|
|
218 |
The `GEDI` framework is taken directly from the original paper by [Maldonado](mailto:[email protected]), Frey, Tavares, Rehwald and Seidl and is *to appear on BPM'24*.
|
219 |
|
220 |
```bibtex
|
221 |
-
@
|
222 |
-
author=
|
223 |
-
|
224 |
-
|
225 |
-
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
|
230 |
-
and Rosemann, Michael",
|
231 |
-
title="GEDI: Generating Event Data with Intentional Features for Benchmarking Process Mining",
|
232 |
-
booktitle="Business Process Management",
|
233 |
-
year="2024",
|
234 |
-
publisher="Springer Nature Switzerland",
|
235 |
-
address="Cham",
|
236 |
-
pages="221--237",
|
237 |
-
abstract="Process mining solutions include enhancing performance, conserving resources, and alleviating bottlenecks in organizational contexts. However, as in other data mining fields, success hinges on data quality and availability. Existing analyses for process mining solutions lack diverse and ample data for rigorous testing, hindering insights' generalization. To address this, we propose Generating Event Data with Intentional features, a framework producing event data sets satisfying specific meta-features. Considering the meta-feature space that defines feasible event logs, we observe that existing real-world datasets describe only local areas within the overall space. Hence, our framework aims at providing the capability to generate an event data benchmark, which covers unexplored regions. Therefore, our approach leverages a discretization of the meta-feature space to steer generated data towards regions, where a combination of meta-features is not met yet by existing benchmark datasets. Providing a comprehensive data pool enriches process mining analyses, enables methods to capture a wider range of real-world scenarios, and improves evaluation quality. Moreover, it empowers analysts to uncover correlations between meta-features and evaluation metrics, enhancing explainability and solution effectiveness. Experiments demonstrate GEDI's ability to produce a benchmark of intentional event data sets and robust analyses for process mining tasks.",
|
238 |
-
isbn="978-3-031-70396-6"
|
239 |
}
|
240 |
```
|
|
|
17 |
|
18 |
**i**nteractive **G**enerating **E**vent **D**ata with **I**ntentional Features for Benchmarking Process Mining<br />
|
19 |
This repository contains the codebase for the interactive web application tool (iGEDI) as well as for the [GEDI paper](https://mcml.ai/publications/gedi.pdf) accepted at the BPM'24 conference.
|
|
|
|
|
|
|
|
|
20 |
|
21 |
## Table of Contents
|
22 |
|
23 |
- [Interactive Web Application (iGEDI)](#interactive-web-application)
|
24 |
+
- [Requirements](#requirements)
|
25 |
- [Installation](#installation)
|
|
|
|
|
|
|
26 |
- [General Usage](#general-usage)
|
27 |
- [Experiments](#experiments)
|
28 |
- [Citation](#citation)
|
|
|
31 |
Our [interactive web application](https://huggingface.co/spaces/andreamalhera/gedi) (iGEDI) guides you through the specification process, runs GEDI for you. You can directly download the resulting generated logs or the configuration file to run GEDI locally.
|
32 |

|
33 |
|
34 |
+
## Requirements
|
|
|
35 |
- [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
|
36 |
- Graphviz on your OS e.g.
|
37 |
For MacOS:
|
|
|
43 |
```console
|
44 |
conda install pyrfr swig
|
45 |
```
|
46 |
+
## Installation
|
47 |
+
- `conda env create -f .conda.yml`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
|
49 |
+
### Startup
|
50 |
```console
|
51 |
+
conda activate gedi
|
52 |
+
python main.py -a config_files/test/experiment_test.json
|
53 |
```
|
54 |
The last step should take only a few minutes to run.
|
55 |
|
|
|
61 |
- [Evaluation Plotter](https://github.com/lmu-dbs/gedi/blob/16-documentation-update-readme/README.md#evaluation-plotting)
|
62 |
|
63 |
To run different steps of the GEDI pipeline, please adapt the `.json` accordingly.
|
64 |
+
```console
|
65 |
+
conda activate gedi
|
66 |
+
python main.py -a config_files/pipeline_steps/<pipeline-step>.json
|
67 |
```
|
68 |
For reference of possible keys and values for each step, please see `config_files/test/experiment_test.json`.
|
69 |
To run the whole pipeline please create a new `.json` file, specifying all steps you want to run and specify desired keys and values for each step.
|
|
|
72 |
### Feature Extraction
|
73 |
---
|
74 |
To extract the features on the event-log level and use them for hyperparameter optimization, we employ the following script:
|
75 |
+
```console
|
76 |
+
conda activate gedi
|
77 |
+
python main.py -a config_files/pipeline_steps/feature_extraction.json
|
78 |
```
|
79 |
The JSON file consists of the following key-value pairs:
|
80 |
|
|
|
94 |
|
95 |
The command to execute the generation step is given by a exemplarily generation.json file:
|
96 |
|
97 |
+
```console
|
98 |
+
conda activate gedi
|
99 |
+
python main.py -a config_files/pipeline_steps/generation.json
|
100 |
```
|
101 |
|
102 |
In the `generation.json`, we have the following key-value pairs:
|
|
|
123 |
|
124 |
- plot_reference_feature: defines the feature, which is used on the x-axis on the output plots, i.e., each feature defined in the 'objectives' of the 'experiment' is plotted against the reference feature being defined in this value
|
125 |
|
126 |
+
In case of manually defining the targets for the features in config space, the following table shows the range of the features in the real-world event log data (BPIC's) for reference:
|
127 |
+
<div style="overflow-x:auto;">
|
128 |
+
<table border="1" class="dataframe">
|
129 |
+
<thead>
|
130 |
+
<tr style="text-align: right;">
|
131 |
+
<th></th>
|
132 |
+
<th>n_traces</th>
|
133 |
+
<th>n_unique_traces</th>
|
134 |
+
<th>ratio_variants_per_number_of_traces</th>
|
135 |
+
<th>trace_len_min</th>
|
136 |
+
<th>trace_len_max</th>
|
137 |
+
<th>trace_len_mean</th>
|
138 |
+
<th>trace_len_median</th>
|
139 |
+
<th>trace_len_mode</th>
|
140 |
+
<th>trace_len_std</th>
|
141 |
+
<th>trace_len_variance</th>
|
142 |
+
<th>trace_len_q1</th>
|
143 |
+
<th>trace_len_q3</th>
|
144 |
+
<th>trace_len_iqr</th>
|
145 |
+
<th>trace_len_geometric_mean</th>
|
146 |
+
<th>trace_len_geometric_std</th>
|
147 |
+
<th>trace_len_harmonic_mean</th>
|
148 |
+
<th>trace_len_skewness</th>
|
149 |
+
<th>trace_len_kurtosis</th>
|
150 |
+
<th>trace_len_coefficient_variation</th>
|
151 |
+
<th>trace_len_entropy</th>
|
152 |
+
<th>trace_len_hist1</th>
|
153 |
+
<th>trace_len_hist2</th>
|
154 |
+
<th>trace_len_hist3</th>
|
155 |
+
<th>trace_len_hist4</th>
|
156 |
+
<th>trace_len_hist5</th>
|
157 |
+
<th>trace_len_hist6</th>
|
158 |
+
<th>trace_len_hist7</th>
|
159 |
+
<th>trace_len_hist8</th>
|
160 |
+
<th>trace_len_hist9</th>
|
161 |
+
<th>trace_len_hist10</th>
|
162 |
+
<th>trace_len_skewness_hist</th>
|
163 |
+
<th>trace_len_kurtosis_hist</th>
|
164 |
+
<th>ratio_most_common_variant</th>
|
165 |
+
<th>ratio_top_1_variants</th>
|
166 |
+
<th>ratio_top_5_variants</th>
|
167 |
+
<th>ratio_top_10_variants</th>
|
168 |
+
<th>ratio_top_20_variants</th>
|
169 |
+
<th>ratio_top_50_variants</th>
|
170 |
+
<th>ratio_top_75_variants</th>
|
171 |
+
<th>mean_variant_occurrence</th>
|
172 |
+
<th>std_variant_occurrence</th>
|
173 |
+
<th>skewness_variant_occurrence</th>
|
174 |
+
<th>kurtosis_variant_occurrence</th>
|
175 |
+
<th>n_unique_activities</th>
|
176 |
+
<th>activities_min</th>
|
177 |
+
<th>activities_max</th>
|
178 |
+
<th>activities_mean</th>
|
179 |
+
<th>activities_median</th>
|
180 |
+
<th>activities_std</th>
|
181 |
+
<th>activities_variance</th>
|
182 |
+
<th>activities_q1</th>
|
183 |
+
<th>activities_q3</th>
|
184 |
+
<th>activities_iqr</th>
|
185 |
+
<th>activities_skewness</th>
|
186 |
+
<th>activities_kurtosis</th>
|
187 |
+
<th>n_unique_start_activities</th>
|
188 |
+
<th>start_activities_min</th>
|
189 |
+
<th>start_activities_max</th>
|
190 |
+
<th>start_activities_mean</th>
|
191 |
+
<th>start_activities_median</th>
|
192 |
+
<th>start_activities_std</th>
|
193 |
+
<th>start_activities_variance</th>
|
194 |
+
<th>start_activities_q1</th>
|
195 |
+
<th>start_activities_q3</th>
|
196 |
+
<th>start_activities_iqr</th>
|
197 |
+
<th>start_activities_skewness</th>
|
198 |
+
<th>start_activities_kurtosis</th>
|
199 |
+
<th>n_unique_end_activities</th>
|
200 |
+
<th>end_activities_min</th>
|
201 |
+
<th>end_activities_max</th>
|
202 |
+
<th>end_activities_mean</th>
|
203 |
+
<th>end_activities_median</th>
|
204 |
+
<th>end_activities_std</th>
|
205 |
+
<th>end_activities_variance</th>
|
206 |
+
<th>end_activities_q1</th>
|
207 |
+
<th>end_activities_q3</th>
|
208 |
+
<th>end_activities_iqr</th>
|
209 |
+
<th>end_activities_skewness</th>
|
210 |
+
<th>end_activities_kurtosis</th>
|
211 |
+
<th>eventropy_trace</th>
|
212 |
+
<th>eventropy_prefix</th>
|
213 |
+
<th>eventropy_global_block</th>
|
214 |
+
<th>eventropy_lempel_ziv</th>
|
215 |
+
<th>eventropy_k_block_diff_1</th>
|
216 |
+
<th>eventropy_k_block_diff_3</th>
|
217 |
+
<th>eventropy_k_block_diff_5</th>
|
218 |
+
<th>eventropy_k_block_ratio_1</th>
|
219 |
+
<th>eventropy_k_block_ratio_3</th>
|
220 |
+
<th>eventropy_k_block_ratio_5</th>
|
221 |
+
<th>eventropy_knn_3</th>
|
222 |
+
<th>eventropy_knn_5</th>
|
223 |
+
<th>eventropy_knn_7</th>
|
224 |
+
<th>epa_variant_entropy</th>
|
225 |
+
<th>epa_normalized_variant_entropy</th>
|
226 |
+
<th>epa_sequence_entropy</th>
|
227 |
+
<th>epa_normalized_sequence_entropy</th>
|
228 |
+
<th>epa_sequence_entropy_linear_forgetting</th>
|
229 |
+
<th>epa_normalized_sequence_entropy_linear_forgetting</th>
|
230 |
+
<th>epa_sequence_entropy_exponential_forgetting</th>
|
231 |
+
<th>epa_normalized_sequence_entropy_exponential_forgetting</th>
|
232 |
+
</tr>
|
233 |
+
</thead>
|
234 |
+
<tbody>
|
235 |
+
<tr>
|
236 |
+
<td>[ min, max ]</td>
|
237 |
+
<td>[ 226.0, 251734.0 ]</td>
|
238 |
+
<td>[ 6.0, 28457.0 ]</td>
|
239 |
+
<td>[ 0.0, 1.0 ]</td>
|
240 |
+
<td>[ 1.0, 24.0 ]</td>
|
241 |
+
<td>[ 1.0, 2973.0 ]</td>
|
242 |
+
<td>[ 1.0, 131.49 ]</td>
|
243 |
+
<td>[ 1.0, 55.0 ]</td>
|
244 |
+
<td>[ 1.0, 61.0 ]</td>
|
245 |
+
<td>[ 0.0, 202.53 ]</td>
|
246 |
+
<td>[ 0.0, 41017.89 ]</td>
|
247 |
+
<td>[ 1.0, 44.0 ]</td>
|
248 |
+
<td>[ 1.0, 169.0 ]</td>
|
249 |
+
<td>[ 0.0, 161.0 ]</td>
|
250 |
+
<td>[ 1.0, 53.78 ]</td>
|
251 |
+
<td>[ 1.0, 5.65 ]</td>
|
252 |
+
<td>[ 1.0, 51.65 ]</td>
|
253 |
+
<td>[ -0.58, 111.97 ]</td>
|
254 |
+
<td>[ -0.97, 14006.75 ]</td>
|
255 |
+
<td>[ 0.0, 4.74 ]</td>
|
256 |
+
<td>[ 5.33, 12.04 ]</td>
|
257 |
+
<td>[ 0.0, 1.99 ]</td>
|
258 |
+
<td>[ 0.0, 0.42 ]</td>
|
259 |
+
<td>[ 0.0, 0.4 ]</td>
|
260 |
+
<td>[ 0.0, 0.19 ]</td>
|
261 |
+
<td>[ 0.0, 0.14 ]</td>
|
262 |
+
<td>[ 0.0, 10.0 ]</td>
|
263 |
+
<td>[ 0.0, 0.02 ]</td>
|
264 |
+
<td>[ 0.0, 0.04 ]</td>
|
265 |
+
<td>[ 0.0, 0.0 ]</td>
|
266 |
+
<td>[ 0.0, 2.7 ]</td>
|
267 |
+
<td>[ -0.58, 111.97 ]</td>
|
268 |
+
<td>[ -0.97, 14006.75 ]</td>
|
269 |
+
<td>[ 0.0, 0.79 ]</td>
|
270 |
+
<td>[ 0.0, 0.87 ]</td>
|
271 |
+
<td>[ 0.0, 0.98 ]</td>
|
272 |
+
<td>[ 0.0, 0.99 ]</td>
|
273 |
+
<td>[ 0.2, 1.0 ]</td>
|
274 |
+
<td>[ 0.5, 1.0 ]</td>
|
275 |
+
<td>[ 0.75, 1.0 ]</td>
|
276 |
+
<td>[ 1.0, 24500.67 ]</td>
|
277 |
+
<td>[ 0.04, 42344.04 ]</td>
|
278 |
+
<td>[ 1.54, 64.77 ]</td>
|
279 |
+
<td>[ 0.66, 5083.46 ]</td>
|
280 |
+
<td>[ 1.0, 1152.0 ]</td>
|
281 |
+
<td>[ 1.0, 66058.0 ]</td>
|
282 |
+
<td>[ 34.0, 466141.0 ]</td>
|
283 |
+
<td>[ 4.13, 66058.0 ]</td>
|
284 |
+
<td>[ 2.0, 66058.0 ]</td>
|
285 |
+
<td>[ 0.0, 120522.25 ]</td>
|
286 |
+
<td>[ 0.0, 14525612122.34 ]</td>
|
287 |
+
<td>[ 1.0, 66058.0 ]</td>
|
288 |
+
<td>[ 4.0, 79860.0 ]</td>
|
289 |
+
<td>[ 0.0, 77290.0 ]</td>
|
290 |
+
<td>[ -0.06, 15.21 ]</td>
|
291 |
+
<td>[ -1.5, 315.84 ]</td>
|
292 |
+
<td>[ 1.0, 809.0 ]</td>
|
293 |
+
<td>[ 1.0, 150370.0 ]</td>
|
294 |
+
<td>[ 27.0, 199867.0 ]</td>
|
295 |
+
<td>[ 3.7, 150370.0 ]</td>
|
296 |
+
<td>[ 1.0, 150370.0 ]</td>
|
297 |
+
<td>[ 0.0, 65387.49 ]</td>
|
298 |
+
<td>[ 0.0, 4275524278.19 ]</td>
|
299 |
+
<td>[ 1.0, 150370.0 ]</td>
|
300 |
+
<td>[ 4.0, 150370.0 ]</td>
|
301 |
+
<td>[ 0.0, 23387.25 ]</td>
|
302 |
+
<td>[ 0.0, 9.3 ]</td>
|
303 |
+
<td>[ -2.0, 101.82 ]</td>
|
304 |
+
<td>[ 1.0, 757.0 ]</td>
|
305 |
+
<td>[ 1.0, 16653.0 ]</td>
|
306 |
+
<td>[ 28.0, 181328.0 ]</td>
|
307 |
+
<td>[ 3.53, 24500.67 ]</td>
|
308 |
+
<td>[ 1.0, 16653.0 ]</td>
|
309 |
+
<td>[ 0.0, 42344.04 ]</td>
|
310 |
+
<td>[ 0.0, 1793017566.89 ]</td>
|
311 |
+
<td>[ 1.0, 16653.0 ]</td>
|
312 |
+
<td>[ 3.0, 39876.0 ]</td>
|
313 |
+
<td>[ 0.0, 39766.0 ]</td>
|
314 |
+
<td>[ -0.7, 13.82 ]</td>
|
315 |
+
<td>[ -2.0, 255.39 ]</td>
|
316 |
+
<td>[ 0.0, 13.36 ]</td>
|
317 |
+
<td>[ 0.0, 16.77 ]</td>
|
318 |
+
<td>[ 0.0, 24.71 ]</td>
|
319 |
+
<td>[ 0.0, 685.0 ]</td>
|
320 |
+
<td>[ -328.0, 962.0 ]</td>
|
321 |
+
<td>[ 0.0, 871.0 ]</td>
|
322 |
+
<td>[ 0.0, 881.0 ]</td>
|
323 |
+
<td>[ 0.0, 935.0 ]</td>
|
324 |
+
<td>[ 0.0, 7.11 ]</td>
|
325 |
+
<td>[ 0.0, 7.11 ]</td>
|
326 |
+
<td>[ 0.0, 8.93 ]</td>
|
327 |
+
<td>[ 0.0, 648.0 ]</td>
|
328 |
+
<td>[ 0.0, 618.0 ]</td>
|
329 |
+
<td>[ 0.0, 11563842.15 ]</td>
|
330 |
+
<td>[ 0.0, 0.9 ]</td>
|
331 |
+
<td>[ 0.0, 21146257.12 ]</td>
|
332 |
+
<td>[ 0.0, 0.76 ]</td>
|
333 |
+
<td>[ 0.0, 14140225.9 ]</td>
|
334 |
+
<td>[ 0.0, 0.42 ]</td>
|
335 |
+
<td>[ 0.0, 15576076.83 ]</td>
|
336 |
+
<td>[ 0.0, 0.51 ]</td>
|
337 |
+
</tr>
|
338 |
+
</tbody>
|
339 |
+
</table>
|
340 |
+
</div>
|
341 |
+
|
342 |
### Benchmark
|
343 |
The benchmarking defines the downstream task which is used for evaluating the goodness of the synthesized event log datasets with the metrics of real-world datasets. The command to execute a benchmarking is shown in the following script:
|
344 |
|
345 |
+
```console
|
346 |
+
conda activate gedi
|
347 |
+
python main.py -a config_files/pipeline_steps/benchmark.json
|
348 |
```
|
349 |
|
350 |
In the `benchmark.json`, we have the following key-value pairs:
|
|
|
360 |
The purpose of the evaluation plotting step is used just for visualization. Some examples of how the plotter can be used is shown in the following exemplarily script:
|
361 |
|
362 |
|
363 |
+
```console
|
364 |
+
conda activate gedi
|
365 |
+
python main.py -a config_files/pipeline_steps/evaluation_plotter.json
|
366 |
```
|
367 |
|
368 |
Generally, in the `evaluation_plotter.json`, we have the following key-value pairs:
|
|
|
380 |
### Generating data with real targets
|
381 |
To execute the experiments with real targets, we employ the [experiment_real_targets.json](config_files/experiment_real_targets.json). The script's pipeline will output the [generated event logs (GenBaselineED)](data/event_logs/GenBaselineED), which optimize their feature values towards [real-world event data features](data/BaselineED_feat.csv), alongside their respectively measured [feature values](data/GenBaselineED_feat.csv) and [benchmark metrics values](data/GenBaselineED_bench.csv).
|
382 |
|
383 |
+
```console
|
384 |
+
conda activate gedi
|
385 |
+
python main.py -a config_files/experiment_real_targets.json
|
386 |
```
|
387 |
|
388 |
### Generating data with grid targets
|
|
|
393 |
```
|
394 |
We employ the [experiment_grid_2obj_configfiles_fabric.ipynb](notebooks/experiment_grid_2obj_configfiles_fabric.ipynb) to create all necessary [configuration](config_files/grid_2obj) and [objective](data/grid_2obj) files for this experiment.
|
395 |
For more details about these config_files, please refer to [Feature Extraction](#feature-extraction), [Generation](#generation), and [Benchmark](#benchmark).
|
396 |
+
To create configuration files for grid objectives interactively, you can use the start the following dashboard:
|
397 |
+
```
|
398 |
+
streamlit run utils/config_fabric.py # To tunnel to local machine add: --server.port 8501 --server.headless true
|
399 |
|
400 |
+
# In local machine (only in case you are tunneling):
|
401 |
+
ssh -N -f -L 9000:localhost:8501 <user@remote_machine.com>
|
402 |
+
open "http://localhost:9000/"
|
403 |
+
```
|
404 |
### Visualizations
|
|
|
405 |
To run the visualizations, we employ [jupyter notebooks](https://jupyter.org/install) and [add the installed environment to the jupyter notebook](https://medium.com/@nrk25693/how-to-add-your-conda-environment-to-your-jupyter-notebook-in-just-4-steps-abeab8b8d084). We then start all visualizations by running e.g.: `jupyter noteboook`. In the following, we describe the `.ipynb`-files in the folder `\notebooks` to reproduce the figures from our paper.
|
406 |
|
407 |
#### [Fig. 4 and fig. 5 Representativeness](notebooks/gedi_figs4and5_representativeness.ipynb)
|
|
|
421 |
The `GEDI` framework is taken directly from the original paper by [Maldonado](mailto:[email protected]), Frey, Tavares, Rehwald and Seidl and is *to appear on BPM'24*.
|
422 |
|
423 |
```bibtex
|
424 |
+
@article{maldonado2024gedi,
|
425 |
+
author = {Maldonado, Andrea and Frey, {Christian M. M.} and Tavares, {Gabriel M.} and Rehwald, Nikolina and Seidl, Thomas},
|
426 |
+
title = {{GEDI:} Generating Event Data with Intentional Features for Benchmarking Process Mining},
|
427 |
+
journal = {To be published in BPM 2024. Krakow, Poland, Sep 01-06},
|
428 |
+
volume = {},
|
429 |
+
year = {2024},
|
430 |
+
url = {https://mcml.ai/publications/gedi.pdf},
|
431 |
+
doi = {},
|
432 |
+
eprinttype = {website},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
433 |
}
|
434 |
```
|
config_files/test/test_abbrv_generation.json
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[{"pipeline_step": "event_logs_generation",
|
2 |
+
"output_path": "output/test",
|
3 |
+
"generator_params": {"experiment":
|
4 |
+
{"input_path": "data/test/igedi_table_1.csv",
|
5 |
+
"objectives": ["rmcv","ense"]},
|
6 |
+
"config_space": {"mode": [5, 20], "sequence": [0.01, 1],
|
7 |
+
"choice": [0.01, 1], "parallel": [0.01, 1], "loop": [0.01, 1],
|
8 |
+
"silent": [0.01, 1], "lt_dependency": [0.01, 1],
|
9 |
+
"num_traces": [10, 10001], "duplicate": [0],
|
10 |
+
"or": [0]}, "n_trials": 2}},
|
11 |
+
{"pipeline_step": "feature_extraction",
|
12 |
+
"input_path": "output/test/igedi_table_1/2_ense_rmcv",
|
13 |
+
"feature_params": {"feature_set": ["simple_stats", "trace_length", "trace_variant",
|
14 |
+
"activities", "start_activities", "end_activities", "eventropies", "epa_based"]},
|
15 |
+
"output_path": "output/plots", "real_eventlog_path": "data/test/2_bpic_features.csv",
|
16 |
+
"plot_type": "boxplot"}]
|
data/test/igedi_table_1.csv
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
log,rmcv,ense
|
2 |
+
BPIC15f4,0.003,0.604
|
3 |
+
RTFMP,0.376,0.112
|
4 |
+
HD,0.517,0.254
|
data/validation/2_ense_rmcv_feat.csv
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
log,n_traces,n_unique_traces,trace_len_coefficient_variation,trace_len_entropy,trace_len_geometric_mean,trace_len_geometric_std,trace_len_harmonic_mean,trace_len_hist1,trace_len_hist10,trace_len_hist2,trace_len_hist3,trace_len_hist4,trace_len_hist5,trace_len_hist6,trace_len_hist7,trace_len_hist8,trace_len_hist9,trace_len_iqr,trace_len_kurtosis,trace_len_kurtosis_hist,trace_len_max,trace_len_mean,trace_len_median,trace_len_min,trace_len_mode,trace_len_q1,trace_len_q3,trace_len_skewness,trace_len_skewness_hist,trace_len_std,trace_len_variance,kurtosis_variant_occurrence,mean_variant_occurrence,ratio_most_common_variant,ratio_top_10_variants,ratio_top_1_variants,ratio_top_20_variants,ratio_top_50_variants,ratio_top_5_variants,ratio_top_75_variants,skewness_variant_occurrence,std_variant_occurrence,activities_iqr,activities_kurtosis,activities_max,activities_mean,activities_median,activities_min,activities_q1,activities_q3,activities_skewness,activities_std,activities_variance,n_unique_activities,n_unique_start_activities,start_activities_iqr,start_activities_kurtosis,start_activities_max,start_activities_mean,start_activities_median,start_activities_min,start_activities_q1,start_activities_q3,start_activities_skewness,start_activities_std,start_activities_variance,end_activities_iqr,end_activities_kurtosis,end_activities_max,end_activities_mean,end_activities_median,end_activities_min,end_activities_q1,end_activities_q3,end_activities_skewness,end_activities_std,end_activities_variance,n_unique_end_activities,eventropy_global_block,eventropy_global_block_flattened,eventropy_k_block_diff_1,eventropy_k_block_diff_3,eventropy_k_block_diff_5,eventropy_k_block_ratio_1,eventropy_k_block_ratio_3,eventropy_k_block_ratio_5,eventropy_knn_3,eventropy_knn_5,eventropy_knn_7,eventropy_lempel_ziv,eventropy_lempel_ziv_flattened,eventropy_prefix,eventropy_prefix_flattened,eventropy_trace,epa_variant_entropy,epa_normalized_variant_entropy,epa_sequence_entropy,epa_normalized_sequence_entropy,epa_sequence_entropy_linear_forgetting,epa_normalized_sequence_entropy_linear_forgetting,epa_sequence_entropy_exponential_forgetting,epa_normalized_sequence_entropy_exponential_forgetting,ratio_variants_per_number_of_traces
|
2 |
+
genELBPIC15f4_0604_0003,8616,4031,1.0086445672512825,8.700230419287818,8.516920996327995,2.1832133718212567,6.58111248846037,0.05713165933282198,1.682074468800883e-05,0.009932649738269211,0.0033136867035377378,0.0012279143622246447,0.0005214430853282738,0.00017661781922409254,0.0001093348404720574,3.364148937601766e-05,0.0,9.0,11.77613857723645,4.64306597180025,141,11.964136490250697,7.0,3,3,5.0,14.0,2.836323931248485,2.5294876299887217,12.067561272744191,145.6260350714354,1651.5545366193303,2.137434879682461,0.09099350046425256,0.5789229340761374,0.40401578458681525,0.6256963788300836,0.766016713091922,0.5258820798514392,0.883008356545961,36.276105773051086,15.574023282690577,2184.5,1.9085746306932307,34121,12885.375,8627.0,8584,8616.0,10800.5,1.8663249384138656,8507.416043333898,72376127.734375,8,2,2111.0,-2.0,6419,4308.0,4308.0,2197,3252.5,5363.5,0.0,2111.0,4456321.0,768.0,0.0021026107788850723,4895,1723.2,832.0,495,813.0,1581.0,1.331337855426617,1625.5283940922102,2642342.56,5,15.897,16.276,2.756,1.525,1.375,2.756,2.016,1.775,6.564,6.07,5.761,1.405,1.786,12.139,13.493,9.703,365917.06171394786,0.7166786736830569,651595.1462643282,0.5475971681938718,62016.045914910814,0.05211796208164211,266396.7627350506,0.22387845232743814,0.46785051067780875
|
3 |
+
genELHD_0254_0517,6822,565,1.1300022933733087,8.390788875278787,1.9006921917027269,2.263915758458681,1.4763543408149593,0.28822871537617945,0.00010858116985352402,0.04077222927999826,0.02383356678284851,0.006080545511797346,0.005591930247456488,0.002823110416191621,0.0017915893025831464,0.0006514870191211442,0.0004886152643408582,2.0,9.718268017319556,4.770965470001153,28,2.8346525945470535,1.0,1,1,1.0,3.0,2.765986310146101,2.5637920433464965,3.2031639327547703,10.260259180101007,226.4931382842208,12.07433628318584,0.24860744649662855,0.9079448841981823,0.6807387862796834,0.9321313397830548,0.9585165640574611,0.8717384931105248,0.9791849897390794,14.639488482439702,105.6342402074512,1283.0,8.118508585327676,6848,1137.5294117647059,472.0,208,413.0,1696.0,2.9234849385484285,1541.823981624173,2377221.1903114184,17,10,294.25,2.299363631971671,3383,682.2,217.0,101,121.75,416.0,1.9301655015244086,1008.2924972447232,1016653.7600000001,334.5,2.8813625853874614,3383,620.1818181818181,157.0,79,104.5,439.0,2.0614116860983223,981.5564465945092,963453.0578512397,11,9.069,10.932,3.265,0.908,0.67,3.265,1.808,1.456,4.81,4.359,4.05,0.696,2.01,6.995,10.12,4.469,16958.33766640406,0.7450438396474315,70379.87102533762,0.36874603139171797,9719.481922433943,0.050923940806750986,30545.050254490514,0.16003675334882345,0.08282028730577544
|
4 |
+
genELRTFMP_0112_0376,6822,565,1.1300022933733087,8.390788875278787,1.9006921917027269,2.263915758458681,1.4763543408149593,0.28822871537617945,0.00010858116985352402,0.04077222927999826,0.02383356678284851,0.006080545511797346,0.005591930247456488,0.002823110416191621,0.0017915893025831464,0.0006514870191211442,0.0004886152643408582,2.0,9.718268017319556,4.770965470001153,28,2.8346525945470535,1.0,1,1,1.0,3.0,2.765986310146101,2.5637920433464965,3.2031639327547703,10.260259180101007,226.4931382842208,12.07433628318584,0.24860744649662855,0.9079448841981823,0.6807387862796834,0.9321313397830548,0.9585165640574611,0.8717384931105248,0.9791849897390794,14.639488482439702,105.6342402074512,1283.0,8.118508585327676,6848,1137.5294117647059,472.0,208,413.0,1696.0,2.9234849385484285,1541.823981624173,2377221.1903114184,17,10,294.25,2.299363631971671,3383,682.2,217.0,101,121.75,416.0,1.9301655015244086,1008.2924972447232,1016653.7600000001,334.5,2.8813625853874614,3383,620.1818181818181,157.0,79,104.5,439.0,2.0614116860983223,981.5564465945092,963453.0578512397,11,9.069,10.932,3.265,0.908,0.67,3.265,1.808,1.456,4.81,4.359,4.05,0.696,2.01,6.995,10.12,4.469,16958.33766640406,0.7450438396474315,70379.87102533762,0.36874603139171797,9719.481922433943,0.050923940806750986,30545.050254490514,0.16003675334882345,0.08282028730577544
|
gedi/__init__.py
CHANGED
@@ -1,3 +1,7 @@
|
|
1 |
-
from .
|
|
|
|
|
|
|
|
|
2 |
|
3 |
-
__all__=['
|
|
|
1 |
+
from .generator import GenerateEventLogs
|
2 |
+
from .features import EventLogFeatures
|
3 |
+
from .augmentation import InstanceAugmentator
|
4 |
+
from .benchmark import BenchmarkTest
|
5 |
+
from .plotter import BenchmarkPlotter, FeaturesPlotter, AugmentationPlotter, GenerationPlotter
|
6 |
|
7 |
+
__all__=[ 'GenerateEventLogs', 'EventLogFeatures', 'FeatureAnalyser', 'InstanceAugmentator', 'BenchmarkTest', 'BenchmarkPlotter', 'FeaturesPlotter', 'AugmentationPlotter', 'GenerationPlotter']
|
gedi/features.py
CHANGED
@@ -10,7 +10,7 @@ from pathlib import Path
|
|
10 |
from utils.param_keys import INPUT_PATH
|
11 |
from utils.param_keys.features import FEATURE_PARAMS, FEATURE_SET
|
12 |
from gedi.utils.io_helpers import dump_features_json
|
13 |
-
|
14 |
def get_sortby_parameter(elem):
|
15 |
number = int(elem.rsplit(".")[0].rsplit("_", 1)[1])
|
16 |
return number
|
@@ -63,6 +63,8 @@ class EventLogFeatures(EventLogFile):
|
|
63 |
|
64 |
if str(self.filename).endswith('csv'): # Returns dataframe from loaded metafeatures file
|
65 |
self.feat = pd.read_csv(self.filepath)
|
|
|
|
|
66 |
print(f"SUCCESS: EventLogFeatures loaded features from {self.filepath}")
|
67 |
elif isinstance(self.filename, list): # Computes metafeatures for list of .xes files
|
68 |
combined_features=pd.DataFrame()
|
|
|
10 |
from utils.param_keys import INPUT_PATH
|
11 |
from utils.param_keys.features import FEATURE_PARAMS, FEATURE_SET
|
12 |
from gedi.utils.io_helpers import dump_features_json
|
13 |
+
from utils.column_mappings import column_mappings
|
14 |
def get_sortby_parameter(elem):
|
15 |
number = int(elem.rsplit(".")[0].rsplit("_", 1)[1])
|
16 |
return number
|
|
|
63 |
|
64 |
if str(self.filename).endswith('csv'): # Returns dataframe from loaded metafeatures file
|
65 |
self.feat = pd.read_csv(self.filepath)
|
66 |
+
columns_to_rename = {col: column_mappings()[col] for col in self.feat.columns if col in column_mappings()}
|
67 |
+
self.feat.rename(columns=columns_to_rename, inplace=True)
|
68 |
print(f"SUCCESS: EventLogFeatures loaded features from {self.filepath}")
|
69 |
elif isinstance(self.filename, list): # Computes metafeatures for list of .xes files
|
70 |
combined_features=pd.DataFrame()
|
gedi/generator.py
CHANGED
@@ -21,6 +21,7 @@ from utils.param_keys import OUTPUT_PATH, INPUT_PATH
|
|
21 |
from utils.param_keys.generator import GENERATOR_PARAMS, EXPERIMENT, CONFIG_SPACE, N_TRIALS
|
22 |
from gedi.utils.io_helpers import get_output_key_value_location, dump_features_json, compute_similarity
|
23 |
from gedi.utils.io_helpers import read_csvs
|
|
|
24 |
import xml.etree.ElementTree as ET
|
25 |
import re
|
26 |
from xml.dom import minidom
|
@@ -153,6 +154,8 @@ class GenerateEventLogs():
|
|
153 |
experiment = self.params.get(EXPERIMENT)
|
154 |
if experiment is not None:
|
155 |
tasks, output_path = get_tasks(experiment, self.output_path)
|
|
|
|
|
156 |
self.output_path = output_path
|
157 |
|
158 |
if 'ratio_variants_per_number_of_traces' in tasks.columns:#HOTFIX
|
|
|
21 |
from utils.param_keys.generator import GENERATOR_PARAMS, EXPERIMENT, CONFIG_SPACE, N_TRIALS
|
22 |
from gedi.utils.io_helpers import get_output_key_value_location, dump_features_json, compute_similarity
|
23 |
from gedi.utils.io_helpers import read_csvs
|
24 |
+
from utils.column_mappings import column_mappings
|
25 |
import xml.etree.ElementTree as ET
|
26 |
import re
|
27 |
from xml.dom import minidom
|
|
|
154 |
experiment = self.params.get(EXPERIMENT)
|
155 |
if experiment is not None:
|
156 |
tasks, output_path = get_tasks(experiment, self.output_path)
|
157 |
+
columns_to_rename = {col: column_mappings()[col] for col in tasks.columns if col in column_mappings()}
|
158 |
+
tasks = tasks.rename(columns=columns_to_rename)
|
159 |
self.output_path = output_path
|
160 |
|
161 |
if 'ratio_variants_per_number_of_traces' in tasks.columns:#HOTFIX
|
gedi/run.py
DELETED
@@ -1,53 +0,0 @@
|
|
1 |
-
import config
|
2 |
-
import pandas as pd
|
3 |
-
from datetime import datetime as dt
|
4 |
-
from gedi.generator import GenerateEventLogs
|
5 |
-
from gedi.features import EventLogFeatures
|
6 |
-
from gedi.augmentation import InstanceAugmentator
|
7 |
-
from gedi.benchmark import BenchmarkTest
|
8 |
-
from gedi.plotter import BenchmarkPlotter, FeaturesPlotter, AugmentationPlotter, GenerationPlotter
|
9 |
-
from utils.default_argparse import ArgParser
|
10 |
-
from utils.param_keys import *
|
11 |
-
|
12 |
-
def run(kwargs:dict, model_params_list: list, filename_list:list):
|
13 |
-
"""
|
14 |
-
This function chooses the running option for the program.
|
15 |
-
@param kwargs: dict
|
16 |
-
contains the running parameters and the event-log file information
|
17 |
-
@param model_params_list: list
|
18 |
-
contains a list of model parameters, which are used to analyse this different models.
|
19 |
-
@param filename_list: list
|
20 |
-
contains the list of the filenames to load multiple event-logs
|
21 |
-
@return:
|
22 |
-
"""
|
23 |
-
params = kwargs[PARAMS]
|
24 |
-
ft = EventLogFeatures(None)
|
25 |
-
augmented_ft = InstanceAugmentator()
|
26 |
-
gen = pd.DataFrame(columns=['log'])
|
27 |
-
|
28 |
-
for model_params in model_params_list:
|
29 |
-
if model_params.get(PIPELINE_STEP) == 'instance_augmentation':
|
30 |
-
augmented_ft = InstanceAugmentator(aug_params=model_params, samples=ft.feat)
|
31 |
-
AugmentationPlotter(augmented_ft, model_params)
|
32 |
-
elif model_params.get(PIPELINE_STEP) == 'event_logs_generation':
|
33 |
-
gen = pd.DataFrame(GenerateEventLogs(model_params).log_config)
|
34 |
-
#gen = pd.read_csv("output/features/generated/grid_2objectives_enseef_enve/2_enseef_enve_feat.csv")
|
35 |
-
#GenerationPlotter(gen, model_params, output_path="output/plots")
|
36 |
-
elif model_params.get(PIPELINE_STEP) == 'benchmark_test':
|
37 |
-
benchmark = BenchmarkTest(model_params, event_logs=gen['log'])
|
38 |
-
# BenchmarkPlotter(benchmark.features, output_path="output/plots")
|
39 |
-
elif model_params.get(PIPELINE_STEP) == 'feature_extraction':
|
40 |
-
ft = EventLogFeatures(**kwargs, logs=gen['log'], ft_params=model_params)
|
41 |
-
FeaturesPlotter(ft.feat, model_params)
|
42 |
-
elif model_params.get(PIPELINE_STEP) == "evaluation_plotter":
|
43 |
-
GenerationPlotter(gen, model_params, output_path=model_params['output_path'], input_path=model_params['input_path'])
|
44 |
-
|
45 |
-
def gedi(config_path):
|
46 |
-
"""
|
47 |
-
This function runs the GEDI pipeline.
|
48 |
-
@param config_path: str
|
49 |
-
contains the path to the config file
|
50 |
-
@return:
|
51 |
-
"""
|
52 |
-
model_params_list = config.get_model_params_list(config_path)
|
53 |
-
run({'params':""}, model_params_list, [])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
execute_grid_experiments.py → gedi/utils/execute_grid_experiments.py
RENAMED
@@ -3,7 +3,7 @@ import os
|
|
3 |
import sys
|
4 |
|
5 |
from datetime import datetime as dt
|
6 |
-
from
|
7 |
from tqdm import tqdm
|
8 |
|
9 |
#TODO: Pass i properly
|
|
|
3 |
import sys
|
4 |
|
5 |
from datetime import datetime as dt
|
6 |
+
from io_helpers import sort_files
|
7 |
from tqdm import tqdm
|
8 |
|
9 |
#TODO: Pass i properly
|
main.py
CHANGED
@@ -1,12 +1,54 @@
|
|
1 |
import config
|
|
|
2 |
from datetime import datetime as dt
|
3 |
-
from gedi.
|
|
|
|
|
|
|
|
|
4 |
from utils.default_argparse import ArgParser
|
5 |
from utils.param_keys import *
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
if __name__=='__main__':
|
8 |
start_gedi = dt.now()
|
9 |
print(f'INFO: GEDI starting {start_gedi}')
|
|
|
10 |
args = ArgParser().parse('GEDI main')
|
11 |
-
|
|
|
|
|
12 |
print(f'SUCCESS: GEDI took {dt.now()-start_gedi} sec.')
|
|
|
1 |
import config
|
2 |
+
import pandas as pd
|
3 |
from datetime import datetime as dt
|
4 |
+
from gedi.generator import GenerateEventLogs
|
5 |
+
from gedi.features import EventLogFeatures
|
6 |
+
from gedi.augmentation import InstanceAugmentator
|
7 |
+
from gedi.benchmark import BenchmarkTest
|
8 |
+
from gedi.plotter import BenchmarkPlotter, FeaturesPlotter, AugmentationPlotter, GenerationPlotter
|
9 |
from utils.default_argparse import ArgParser
|
10 |
from utils.param_keys import *
|
11 |
|
12 |
+
def run(kwargs:dict, model_paramas_list: list, filename_list:list):
|
13 |
+
"""
|
14 |
+
This function chooses the running option for the program.
|
15 |
+
@param kwargs: dict
|
16 |
+
contains the running parameters and the event-log file information
|
17 |
+
@param model_params_list: list
|
18 |
+
contains a list of model parameters, which are used to analyse this different models.
|
19 |
+
@param filename_list: list
|
20 |
+
contains the list of the filenames to load multiple event-logs
|
21 |
+
@return:
|
22 |
+
"""
|
23 |
+
params = kwargs[PARAMS]
|
24 |
+
ft = EventLogFeatures(None)
|
25 |
+
augmented_ft = InstanceAugmentator()
|
26 |
+
gen = pd.DataFrame(columns=['log'])
|
27 |
+
|
28 |
+
for model_params in model_params_list:
|
29 |
+
if model_params.get(PIPELINE_STEP) == 'instance_augmentation':
|
30 |
+
augmented_ft = InstanceAugmentator(aug_params=model_params, samples=ft.feat)
|
31 |
+
AugmentationPlotter(augmented_ft, model_params)
|
32 |
+
elif model_params.get(PIPELINE_STEP) == 'event_logs_generation':
|
33 |
+
gen = pd.DataFrame(GenerateEventLogs(model_params).log_config)
|
34 |
+
#gen = pd.read_csv("output/features/generated/grid_2objectives_enseef_enve/2_enseef_enve_feat.csv")
|
35 |
+
#GenerationPlotter(gen, model_params, output_path="output/plots")
|
36 |
+
elif model_params.get(PIPELINE_STEP) == 'benchmark_test':
|
37 |
+
benchmark = BenchmarkTest(model_params, event_logs=gen['log'])
|
38 |
+
# BenchmarkPlotter(benchmark.features, output_path="output/plots")
|
39 |
+
elif model_params.get(PIPELINE_STEP) == 'feature_extraction':
|
40 |
+
ft = EventLogFeatures(**kwargs, logs=gen['log'], ft_params=model_params)
|
41 |
+
FeaturesPlotter(ft.feat, model_params)
|
42 |
+
elif model_params.get(PIPELINE_STEP) == "evaluation_plotter":
|
43 |
+
GenerationPlotter(gen, model_params, output_path=model_params['output_path'], input_path=model_params['input_path'])
|
44 |
+
|
45 |
+
|
46 |
if __name__=='__main__':
|
47 |
start_gedi = dt.now()
|
48 |
print(f'INFO: GEDI starting {start_gedi}')
|
49 |
+
|
50 |
args = ArgParser().parse('GEDI main')
|
51 |
+
model_params_list = config.get_model_params_list(args.alg_params_json)
|
52 |
+
run({'params':""}, model_params_list, [])
|
53 |
+
|
54 |
print(f'SUCCESS: GEDI took {dt.now()-start_gedi} sec.')
|
setup.py
CHANGED
@@ -4,7 +4,7 @@ import os
|
|
4 |
with open("README.md", "r") as fh:
|
5 |
long_description = fh.read()
|
6 |
|
7 |
-
version_string = os.environ.get("VERSION_PLACEHOLDER", "0.0
|
8 |
print(version_string)
|
9 |
version = version_string
|
10 |
|
@@ -25,59 +25,14 @@ setup(
|
|
25 |
'Levenshtein==0.23.0',
|
26 |
'matplotlib==3.8.4',
|
27 |
'numpy==1.26.4',
|
|
|
28 |
'pm4py==2.7.2',
|
29 |
'scikit-learn==1.2.2',
|
30 |
-
'scipy==1.
|
31 |
'seaborn==0.13.2',
|
32 |
'smac==2.0.2',
|
33 |
'tqdm==4.65.0',
|
34 |
-
'streamlit-toggle-switch>=1.0.2'
|
35 |
-
'click==8.1.7',
|
36 |
-
'cloudpickle==3.0.0',
|
37 |
-
'configspace==0.7.1',
|
38 |
-
'cvxopt==1.3.2',
|
39 |
-
'dask==2024.2.1',
|
40 |
-
'dask-jobqueue==0.8.5',
|
41 |
-
'deprecation==2.1.0',
|
42 |
-
'distributed==2024.2.1',
|
43 |
-
'emcee==3.1.4',
|
44 |
-
'feeed == 1.2.0',
|
45 |
-
'fsspec==2024.2.0',
|
46 |
-
'imbalanced-learn==0.12.0',
|
47 |
-
'imblearn==0.0',
|
48 |
-
'importlib-metadata==7.0.1',
|
49 |
-
'intervaltree==3.1.0',
|
50 |
-
'jinja2==3.1.3',
|
51 |
-
'levenshtein==0.23.0',
|
52 |
-
'locket==1.0.0',
|
53 |
-
'lxml==5.1.0',
|
54 |
-
'markupsafe==2.1.5',
|
55 |
-
'more-itertools==10.2.0',
|
56 |
-
'msgpack==1.0.8',
|
57 |
-
'networkx==3.2.1',
|
58 |
-
'numpy==1.26.4',
|
59 |
-
'pandas>=2.0.0',
|
60 |
-
'partd==1.4.1',
|
61 |
-
'pm4py==2.7.2',
|
62 |
-
'psutil==5.9.8',
|
63 |
-
'pydotplus==2.0.2',
|
64 |
-
'pynisher==1.0.10',
|
65 |
-
'pyrfr==0.9.0',
|
66 |
-
'pyyaml==6.0.1',
|
67 |
-
'rapidfuzz==3.6.1',
|
68 |
-
'regex==2023.12.25',
|
69 |
-
'scikit-learn==1.2.2',
|
70 |
-
'scipy==1.10.1',
|
71 |
-
'seaborn==0.13.2',
|
72 |
-
'smac==2.0.2',
|
73 |
-
'sortedcontainers==2.4.0',
|
74 |
-
'stringdist==1.0.9',
|
75 |
-
'tblib==3.0.0',
|
76 |
-
'toolz==0.12.1',
|
77 |
-
'tqdm==4.65.0',
|
78 |
-
'typing-extensions==4.10.0',
|
79 |
-
'urllib3==2.2.1',
|
80 |
-
'zict==3.0.0'
|
81 |
],
|
82 |
packages = ['gedi'],
|
83 |
classifiers=[
|
@@ -87,4 +42,4 @@ setup(
|
|
87 |
'License :: OSI Approved :: MIT License', # Again, pick a license
|
88 |
'Programming Language :: Python :: 3.9',
|
89 |
],
|
90 |
-
)
|
|
|
4 |
with open("README.md", "r") as fh:
|
5 |
long_description = fh.read()
|
6 |
|
7 |
+
version_string = os.environ.get("VERSION_PLACEHOLDER", "1.0.0")
|
8 |
print(version_string)
|
9 |
version = version_string
|
10 |
|
|
|
25 |
'Levenshtein==0.23.0',
|
26 |
'matplotlib==3.8.4',
|
27 |
'numpy==1.26.4',
|
28 |
+
'pandas==2.2.2',
|
29 |
'pm4py==2.7.2',
|
30 |
'scikit-learn==1.2.2',
|
31 |
+
'scipy==1.13.0',
|
32 |
'seaborn==0.13.2',
|
33 |
'smac==2.0.2',
|
34 |
'tqdm==4.65.0',
|
35 |
+
'streamlit-toggle-switch>=1.0.2'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
],
|
37 |
packages = ['gedi'],
|
38 |
classifiers=[
|
|
|
42 |
'License :: OSI Approved :: MIT License', # Again, pick a license
|
43 |
'Programming Language :: Python :: 3.9',
|
44 |
],
|
45 |
+
)
|
utils/column_mappings.py
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
def column_mappings():
|
2 |
+
|
3 |
+
column_names_short = {
|
4 |
+
'rutpt': 'ratio_unique_traces_per_trace',
|
5 |
+
'rmcv': 'ratio_most_common_variant',
|
6 |
+
'tlcv': 'trace_len_coefficient_variation',
|
7 |
+
'mvo': 'mean_variant_occurrence',
|
8 |
+
'enve': 'epa_normalized_variant_entropy',
|
9 |
+
'ense': 'epa_normalized_sequence_entropy',
|
10 |
+
'eself': 'epa_sequence_entropy_linear_forgetting',
|
11 |
+
'enself': 'epa_normalized_sequence_entropy_linear_forgetting',
|
12 |
+
'eseef': 'epa_sequence_entropy_exponential_forgetting',
|
13 |
+
'enseef': 'epa_normalized_sequence_entropy_exponential_forgetting'
|
14 |
+
}
|
15 |
+
|
16 |
+
return column_names_short
|
utils/config_fabric.py
CHANGED
@@ -13,6 +13,7 @@ import time
|
|
13 |
import shutil
|
14 |
import zipfile
|
15 |
import io
|
|
|
16 |
|
17 |
st.set_page_config(layout='wide')
|
18 |
INPUT_XES="output/inputlog_temp.xes"
|
@@ -174,6 +175,10 @@ def set_generator_experiments(generator_params):
|
|
174 |
df = pd.read_csv(uploaded_file)
|
175 |
if len(df.columns) <= 1:
|
176 |
raise pd.errors.ParserError("Please select a file withat least two columns (e.g. log, feature) and use ',' as a delimiter.")
|
|
|
|
|
|
|
|
|
177 |
sel_features = st.multiselect("Selected features", list(df.columns), list(df.columns)[-1])
|
178 |
if sel_features:
|
179 |
df = df[sel_features]
|
|
|
13 |
import shutil
|
14 |
import zipfile
|
15 |
import io
|
16 |
+
from column_mappings import column_mappings
|
17 |
|
18 |
st.set_page_config(layout='wide')
|
19 |
INPUT_XES="output/inputlog_temp.xes"
|
|
|
175 |
df = pd.read_csv(uploaded_file)
|
176 |
if len(df.columns) <= 1:
|
177 |
raise pd.errors.ParserError("Please select a file withat least two columns (e.g. log, feature) and use ',' as a delimiter.")
|
178 |
+
columns_to_rename = {col: column_mappings()[col] for col in df.columns if col in column_mappings()}
|
179 |
+
|
180 |
+
# Rename the matching columns
|
181 |
+
df.rename(columns=columns_to_rename, inplace=True)
|
182 |
sel_features = st.multiselect("Selected features", list(df.columns), list(df.columns)[-1])
|
183 |
if sel_features:
|
184 |
df = df[sel_features]
|