Spaces:

andreamalhera
/

igedi

Sleeping

App Files Files Community

Andrea Maldonado commited on Sep 24, 2024

Commit

353129b

1 Parent(s): 9013f63

Release cr

Browse files

Files changed (13) hide show

README.md +257 -63
config_files/test/test_abbrv_generation.json +16 -0
data/test/igedi_table_1.csv +4 -0
data/validation/2_ense_rmcv_feat.csv +4 -0
gedi/__init__.py +6 -2
gedi/features.py +3 -1
gedi/generator.py +3 -0
gedi/run.py +0 -53
execute_grid_experiments.py → gedi/utils/execute_grid_experiments.py +1 -1
main.py +44 -2
setup.py +5 -50
utils/column_mappings.py +16 -0
utils/config_fabric.py +5 -0

README.md CHANGED Viewed

@@ -17,18 +17,12 @@ license: mit
 **i**nteractive **G**enerating **E**vent **D**ata with **I**ntentional Features for Benchmarking Process Mining<br />
 This repository contains the codebase for the interactive web application tool (iGEDI) as well as for the [GEDI paper](https://mcml.ai/publications/gedi.pdf) accepted at the BPM'24 conference.
-Our documentation also includes both frameworks. From [General Usage](#general-usage) and beyond, documentation refers especially to reproducibility of the [GEDI paper](https://mcml.ai/publications/gedi.pdf).
-A video tutorial on how to use this tool can be found [here](https://youtu.be/9iQhaYwyQ9E).
 ## Table of Contents
 - [Interactive Web Application (iGEDI)](#interactive-web-application)
 - [Installation](#installation)
-  -  [as PyPi Package](#install-as-pypi-package)
-  -  [of iGEDI](#install-igedi)
-  -  [as local repository](#install-as-local-repository)
 - [General Usage](#general-usage)
 - [Experiments](#experiments)
 - [Citation](#citation)
@@ -37,8 +31,7 @@ A video tutorial on how to use this tool can be found [here](https://youtu.be/9i
 Our [interactive web application](https://huggingface.co/spaces/andreamalhera/gedi) (iGEDI) guides you through the specification process, runs GEDI for you. You can directly download the resulting generated logs or the configuration file to run GEDI locally.
 ![Interface Screenshot](gedi/utils/iGEDI_interface.png)
-## Installation
-### Requirements
 - [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
 - Graphviz on your OS e.g.
 For MacOS:
@@ -50,30 +43,13 @@ brew install swig
 ```console
 conda install pyrfr swig
 ```
-### Install as PyPi package
-To directly use GEDI methods via `import`, install directly from [PyPi](https://pypi.org/project/gedi/) with
-```shell
-pip install gedi
-```
-and run:
-```shell
-python -c "from gedi import gedi; gedi('config_files/pipeline_steps/generation.json')"
-```
-### Install iGEDI
-Our [interactive GEDI (iGEDI)](https://huggingface.co/spaces/andreamalhera/gedi) can be employed to create all necessary [configuration files](config_files) to reproduce our experiements.
-Users can directly use our [web application service](https://huggingface.co/spaces/andreamalhera/gedi) or locally start the following dashboard:
-```
-streamlit run utils/config_fabric.py # To tunnel to local machine add: --server.port 8501 --server.headless true
-# In local machine (only in case you are tunneling):
-ssh -N -f -L 9000:localhost:8501 <user@remote_machine.com>
-open "http://localhost:9000/"
-```
-### Install as local repository
 ```console
-conda env create -f .conda.yml
-from gedi import gedi; gedi('config_files/test/experiment_test.json
 ```
 The last step should take only a few minutes to run.
@@ -85,8 +61,9 @@ Our pipeline offers several pipeline steps, which can be run sequentially or par
 - [Evaluation Plotter](https://github.com/lmu-dbs/gedi/blob/16-documentation-update-readme/README.md#evaluation-plotting)
 To run different steps of the GEDI pipeline, please adapt the `.json` accordingly.
-```python
-from gedi import gedi; gedi('config_files/pipeline_steps/<pipeline-step>.json')
 ```
 For reference of possible keys and values for each step, please see `config_files/test/experiment_test.json`.
 To run the whole pipeline please create a new `.json` file, specifying all steps you want to run and specify desired keys and values for each step.
@@ -95,8 +72,9 @@ To reproduce results from our paper, please refer to [Experiments](#experiments)
 ### Feature Extraction
 ---
 To extract the features on the event-log level and use them for hyperparameter optimization, we employ the following script:
-```python
-from gedi import gedi; gedi('config_files/pipeline_steps/feature_extraction.json')
 ```
 The JSON file consists of the following key-value pairs:
@@ -116,8 +94,9 @@ After having extracted meta features from the files, the next step is to generat
 The command to execute the generation step is given by a exemplarily generation.json file:
-```python
-from gedi import gedi; gedi('config_files/pipeline_steps/generation.json')
 ```
 In the `generation.json`, we have the following key-value pairs:
@@ -144,11 +123,228 @@ In the `generation.json`, we have the following key-value pairs:
     - plot_reference_feature: defines the feature, which is used on the x-axis on the output plots, i.e., each feature defined in the 'objectives' of the 'experiment' is plotted against the reference feature being defined in this value
 ### Benchmark
 The benchmarking defines the downstream task which is used for evaluating the goodness of the synthesized event log datasets with the metrics of real-world datasets. The command to execute a benchmarking is shown in the following script:
-```python
-from gedi import gedi; gedi('config_files/pipeline_steps/benchmark.json')
 ```
 In the `benchmark.json`, we have the following key-value pairs:
@@ -164,8 +360,9 @@ In the `benchmark.json`, we have the following key-value pairs:
 The purpose of the evaluation plotting step is used just for visualization. Some examples of how the plotter can be used is shown in the following exemplarily script:
-```python
-from gedi import gedi; gedi('config_files/pipeline_steps/evaluation_plotter.json')
 ```
 Generally, in the `evaluation_plotter.json`, we have the following key-value pairs:
@@ -183,8 +380,9 @@ We present two settings for generating intentional event logs, using [real targe
 ### Generating data with real targets
 To execute the experiments with real targets, we employ the [experiment_real_targets.json](config_files/experiment_real_targets.json). The script's pipeline will output the [generated event logs (GenBaselineED)](data/event_logs/GenBaselineED), which optimize their feature values towards [real-world event data features](data/BaselineED_feat.csv), alongside their respectively measured [feature values](data/GenBaselineED_feat.csv) and [benchmark metrics values](data/GenBaselineED_bench.csv).
-```python
-from gedi import gedi; gedi('config_files/experiment_real_targets.json')
 ```
 ### Generating data with grid targets
@@ -195,10 +393,15 @@ python execute_grid_experiments.py config_files/grid_2obj
 ```
 We employ the [experiment_grid_2obj_configfiles_fabric.ipynb](notebooks/experiment_grid_2obj_configfiles_fabric.ipynb) to create all necessary [configuration](config_files/grid_2obj) and [objective](data/grid_2obj) files for this experiment.
 For more details about these config_files, please refer to [Feature Extraction](#feature-extraction), [Generation](#generation), and [Benchmark](#benchmark).
-To create configuration files for grid objectives interactively, you can use iGEDI(https://huggingface.co/spaces/andreamalhera/gedi).
 ### Visualizations
-Visualizations correspond to the [GEDI paper](https://mcml.ai/publications/gedi.pdf).
 To run the visualizations, we employ [jupyter notebooks](https://jupyter.org/install) and [add the installed environment to the jupyter notebook](https://medium.com/@nrk25693/how-to-add-your-conda-environment-to-your-jupyter-notebook-in-just-4-steps-abeab8b8d084). We then start all visualizations by running e.g.: `jupyter noteboook`. In the following, we describe the `.ipynb`-files in the folder `\notebooks` to reproduce the figures from our paper.
 #### [Fig. 4 and fig. 5 Representativeness](notebooks/gedi_figs4and5_representativeness.ipynb)
@@ -218,23 +421,14 @@ Likewise to the evaluation on the statistical tests in notebook `gedi_figs7and8_
 The `GEDI` framework is taken directly from the original paper by [Maldonado](mailto:[email protected]), Frey, Tavares, Rehwald and Seidl and is *to appear on BPM'24*.
 ```bibtex
-@InProceedings{10.1007/978-3-031-70396-6_13,
-author="Maldonado, Andrea
-and Frey, Christian M. M.
-and Tavares, Gabriel Marques
-and Rehwald, Nikolina
-and Seidl, Thomas",
-editor="Marrella, Andrea
-and Resinas, Manuel
-and Jans, Mieke
-and Rosemann, Michael",
-title="GEDI: Generating Event Data with Intentional Features for Benchmarking Process Mining",
-booktitle="Business Process Management",
-year="2024",
-publisher="Springer Nature Switzerland",
-address="Cham",
-pages="221--237",
-abstract="Process mining solutions include enhancing performance, conserving resources, and alleviating bottlenecks in organizational contexts. However, as in other data mining fields, success hinges on data quality and availability. Existing analyses for process mining solutions lack diverse and ample data for rigorous testing, hindering insights' generalization. To address this, we propose Generating Event Data with Intentional features, a framework producing event data sets satisfying specific meta-features. Considering the meta-feature space that defines feasible event logs, we observe that existing real-world datasets describe only local areas within the overall space. Hence, our framework aims at providing the capability to generate an event data benchmark, which covers unexplored regions. Therefore, our approach leverages a discretization of the meta-feature space to steer generated data towards regions, where a combination of meta-features is not met yet by existing benchmark datasets. Providing a comprehensive data pool enriches process mining analyses, enables methods to capture a wider range of real-world scenarios, and improves evaluation quality. Moreover, it empowers analysts to uncover correlations between meta-features and evaluation metrics, enhancing explainability and solution effectiveness. Experiments demonstrate GEDI's ability to produce a benchmark of intentional event data sets and robust analyses for process mining tasks.",
-isbn="978-3-031-70396-6"
 }
 ```

 **i**nteractive **G**enerating **E**vent **D**ata with **I**ntentional Features for Benchmarking Process Mining<br />
 This repository contains the codebase for the interactive web application tool (iGEDI) as well as for the [GEDI paper](https://mcml.ai/publications/gedi.pdf) accepted at the BPM'24 conference.
 ## Table of Contents
 - [Interactive Web Application (iGEDI)](#interactive-web-application)
+- [Requirements](#requirements)
 - [Installation](#installation)
 - [General Usage](#general-usage)
 - [Experiments](#experiments)
 - [Citation](#citation)
 Our [interactive web application](https://huggingface.co/spaces/andreamalhera/gedi) (iGEDI) guides you through the specification process, runs GEDI for you. You can directly download the resulting generated logs or the configuration file to run GEDI locally.
 ![Interface Screenshot](gedi/utils/iGEDI_interface.png)
+## Requirements
 - [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
 - Graphviz on your OS e.g.
 For MacOS:
 ```console
 conda install pyrfr swig
 ```
+## Installation
+- `conda env create -f .conda.yml`
+### Startup
 ```console
+conda activate gedi
+python main.py -a config_files/test/experiment_test.json
 ```
 The last step should take only a few minutes to run.
 - [Evaluation Plotter](https://github.com/lmu-dbs/gedi/blob/16-documentation-update-readme/README.md#evaluation-plotting)
 To run different steps of the GEDI pipeline, please adapt the `.json` accordingly.
+```console
+conda activate gedi
+python main.py -a config_files/pipeline_steps/<pipeline-step>.json
 ```
 For reference of possible keys and values for each step, please see `config_files/test/experiment_test.json`.
 To run the whole pipeline please create a new `.json` file, specifying all steps you want to run and specify desired keys and values for each step.
 ### Feature Extraction
 ---
 To extract the features on the event-log level and use them for hyperparameter optimization, we employ the following script:
+```console
+conda activate gedi
+python main.py -a config_files/pipeline_steps/feature_extraction.json
 ```
 The JSON file consists of the following key-value pairs:
 The command to execute the generation step is given by a exemplarily generation.json file:
+```console
+conda activate gedi
+python main.py -a config_files/pipeline_steps/generation.json
 ```
 In the `generation.json`, we have the following key-value pairs:
     - plot_reference_feature: defines the feature, which is used on the x-axis on the output plots, i.e., each feature defined in the 'objectives' of the 'experiment' is plotted against the reference feature being defined in this value
+In case of manually defining the targets for the features in config space, the following table shows the range of the features in the real-world event log data (BPIC's) for reference:
+<div style="overflow-x:auto;">
+    <table border="1" class="dataframe">
+    <thead>
+        <tr style="text-align: right;">
+        <th></th>
+        <th>n_traces</th>
+        <th>n_unique_traces</th>
+        <th>ratio_variants_per_number_of_traces</th>
+        <th>trace_len_min</th>
+        <th>trace_len_max</th>
+        <th>trace_len_mean</th>
+        <th>trace_len_median</th>
+        <th>trace_len_mode</th>
+        <th>trace_len_std</th>
+        <th>trace_len_variance</th>
+        <th>trace_len_q1</th>
+        <th>trace_len_q3</th>
+        <th>trace_len_iqr</th>
+        <th>trace_len_geometric_mean</th>
+        <th>trace_len_geometric_std</th>
+        <th>trace_len_harmonic_mean</th>
+        <th>trace_len_skewness</th>
+        <th>trace_len_kurtosis</th>
+        <th>trace_len_coefficient_variation</th>
+        <th>trace_len_entropy</th>
+        <th>trace_len_hist1</th>
+        <th>trace_len_hist2</th>
+        <th>trace_len_hist3</th>
+        <th>trace_len_hist4</th>
+        <th>trace_len_hist5</th>
+        <th>trace_len_hist6</th>
+        <th>trace_len_hist7</th>
+        <th>trace_len_hist8</th>
+        <th>trace_len_hist9</th>
+        <th>trace_len_hist10</th>
+        <th>trace_len_skewness_hist</th>
+        <th>trace_len_kurtosis_hist</th>
+        <th>ratio_most_common_variant</th>
+        <th>ratio_top_1_variants</th>
+        <th>ratio_top_5_variants</th>
+        <th>ratio_top_10_variants</th>
+        <th>ratio_top_20_variants</th>
+        <th>ratio_top_50_variants</th>
+        <th>ratio_top_75_variants</th>
+        <th>mean_variant_occurrence</th>
+        <th>std_variant_occurrence</th>
+        <th>skewness_variant_occurrence</th>
+        <th>kurtosis_variant_occurrence</th>
+        <th>n_unique_activities</th>
+        <th>activities_min</th>
+        <th>activities_max</th>
+        <th>activities_mean</th>
+        <th>activities_median</th>
+        <th>activities_std</th>
+        <th>activities_variance</th>
+        <th>activities_q1</th>
+        <th>activities_q3</th>
+        <th>activities_iqr</th>
+        <th>activities_skewness</th>
+        <th>activities_kurtosis</th>
+        <th>n_unique_start_activities</th>
+        <th>start_activities_min</th>
+        <th>start_activities_max</th>
+        <th>start_activities_mean</th>
+        <th>start_activities_median</th>
+        <th>start_activities_std</th>
+        <th>start_activities_variance</th>
+        <th>start_activities_q1</th>
+        <th>start_activities_q3</th>
+        <th>start_activities_iqr</th>
+        <th>start_activities_skewness</th>
+        <th>start_activities_kurtosis</th>
+        <th>n_unique_end_activities</th>
+        <th>end_activities_min</th>
+        <th>end_activities_max</th>
+        <th>end_activities_mean</th>
+        <th>end_activities_median</th>
+        <th>end_activities_std</th>
+        <th>end_activities_variance</th>
+        <th>end_activities_q1</th>
+        <th>end_activities_q3</th>
+        <th>end_activities_iqr</th>
+        <th>end_activities_skewness</th>
+        <th>end_activities_kurtosis</th>
+        <th>eventropy_trace</th>
+        <th>eventropy_prefix</th>
+        <th>eventropy_global_block</th>
+        <th>eventropy_lempel_ziv</th>
+        <th>eventropy_k_block_diff_1</th>
+        <th>eventropy_k_block_diff_3</th>
+        <th>eventropy_k_block_diff_5</th>
+        <th>eventropy_k_block_ratio_1</th>
+        <th>eventropy_k_block_ratio_3</th>
+        <th>eventropy_k_block_ratio_5</th>
+        <th>eventropy_knn_3</th>
+        <th>eventropy_knn_5</th>
+        <th>eventropy_knn_7</th>
+        <th>epa_variant_entropy</th>
+        <th>epa_normalized_variant_entropy</th>
+        <th>epa_sequence_entropy</th>
+        <th>epa_normalized_sequence_entropy</th>
+        <th>epa_sequence_entropy_linear_forgetting</th>
+        <th>epa_normalized_sequence_entropy_linear_forgetting</th>
+        <th>epa_sequence_entropy_exponential_forgetting</th>
+        <th>epa_normalized_sequence_entropy_exponential_forgetting</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+        <td>[ min, max ]</td>
+        <td>[ 226.0, 251734.0 ]</td>
+        <td>[ 6.0, 28457.0 ]</td>
+        <td>[ 0.0, 1.0 ]</td>
+        <td>[ 1.0, 24.0 ]</td>
+        <td>[ 1.0, 2973.0 ]</td>
+        <td>[ 1.0, 131.49 ]</td>
+        <td>[ 1.0, 55.0 ]</td>
+        <td>[ 1.0, 61.0 ]</td>
+        <td>[ 0.0, 202.53 ]</td>
+        <td>[ 0.0, 41017.89 ]</td>
+        <td>[ 1.0, 44.0 ]</td>
+        <td>[ 1.0, 169.0 ]</td>
+        <td>[ 0.0, 161.0 ]</td>
+        <td>[ 1.0, 53.78 ]</td>
+        <td>[ 1.0, 5.65 ]</td>
+        <td>[ 1.0, 51.65 ]</td>
+        <td>[ -0.58, 111.97 ]</td>
+        <td>[ -0.97, 14006.75 ]</td>
+        <td>[ 0.0, 4.74 ]</td>
+        <td>[ 5.33, 12.04 ]</td>
+        <td>[ 0.0, 1.99 ]</td>
+        <td>[ 0.0, 0.42 ]</td>
+        <td>[ 0.0, 0.4 ]</td>
+        <td>[ 0.0, 0.19 ]</td>
+        <td>[ 0.0, 0.14 ]</td>
+        <td>[ 0.0, 10.0 ]</td>
+        <td>[ 0.0, 0.02 ]</td>
+        <td>[ 0.0, 0.04 ]</td>
+        <td>[ 0.0, 0.0 ]</td>
+        <td>[ 0.0, 2.7 ]</td>
+        <td>[ -0.58, 111.97 ]</td>
+        <td>[ -0.97, 14006.75 ]</td>
+        <td>[ 0.0, 0.79 ]</td>
+        <td>[ 0.0, 0.87 ]</td>
+        <td>[ 0.0, 0.98 ]</td>
+        <td>[ 0.0, 0.99 ]</td>
+        <td>[ 0.2, 1.0 ]</td>
+        <td>[ 0.5, 1.0 ]</td>
+        <td>[ 0.75, 1.0 ]</td>
+        <td>[ 1.0, 24500.67 ]</td>
+        <td>[ 0.04, 42344.04 ]</td>
+        <td>[ 1.54, 64.77 ]</td>
+        <td>[ 0.66, 5083.46 ]</td>
+        <td>[ 1.0, 1152.0 ]</td>
+        <td>[ 1.0, 66058.0 ]</td>
+        <td>[ 34.0, 466141.0 ]</td>
+        <td>[ 4.13, 66058.0 ]</td>
+        <td>[ 2.0, 66058.0 ]</td>
+        <td>[ 0.0, 120522.25 ]</td>
+        <td>[ 0.0, 14525612122.34 ]</td>
+        <td>[ 1.0, 66058.0 ]</td>
+        <td>[ 4.0, 79860.0 ]</td>
+        <td>[ 0.0, 77290.0 ]</td>
+        <td>[ -0.06, 15.21 ]</td>
+        <td>[ -1.5, 315.84 ]</td>
+        <td>[ 1.0, 809.0 ]</td>
+        <td>[ 1.0, 150370.0 ]</td>
+        <td>[ 27.0, 199867.0 ]</td>
+        <td>[ 3.7, 150370.0 ]</td>
+        <td>[ 1.0, 150370.0 ]</td>
+        <td>[ 0.0, 65387.49 ]</td>
+        <td>[ 0.0, 4275524278.19 ]</td>
+        <td>[ 1.0, 150370.0 ]</td>
+        <td>[ 4.0, 150370.0 ]</td>
+        <td>[ 0.0, 23387.25 ]</td>
+        <td>[ 0.0, 9.3 ]</td>
+        <td>[ -2.0, 101.82 ]</td>
+        <td>[ 1.0, 757.0 ]</td>
+        <td>[ 1.0, 16653.0 ]</td>
+        <td>[ 28.0, 181328.0 ]</td>
+        <td>[ 3.53, 24500.67 ]</td>
+        <td>[ 1.0, 16653.0 ]</td>
+        <td>[ 0.0, 42344.04 ]</td>
+        <td>[ 0.0, 1793017566.89 ]</td>
+        <td>[ 1.0, 16653.0 ]</td>
+        <td>[ 3.0, 39876.0 ]</td>
+        <td>[ 0.0, 39766.0 ]</td>
+        <td>[ -0.7, 13.82 ]</td>
+        <td>[ -2.0, 255.39 ]</td>
+        <td>[ 0.0, 13.36 ]</td>
+        <td>[ 0.0, 16.77 ]</td>
+        <td>[ 0.0, 24.71 ]</td>
+        <td>[ 0.0, 685.0 ]</td>
+        <td>[ -328.0, 962.0 ]</td>
+        <td>[ 0.0, 871.0 ]</td>
+        <td>[ 0.0, 881.0 ]</td>
+        <td>[ 0.0, 935.0 ]</td>
+        <td>[ 0.0, 7.11 ]</td>
+        <td>[ 0.0, 7.11 ]</td>
+        <td>[ 0.0, 8.93 ]</td>
+        <td>[ 0.0, 648.0 ]</td>
+        <td>[ 0.0, 618.0 ]</td>
+        <td>[ 0.0, 11563842.15 ]</td>
+        <td>[ 0.0, 0.9 ]</td>
+        <td>[ 0.0, 21146257.12 ]</td>
+        <td>[ 0.0, 0.76 ]</td>
+        <td>[ 0.0, 14140225.9 ]</td>
+        <td>[ 0.0, 0.42 ]</td>
+        <td>[ 0.0, 15576076.83 ]</td>
+        <td>[ 0.0, 0.51 ]</td>
+        </tr>
+    </tbody>
+    </table>
+</div>
 ### Benchmark
 The benchmarking defines the downstream task which is used for evaluating the goodness of the synthesized event log datasets with the metrics of real-world datasets. The command to execute a benchmarking is shown in the following script:
+```console
+conda activate gedi
+python main.py -a config_files/pipeline_steps/benchmark.json
 ```
 In the `benchmark.json`, we have the following key-value pairs:
 The purpose of the evaluation plotting step is used just for visualization. Some examples of how the plotter can be used is shown in the following exemplarily script:
+```console
+conda activate gedi
+python main.py -a config_files/pipeline_steps/evaluation_plotter.json
 ```
 Generally, in the `evaluation_plotter.json`, we have the following key-value pairs:
 ### Generating data with real targets
 To execute the experiments with real targets, we employ the [experiment_real_targets.json](config_files/experiment_real_targets.json). The script's pipeline will output the [generated event logs (GenBaselineED)](data/event_logs/GenBaselineED), which optimize their feature values towards [real-world event data features](data/BaselineED_feat.csv), alongside their respectively measured [feature values](data/GenBaselineED_feat.csv) and [benchmark metrics values](data/GenBaselineED_bench.csv).
+```console
+conda activate gedi
+python main.py -a config_files/experiment_real_targets.json
 ```
 ### Generating data with grid targets
 ```
 We employ the [experiment_grid_2obj_configfiles_fabric.ipynb](notebooks/experiment_grid_2obj_configfiles_fabric.ipynb) to create all necessary [configuration](config_files/grid_2obj) and [objective](data/grid_2obj) files for this experiment.
 For more details about these config_files, please refer to [Feature Extraction](#feature-extraction), [Generation](#generation), and [Benchmark](#benchmark).
+To create configuration files for grid objectives interactively, you can use the start the following dashboard:
+```
+streamlit run utils/config_fabric.py # To tunnel to local machine add: --server.port 8501 --server.headless true
+# In local machine (only in case you are tunneling):
+ssh -N -f -L 9000:localhost:8501 <user@remote_machine.com>
+open "http://localhost:9000/"
+```
 ### Visualizations
 To run the visualizations, we employ [jupyter notebooks](https://jupyter.org/install) and [add the installed environment to the jupyter notebook](https://medium.com/@nrk25693/how-to-add-your-conda-environment-to-your-jupyter-notebook-in-just-4-steps-abeab8b8d084). We then start all visualizations by running e.g.: `jupyter noteboook`. In the following, we describe the `.ipynb`-files in the folder `\notebooks` to reproduce the figures from our paper.
 #### [Fig. 4 and fig. 5 Representativeness](notebooks/gedi_figs4and5_representativeness.ipynb)
 The `GEDI` framework is taken directly from the original paper by [Maldonado](mailto:[email protected]), Frey, Tavares, Rehwald and Seidl and is *to appear on BPM'24*.
 ```bibtex
+@article{maldonado2024gedi,
+  author       = {Maldonado, Andrea and Frey, {Christian M. M.} and Tavares, {Gabriel M.} and Rehwald, Nikolina and Seidl, Thomas},
+  title        = {{GEDI:} Generating Event Data with Intentional Features for Benchmarking Process Mining},
+  journal      = {To be published in BPM 2024. Krakow, Poland, Sep 01-06},
+  volume       = {},
+  year         = {2024},
+  url          = {https://mcml.ai/publications/gedi.pdf},
+  doi          = {},
+  eprinttype    = {website},
 }
 ```

config_files/test/test_abbrv_generation.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[{"pipeline_step": "event_logs_generation",
+"output_path": "output/test",
+"generator_params": {"experiment":
+	{"input_path": "data/test/igedi_table_1.csv",
+	"objectives": ["rmcv","ense"]},
+	"config_space": {"mode": [5, 20], "sequence": [0.01, 1],
+	"choice": [0.01, 1], "parallel": [0.01, 1], "loop": [0.01, 1],
+	"silent": [0.01, 1], "lt_dependency": [0.01, 1],
+	"num_traces": [10, 10001], "duplicate": [0],
+	"or": [0]}, "n_trials": 2}},
+ {"pipeline_step": "feature_extraction",
+ "input_path": "output/test/igedi_table_1/2_ense_rmcv",
+ "feature_params": {"feature_set": ["simple_stats", "trace_length", "trace_variant",
+ "activities", "start_activities", "end_activities", "eventropies", "epa_based"]},
+ "output_path": "output/plots", "real_eventlog_path": "data/test/2_bpic_features.csv",
+ "plot_type": "boxplot"}]

data/test/igedi_table_1.csv ADDED Viewed

	@@ -0,0 +1,4 @@

+log,rmcv,ense
+BPIC15f4,0.003,0.604
+RTFMP,0.376,0.112
+HD,0.517,0.254

data/validation/2_ense_rmcv_feat.csv ADDED Viewed

	@@ -0,0 +1,4 @@

+log,n_traces,n_unique_traces,trace_len_coefficient_variation,trace_len_entropy,trace_len_geometric_mean,trace_len_geometric_std,trace_len_harmonic_mean,trace_len_hist1,trace_len_hist10,trace_len_hist2,trace_len_hist3,trace_len_hist4,trace_len_hist5,trace_len_hist6,trace_len_hist7,trace_len_hist8,trace_len_hist9,trace_len_iqr,trace_len_kurtosis,trace_len_kurtosis_hist,trace_len_max,trace_len_mean,trace_len_median,trace_len_min,trace_len_mode,trace_len_q1,trace_len_q3,trace_len_skewness,trace_len_skewness_hist,trace_len_std,trace_len_variance,kurtosis_variant_occurrence,mean_variant_occurrence,ratio_most_common_variant,ratio_top_10_variants,ratio_top_1_variants,ratio_top_20_variants,ratio_top_50_variants,ratio_top_5_variants,ratio_top_75_variants,skewness_variant_occurrence,std_variant_occurrence,activities_iqr,activities_kurtosis,activities_max,activities_mean,activities_median,activities_min,activities_q1,activities_q3,activities_skewness,activities_std,activities_variance,n_unique_activities,n_unique_start_activities,start_activities_iqr,start_activities_kurtosis,start_activities_max,start_activities_mean,start_activities_median,start_activities_min,start_activities_q1,start_activities_q3,start_activities_skewness,start_activities_std,start_activities_variance,end_activities_iqr,end_activities_kurtosis,end_activities_max,end_activities_mean,end_activities_median,end_activities_min,end_activities_q1,end_activities_q3,end_activities_skewness,end_activities_std,end_activities_variance,n_unique_end_activities,eventropy_global_block,eventropy_global_block_flattened,eventropy_k_block_diff_1,eventropy_k_block_diff_3,eventropy_k_block_diff_5,eventropy_k_block_ratio_1,eventropy_k_block_ratio_3,eventropy_k_block_ratio_5,eventropy_knn_3,eventropy_knn_5,eventropy_knn_7,eventropy_lempel_ziv,eventropy_lempel_ziv_flattened,eventropy_prefix,eventropy_prefix_flattened,eventropy_trace,epa_variant_entropy,epa_normalized_variant_entropy,epa_sequence_entropy,epa_normalized_sequence_entropy,epa_sequence_entropy_linear_forgetting,epa_normalized_sequence_entropy_linear_forgetting,epa_sequence_entropy_exponential_forgetting,epa_normalized_sequence_entropy_exponential_forgetting,ratio_variants_per_number_of_traces
+genELBPIC15f4_0604_0003,8616,4031,1.0086445672512825,8.700230419287818,8.516920996327995,2.1832133718212567,6.58111248846037,0.05713165933282198,1.682074468800883e-05,0.009932649738269211,0.0033136867035377378,0.0012279143622246447,0.0005214430853282738,0.00017661781922409254,0.0001093348404720574,3.364148937601766e-05,0.0,9.0,11.77613857723645,4.64306597180025,141,11.964136490250697,7.0,3,3,5.0,14.0,2.836323931248485,2.5294876299887217,12.067561272744191,145.6260350714354,1651.5545366193303,2.137434879682461,0.09099350046425256,0.5789229340761374,0.40401578458681525,0.6256963788300836,0.766016713091922,0.5258820798514392,0.883008356545961,36.276105773051086,15.574023282690577,2184.5,1.9085746306932307,34121,12885.375,8627.0,8584,8616.0,10800.5,1.8663249384138656,8507.416043333898,72376127.734375,8,2,2111.0,-2.0,6419,4308.0,4308.0,2197,3252.5,5363.5,0.0,2111.0,4456321.0,768.0,0.0021026107788850723,4895,1723.2,832.0,495,813.0,1581.0,1.331337855426617,1625.5283940922102,2642342.56,5,15.897,16.276,2.756,1.525,1.375,2.756,2.016,1.775,6.564,6.07,5.761,1.405,1.786,12.139,13.493,9.703,365917.06171394786,0.7166786736830569,651595.1462643282,0.5475971681938718,62016.045914910814,0.05211796208164211,266396.7627350506,0.22387845232743814,0.46785051067780875
+genELHD_0254_0517,6822,565,1.1300022933733087,8.390788875278787,1.9006921917027269,2.263915758458681,1.4763543408149593,0.28822871537617945,0.00010858116985352402,0.04077222927999826,0.02383356678284851,0.006080545511797346,0.005591930247456488,0.002823110416191621,0.0017915893025831464,0.0006514870191211442,0.0004886152643408582,2.0,9.718268017319556,4.770965470001153,28,2.8346525945470535,1.0,1,1,1.0,3.0,2.765986310146101,2.5637920433464965,3.2031639327547703,10.260259180101007,226.4931382842208,12.07433628318584,0.24860744649662855,0.9079448841981823,0.6807387862796834,0.9321313397830548,0.9585165640574611,0.8717384931105248,0.9791849897390794,14.639488482439702,105.6342402074512,1283.0,8.118508585327676,6848,1137.5294117647059,472.0,208,413.0,1696.0,2.9234849385484285,1541.823981624173,2377221.1903114184,17,10,294.25,2.299363631971671,3383,682.2,217.0,101,121.75,416.0,1.9301655015244086,1008.2924972447232,1016653.7600000001,334.5,2.8813625853874614,3383,620.1818181818181,157.0,79,104.5,439.0,2.0614116860983223,981.5564465945092,963453.0578512397,11,9.069,10.932,3.265,0.908,0.67,3.265,1.808,1.456,4.81,4.359,4.05,0.696,2.01,6.995,10.12,4.469,16958.33766640406,0.7450438396474315,70379.87102533762,0.36874603139171797,9719.481922433943,0.050923940806750986,30545.050254490514,0.16003675334882345,0.08282028730577544
+genELRTFMP_0112_0376,6822,565,1.1300022933733087,8.390788875278787,1.9006921917027269,2.263915758458681,1.4763543408149593,0.28822871537617945,0.00010858116985352402,0.04077222927999826,0.02383356678284851,0.006080545511797346,0.005591930247456488,0.002823110416191621,0.0017915893025831464,0.0006514870191211442,0.0004886152643408582,2.0,9.718268017319556,4.770965470001153,28,2.8346525945470535,1.0,1,1,1.0,3.0,2.765986310146101,2.5637920433464965,3.2031639327547703,10.260259180101007,226.4931382842208,12.07433628318584,0.24860744649662855,0.9079448841981823,0.6807387862796834,0.9321313397830548,0.9585165640574611,0.8717384931105248,0.9791849897390794,14.639488482439702,105.6342402074512,1283.0,8.118508585327676,6848,1137.5294117647059,472.0,208,413.0,1696.0,2.9234849385484285,1541.823981624173,2377221.1903114184,17,10,294.25,2.299363631971671,3383,682.2,217.0,101,121.75,416.0,1.9301655015244086,1008.2924972447232,1016653.7600000001,334.5,2.8813625853874614,3383,620.1818181818181,157.0,79,104.5,439.0,2.0614116860983223,981.5564465945092,963453.0578512397,11,9.069,10.932,3.265,0.908,0.67,3.265,1.808,1.456,4.81,4.359,4.05,0.696,2.01,6.995,10.12,4.469,16958.33766640406,0.7450438396474315,70379.87102533762,0.36874603139171797,9719.481922433943,0.050923940806750986,30545.050254490514,0.16003675334882345,0.08282028730577544

gedi/__init__.py CHANGED Viewed

@@ -1,3 +1,7 @@
-from .run import gedi
-__all__=['gedi']

+from .generator import GenerateEventLogs
+from .features import EventLogFeatures
+from .augmentation import InstanceAugmentator
+from .benchmark import BenchmarkTest
+from .plotter import BenchmarkPlotter, FeaturesPlotter, AugmentationPlotter, GenerationPlotter
+__all__=[ 'GenerateEventLogs', 'EventLogFeatures', 'FeatureAnalyser', 'InstanceAugmentator', 'BenchmarkTest', 'BenchmarkPlotter', 'FeaturesPlotter', 'AugmentationPlotter', 'GenerationPlotter']

gedi/features.py CHANGED Viewed

@@ -10,7 +10,7 @@ from pathlib import Path
 from utils.param_keys import INPUT_PATH
 from utils.param_keys.features import FEATURE_PARAMS, FEATURE_SET
 from gedi.utils.io_helpers import dump_features_json
 def get_sortby_parameter(elem):
     number = int(elem.rsplit(".")[0].rsplit("_", 1)[1])
     return number
@@ -63,6 +63,8 @@ class EventLogFeatures(EventLogFile):
             if str(self.filename).endswith('csv'): # Returns dataframe from loaded metafeatures file
                 self.feat = pd.read_csv(self.filepath)
                 print(f"SUCCESS: EventLogFeatures loaded features from {self.filepath}")
             elif isinstance(self.filename, list): # Computes metafeatures for list of .xes files
                 combined_features=pd.DataFrame()

 from utils.param_keys import INPUT_PATH
 from utils.param_keys.features import FEATURE_PARAMS, FEATURE_SET
 from gedi.utils.io_helpers import dump_features_json
+from utils.column_mappings import column_mappings
 def get_sortby_parameter(elem):
     number = int(elem.rsplit(".")[0].rsplit("_", 1)[1])
     return number
             if str(self.filename).endswith('csv'): # Returns dataframe from loaded metafeatures file
                 self.feat = pd.read_csv(self.filepath)
+                columns_to_rename = {col: column_mappings()[col] for col in self.feat.columns if col in column_mappings()}
+                self.feat.rename(columns=columns_to_rename, inplace=True)
                 print(f"SUCCESS: EventLogFeatures loaded features from {self.filepath}")
             elif isinstance(self.filename, list): # Computes metafeatures for list of .xes files
                 combined_features=pd.DataFrame()

gedi/generator.py CHANGED Viewed

@@ -21,6 +21,7 @@ from utils.param_keys import OUTPUT_PATH, INPUT_PATH
 from utils.param_keys.generator import GENERATOR_PARAMS, EXPERIMENT, CONFIG_SPACE, N_TRIALS
 from gedi.utils.io_helpers import get_output_key_value_location, dump_features_json, compute_similarity
 from gedi.utils.io_helpers import read_csvs
 import xml.etree.ElementTree as ET
 import re
 from xml.dom import minidom
@@ -153,6 +154,8 @@ class GenerateEventLogs():
         experiment = self.params.get(EXPERIMENT)
         if experiment is not None:
             tasks, output_path = get_tasks(experiment, self.output_path)
             self.output_path = output_path
         if 'ratio_variants_per_number_of_traces' in tasks.columns:#HOTFIX

 from utils.param_keys.generator import GENERATOR_PARAMS, EXPERIMENT, CONFIG_SPACE, N_TRIALS
 from gedi.utils.io_helpers import get_output_key_value_location, dump_features_json, compute_similarity
 from gedi.utils.io_helpers import read_csvs
+from utils.column_mappings import column_mappings
 import xml.etree.ElementTree as ET
 import re
 from xml.dom import minidom
         experiment = self.params.get(EXPERIMENT)
         if experiment is not None:
             tasks, output_path = get_tasks(experiment, self.output_path)
+            columns_to_rename = {col: column_mappings()[col] for col in tasks.columns if col in column_mappings()}
+            tasks = tasks.rename(columns=columns_to_rename)
             self.output_path = output_path
         if 'ratio_variants_per_number_of_traces' in tasks.columns:#HOTFIX

gedi/run.py DELETED Viewed

@@ -1,53 +0,0 @@
-import config
-import pandas as pd
-from datetime import datetime as dt
-from gedi.generator import GenerateEventLogs
-from gedi.features import EventLogFeatures
-from gedi.augmentation import InstanceAugmentator
-from gedi.benchmark import BenchmarkTest
-from gedi.plotter import BenchmarkPlotter, FeaturesPlotter, AugmentationPlotter, GenerationPlotter
-from utils.default_argparse import ArgParser
-from utils.param_keys import *
-def run(kwargs:dict, model_params_list: list, filename_list:list):
-    """
-    This function chooses the running option for the program.
-    @param kwargs: dict
-        contains the running parameters and the event-log file information
-    @param model_params_list: list
-        contains a list of model parameters, which are used to analyse this different models.
-    @param filename_list: list
-        contains the list of the filenames to load multiple event-logs
-    @return:
-    """
-    params = kwargs[PARAMS]
-    ft = EventLogFeatures(None)
-    augmented_ft = InstanceAugmentator()
-    gen = pd.DataFrame(columns=['log'])
-    for model_params in model_params_list:
-        if model_params.get(PIPELINE_STEP) == 'instance_augmentation':
-            augmented_ft = InstanceAugmentator(aug_params=model_params, samples=ft.feat)
-            AugmentationPlotter(augmented_ft, model_params)
-        elif model_params.get(PIPELINE_STEP) == 'event_logs_generation':
-            gen = pd.DataFrame(GenerateEventLogs(model_params).log_config)
-            #gen = pd.read_csv("output/features/generated/grid_2objectives_enseef_enve/2_enseef_enve_feat.csv")
-            #GenerationPlotter(gen, model_params, output_path="output/plots")
-        elif model_params.get(PIPELINE_STEP) == 'benchmark_test':
-            benchmark = BenchmarkTest(model_params, event_logs=gen['log'])
-            # BenchmarkPlotter(benchmark.features, output_path="output/plots")
-        elif model_params.get(PIPELINE_STEP) == 'feature_extraction':
-            ft = EventLogFeatures(**kwargs, logs=gen['log'], ft_params=model_params)
-            FeaturesPlotter(ft.feat, model_params)
-        elif model_params.get(PIPELINE_STEP) == "evaluation_plotter":
-            GenerationPlotter(gen, model_params, output_path=model_params['output_path'], input_path=model_params['input_path'])
-def gedi(config_path):
-    """
-    This function runs the GEDI pipeline.
-    @param config_path: str
-        contains the path to the config file
-    @return:
-    """
-    model_params_list = config.get_model_params_list(config_path)
-    run({'params':""}, model_params_list, [])

execute_grid_experiments.py → gedi/utils/execute_grid_experiments.py RENAMED Viewed

@@ -3,7 +3,7 @@ import os
 import sys
 from datetime import datetime as dt
-from gedi.utils.io_helpers import sort_files
 from tqdm import tqdm
 #TODO: Pass i properly

 import sys
 from datetime import datetime as dt
+from io_helpers import sort_files
 from tqdm import tqdm
 #TODO: Pass i properly

main.py CHANGED Viewed

@@ -1,12 +1,54 @@
 import config
 from datetime import datetime as dt
-from gedi.run import gedi, run
 from utils.default_argparse import ArgParser
 from utils.param_keys import *
 if __name__=='__main__':
     start_gedi = dt.now()
     print(f'INFO: GEDI starting {start_gedi}')
     args = ArgParser().parse('GEDI main')
-    gedi(args.alg_params_json)
     print(f'SUCCESS: GEDI took {dt.now()-start_gedi} sec.')

 import config
+import pandas as pd
 from datetime import datetime as dt
+from gedi.generator import GenerateEventLogs
+from gedi.features import EventLogFeatures
+from gedi.augmentation import InstanceAugmentator
+from gedi.benchmark import BenchmarkTest
+from gedi.plotter import BenchmarkPlotter, FeaturesPlotter, AugmentationPlotter, GenerationPlotter
 from utils.default_argparse import ArgParser
 from utils.param_keys import *
+def run(kwargs:dict, model_paramas_list: list, filename_list:list):
+    """
+    This function chooses the running option for the program.
+    @param kwargs: dict
+        contains the running parameters and the event-log file information
+    @param model_params_list: list
+        contains a list of model parameters, which are used to analyse this different models.
+    @param filename_list: list
+        contains the list of the filenames to load multiple event-logs
+    @return:
+    """
+    params = kwargs[PARAMS]
+    ft = EventLogFeatures(None)
+    augmented_ft = InstanceAugmentator()
+    gen = pd.DataFrame(columns=['log'])
+    for model_params in model_params_list:
+        if model_params.get(PIPELINE_STEP) == 'instance_augmentation':
+            augmented_ft = InstanceAugmentator(aug_params=model_params, samples=ft.feat)
+            AugmentationPlotter(augmented_ft, model_params)
+        elif model_params.get(PIPELINE_STEP) == 'event_logs_generation':
+            gen = pd.DataFrame(GenerateEventLogs(model_params).log_config)
+            #gen = pd.read_csv("output/features/generated/grid_2objectives_enseef_enve/2_enseef_enve_feat.csv")
+            #GenerationPlotter(gen, model_params, output_path="output/plots")
+        elif model_params.get(PIPELINE_STEP) == 'benchmark_test':
+            benchmark = BenchmarkTest(model_params, event_logs=gen['log'])
+            # BenchmarkPlotter(benchmark.features, output_path="output/plots")
+        elif model_params.get(PIPELINE_STEP) == 'feature_extraction':
+            ft = EventLogFeatures(**kwargs, logs=gen['log'], ft_params=model_params)
+            FeaturesPlotter(ft.feat, model_params)
+        elif model_params.get(PIPELINE_STEP) == "evaluation_plotter":
+            GenerationPlotter(gen, model_params, output_path=model_params['output_path'], input_path=model_params['input_path'])
 if __name__=='__main__':
     start_gedi = dt.now()
     print(f'INFO: GEDI starting {start_gedi}')
     args = ArgParser().parse('GEDI main')
+    model_params_list = config.get_model_params_list(args.alg_params_json)
+    run({'params':""}, model_params_list, [])
     print(f'SUCCESS: GEDI took {dt.now()-start_gedi} sec.')

setup.py CHANGED Viewed

@@ -4,7 +4,7 @@ import os
 with open("README.md", "r") as fh:
     long_description = fh.read()
-version_string = os.environ.get("VERSION_PLACEHOLDER", "0.0.5")
 print(version_string)
 version = version_string
@@ -25,59 +25,14 @@ setup(
             'Levenshtein==0.23.0',
             'matplotlib==3.8.4',
             'numpy==1.26.4',
             'pm4py==2.7.2',
             'scikit-learn==1.2.2',
-            'scipy==1.10.1',
             'seaborn==0.13.2',
             'smac==2.0.2',
             'tqdm==4.65.0',
-            'streamlit-toggle-switch>=1.0.2',
-            'click==8.1.7',
-            'cloudpickle==3.0.0',
-            'configspace==0.7.1',
-            'cvxopt==1.3.2',
-            'dask==2024.2.1',
-            'dask-jobqueue==0.8.5',
-            'deprecation==2.1.0',
-            'distributed==2024.2.1',
-            'emcee==3.1.4',
-            'feeed == 1.2.0',
-            'fsspec==2024.2.0',
-            'imbalanced-learn==0.12.0',
-            'imblearn==0.0',
-            'importlib-metadata==7.0.1',
-            'intervaltree==3.1.0',
-            'jinja2==3.1.3',
-            'levenshtein==0.23.0',
-            'locket==1.0.0',
-            'lxml==5.1.0',
-            'markupsafe==2.1.5',
-            'more-itertools==10.2.0',
-            'msgpack==1.0.8',
-            'networkx==3.2.1',
-            'numpy==1.26.4',
-            'pandas>=2.0.0',
-            'partd==1.4.1',
-            'pm4py==2.7.2',
-            'psutil==5.9.8',
-            'pydotplus==2.0.2',
-            'pynisher==1.0.10',
-            'pyrfr==0.9.0',
-            'pyyaml==6.0.1',
-            'rapidfuzz==3.6.1',
-            'regex==2023.12.25',
-            'scikit-learn==1.2.2',
-            'scipy==1.10.1',
-            'seaborn==0.13.2',
-            'smac==2.0.2',
-            'sortedcontainers==2.4.0',
-            'stringdist==1.0.9',
-            'tblib==3.0.0',
-            'toolz==0.12.1',
-            'tqdm==4.65.0',
-            'typing-extensions==4.10.0',
-            'urllib3==2.2.1',
-            'zict==3.0.0'
             ],
         packages = ['gedi'],
         classifiers=[
@@ -87,4 +42,4 @@ setup(
             'License :: OSI Approved :: MIT License',   # Again, pick a license
             'Programming Language :: Python :: 3.9',
     ],
-)

 with open("README.md", "r") as fh:
     long_description = fh.read()
+version_string = os.environ.get("VERSION_PLACEHOLDER", "1.0.0")
 print(version_string)
 version = version_string
             'Levenshtein==0.23.0',
             'matplotlib==3.8.4',
             'numpy==1.26.4',
+            'pandas==2.2.2',
             'pm4py==2.7.2',
             'scikit-learn==1.2.2',
+            'scipy==1.13.0',
             'seaborn==0.13.2',
             'smac==2.0.2',
             'tqdm==4.65.0',
+            'streamlit-toggle-switch>=1.0.2'
             ],
         packages = ['gedi'],
         classifiers=[
             'License :: OSI Approved :: MIT License',   # Again, pick a license
             'Programming Language :: Python :: 3.9',
     ],
+)

utils/column_mappings.py ADDED Viewed

	@@ -0,0 +1,16 @@

+def column_mappings():
+    column_names_short = {
+    'rutpt': 'ratio_unique_traces_per_trace',
+    'rmcv': 'ratio_most_common_variant',
+    'tlcv': 'trace_len_coefficient_variation',
+    'mvo': 'mean_variant_occurrence',
+    'enve': 'epa_normalized_variant_entropy',
+    'ense': 'epa_normalized_sequence_entropy',
+    'eself': 'epa_sequence_entropy_linear_forgetting',
+    'enself': 'epa_normalized_sequence_entropy_linear_forgetting',
+    'eseef': 'epa_sequence_entropy_exponential_forgetting',
+    'enseef': 'epa_normalized_sequence_entropy_exponential_forgetting'
+    }
+    return column_names_short

utils/config_fabric.py CHANGED Viewed

@@ -13,6 +13,7 @@ import time
 import shutil
 import zipfile
 import io
 st.set_page_config(layout='wide')
 INPUT_XES="output/inputlog_temp.xes"
@@ -174,6 +175,10 @@ def set_generator_experiments(generator_params):
             df = pd.read_csv(uploaded_file)
             if len(df.columns) <= 1:
                 raise pd.errors.ParserError("Please select a file withat least two columns (e.g. log, feature) and use ',' as a delimiter.")
             sel_features = st.multiselect("Selected features", list(df.columns), list(df.columns)[-1])
             if sel_features:
                 df = df[sel_features]

 import shutil
 import zipfile
 import io
+from column_mappings import column_mappings
 st.set_page_config(layout='wide')
 INPUT_XES="output/inputlog_temp.xes"
             df = pd.read_csv(uploaded_file)
             if len(df.columns) <= 1:
                 raise pd.errors.ParserError("Please select a file withat least two columns (e.g. log, feature) and use ',' as a delimiter.")
+            columns_to_rename = {col: column_mappings()[col] for col in df.columns if col in column_mappings()}
+            # Rename the matching columns
+            df.rename(columns=columns_to_rename, inplace=True)
             sel_features = st.multiselect("Selected features", list(df.columns), list(df.columns)[-1])
             if sel_features:
                 df = df[sel_features]