mlupopa/HydraGNN_Predictive_GFM_2024

HydraGNN_Predictive_GFM_2024 readme.txt file was generated on 2024-10-28 by Massimiliano Lupo Pasini

GENERAL INFORMATION

Title of Dataset: HydraGNN_Predictive_GFM_2024
Author Information A. Principal Investigator Contact Information Name: Massimiliano Lupo Pasini Institution: Oak Ridge National Laboratory Address: 1 Bethel Valley Road, Bldg. 5700, Rm F119, Mail Stop 6085, P.O. Box 2008, Oak Ridge, TN, 37831 Email: [email protected] B. Alternate Contact Information Name: Prasanna Balaprakash Institution: Oak Ridge National Laboratory Address: 1 Bethel Valley Road, P.O. Box 2008, Oak Ridge, TN, 37831 Email: [email protected]
Date of data collection (single date, range, approximate date) : 2023-10-01 through 2024-10-01
Geographic location of data collection <latitude, longitude, or city/region, State, Country, as appropriate>: Oak Ridge, TN, USA - Berkeley, CA, USA
Information about funding sources that supported the collection of the data: This work was supported in part by the Office of Science of the Department of Energy and by the Laboratory Directed Research and Development (LDRD) Program of Oak Ridge National Laboratory. This research is sponsored by the Artificial Intelligence Initiative as part of the Laboratory Directed Research and Development (LDRD) Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. This work used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725, under Directorate Discretionary award LRN026 and INCITE award CPH161. This work also used resources of the National Energy Research Scientific Computing (NERSC) Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, under award ERCAP0027259.

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data: BSD-3 Clause license
Links to publications that cite or use the data: TBD
Links to other publicly accessible locations of the data: None
Links/relationships to ancillary data sets: None
Was data derived from another source? No A. If yes, list source(s):
Recommended citation for this dataset: M. Lupo Pasini, J. Y. Choi, K. Mehta, P. Zhang, D. Rogers, J. Bae, K. Ibrahim, A. Aji, K. W. Schulz, J. Polo, and P. Balaprakash, HydraGNN_Predictive_GFM_2024 - Ensemble of predictive graph foundation models for group state atomistic materials modeling, DOI 10.13139/OLCF/2474799

DATA & FILE OVERVIEW

The "ADIOS_files" directory contains 6 sub-directories named as follows:

ANI1x-v3.bp
MPTrj-v3.bp
OC2020-20M-v3.bp
OC2020-v3.bp
OC2022-v3.bp
qm7x-v3.bp Each sub-directory contains the pre-processed datasets converted in Adaptable I/O System (ADIOS) format (https://www.exascaleproject.org/research-project/adios/) that have been used to the development, training, and performance testing of the ensemble go predictive graph foundation models.

The "Ensemble_of_models" directory contains 15 sub-directories named as follows:

gfm_0.229
gfm_0.156
gfm_0.147
gfm_0.260
gfm_0.165
gfm_0.78
gfm_0.137
gfm_0.1
gfm_0.175
gfm_0.171
gfm_0.181
gfm_0.67
gfm_0.179
gfm_0.167
gfm_0.351

Each one of these sub-directories refers to one of the fifteen hyperparameter optimization (HPO) trials that have been selected to continue the pre-training with at most 30 epochs. Within each sub-directory associated with a specific HPO trial, the following files can be found:

config.json: file for argument parsing to develop and train an HydraGNN architecture
gfm_0.ID_epoch_N.pk: file with model parameters for HPO ID trial after N epochs of training

The code used to develop, pre-train, and load the pre-trained models for post-processing analysis is available on the ORNL-GitHub at the following link: (https://github.com/ORNL/HydraGNN/tree/Predictive_GFM_2024)

The scripts used to pre-process the data, generate the ADIOS files, run HPO, and continue the pretraining are available in the directory called examples/multidataset_hpo (https://github.com/ORNL/HydraGNN/tree/Predictive_GFM_2024/examples/multidataset_hpo). The scripts that use the pre-trained ensemble of GFMs for ensemble averaging and epistemic UQ are available in the directory called examples/ensemble_learning (https://github.com/ORNL/HydraGNN/tree/Predictive_GFM_2024/examples/ensemble_learning).

The ADIOS files and the parameters for the ensemble of models is available at the following entry of the OLCF Data Constellation (https://doi.ccs.ornl.gov/dataset/3a49c8df-83f7-5d32-84be-f81d289e7cdd)

Relationship between files, if important: None
Additional related data collected that was not included in the current data package: None
Are there multiple versions of the dataset? No A. If yes, name of file(s) that was updated: i. Why was the file updated? ii. When was the file updated?

METHODOLOGICAL INFORMATION

Description of methods used for collection/generation of data: We provide the ensemble of fifteen pre-trained graph foundation models (GFMs) for atomistic materials modeling applications.

Each one of the fifteen GFMs has been trained on five open-source datasets that (once aggregated) amount to over 154 million atomistic structures, which cover over two-thirds of the natural elements of the periodic table and that comprises a broad set of organic and inorganic compounds. This vas set of atomistic structures comprises ground state configurations that are dynamically stable (i.e., equilibrated structures with atomic forces approximately close to zero values) as well as dynamically unstable structures (i.e., non-equilibrium structures with non-negligible non-zero values of atomic forces). The ensemble of datasets aggregated does NOT include excited states.

The datasets have been curated to remove atomistic structures with spectral norm of the force tensor above 100 eV/angstrom. Moreover, a linear term of the energy was computed for each dataset using a linear regression model that uses the chemical concentration of each natural element as regressor. The linear term predicted by the linear regression model has been subtracted from each original energy value to perform a re-alignment of the energy values across different electronic structures approximation theories performed to generate the diverse multi-source, multi-fidelity datasets.

The folder "ADIOS_files" contains the set of pre-processed datasets in Adaptable I/O System (ADIOS) format (https://www.exascaleproject.org/research-project/adios/) that have been used for the development and training of GFMs in this work.

Each GFM was developed using HydraGNN (https://github.com/ORNL/HydraGNN) as underlying graph neural network (GNN) architecture. The multi-task learning (MTL) capability of HydraGNN was used to simultaneously train the GFMs on labeled values for direct predictions of energy (a total system property of an atomistic structure that measures the chemical stability) and atomic forces (an atomic level property of an atomistic structure that measures the dynamical stability).

The hyper parameters of the GFM have been tuned using scalable hyperparameter optimization (HPO) algorithms implemented in the software DeepHyper (https://github.com/deephyper/deephyper). The pre-training of each HPO trial was performed using distributed data parallelism (DDP) to scale the training across 128 compute nodes of the exascale OLCF supercomputer Frontier. Each HPO trial was trained only for 10 epochs and an early stopping was performed to avoid wasting significant computational resources on GNN architectures that were clearly underperforming. For each HPO trial, the 'omnistat' tool developed by (AMD Research - Advanced Micro Device) was used to measure the total energy consumption in kWh.

The ensemble of GFMs was obtained by selecting the fifteen best performing HPO trials. Four models have been selected for their clear advantage in accuracy, and these are the GFMs with IDs 229, 156, 147, 260. Additional eleven models have been selected based on judicious balance between accuracy and energy consumption needed for training, and these are the GFMs with IDs 165, 78, 137, 1, 175, 171, 181, 67, 179, 167, 351. Each selected GFM of the ensemble was continued to cumulate a total of at most 30 epochs. In some cases, the total number of epochs actually performed was les than 30 due to two combined factors: (1) the size of the GFM (i.e., the number of model parameters to train) and (2) the total wall-clock time for which the computational resources could be allocated on OLCF-Frontier.

The ensemble of fifteen GFM architectures was used for (1) ensemble averaging to stabilize the predictions of energy and atomic forces after pre-training for post-processing analysis and (2) ensemble uncertainty quantification (UQ).

This research is sponsored by the Artificial Intelligence Initiative as part of the Laboratory Directed Research and Development (LDRD) Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. This work used resources of the Oak Ridge Leadership Computing Facility, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725, under Directorate Discretionary award LRN026 and INCITE award CPH161. This work also used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, under award ERCAP0025216.

Methods for processing the data: The data submitted is the raw data as it is produced by the density functional theory code Vienna Ab-Initio Simulation Package (VASP)
Instrument- or software-specific information needed to interpret the data: HydraGNN (https://github.com/ORNL/HydraGNN)
Standards and calibration information, if appropriate: None
Environmental/experimental conditions: None
Describe any quality-assurance procedures performed on the data: None
People involved with sample collection, processing, analysis and/or submission: Massimiliano Lupo Pasini, Jong Youl Choi, Kshitij Mehta, Pei Zhang, David Rogers, Jonghyun Bae, Khaled Ibrahim, Ashwin Aji, Karl W. Schulz, Jordan Polo, Prasanna Balaprakash

DATA-SPECIFIC INFORMATION FOR: [FILENAME] <repeat this section for each dataset, folder or file, as appropriate>

Number of variables: the "Ensemble_of_models" folder contains sub-folders named "gfm_0.ID" which refer to different HPO trials with ID as unique identification number. Each "gfm_0.ID" subfolder contains a "config.json" file that describe the input parsing argument to build and train an HydraGNN architecture, and a series of files names "gfm_0.ID_N_epoch.pk" which refers to the model parameters of the HPO ID trial at the Nth epoch of pre-training.
Number of cases/rows: 3,100 atomic structures
Variable List: None
Missing data codes: None
Specialized formats or other abbreviations used: ASCII format