mlupopa commited on
Commit
e8e3b99
·
verified ·
1 Parent(s): f1aabfb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -1
README.md CHANGED
@@ -4,4 +4,159 @@ language:
4
  - en
5
  metrics:
6
  - accuracy
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - en
5
  metrics:
6
  - accuracy
7
+ ---
8
+
9
+ HydraGNN_Predictive_GFM_2024 readme.txt file was generated on 2024-10-28 by Massimiliano Lupo Pasini
10
+
11
+
12
+ GENERAL INFORMATION
13
+
14
+ 1. Title of Dataset: HydraGNN_Predictive_GFM_2024
15
+
16
+ 2. Author Information
17
+ A. Principal Investigator Contact Information
18
+ Name: Massimiliano Lupo Pasini
19
+ Institution: Oak Ridge National Laboratory
20
+ Address: 1 Bethel Valley Road, Bldg. 5700, Rm F119, Mail Stop 6085, P.O. Box 2008, Oak Ridge, TN, 37831
21
22
+
23
+
24
+ B. Alternate Contact Information
25
+ Name: Prasanna Balaprakash
26
+ Institution: Oak Ridge National Laboratory
27
+ Address: 1 Bethel Valley Road, P.O. Box 2008, Oak Ridge, TN, 37831
28
29
+
30
+ 3. Date of data collection (single date, range, approximate date) : 2023-10-01 through 2024-10-01
31
+
32
+ 4. Geographic location of data collection <latitude, longitude, or city/region, State, Country, as appropriate>: Oak Ridge, TN, USA - Berkeley, CA, USA
33
+
34
+ 5. Information about funding sources that supported the collection of the data:
35
+ This work was supported in part by the Office of Science of the Department of Energy and by the Laboratory Directed Research and Development (LDRD) Program of Oak Ridge National Laboratory.
36
+ This research is sponsored by the Artificial Intelligence Initiative as part of the Laboratory Directed Research and Development (LDRD) Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725.
37
+ This work used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725, under Directorate Discretionary award LRN026 and INCITE award CPH161. This work also used resources of the National Energy Research
38
+ Scientific Computing (NERSC) Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, under award ERCAP0027259.
39
+
40
+
41
+ SHARING/ACCESS INFORMATION
42
+
43
+ 1. Licenses/restrictions placed on the data: BSD-3 Clause license
44
+
45
+ 2. Links to publications that cite or use the data: TBD
46
+
47
+ 3. Links to other publicly accessible locations of the data: None
48
+
49
+ 4. Links/relationships to ancillary data sets: None
50
+
51
+ 5. Was data derived from another source? No
52
+ A. If yes, list source(s):
53
+
54
+ 6. Recommended citation for this dataset:
55
+ M. Lupo Pasini, J. Y. Choi, K. Mehta, P. Zhang, D. Rogers, J. Bae, K. Ibrahim, A. Aji, K. W. Schulz, J. Polo, and P. Balaprakash, HydraGNN_Predictive_GFM_2024 - Ensemble of predictive graph foundation models for group state atomistic materials modeling, DOI 10.13139/OLCF/2474799
56
+
57
+
58
+ DATA & FILE OVERVIEW
59
+
60
+ The "ADIOS_files" directory contains 6 sub-directories named as follows:
61
+ - ANI1x-v3.bp
62
+ - MPTrj-v3.bp
63
+ - OC2020-20M-v3.bp
64
+ - OC2020-v3.bp
65
+ - OC2022-v3.bp
66
+ - qm7x-v3.bp
67
+ Each sub-directory contains the pre-processed datasets converted in Adaptable I/O System (ADIOS) format (https://www.exascaleproject.org/research-project/adios/) that have been used to the development, training, and performance testing of the ensemble go predictive graph foundation models.
68
+
69
+ The "Ensemble_of_models" directory contains 15 sub-directories named as follows:
70
+ - gfm_0.229
71
+ - gfm_0.156
72
+ - gfm_0.147
73
+ - gfm_0.260
74
+ - gfm_0.165
75
+ - gfm_0.78
76
+ - gfm_0.137
77
+ - gfm_0.1
78
+ - gfm_0.175
79
+ - gfm_0.171
80
+ - gfm_0.181
81
+ - gfm_0.67
82
+ - gfm_0.179
83
+ - gfm_0.167
84
+ - gfm_0.351
85
+
86
+ Each one of these sub-directories refers to one of the fifteen hyperparameter optimization (HPO) trials that have been selected to continue toe pre-training with at most 30 epochs.
87
+ With each sub-directory associated with a specific HPO trial, the following files can be found:
88
+ - config.json: file for argument parsing to develop and train an HydraGNN architecture
89
+ - gfm_0.ID_epoch_N.pk: file with model parameters for HPO ID trial after N epochs of training
90
+
91
+
92
+ The code used to develop, pre-train, and load the pre-trained models for post-processing analysis is available on the ORNL-GitHub at the following link:
93
+ https://github.com/ORNL/HydraGNN/tree/Predictive_GFM_2024
94
+
95
+ The ADIOS files and the parameters for the ensemble of models is available at the following entry of the OLCF Data Constellation
96
+ https://doi.ccs.ornl.gov/dataset/3a49c8df-83f7-5d32-84be-f81d289e7cdd
97
+
98
+
99
+
100
+ 2. Relationship between files, if important: None
101
+
102
+ 3. Additional related data collected that was not included in the current data package: None
103
+
104
+ 4. Are there multiple versions of the dataset? No
105
+ A. If yes, name of file(s) that was updated:
106
+ i. Why was the file updated?
107
+ ii. When was the file updated?
108
+
109
+
110
+ METHODOLOGICAL INFORMATION
111
+
112
+ 1. Description of methods used for collection/generation of data:
113
+ We provide the ensemble of fifteen pre-trained graph foundation models (GFMs) for atomistic materials modeling applications.
114
+
115
+ Each one of the fifteen GFMs has been trained on five open-source datasets that (once aggregated) amount to over 154 million atomistic structures, which cover over two-thirds of the natural elements of the periodic table and that comprises a broad set of organic and inorganic compounds. This vas set of atomistic structures comprises ground state configurations that are dynamically stable (i.e., equilibrated structures with atomic forces approximately close to zero values) as well as dynamically unstable structures (i.e., non-equilibrium structures with non-negligible non-zero values of atomic forces). The ensemble of datasets aggregated does NOT include excited states.
116
+
117
+ The datasets have been curated to remove atomistic structures with spectral norm of the force tensor above 100 eV/angstrom. Moreover, a linear term of the energy was computed for each dataset using a linear regression model that uses the chemical concentration of each natural element as regressor. The linear term predicted by the linear regression model has been subtracted from each original energy value to perform a re-alignment of the energy values across different electronic structures approximation theories performed to generate the diverse multi-source, multi-fidelity datasets.
118
+
119
+ The folder "ADIOS_files" contains the set of pre-processed datasets in Adaptable I/O System (ADIOS) format (https://www.exascaleproject.org/research-project/adios/) that have been used for the development and training of GFMs in this work.
120
+
121
+ Each GFM was developed using HydraGNN (https://github.com/ORNL/HydraGNN) as underlying graph neural network (GNN) architecture.
122
+ The multi-task learning (MTL) capability of HydraGNN was used to simultaneously train the GFMs on labeled values for direct predictions of energy (a total system property of an atomistic structure that measures the chemical stability) and atomic forces (an atomic level property of an atomistic structure that measures the dynamical stability).
123
+
124
+ The hyper parameters of the GFM have been tuned using scalable hyperparameter optimization (HPO) algorithms implemented in the software DeepHyper (https://github.com/deephyper/deephyper). The pre-training of each HPO trial was performed using distributed data parallelism (DDP) to scale the training across 128 compute nodes of the exascale OLCF supercomputer Frontier. Each HPO trial was trained only for 10 epochs and an early stopping was performed to avoid wasting significant computational resources on GNN architectures that were clearly underperforming. For each HPO trial, the 'omnistat' tool developed by (AMD Research - Advanced Micro Device) was used to measure the total energy consumption in kWh.
125
+
126
+ The ensemble of GFMs was obtained by selecting the fifteen best performing HPO trials. Four models have been selected for their clear advantage in accuracy, and these are the GFMs with IDs 229, 156, 147, 260. Additional eleven models have been selected based on judicious balance between accuracy and energy consumption needed for training, and these are the GFMs with IDs 165, 78, 137, 1, 175, 171, 181, 67, 179, 167, 351. Each selected GFM of the ensemble was continued to cumulate a total of at most 30 epochs. In some cases, the total number of epochs actually performed was les than 30 due to two combined factors: (1) the size of the GFM (i.e., the number of model parameters to train) and (2) the total wall-clock time for which the computational resources could be allocated on OLCF-Frontier.
127
+
128
+ The ensemble of fifteen GFM architectures was used for (1) ensemble averaging to stabilize the predictions of energy and atomic forces after pre-training for post-processing analysis and (2) ensemble uncertainty quantification (UQ).
129
+
130
+
131
+ This research is sponsored by the Artificial Intelligence Initiative as part of the Laboratory Directed Research and Development (LDRD) Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725.
132
+ This work used resources of the Oak Ridge Leadership Computing Facility, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725, under Directorate Discretionary award LRN026 and INCITE award CPH161. This work also used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, under award ERCAP0025216.
133
+
134
+
135
+
136
+ 2. Methods for processing the data: The data submitted is the raw data as it is produced by the density functional theory code Vienna Ab-Initio Simulation Package (VASP)
137
+
138
+ 3. Instrument- or software-specific information needed to interpret the data:
139
+ HydraGNN (https://github.com/ORNL/HydraGNN)
140
+
141
+ 4. Standards and calibration information, if appropriate: None
142
+
143
+ 5. Environmental/experimental conditions: None
144
+
145
+ 6. Describe any quality-assurance procedures performed on the data: None
146
+
147
+ 7. People involved with sample collection, processing, analysis and/or submission: Massimiliano Lupo Pasini, Jong Youl Choi, Kshitij Mehta, Pei Zhang, David Rogers, Jonghyun Bae, Khaled Ibrahim, Ashwin Aji, Karl W. Schulz, Jordan Polo, Prasanna Balaprakash
148
+
149
+
150
+ DATA-SPECIFIC INFORMATION FOR: [FILENAME]
151
+ <repeat this section for each dataset, folder or file, as appropriate>
152
+
153
+ 1. Number of variables: the "Ensemble_of_models" folder contains sub-folders named "gfm_0.ID" which refer to different HPO trials with ID as unique identification number. Each "gfm_0.ID" subfolder contains a "config.json" file that describe the input parsing argument to build and train an HydraGNN architecture, and a series of files names "gfm_0.ID_N_epoch.pk" which refers to the model parameters of the HPO ID trial at the Nth epoch of pre-training.
154
+
155
+
156
+ 2. Number of cases/rows: 3,100 atomic structures
157
+
158
+ 3. Variable List: None
159
+
160
+ 4. Missing data codes: None
161
+
162
+ 5. Specialized formats or other abbreviations used: ASCII format