Malformed Training Dataset

#1
by JBilgewater - opened

I wanted to fool around with some of the XRD data and found that ILtrainV1 part 1 database works and contains 597845 diffraction patterns. Part 2 produces a malformed sql database though. There's no md5sum or similar, so I re-downloaded all files to make sure it wasn't a network problem. Perhaps someone could re-make these files?

I also notice that part 1 of the training database contains 119569 x 5 while the article referred to says there should be 119569 x 30 powder patterns. Is the full database stored elsewhere?

SimXRD org

Dear JBilgewater,

The Hugging Face repository currently contains 119,569 ร— 10 data points. Due to file size limitations, the data has been split into multiple parts and uploaded separately. Please follow the tutorial below to merge the files and check again:

import gzip
import os
import shutil

def combine_files(input_prefix, num_parts, output_file):
    with gzip.open(output_file, 'wb') as f_out:
        for i in range(1, num_parts + 1):
            part_file = f"{input_prefix}_part{i}.gz"
            with gzip.open(part_file, 'rb') as f_in:
                shutil.copyfileobj(f_in, f_out)
                
            os.remove(part_file)

combine_files('ILtrain1', 5, 'ILtrain1.db.gz')

Previously, the dataset was shared via OneDrive and later transferred to Hugging Face as per the reviewers' requirements. We will reopen the OneDrive link soon, after verification, and update the homepage accordingly.

Additionally, we have open-sourced all crystal data and simulation tools. You can find them here:
๐Ÿ”— CrystDB on Hugging Face

The project is managed via GitHub, available at:
๐Ÿ”— SimXRD on GitHub

Best regards,
Cao Bin
HKUST(GZ)

Thanks for getting back to me.

That is the procedure I tried the first time, but I repeated it all from scratch and get the same issue (PRAGMA integrity_check with sqlite3 yields a whole pile of "btreeInitPage() returns error code 11" errors).

If the large file sizes are causing hosting problems, might simplify your life to switch to a numpy binary array file.

I suppose having CrystDB, SIMXRD and just the test dataset all available is enough to standardize community work on powder XRD patterns, though it's kind of nice to have a global training set as well.

SimXRD org

Thank you for your suggestion. Feel free to download CrystDB and SimXRD to generate patterns for your research. The original database is backed up on OneDrive here: [https://1drv.ms/f/c/5d8626238470b49e/Er6Ow4w2x5NPhTdiAnZaAb0BV6cPf4ODs4qHGkD8NV8_8w?e=lnGvOq]. Please don't hesitate to download it.

If you have any further questions, feel free to reach out to us again.

Wishing you success in your research!

Sign up or log in to comment