Issues with training JAT

#2
by ksridhar - opened

Hi,
First of all, I am really thankful for this great work collecting many of the Gato datasets and providing a great codebase to start on. But, I've run across some basic errors where I'm hoping to get some clarification or help. This is also the same post posted to the issues page on Github at https://github.com/huggingface/jat/issues/170 .

There is no function called mix_iterable_datasets in jat.utils but it is still imported in scripts/train_jat.py. This prevents me from running scripts/train_jat.py.
The command to train jat in the README has a few problems:
(a) It uses scripts/train_jat_tokenized.py even though the tokenized dataset was originally said to be unavailable but now is seemingly available at this link? It would be great to get a clarification on what the tokenized dataset is and if that is recommended for use. For now, it seems like the tokenized dataset is a superset of the raw dataset and was updated last in Dec 2023 (which is much before the raw dataset).

(b) While trying to run scripts/train_jat_tokenized.py anyway with the command in the readme, it seems to look for the jat-project/jat-small on huggingface which seems to be private. Is there a clean command to train the model from scratch?

ksridhar changed discussion status to closed

Sign up or log in to comment