Which fields in tokenizer_config.json are important?

#30
by juan-cytora - opened

Here are the contents of tokenizer_config.json

{
  "add_prefix_space": false,
  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "model_max_length": 1000000000000000019884624838656,
  "name_or_path": "EleutherAI/pythia-12b",
  "special_tokens_map_file": "/admin/home-hailey/.cache/huggingface/hub/models--EleutherAI--gpt-neox-20b/snapshots/4e49eadb5d14bd22f314ec3f45b69a87b88c7691/special_tokens_map.json",
  "tokenizer_class": "GPTNeoXTokenizer",
  "unk_token": "<|endoftext|>"
}

I note that special_tokens_map_file points to a file which doesn't exist in any of my testing environments, which made me question its (and now all the other fields') usefulness.

Databricks org

I don't think it's used. This is a holdover from Pythia where also not sure if it has significance

Thanks Sean, what about the other fields?

E.g. I fine-tuned each of 3b/7b/12b, but for all of them just copied the pipelines from 3b, which means all of them have:

"name_or_path": "EleutherAI/pythia-2.8b",

Wondering if this breaks things.

(I suppose where you say "I don't think it's used." by "it's" you mean the entire file, not just that one field...)

Databricks org

That is important, it says what tokenizer to use. It is Pythia's . It is not modified here

Databricks org

I mean just that one entry

OK, thanks Sean!

juan-cytora changed discussion status to closed

Quick test: only name_or_path seems to differ:

mkdir 3b 7b 12b
for scale in 3b 7b 12b; do for file in instruct_pipeline.py special_tokens_map.json tokenizer_config.json tokenizer.json ; do cd $scale ; wget https://huggingface.co/databricks/dolly-v2-$scale/raw/main/$file ; done; cd -; done

What differs?

% for f in `ls -1 3b`; do diff -q 3b/$f 7b/$f; done
Files 3b/instruct_pipeline.py and 7b/instruct_pipeline.py differ
Files 3b/tokenizer_config.json and 7b/tokenizer_config.json differ
% for f in `ls -1 3b`; do diff -q 3b/$f 12b/$f; done
Files 3b/instruct_pipeline.py and 12b/instruct_pipeline.py differ
Files 3b/tokenizer_config.json and 12b/tokenizer_config.json differ
% for f in `ls -1 3b`; do diff -q 7b/$f 12b/$f; done
Files 7b/tokenizer_config.json and 12b/tokenizer_config.json differ

instruct_pipeline.py doesn't really differ. Just newline at end of file.

Just one field differs in tokenizer_config.json

Which is the important one, as Sean noted.

% diff {3b,7b}/tokenizer_config.json
6c6
<   "name_or_path": "EleutherAI/pythia-2.8b",
---
>   "name_or_path": "EleutherAI/pythia-6.9b",
% diff {3b,12b}/tokenizer_config.json
6c6
<   "name_or_path": "EleutherAI/pythia-2.8b",
---
>   "name_or_path": "EleutherAI/pythia-12b",

Sign up or log in to comment