Regarding the vocabulary used in the paper

#11

by jiaxin-wen - opened Sep 15, 2023

Sep 15, 2023

Thanks for your great work!
I have a question regarding the vocabulary. Specifically, the paper mentions that "We use GPT-Neo tokenizer but only keep the top 10K most common tokens". However, the current uploaded vocabulary consists of 50K tokens. Would you please update the vocabulary that can be used to reproduce your experiments:)

maveriq

Sep 28, 2023

See here for an explanation : https://huggingface.co/datasets/roneneldan/TinyStories/discussions/4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment