converter indices tensors added.

Files changed (3) hide show

README.md CHANGED Viewed

@@ -21,6 +21,7 @@ This is the code example:
 import torch
 from transformers import pipeline
 pipe = pipeline(
     "text-generation",
     model='k-l-lambda/Llama-3.2-1B-vocab32k',
@@ -51,3 +52,38 @@ input_ids = tokenizer.encode("Hello, ", return_tensors="pt")
 output = model.generate(input_ids)
 print(tokenizer.decode(output[0]))
 ```

 import torch
 from transformers import pipeline
 pipe = pipeline(
     "text-generation",
     model='k-l-lambda/Llama-3.2-1B-vocab32k',
 output = model.generate(input_ids)
 print(tokenizer.decode(output[0]))
 ```
+## Token converter
+You can map a ID value in 32k vocab to the ID value in original 128k vocab, by the tensor in `token_indices.pt`.
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import AutoTokenizer
+tokenizer128k = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-1B-Instruct')
+tokenizer32k = AutoTokenizer.from_pretrained('k-l-lambda/Llama-3.2-1B-vocab32k')
+indices_path = hf_hub_download(repo_id='k-l-lambda/Llama-3.2-1B-vocab32k', filename='token_indices.pt')
+inv_indices_path = hf_hub_download(repo_id='k-l-lambda/Llama-3.2-1B-vocab32k', filename='inv_token_indices.pt')
+token_indices = torch.load(indices_path)
+inv_token_indices = torch.load(inv_indices_path)
+ids_32k = tokenizer32k.encode('This is an example sentence.')
+ids_128k = [token_indices[id] for id in ids1]
+print(f'{ids_32k=}')
+print(f'{ids_128k=}')
+print(tokenizer128k.decode(ids_128k))
+ids_128k = tokenizer128k.encode('This is another example sentence.')
+ids_32k = [inv_token_indices[id] for id in ids1]
+print(f'{ids_128k=}')
+print(f'{ids_32k=}')	# non-exist tokens in 32k vocab will map to -1
+print(tokenizer32k.decode(ids_32k))
+```

inv_token_indices.pt ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:a7b97585a6bdd88d65084d9b9dd6f608cd1491e4b15a02faab8eb9659448da45
+size 1027278

token_indices.pt ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:f455f59558b410ee6eb8ef8676975157a3b6ba8e9e0cb8a05d59809e331e9c6f
+size 259258