dleemiller commited on
Commit
dd4d659
1 Parent(s): 9f7d514

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +11 -6
README.md CHANGED
@@ -1,10 +1,10 @@
1
 
2
  # WordLlama
3
 
4
- The power of 15 trillion tokens of training, extracted, flogged and minimized into a cute little package for word embedding.
5
 
6
  <p align="center">
7
- <img src="wordllama.png" alt="Word Llama" width="60%">
8
  </p>
9
 
10
 
@@ -96,6 +96,11 @@ Because of its fast and portable size, it makes a good "Swiss-Army Knife" utilit
96
  The [l2_supercat](https://huggingface.co/dleemiller/word-llama-l2-supercat) is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens).
97
  Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).
98
 
 
 
 
 
 
99
  ## Embed Text
100
 
101
  Here’s how you can load pre-trained embeddings and use them to embed text:
@@ -115,10 +120,10 @@ print(embeddings.shape) # (2, 64)
115
  Binary embedding models can be used like this:
116
 
117
  ```python
118
- # Binary embeddings are packed into uint32
119
- # 64-dims => array of 2x uint32
120
  wl = WordLlama.load(trunc_dim=64, binary=True) # this will download the binary model from huggingface
121
- wl.embed("I went to the car") # Output: array([[3029168104, 2427562626]], dtype=uint32)
122
 
123
  # load binary trained model trained with straight through estimator
124
  wl = WordLlama.load(dim=1024, binary=True)
@@ -181,7 +186,7 @@ If you use WordLlama in your research or project, please consider citing it as f
181
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
182
  year = {2024},
183
  url = {https://github.com/dleemiller/wordllama},
184
- version = {0.2.3}
185
  }
186
  ```
187
 
 
1
 
2
  # WordLlama
3
 
4
+ **WordLlama** is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.
5
 
6
  <p align="center">
7
+ <img src="wordllama.png" alt="Word Llama" width="50%">
8
  </p>
9
 
10
 
 
96
  The [l2_supercat](https://huggingface.co/dleemiller/word-llama-l2-supercat) is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens).
97
  Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).
98
 
99
+ ### Other Models
100
+ [Results](wordllama/RESULTS.md)
101
+
102
+ Llama3-based: [l3_supercat](https://huggingface.co/dleemiller/wordllama-l3-supercat)
103
+
104
  ## Embed Text
105
 
106
  Here’s how you can load pre-trained embeddings and use them to embed text:
 
120
  Binary embedding models can be used like this:
121
 
122
  ```python
123
+ # Binary embeddings are packed into uint64
124
+ # 64-dims => array of 1x uint64
125
  wl = WordLlama.load(trunc_dim=64, binary=True) # this will download the binary model from huggingface
126
+ wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64)
127
 
128
  # load binary trained model trained with straight through estimator
129
  wl = WordLlama.load(dim=1024, binary=True)
 
186
  title = {WordLlama: Recycled Token Embeddings from Large Language Models},
187
  year = {2024},
188
  url = {https://github.com/dleemiller/wordllama},
189
+ version = {0.2.5}
190
  }
191
  ```
192