dleemiller
commited on
Commit
•
dd4d659
1
Parent(s):
9f7d514
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -1,10 +1,10 @@
|
|
1 |
|
2 |
# WordLlama
|
3 |
|
4 |
-
|
5 |
|
6 |
<p align="center">
|
7 |
-
<img src="wordllama.png" alt="Word Llama" width="
|
8 |
</p>
|
9 |
|
10 |
|
@@ -96,6 +96,11 @@ Because of its fast and portable size, it makes a good "Swiss-Army Knife" utilit
|
|
96 |
The [l2_supercat](https://huggingface.co/dleemiller/word-llama-l2-supercat) is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens).
|
97 |
Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).
|
98 |
|
|
|
|
|
|
|
|
|
|
|
99 |
## Embed Text
|
100 |
|
101 |
Here’s how you can load pre-trained embeddings and use them to embed text:
|
@@ -115,10 +120,10 @@ print(embeddings.shape) # (2, 64)
|
|
115 |
Binary embedding models can be used like this:
|
116 |
|
117 |
```python
|
118 |
-
# Binary embeddings are packed into
|
119 |
-
# 64-dims => array of
|
120 |
wl = WordLlama.load(trunc_dim=64, binary=True) # this will download the binary model from huggingface
|
121 |
-
wl.embed("I went to the car") # Output: array([[
|
122 |
|
123 |
# load binary trained model trained with straight through estimator
|
124 |
wl = WordLlama.load(dim=1024, binary=True)
|
@@ -181,7 +186,7 @@ If you use WordLlama in your research or project, please consider citing it as f
|
|
181 |
title = {WordLlama: Recycled Token Embeddings from Large Language Models},
|
182 |
year = {2024},
|
183 |
url = {https://github.com/dleemiller/wordllama},
|
184 |
-
version = {0.2.
|
185 |
}
|
186 |
```
|
187 |
|
|
|
1 |
|
2 |
# WordLlama
|
3 |
|
4 |
+
**WordLlama** is a fast, lightweight NLP toolkit that handles tasks like fuzzy-deduplication, similarity and ranking with minimal inference-time dependencies and optimized for CPU hardware.
|
5 |
|
6 |
<p align="center">
|
7 |
+
<img src="wordllama.png" alt="Word Llama" width="50%">
|
8 |
</p>
|
9 |
|
10 |
|
|
|
96 |
The [l2_supercat](https://huggingface.co/dleemiller/word-llama-l2-supercat) is a Llama2-vocabulary model. To train this model, I concatenated codebooks from several models, including Llama2 70B and phi3 medium (after removing additional special tokens).
|
97 |
Because several models have used the Llama2 tokenizer, their codebooks can be concatenated and trained together. Performance of the resulting model is comparable to training the Llama3 70B codebook, while being 4x smaller (32k vs 128k vocabulary).
|
98 |
|
99 |
+
### Other Models
|
100 |
+
[Results](wordllama/RESULTS.md)
|
101 |
+
|
102 |
+
Llama3-based: [l3_supercat](https://huggingface.co/dleemiller/wordllama-l3-supercat)
|
103 |
+
|
104 |
## Embed Text
|
105 |
|
106 |
Here’s how you can load pre-trained embeddings and use them to embed text:
|
|
|
120 |
Binary embedding models can be used like this:
|
121 |
|
122 |
```python
|
123 |
+
# Binary embeddings are packed into uint64
|
124 |
+
# 64-dims => array of 1x uint64
|
125 |
wl = WordLlama.load(trunc_dim=64, binary=True) # this will download the binary model from huggingface
|
126 |
+
wl.embed("I went to the car") # Output: array([[3029168427562626]], dtype=uint64)
|
127 |
|
128 |
# load binary trained model trained with straight through estimator
|
129 |
wl = WordLlama.load(dim=1024, binary=True)
|
|
|
186 |
title = {WordLlama: Recycled Token Embeddings from Large Language Models},
|
187 |
year = {2024},
|
188 |
url = {https://github.com/dleemiller/wordllama},
|
189 |
+
version = {0.2.5}
|
190 |
}
|
191 |
```
|
192 |
|