dleemiller
commited on
Commit
•
ebe0c07
1
Parent(s):
7651aec
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,5 @@
|
|
|
|
|
|
1 |
## Installation
|
2 |
|
3 |
Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
|
@@ -5,6 +7,36 @@ Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
|
|
5 |
pip install wordllama
|
6 |
```
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
## MTEB Results (l2_supercat)
|
10 |
|
@@ -20,4 +52,9 @@ pip install wordllama
|
|
20 |
|
21 |
---
|
22 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
23 |
---
|
|
|
1 |
+
# wordllama
|
2 |
+
|
3 |
## Installation
|
4 |
|
5 |
Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
|
|
|
7 |
pip install wordllama
|
8 |
```
|
9 |
|
10 |
+
## Intended Use
|
11 |
+
|
12 |
+
This model is intended for use in natural language processing applications that require text embeddings, such as text classification, sentiment analysis, and document clustering.
|
13 |
+
It's a token embedding model that is comparable to word embedding models, but substantionally smaller in size (16mb default 256-dim model).
|
14 |
+
|
15 |
+
```python
|
16 |
+
from wordllama import load
|
17 |
+
|
18 |
+
wl = load()
|
19 |
+
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
|
20 |
+
print(similarity_score) # Output: 0.06641249096796882
|
21 |
+
```
|
22 |
+
|
23 |
+
## Model Architecture
|
24 |
+
|
25 |
+
Wordllama is based on token embedding codebooks extracted from large language models.
|
26 |
+
It is trained like a general embedding, with MultipleNegativesRankingLoss using the sentence transformers library,
|
27 |
+
using Matryoshka Representation Learning so that embeddings can be truncated to 64, 128, 256, 512 or 1024 dimensions.
|
28 |
+
|
29 |
+
To create WordLlama L2 "supercat", we extract and concatenate the token embedding codebooks from several large language models that
|
30 |
+
use the llama2 tokenizer vocabulary (32k vocab size). This includes models like Llama2 70B and Phi-3 Medium.
|
31 |
+
Then we add a trainable token weight parameter and initialize stopwords to a smaller value (0.1). Finally, we
|
32 |
+
train a projection from the large, concatenated codebook down to a smaller dimension and average pool.
|
33 |
+
|
34 |
+
We use popular embeddings datasets from sentence transformers, and matryoshka representation learning (MRL) so that
|
35 |
+
dimensions can be truncated. For "binary" models, we train using a straight through estimator, so that the embeddings
|
36 |
+
can be binarized eg, (x>0).sign() and packed into integers for hamming distance computation.
|
37 |
+
|
38 |
+
After training, we save a new, small token embedding codebook, which is analogous to vectors of a word embedding.
|
39 |
+
|
40 |
|
41 |
## MTEB Results (l2_supercat)
|
42 |
|
|
|
52 |
|
53 |
---
|
54 |
license: mit
|
55 |
+
datasets:
|
56 |
+
- sentence-transformers/all-nli
|
57 |
+
- sentence-transformers/gooaq
|
58 |
+
language:
|
59 |
+
- en
|
60 |
---
|