dleemiller
/

word-llama-l2-supercat

Model card Files Files and versions Community

dleemiller commited on Jul 14

Commit

ebe0c07

•

1 Parent(s): 7651aec

Update README.md

Files changed (1) hide show

README.md +37 -0

README.md CHANGED Viewed

@@ -1,3 +1,5 @@
 ## Installation
 Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
@@ -5,6 +7,36 @@ Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
 pip install wordllama
 ```
 ## MTEB Results (l2_supercat)
@@ -20,4 +52,9 @@ pip install wordllama
 ---
 license: mit
 ---

+# wordllama
 ## Installation
 Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
 pip install wordllama
 ```
+## Intended Use
+This model is intended for use in natural language processing applications that require text embeddings, such as text classification, sentiment analysis, and document clustering.
+It's a token embedding model that is comparable to word embedding models, but substantionally smaller in size (16mb default 256-dim model).
+```python
+from wordllama import load
+wl = load()
+similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
+print(similarity_score)  # Output: 0.06641249096796882
+```
+## Model Architecture
+Wordllama is based on token embedding codebooks extracted from large language models.
+It is trained like a general embedding, with MultipleNegativesRankingLoss using the sentence transformers library,
+using Matryoshka Representation Learning so that embeddings can be truncated to 64, 128, 256, 512 or 1024 dimensions.
+To create WordLlama L2 "supercat", we extract and concatenate the token embedding codebooks from several large language models that
+use the llama2 tokenizer vocabulary (32k vocab size). This includes models like Llama2 70B and Phi-3 Medium.
+Then we add a trainable token weight parameter and initialize stopwords to a smaller value (0.1). Finally, we
+train a projection from the large, concatenated codebook down to a smaller dimension and average pool.
+We use popular embeddings datasets from sentence transformers, and matryoshka representation learning (MRL) so that
+dimensions can be truncated. For "binary" models, we train using a straight through estimator, so that the embeddings
+can be binarized eg, (x>0).sign() and packed into integers for hamming distance computation.
+After training, we save a new, small token embedding codebook, which is analogous to vectors of a word embedding.
 ## MTEB Results (l2_supercat)
 ---
 license: mit
+datasets:
+- sentence-transformers/all-nli
+- sentence-transformers/gooaq
+language:
+- en
 ---