kevinkrahn
/

shlm-grc-en

@@ -6,30 +6,24 @@ tags:
 - feature-extraction
 - sentence-similarity
 - transformers
-- semantic-search
 ---
-# shlm-grc-en
-## Sentence embeddings for English and Ancient Greek
-This model creates sentence embeddings in a shared vector space for Ancient Greek and English text.
-The base model uses a modified version of the HLM architecture described in [Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers](https://aclanthology.org/2024.sigtyp-1.16/)
-This model is trained to produce sentence embeddings using the multilingual knowledge distillation method and datasets described in [Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation](https://aclanthology.org/2023.alp-1.2/).
-This model was distilled from `BAAI/bge-base-en-v1.5` for embedding English and Ancient Greek text.
 ## Usage (Sentence-Transformers)
-**This model is currently incompatible with the latest version of the sentence-transformers library.**
-For now, either use HuggingFace Transformers directly (see below) or the following fork of sentence-transformers:
-https://github.com/kevinkrahn/sentence-transformers
-You can use the model with sentence-transformers like this:
 ```python
 from sentence_transformers import SentenceTransformer
@@ -50,7 +44,7 @@ from transformers import AutoTokenizer, AutoModel
 import torch
-def cls_pooling(model_output):
     return model_output[0][:,0]
@@ -58,8 +52,8 @@ def cls_pooling(model_output):
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
-model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -69,61 +63,30 @@ with torch.no_grad():
     model_output = model(**encoded_input)
 # Perform pooling. In this case, cls pooling.
-sentence_embeddings = cls_pooling(model_output)
 print("Sentence embeddings:")
 print(sentence_embeddings)
 ```
-## Citing & Authors
 ```
-@inproceedings{riemenschneider-krahn-2024-heidelberg,
-    title = "Heidelberg-Boston @ {SIGTYP} 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers",
-    author = "Riemenschneider, Frederick  and
-      Krahn, Kevin",
-    editor = "Hahn, Michael  and
-      Sorokin, Alexey  and
-      Kumar, Ritesh  and
-      Shcherbakov, Andreas  and
-      Otmakhova, Yulia  and
-      Yang, Jinrui  and
-      Serikov, Oleg  and
-      Rani, Priya  and
-      Ponti, Edoardo M.  and
-      Murado{\u{g}}lu, Saliha  and
-      Gao, Rena  and
-      Cotterell, Ryan  and
-      Vylomova, Ekaterina",
-    booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
-    month = mar,
-    year = "2024",
-    address = "St. Julian's, Malta",
-    publisher = "Association for Computational Linguistics",
-    url = "https://aclanthology.org/2024.sigtyp-1.16",
-    pages = "131--141",
-}
 ```
-```
-@inproceedings{krahn-etal-2023-sentence,
-    title = "Sentence Embedding Models for {A}ncient {G}reek Using Multilingual Knowledge Distillation",
-    author = "Krahn, Kevin  and
-      Tate, Derrick  and
-      Lamicela, Andrew C.",
-    editor = "Anderson, Adam  and
-      Gordin, Shai  and
-      Li, Bin  and
-      Liu, Yudong  and
-      Passarotti, Marco C.",
-    booktitle = "Proceedings of the Ancient Language Processing Workshop",
-    month = sep,
-    year = "2023",
-    address = "Varna, Bulgaria",
-    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
-    url = "https://aclanthology.org/2023.alp-1.2",
-    pages = "13--22",
-}
-```

 - feature-extraction
 - sentence-similarity
 - transformers
 ---
+# kevinkrahn/shlm-grc-en
+This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
+<!--- Describe your model here -->
 ## Usage (Sentence-Transformers)
+Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
+```
+pip install -U sentence-transformers
+```
+Then you can use the model like this:
 ```python
 from sentence_transformers import SentenceTransformer
 import torch
+def cls_pooling(model_output, attention_mask):
     return model_output[0][:,0]
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en')
+model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en')
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
     model_output = model(**encoded_input)
 # Perform pooling. In this case, cls pooling.
+sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
 print("Sentence embeddings:")
 print(sentence_embeddings)
 ```
+## Evaluation Results
+<!--- Describe how your model was evaluated -->
+For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=kevinkrahn/shlm-grc-en)
+## Full Model Architecture
 ```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: HLMModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+)
 ```
+## Citing & Authors
+<!--- Describe where people can find more information -->

config.json CHANGED Viewed

@@ -26,7 +26,7 @@
   "pad_token_id": 0,
   "residual_word_embedding": false,
   "torch_dtype": "float32",
-  "transformers_version": "4.38.2",
   "type_vocab_size": 2,
   "vocab_size": 512
 }

   "pad_token_id": 0,
   "residual_word_embedding": false,
   "torch_dtype": "float32",
+  "transformers_version": "4.46.1",
   "type_vocab_size": 2,
   "vocab_size": 512
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d0a6e4c5f4eb9a71f57b56dac6a207932e0def2a9fb3c9956ae28482b39cfe6f
 size 379310632

 version https://git-lfs.github.com/spec/v1
+oid sha256:c673386ded6f25a1ec74e3cf31b244f46099d196877d3b0c949bd2c7f1e482ef
 size 379310632

modeling_hlm.py CHANGED Viewed

@@ -27,6 +27,8 @@ class HLMBaseModelOutput(ModelOutput):
 class HLMEncoder(nn.Module):
     def __init__(self, config) -> None:
         super().__init__()
@@ -38,6 +40,17 @@ class HLMEncoder(nn.Module):
                 TransformerBlock(config, bias=i in sandwich_indices) for i in range(config.num_hidden_layers)])
             for i in range(config.sandwich_size):
                 self.layers[sandwich_start_index + i*2+1].make_sandwich(self.layers[sandwich_start_index + i*2])
         else:
             self.layers = nn.ModuleList([TransformerBlock(config) for _ in range(config.num_hidden_layers)])
@@ -62,8 +75,9 @@ class HLMEncoder(nn.Module):
     def forward(self, hidden_states, attention_mask, freqs_cos, freqs_sin, return_dict=True, output_hidden_states=False):
         all_hidden_states = []
         attn_mask = self._get_attention_mask(attention_mask, hidden_states.dtype)
-        for layer in self.layers:
             hidden_states = layer(hidden_states, attn_mask, freqs_cos, freqs_sin)
             all_hidden_states.append(hidden_states)
         if return_dict:
@@ -86,6 +100,7 @@ class HLMPreTrainedModel(PreTrainedModel):
     base_model_prefix = "hlm"
     _keys_to_ignore_on_load_unexpected = []
     supports_gradient_checkpointing = True
     def _init_weights(self, module):
         """Initialize the weights."""
@@ -293,6 +308,9 @@ class TransformerBlock(nn.Module):
     def make_sandwich(self, other):
         assert self.has_bias
         assert not other.has_bias
         self.q.weight = other.q.weight
         self.k.weight = other.k.weight
         self.v.weight = other.v.weight

 class HLMEncoder(nn.Module):
+    _dynamic_tied_weights_keys = []
     def __init__(self, config) -> None:
         super().__init__()
                 TransformerBlock(config, bias=i in sandwich_indices) for i in range(config.num_hidden_layers)])
             for i in range(config.sandwich_size):
                 self.layers[sandwich_start_index + i*2+1].make_sandwich(self.layers[sandwich_start_index + i*2])
+                tied_weights_keys = [
+                    'q.weight',
+                    'k.weight',
+                    'v.weight',
+                    'att_proj_linear.weight',
+                    'ff_linear_1.weight',
+                    'ff_linear_2.weight',
+                    'ff_linear_3.weight',
+                ]
+                for key in tied_weights_keys:
+                    self._dynamic_tied_weights_keys.append(f'layers.{sandwich_start_index + i*2}.{key}')
         else:
             self.layers = nn.ModuleList([TransformerBlock(config) for _ in range(config.num_hidden_layers)])
     def forward(self, hidden_states, attention_mask, freqs_cos, freqs_sin, return_dict=True, output_hidden_states=False):
         all_hidden_states = []
         attn_mask = self._get_attention_mask(attention_mask, hidden_states.dtype)
+        for i, layer in enumerate(self.layers):
             hidden_states = layer(hidden_states, attn_mask, freqs_cos, freqs_sin)
+            #print(f'layer: {i}, bias: {layer.has_bias}, {hidden_states[0][0][0:2]}')
             all_hidden_states.append(hidden_states)
         if return_dict:
     base_model_prefix = "hlm"
     _keys_to_ignore_on_load_unexpected = []
     supports_gradient_checkpointing = True
+    _supports_param_buffer_assignment = False
     def _init_weights(self, module):
         """Initialize the weights."""
     def make_sandwich(self, other):
         assert self.has_bias
         assert not other.has_bias
+        # TODO: change this to support buffers, because it breaks if _supports_param_buffer_assignment == True
+        # introduced in transformers 4.43 PR: https://github.com/huggingface/transformers/pull/31771
         self.q.weight = other.q.weight
         self.k.weight = other.k.weight
         self.v.weight = other.v.weight

tokenization_hlm.py CHANGED Viewed

@@ -59,8 +59,6 @@ class HLMTokenizer(PreTrainedTokenizer):
     vocab_files_names = VOCAB_FILES_NAMES
     model_input_names: List[str] = ["input_ids", "char_input_mask", "word_input_mask", "word_type_ids"]
-    padding_side: str = "right"
-    truncation_side: str = "right"
     def __init__(
         self,
@@ -116,7 +114,7 @@ class HLMTokenizer(PreTrainedTokenizer):
             **kwargs,
         )
         self.unk_id = self.vocab["[UNK]"]
-        self.word_cls_token = word_cls_token
         self.word_cls_token_id = self._convert_token_to_id(word_cls_token)
         self.label_pad_token_id = -100
         self.special_ids = [self._convert_token_to_id(token) for token in vocab_data["special_tokens"]]
@@ -374,7 +372,7 @@ class HLMTokenizer(PreTrainedTokenizer):
             encoded_inputs["word_type_ids"] = self.create_token_type_ids_from_sequences(ids, pair_ids, add_special_tokens)
             assert len(encoded_inputs["word_type_ids"]) == len(encoded_inputs["word_input_mask"])
-        # Always pad words
         for word in encoded_inputs["input_ids"]:
             if len(word) < self.max_word_length:
                 word.extend([self.pad_token_id] * (self.max_word_length - len(word)))
@@ -394,7 +392,7 @@ class HLMTokenizer(PreTrainedTokenizer):
         )
         return batch_outputs
     def _encode_plus(
         self,
         text: Union[TextInput, PreTokenizedInput, EncodedInput],
@@ -552,7 +550,7 @@ class HLMTokenizer(PreTrainedTokenizer):
         return BatchEncoding(batch_outputs)
-    def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, split_long_words: bool = True) -> List[List[str]]:
         text = unicodedata.normalize('NFKC', text)
         if split_long_words:
             tokenized_text = []
@@ -580,6 +578,7 @@ class HLMTokenizer(PreTrainedTokenizer):
         return_tensors: Optional[Union[str, TensorType]] = None,
         #label_pad_token_id=-100,
         verbose: bool = True,
     ) -> BatchEncoding:
         # If we have a list of dicts, let's convert it in a dict of lists
         # We do this to allow using this method as a collate_fn function in PyTorch Dataloader
@@ -630,7 +629,7 @@ class HLMTokenizer(PreTrainedTokenizer):
         batch_outputs["word_input_mask"] = \
             [f + [0]*(longest_in_batch - len(f)) for f in encoded_inputs['word_input_mask']]
         if "word_type_ids" in encoded_inputs:
             batch_outputs["word_type_ids"] = [f + [0]*(longest_in_batch - len(f)) for f in encoded_inputs["word_type_ids"]]
@@ -652,13 +651,8 @@ class HLMTokenizer(PreTrainedTokenizer):
                     continue
                 labels = encoded_inputs[label_name]
                 label_pad_word = [[self.label_pad_token_id]*self.max_word_length]
-                if self.padding_side == "right":
-                    batch_outputs[label_name] = [
-                        to_list(label) + label_pad_word * (longest_in_batch - len(label)) for label in labels
-                    ]
-                else:
-                    batch_outputs[label_name] = [
-                        label_pad_word * (longest_in_batch - len(label)) + to_list(label) for label in labels
-                    ]
         return BatchEncoding(batch_outputs, tensor_type=return_tensors)

     vocab_files_names = VOCAB_FILES_NAMES
     model_input_names: List[str] = ["input_ids", "char_input_mask", "word_input_mask", "word_type_ids"]
     def __init__(
         self,
             **kwargs,
         )
         self.unk_id = self.vocab["[UNK]"]
+        self.word_cls_token = word_cls_token
         self.word_cls_token_id = self._convert_token_to_id(word_cls_token)
         self.label_pad_token_id = -100
         self.special_ids = [self._convert_token_to_id(token) for token in vocab_data["special_tokens"]]
             encoded_inputs["word_type_ids"] = self.create_token_type_ids_from_sequences(ids, pair_ids, add_special_tokens)
             assert len(encoded_inputs["word_type_ids"]) == len(encoded_inputs["word_input_mask"])
+        # Always pad words
         for word in encoded_inputs["input_ids"]:
             if len(word) < self.max_word_length:
                 word.extend([self.pad_token_id] * (self.max_word_length - len(word)))
         )
         return batch_outputs
     def _encode_plus(
         self,
         text: Union[TextInput, PreTokenizedInput, EncodedInput],
         return BatchEncoding(batch_outputs)
+    def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, split_long_words: bool = True, **kwargs) -> List[List[str]]:
         text = unicodedata.normalize('NFKC', text)
         if split_long_words:
             tokenized_text = []
         return_tensors: Optional[Union[str, TensorType]] = None,
         #label_pad_token_id=-100,
         verbose: bool = True,
+        **kwargs
     ) -> BatchEncoding:
         # If we have a list of dicts, let's convert it in a dict of lists
         # We do this to allow using this method as a collate_fn function in PyTorch Dataloader
         batch_outputs["word_input_mask"] = \
             [f + [0]*(longest_in_batch - len(f)) for f in encoded_inputs['word_input_mask']]
         if "word_type_ids" in encoded_inputs:
             batch_outputs["word_type_ids"] = [f + [0]*(longest_in_batch - len(f)) for f in encoded_inputs["word_type_ids"]]
                     continue
                 labels = encoded_inputs[label_name]
                 label_pad_word = [[self.label_pad_token_id]*self.max_word_length]
+                batch_outputs[label_name] = [
+                    to_list(label) + label_pad_word * (longest_in_batch - len(label)) for label in labels
+                ]
         return BatchEncoding(batch_outputs, tensor_type=return_tensors)