kevinkrahn
/

shlm-grc-en

@@ -6,24 +6,30 @@ tags:
 - feature-extraction
 - sentence-similarity
 - transformers
 ---
-# kevinkrahn/shlm-grc-en
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-<!--- Describe your model here -->
 ## Usage (Sentence-Transformers)
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
-```
-pip install -U sentence-transformers
-```
-Then you can use the model like this:
 ```python
 from sentence_transformers import SentenceTransformer
@@ -44,7 +50,7 @@ from transformers import AutoTokenizer, AutoModel
 import torch
-def cls_pooling(model_output, attention_mask):
     return model_output[0][:,0]
@@ -52,8 +58,8 @@ def cls_pooling(model_output, attention_mask):
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en')
-model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en')
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -63,30 +69,61 @@ with torch.no_grad():
     model_output = model(**encoded_input)
 # Perform pooling. In this case, cls pooling.
-sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
 print("Sentence embeddings:")
 print(sentence_embeddings)
-```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=kevinkrahn/shlm-grc-en)
-## Full Model Architecture
 ```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: HLMModel
-  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-)
 ```
-## Citing & Authors
-<!--- Describe where people can find more information -->

 - feature-extraction
 - sentence-similarity
 - transformers
+- semantic-search
 ---
+# shlm-grc-en
+## Sentence embeddings for English and Ancient Greek
+This model creates sentence embeddings in a shared vector space for Ancient Greek and English text.
+The base model uses a modified version of the HLM architecture described in [Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers](https://aclanthology.org/2024.sigtyp-1.16/)
+This model is trained to produce sentence embeddings using the multilingual knowledge distillation method and datasets described in [Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation](https://aclanthology.org/2023.alp-1.2/).
+This model was distilled from `BAAI/bge-base-en-v1.5` for embedding English and Ancient Greek text.
 ## Usage (Sentence-Transformers)
+**This model is currently incompatible with the latest version of the sentence-transformers library.**
+For now, either use HuggingFace Transformers directly (see below) or the following fork of sentence-transformers:
+https://github.com/kevinkrahn/sentence-transformers
+You can use the model with sentence-transformers like this:
 ```python
 from sentence_transformers import SentenceTransformer
 import torch
+def cls_pooling(model_output):
     return model_output[0][:,0]
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
+model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
     model_output = model(**encoded_input)
 # Perform pooling. In this case, cls pooling.
+sentence_embeddings = cls_pooling(model_output)
 print("Sentence embeddings:")
 print(sentence_embeddings)
+```
+## Citing & Authors
 ```
+@inproceedings{riemenschneider-krahn-2024-heidelberg,
+    title = "Heidelberg-Boston @ {SIGTYP} 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers",
+    author = "Riemenschneider, Frederick  and
+      Krahn, Kevin",
+    editor = "Hahn, Michael  and
+      Sorokin, Alexey  and
+      Kumar, Ritesh  and
+      Shcherbakov, Andreas  and
+      Otmakhova, Yulia  and
+      Yang, Jinrui  and
+      Serikov, Oleg  and
+      Rani, Priya  and
+      Ponti, Edoardo M.  and
+      Murado{\u{g}}lu, Saliha  and
+      Gao, Rena  and
+      Cotterell, Ryan  and
+      Vylomova, Ekaterina",
+    booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
+    month = mar,
+    year = "2024",
+    address = "St. Julian's, Malta",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2024.sigtyp-1.16",
+    pages = "131--141",
+}
 ```
+```
+@inproceedings{krahn-etal-2023-sentence,
+    title = "Sentence Embedding Models for {A}ncient {G}reek Using Multilingual Knowledge Distillation",
+    author = "Krahn, Kevin  and
+      Tate, Derrick  and
+      Lamicela, Andrew C.",
+    editor = "Anderson, Adam  and
+      Gordin, Shai  and
+      Li, Bin  and
+      Liu, Yudong  and
+      Passarotti, Marco C.",
+    booktitle = "Proceedings of the Ancient Language Processing Workshop",
+    month = sep,
+    year = "2023",
+    address = "Varna, Bulgaria",
+    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
+    url = "https://aclanthology.org/2023.alp-1.2",
+    pages = "13--22",
+}
+```