metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:40906
- loss:MatryoshkaLoss
- loss:MegaBatchMarginLoss
widget:
- source_sentence: >-
One of three laminate structures that form the spindle pole body; the
inner plaque is in the nucleus.
sentences:
- >-
maturation of SSU-rRNA from tetracistronic rRNA transcript (SSU-rRNA,
5.8S rRNA, 2S rRNA, LSU-rRNA)
- leukotriene receptor activity
- inner plaque of spindle pole body
- source_sentence: >-
The covalent attachment of a myristoyl group to the N-terminal amino acid
residue of a protein.
sentences:
- MHC class I protein complex assembly
- N-terminal protein myristoylation
- neurotrophin receptor activity
- source_sentence: >-
The inner, i.e. lumen-facing, lipid bilayer of the plastid envelope; also
faces the plastid stroma.
sentences:
- plastid inner membrane
- neuron migration involved in retrograde extension
- stomatal complex morphogenesis
- source_sentence: >-
Initiation of a region of tissue in a plant that is composed of one or
more undifferentiated cells capable of undergoing mitosis and
differentiation, thereby effecting growth and development of a plant by
giving rise to more meristem or specialized tissue.
sentences:
- meristem initiation
- polytene chromosome
- cardiac ventricle development
- source_sentence: >-
The sex chromosome present in both sexes of species in which the male is
the heterogametic sex. Two copies of the X chromosome are present in each
somatic cell of females and one copy is present in males.
sentences:
- establishment of cell polarity involved in gastrulation cell migration
- X chromosome
- somatic diversification of immune receptors by N region addition
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- src2trg_accuracy
- trg2src_accuracy
- mean_accuracy
model-index:
- name: SentenceTransformer
results:
- task:
type: translation
name: Translation
dataset:
name: Unknown
type: unknown
metrics:
- type: src2trg_accuracy
value: 0.7840546697038724
name: Src2Trg Accuracy
- type: trg2src_accuracy
value: 0.7757023538344723
name: Trg2Src Accuracy
- type: mean_accuracy
value: 0.7798785117691723
name: Mean Accuracy
license: mit
language:
- en
base_model:
- Snowflake/snowflake-arctic-embed-m-v1.5
datasets:
- NothingMuch/GO-Terms
SentenceTransformer
This is a sentence-transformers model trained on the parquet dataset. It maps sentences & paragraphs to a 128-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 128 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- parquet
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("NothingMuch/GO-Term-Embeddings")
# Run inference
sentences = [
'The sex chromosome present in both sexes of species in which the male is the heterogametic sex. Two copies of the X chromosome are present in each somatic cell of females and one copy is present in males.',
'X chromosome',
'somatic diversification of immune receptors by N region addition',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 128]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Translation
- Evaluated with
TranslationEvaluator
Metric | Value |
---|---|
src2trg_accuracy | 0.7841 |
trg2src_accuracy | 0.7757 |
mean_accuracy | 0.7799 |
Training Details
Training Dataset
parquet
- Dataset: parquet
- Size: 40,906 training samples
- Columns:
anchor
andpositive
- Approximate statistics based on the first 1000 samples:
anchor positive type string string details - min: 8 tokens
- mean: 42.05 tokens
- max: 192 tokens
- min: 3 tokens
- mean: 10.48 tokens
- max: 40 tokens
- Samples:
anchor positive Catalysis of the transfer of a mannose residue to an oligosaccharide, forming an alpha-(1->6) linkage.
1,6-alpha-mannosyltransferase activity
Catalysis of the hydrolysis of ester linkages within a single-stranded deoxyribonucleic acid molecule by creating internal breaks.
single-stranded DNA specific endodeoxyribonuclease activity
Catalysis of the hydrolysis of ester linkages within a single-stranded deoxyribonucleic acid molecule by creating internal breaks.
ssDNA-specific endodeoxyribonuclease activity
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MegaBatchMarginLoss", "matryoshka_dims": [ 64, 32 ], "matryoshka_weights": [ 1, 1 ], "n_dims_per_step": -1 }
Evaluation Dataset
parquet
- Dataset: parquet
- Size: 6,585 evaluation samples
- Columns:
anchor
andpositive
- Approximate statistics based on the first 1000 samples:
anchor positive type string string details - min: 8 tokens
- mean: 41.29 tokens
- max: 253 tokens
- min: 3 tokens
- mean: 9.23 tokens
- max: 44 tokens
- Samples:
anchor positive The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.
mitochondrial genome maintenance
The repair of single strand breaks in DNA. Repair of such breaks is mediated by the same enzyme systems as are used in base excision repair.
single strand break repair
Any process that modulates the frequency, rate or extent of DNA recombination, a DNA metabolic process in which a new genotype is formed by reassortment of genes resulting in gene combinations different from those that were present in the parents.
regulation of DNA recombination
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MegaBatchMarginLoss", "matryoshka_dims": [ 64, 32 ], "matryoshka_weights": [ 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: epochper_device_train_batch_size
: 32per_device_eval_batch_size
: 16torch_empty_cache_steps
: 250learning_rate
: 0.00025lr_scheduler_type
: cosine_with_restartswarmup_steps
: 25seed
: 25load_best_model_at_end
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: epochprediction_loss_only
: Trueper_device_train_batch_size
: 32per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: 250learning_rate
: 0.00025weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: cosine_with_restartslr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 25log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 25data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Trueignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | Validation Loss | mean_accuracy |
---|---|---|---|---|
1.0016 | 641 | 0.2501 | 0.6276 | 0.7343 |
2.0016 | 1282 | 0.3146 | 0.5520 | 0.7651 |
2.9969 | 1920 | 0.1976 | 0.5097 | 0.7799 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.3.1
- Transformers: 4.47.0
- PyTorch: 2.4.0
- Accelerate: 1.2.0
- Datasets: 3.2.0
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MegaBatchMarginLoss
@inproceedings{wieting-gimpel-2018-paranmt,
title = "{P}ara{NMT}-50{M}: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations",
author = "Wieting, John and Gimpel, Kevin",
editor = "Gurevych, Iryna and Miyao, Yusuke",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2018",
address = "Melbourne, Australia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P18-1042",
doi = "10.18653/v1/P18-1042",
pages = "451--462",
}