GO-Term-Embeddings / README.md
NothingMuch's picture
Update README.md
ce8b012 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:40906
  - loss:MatryoshkaLoss
  - loss:MegaBatchMarginLoss
widget:
  - source_sentence: >-
      One of three laminate structures that form the spindle pole body; the
      inner plaque is in the nucleus.
    sentences:
      - >-
        maturation of SSU-rRNA from tetracistronic rRNA transcript (SSU-rRNA,
        5.8S rRNA, 2S rRNA, LSU-rRNA)
      - leukotriene receptor activity
      - inner plaque of spindle pole body
  - source_sentence: >-
      The covalent attachment of a myristoyl group to the N-terminal amino acid
      residue of a protein.
    sentences:
      - MHC class I protein complex assembly
      - N-terminal protein myristoylation
      - neurotrophin receptor activity
  - source_sentence: >-
      The inner, i.e. lumen-facing, lipid bilayer of the plastid envelope; also
      faces the plastid stroma.
    sentences:
      - plastid inner membrane
      - neuron migration involved in retrograde extension
      - stomatal complex morphogenesis
  - source_sentence: >-
      Initiation of a region of tissue in a plant that is composed of one or
      more undifferentiated cells capable of undergoing mitosis and
      differentiation, thereby effecting growth and development of a plant by
      giving rise to more meristem or specialized tissue.
    sentences:
      - meristem initiation
      - polytene chromosome
      - cardiac ventricle development
  - source_sentence: >-
      The sex chromosome present in both sexes of species in which the male is
      the heterogametic sex. Two copies of the X chromosome are present in each
      somatic cell of females and one copy is present in males.
    sentences:
      - establishment of cell polarity involved in gastrulation cell migration
      - X chromosome
      - somatic diversification of immune receptors by N region addition
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - src2trg_accuracy
  - trg2src_accuracy
  - mean_accuracy
model-index:
  - name: SentenceTransformer
    results:
      - task:
          type: translation
          name: Translation
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: src2trg_accuracy
            value: 0.7840546697038724
            name: Src2Trg Accuracy
          - type: trg2src_accuracy
            value: 0.7757023538344723
            name: Trg2Src Accuracy
          - type: mean_accuracy
            value: 0.7798785117691723
            name: Mean Accuracy
license: mit
language:
  - en
base_model:
  - Snowflake/snowflake-arctic-embed-m-v1.5
datasets:
  - NothingMuch/GO-Terms

SentenceTransformer

This is a sentence-transformers model trained on the parquet dataset. It maps sentences & paragraphs to a 128-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 128 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • parquet

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("NothingMuch/GO-Term-Embeddings")
# Run inference
sentences = [
    'The sex chromosome present in both sexes of species in which the male is the heterogametic sex. Two copies of the X chromosome are present in each somatic cell of females and one copy is present in males.',
    'X chromosome',
    'somatic diversification of immune receptors by N region addition',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 128]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Translation

Metric Value
src2trg_accuracy 0.7841
trg2src_accuracy 0.7757
mean_accuracy 0.7799

Training Details

Training Dataset

parquet

  • Dataset: parquet
  • Size: 40,906 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 8 tokens
    • mean: 42.05 tokens
    • max: 192 tokens
    • min: 3 tokens
    • mean: 10.48 tokens
    • max: 40 tokens
  • Samples:
    anchor positive
    Catalysis of the transfer of a mannose residue to an oligosaccharide, forming an alpha-(1->6) linkage. 1,6-alpha-mannosyltransferase activity
    Catalysis of the hydrolysis of ester linkages within a single-stranded deoxyribonucleic acid molecule by creating internal breaks. single-stranded DNA specific endodeoxyribonuclease activity
    Catalysis of the hydrolysis of ester linkages within a single-stranded deoxyribonucleic acid molecule by creating internal breaks. ssDNA-specific endodeoxyribonuclease activity
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MegaBatchMarginLoss",
        "matryoshka_dims": [
            64,
            32
        ],
        "matryoshka_weights": [
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Evaluation Dataset

parquet

  • Dataset: parquet
  • Size: 6,585 evaluation samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 8 tokens
    • mean: 41.29 tokens
    • max: 253 tokens
    • min: 3 tokens
    • mean: 9.23 tokens
    • max: 44 tokens
  • Samples:
    anchor positive
    The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome. mitochondrial genome maintenance
    The repair of single strand breaks in DNA. Repair of such breaks is mediated by the same enzyme systems as are used in base excision repair. single strand break repair
    Any process that modulates the frequency, rate or extent of DNA recombination, a DNA metabolic process in which a new genotype is formed by reassortment of genes resulting in gene combinations different from those that were present in the parents. regulation of DNA recombination
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MegaBatchMarginLoss",
        "matryoshka_dims": [
            64,
            32
        ],
        "matryoshka_weights": [
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • torch_empty_cache_steps: 250
  • learning_rate: 0.00025
  • lr_scheduler_type: cosine_with_restarts
  • warmup_steps: 25
  • seed: 25
  • load_best_model_at_end: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: 250
  • learning_rate: 0.00025
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: cosine_with_restarts
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 25
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 25
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss mean_accuracy
1.0016 641 0.2501 0.6276 0.7343
2.0016 1282 0.3146 0.5520 0.7651
2.9969 1920 0.1976 0.5097 0.7799
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.0
  • PyTorch: 2.4.0
  • Accelerate: 1.2.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MegaBatchMarginLoss

@inproceedings{wieting-gimpel-2018-paranmt,
    title = "{P}ara{NMT}-50{M}: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations",
    author = "Wieting, John and Gimpel, Kevin",
    editor = "Gurevych, Iryna and Miyao, Yusuke",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2018",
    address = "Melbourne, Australia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P18-1042",
    doi = "10.18653/v1/P18-1042",
    pages = "451--462",
}