angelitasr's picture
End of training
4de028c verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:3503
  - loss:MultipleNegativesRankingLoss
base_model: jinaai/jina-embeddings-v3
widget:
  - source_sentence: >-
      ###Question###:Factorising into a Double Bracket-Factorise a quadratic
      expression in the form x² + bx - c-If

      \(

      m^{2}+5 m-14 \equiv(m+a)(m+b)

      \)

      then \( a \times b= \)

      ###Correct Answer###:\( -14 \)

      ###Misconcepted Incorrect answer###:\( 5 \)
    sentences:
      - Does not know that units of volume are usually cubed
      - >-
        Believes the coefficent of x in an expanded quadratic comes from
        multiplying the two numbers in the brackets
      - Does not copy a given method accurately
  - source_sentence: >-
      ###Question###:Rounding to the Nearest Whole (10, 100, etc)-Round
      non-integers to the nearest 10-What is \( \mathbf{8 6 9 8 . 9} \) rounded
      to the nearest ten?

      ###Correct Answer###:\( 8700 \)

      ###Misconcepted Incorrect answer###:\( 8699 \)
    sentences:
      - Rounds to the wrong degree of accuracy (rounds too much)
      - 'Believes division is commutative '
      - Believes that a number divided by itself equals 0
  - source_sentence: >-
      ###Question###:Simultaneous Equations-Solve linear simultaneous equations
      requiring a scaling of both expressions-If five cups of tea and two cups
      of coffee cost \( £ 3.70 \), and two cups of tea and five cups of coffee
      cost \( £ 4.00 \), what is the cost of a cup of tea and a cup of coffee?

      ###Correct Answer###:Tea \( =50 \mathrm{p} \) coffee \( =60 p \)

      ###Misconcepted Incorrect answer###:\( \begin{array}{l}\text { Tea }=0.5
      \\ \text { coffee }=0.6\end{array} \)
    sentences:
      - Misinterprets the meaning of angles on a straight line angle fact
      - Does not include units in answer.
      - Believes midpoint calculation is just half of the difference
  - source_sentence: >-
      ###Question###:Quadratic Sequences-Find the nth term rule for ascending
      quadratic sequences in the form ax² + bx + c-\(

      6,14,28,48,74, \ldots

      \)


      When calculating the nth-term rule of this sequence, what should replace
      the triangle?


      nth-term rule: \( 3 n^{2} \)\( \color{red}\triangle \)  \(n\) \(
      \color{purple}\square \)


      ###Correct Answer###:\( -1 \)

      (or just a - sign)

      ###Misconcepted Incorrect answer###:\[

      +1

      \]

      (or just a + sign)
    sentences:
      - >-
        When finding the differences between terms in a sequence, believes they
        can do so from right to left 
      - >-
        When solving an equation forgets to eliminate the coefficient in front
        of the variable in the last step
      - >-
        Believes parallelogram is the term used to describe two lines at right
        angles
  - source_sentence: >-
      ###Question###:Written Multiplication-Multiply 2 digit integers by 2 digit
      integers using long multiplication-Which working out is correct for $72
      \times 36$?

      ###Correct Answer###:![ Long multiplication for 72 multiplied by 36 with
      correct working and correct final answer. First row of working is correct:
      4 3 2. Second row of working is correct: 2 1 6 0. Final answer is correct:
      2 5 9 2.]()

      ###Misconcepted Incorrect answer###:![ Long multiplication for 72
      multiplied by 36 with incorrect working and incorrect final answer. First
      row of working is incorrect: 4 2 2. Second row of working is incorrect: 2
      7. Final answer is incorrect: 4 4 9.]()
    sentences:
      - >-
        When solving an equation forgets to eliminate the coefficient in front
        of the variable in the last step
      - >-
        Thinks a variable next to a number means addition rather than
        multiplication
      - >-
        When two digits multiply to 10 or more during a multiplication problem,
        does not add carried value to the preceding digit
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on jinaai/jina-embeddings-v3

This is a sentence-transformers model finetuned from jinaai/jina-embeddings-v3. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: jinaai/jina-embeddings-v3
  • Maximum Sequence Length: 8194 tokens
  • Output Dimensionality: 1024 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (transformer): Transformer(
    (auto_model): XLMRobertaLoRA(
      (roberta): XLMRobertaModel(
        (embeddings): XLMRobertaEmbeddings(
          (word_embeddings): ParametrizedEmbedding(
            250002, 1024, padding_idx=1
            (parametrizations): ModuleDict(
              (weight): ParametrizationList(
                (0): LoRAParametrization()
              )
            )
          )
          (token_type_embeddings): ParametrizedEmbedding(
            1, 1024
            (parametrizations): ModuleDict(
              (weight): ParametrizationList(
                (0): LoRAParametrization()
              )
            )
          )
        )
        (emb_drop): Dropout(p=0.1, inplace=False)
        (emb_ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (encoder): XLMRobertaEncoder(
          (layers): ModuleList(
            (0-23): 24 x Block(
              (mixer): MHA(
                (rotary_emb): RotaryEmbedding()
                (Wqkv): ParametrizedLinearResidual(
                  in_features=1024, out_features=3072, bias=True
                  (parametrizations): ModuleDict(
                    (weight): ParametrizationList(
                      (0): LoRAParametrization()
                    )
                  )
                )
                (inner_attn): FlashSelfAttention(
                  (drop): Dropout(p=0.1, inplace=False)
                )
                (inner_cross_attn): FlashCrossAttention(
                  (drop): Dropout(p=0.1, inplace=False)
                )
                (out_proj): ParametrizedLinear(
                  in_features=1024, out_features=1024, bias=True
                  (parametrizations): ModuleDict(
                    (weight): ParametrizationList(
                      (0): LoRAParametrization()
                    )
                  )
                )
              )
              (dropout1): Dropout(p=0.1, inplace=False)
              (drop_path1): StochasticDepth(p=0.0, mode=row)
              (norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): ParametrizedLinear(
                  in_features=1024, out_features=4096, bias=True
                  (parametrizations): ModuleDict(
                    (weight): ParametrizationList(
                      (0): LoRAParametrization()
                    )
                  )
                )
                (fc2): ParametrizedLinear(
                  in_features=4096, out_features=1024, bias=True
                  (parametrizations): ModuleDict(
                    (weight): ParametrizationList(
                      (0): LoRAParametrization()
                    )
                  )
                )
              )
              (dropout2): Dropout(p=0.1, inplace=False)
              (drop_path2): StochasticDepth(p=0.0, mode=row)
              (norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            )
          )
        )
        (pooler): XLMRobertaPooler(
          (dense): ParametrizedLinear(
            in_features=1024, out_features=1024, bias=True
            (parametrizations): ModuleDict(
              (weight): ParametrizationList(
                (0): LoRAParametrization()
              )
            )
          )
          (activation): Tanh()
        )
      )
    )
  )
  (pooler): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (normalizer): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    '###Question###:Written Multiplication-Multiply 2 digit integers by 2 digit integers using long multiplication-Which working out is correct for $72 \\times 36$?\n###Correct Answer###:![ Long multiplication for 72 multiplied by 36 with correct working and correct final answer. First row of working is correct: 4 3 2. Second row of working is correct: 2 1 6 0. Final answer is correct: 2 5 9 2.]()\n###Misconcepted Incorrect answer###:![ Long multiplication for 72 multiplied by 36 with incorrect working and incorrect final answer. First row of working is incorrect: 4 2 2. Second row of working is incorrect: 2 7. Final answer is incorrect: 4 4 9.]()',
    'When two digits multiply to 10 or more during a multiplication problem, does not add carried value to the preceding digit',
    'Thinks a variable next to a number means addition rather than multiplication',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 3,503 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 59 tokens
    • mean: 131.26 tokens
    • max: 449 tokens
    • min: 6 tokens
    • mean: 17.43 tokens
    • max: 46 tokens
  • Samples:
    anchor positive
    ###Question###:Area of Simple Shapes-Calculate the area of a parallelogram where the dimensions are given in the same units-What is the area of this shape? A parallelogram drawn on a square grid in purple with an area of 9 square units. The base is length 3 squares and the perpendicular height is also length 3 squares.
    ###Correct Answer###:( 9 )
    ###Misconcepted Incorrect answer###:( 12 )
    Counts half-squares as full squares when calculating area on a square grid
    ###Question###:Substitution into Formula-Substitute into simple formulae given in words-A theme park charges ( £ 8 ) entry fee and then ( £ 3 ) for every ride you go on.
    Heena goes on ( 5 ) rides.
    How much does she pay in total?
    ###Correct Answer###:( £ 23 )
    ###Misconcepted Incorrect answer###:( £ 55 )
    Combines variables with constants when writing a formula from a given situation
    ###Question###:Trial and Improvement and Iterative Methods-Use area to write algebraic expressions-The area of the rectangle on the right is ( 8 \mathrm{~cm}^{2} ).

    Which of the following equations can we write from the information given? A rectangle with the short side labelled \(x\) and the opposite side labelled \(x^2 + 9\).
    ###Correct Answer###:( x^{3}+9 x=8 )
    ###Misconcepted Incorrect answer###:( x^{3}+9=8 )
    Only multiplies the first term in the expansion of a bracket
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • num_train_epochs: 10
  • push_to_hub: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: True
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss
1.1416 500 0.3244
2.2831 1000 0.1048
3.4247 1500 0.0394
4.5662 2000 0.0211
5.7078 2500 0.0145
6.8493 3000 0.0114
7.9909 3500 0.0106
9.1324 4000 0.0092

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.1
  • Transformers: 4.45.2
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.1.1
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}