SentenceTransformer based on distilbert/distilbert-base-uncased-finetuned-sst-2-english
This is a sentence-transformers model finetuned from distilbert/distilbert-base-uncased-finetuned-sst-2-english. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: distilbert/distilbert-base-uncased-finetuned-sst-2-english
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("wasabibish/similarity-code-ai-generated")
# Run inference
sentences = [
'def move_zeroes(nums):\n count = 0\n for i in range(len(nums)):\n if nums[i] != 0:\n nums[count], nums[i]= nums[i], nums[count]\n count += 1\n for i in range(count, len(nums)):\n nums[i] =0\n\ninput = [int(x) for x in input("Enter integers separated by spaces: ").split()]\nmove_zeroes(input)\n\nprint(input)',
'def move_zeros_to_end(lst):\n zero_count = 0\n for i in range(len(lst)):\n if lst[i] != 0:\n lst[i], lst[zero_count] = lst[zero_count], lst[i]\n zero_count += 1\n\n# Test cases\nlst1 = [0, 1, 0, 3, 12]\nmove_zeros_to_end(lst1)\nprint(lst1) # Output: [1, 3, 12, 0, 0]\n\nlst2 = [0, 0, 1]\nmove_zeros_to_end(lst2)\nprint(lst2) # Output: [1, 0, 0]\n',
'using System;\nusing System.Collections.Generic;\n\nclass BracketChecker\n{\n private readonly Dictionary<char, char> bracketPairs = new Dictionary<char, char>\n {\n { \'(\', \')\' },\n { \'[\', \']\' },\n { \'{\', \'}\' }\n };\n\n public bool CheckBalancedBrackets(string input)\n {\n if (string.IsNullOrEmpty(input))\n {\n return true;\n }\n\n Stack<char> stack = new Stack<char>();\n\n foreach (char c in input)\n {\n if (bracketPairs.ContainsValue(c))\n {\n if (stack.Count == 0 || bracketPairs[stack.Peek()] != c)\n {\n return false;\n }\n stack.Pop();\n }\n else if (bracketPairs.ContainsKey(c))\n {\n stack.Push(c);\n }\n }\n\n return stack.Count == 0;\n }\n}\n\nclass Program\n{\n static void Main()\n {\n BracketChecker bracketChecker = new BracketChecker();\n\n string input1 = "(a+[b*c]-{d/e})";\n Console.WriteLine("Input: \\"{0}\\"", input1);\n Console.WriteLine("Output: {0}\\n", bracketChecker.CheckBalancedBrackets(input1));\n\n string input2 = "(a+[b*c)-{d/e}]";\n Console.WriteLine("Input: \\"{0}\\"", input2);\n Console.WriteLine("Output: {0}", bracketChecker.CheckBalancedBrackets(input2));\n }\n}\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Semantic Similarity
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.9 |
spearman_cosine | 0.9014 |
pearson_manhattan | 0.862 |
spearman_manhattan | 0.802 |
pearson_euclidean | 0.8685 |
spearman_euclidean | 0.8234 |
pearson_dot | 0.8495 |
spearman_dot | 0.8948 |
pearson_max | 0.9 |
spearman_max | 0.9014 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 302 training samples
- Columns:
sentence1
,sentence2
, andscore
- Approximate statistics based on the first 302 samples:
sentence1 sentence2 score type string string float details - min: 3 tokens
- mean: 206.43 tokens
- max: 512 tokens
- min: 27 tokens
- mean: 244.9 tokens
- max: 512 tokens
- min: 0.0
- mean: 0.29
- max: 0.9
- Samples:
sentence1 sentence2 score from django.views.generic import ListView
class PersonListView(ListView):
model = Person
template_name = 'person_list.html'
def get_queryset(self):
return Person.objects.filter(birthdate__year__lte=2005)from myapp.models import Customer # Import the Customer model from your Django app
def get_customers_with_zip_code_starting_with_123():
customers = Customer.objects.filter(zip_code__startswith='123').values() # Query to filter customers with zip_code starting with '123'
return list(customers) # Return a list of dictionaries for matching records0.4
Welcome to our website!
function createSentence(words, maxChars) {
if (words.length === 0AAAAAA
#include
#include
class KMP {
public:
std::vector findPatternIndices(const CString& text, const CString& pattern) {
std::vector indices;
if (pattern.IsEmpty() - Loss:
CosineSimilarityLoss
with these parameters:{ "loss_fct": "torch.nn.modules.loss.MSELoss" }
Evaluation Dataset
Unnamed Dataset
- Size: 76 evaluation samples
- Columns:
sentence1
,sentence2
, andscore
- Approximate statistics based on the first 76 samples:
sentence1 sentence2 score type string string float details - min: 5 tokens
- mean: 216.92 tokens
- max: 512 tokens
- min: 54 tokens
- mean: 254.78 tokens
- max: 512 tokens
- min: 0.0
- mean: 0.33
- max: 0.9
- Samples:
sentence1 sentence2 score function stripHtmlTags(str) {
return str.replace(/<[^>]*>/g, '');
}
const input = 'Hello World!
';
const output = stripHtmlTags(input);
console.log(output);function stripHtmlTags(input) {
if (!input) return '';
const tagRegex = /<[^>]*>/g;
return input.replace(tagRegex, '');
}0.6
function getTopThreeWords($text) {
// Remove punctuation and convert to lowercase
$words = str_word_count(strtolower(preg_replace('/[^\p{L}\p{N}\s]/u', ' ', $text)), 1);
// Count the frequency of each word
$wordFrequency = array_count_values($words);
// Sort the words by frequency in descending order
arsort($wordFrequency);
// Get the top three words
$topThreeWords = array_slice($wordFrequency, 0, 3, true);
// Format the output
$output = [];
foreach ($topThreeWords as $word => $count) {
$output[] = "('$word', $count)";
}
return '[' . implode(', ', $output) . ']';
}
// Example usage:
$inputText = "The quick brown fox jumps over the lazy dog. The dog was lazy!";
echo getTopThreeWords($inputText);
?>
function countTopWords($inputString) {
// Convert the input string to lowercase and remove punctuation
$cleanString = preg_replace("/[\W_]+/", " ", strtolower($inputString));
// Split the string into an array of words
$words = explode(" ", $cleanString);
// Count the frequency of each word
$wordCount = array_count_values($words);
// Sort the words by frequency in descending order
arsort($wordCount);
// Get the top three most common words
$topWords = array_slice($wordCount, 0, 3);
// Format the output as an array of tuples
$output = [];
foreach ($topWords as $word => $count) {
$output[] = [$word, $count];
}
return $output;
}
// Test the function with the example input
$inputString = "The quick brown fox jumps over the lazy dog. The dog was lazy!";
$output = countTopWords($inputString);
print_r($output);
?>0.3
AAAAAA
#include
#include
class KMP {
public:
std::vector findPatternIndices(const CString& text, const CString& pattern) {
std::vector indices;
if (pattern.IsEmpty() - Loss:
CosineSimilarityLoss
with these parameters:{ "loss_fct": "torch.nn.modules.loss.MSELoss" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsweight_decay
: 0.2max_steps
: 100warmup_steps
: 150
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 8per_device_eval_batch_size
: 8per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.2adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 3.0max_steps
: 100lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 150log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseeval_use_gather_object
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | loss | spearman_max |
---|---|---|---|
0.5263 | 20 | 0.3765 | 0.5421 |
1.0526 | 40 | 0.1518 | 0.5774 |
1.5789 | 60 | 0.0501 | 0.8533 |
2.1053 | 80 | 0.0217 | 0.8900 |
2.6316 | 100 | 0.0168 | 0.9014 |
Framework Versions
- Python: 3.9.10
- Sentence Transformers: 3.1.0
- Transformers: 4.44.2
- PyTorch: 2.4.1+cpu
- Accelerate: 0.34.2
- Datasets: 3.0.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
- Downloads last month
- 6
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for wasabibish/similarity-code-ai-generated
Evaluation results
- Pearson Cosine on Unknownself-reported0.900
- Spearman Cosine on Unknownself-reported0.901
- Pearson Manhattan on Unknownself-reported0.862
- Spearman Manhattan on Unknownself-reported0.802
- Pearson Euclidean on Unknownself-reported0.868
- Spearman Euclidean on Unknownself-reported0.823
- Pearson Dot on Unknownself-reported0.849
- Spearman Dot on Unknownself-reported0.895
- Pearson Max on Unknownself-reported0.900
- Spearman Max on Unknownself-reported0.901