--- license: mit datasets: - mteb/sts12-sts metrics: - accuracy base_model: - sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 library_name: transformers --- # Model Description This model is a fine-tuned version of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 for sentence similarity tasks. It was trained on the mteb/stsbenchmark-sts dataset to evaluate the similarity between sentence pairs. Model Type: Sequence Classification (Regression) Pre-trained Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 Fine-Tuning Dataset: mteb/stsbenchmark-sts Task: Sentence similarity (regression) Training Details Training Objective: To predict the similarity score between pairs of sentences. Training Data: mteb/stsbenchmark-sts, which contains sentence pairs with similarity scores. Number of Labels: 1 (regression) Epochs: 2 Batch Size: 8 Learning Rate: 2e-5 Weight Decay: 0.01 Evaluation The model was evaluated using Pearson correlation on the validation set of the mteb/stsbenchmark-sts dataset. Results indicate how well the model predicts similarity scores between sentence pairs. # Usage To use this model for sentence similarity, follow these steps: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load the fine-tuned model model = AutoModelForSequenceClassification.from_pretrained("./paraphraser_model") tokenizer = AutoTokenizer.from_pretrained("./paraphraser_model") sentences = ["The quick brown fox jumps over the lazy dog.", "A fast dark-colored fox leaps over a sleeping dog."] encoded_input = tokenizer(sentences[0], sentences[1], return_tensors="pt", truncation=True, padding='max_length', max_length=128) # Compute Similarity Score: import torch import torch.nn.functional as F # Perform inference with torch.no_grad(): model_output = model(**encoded_input) logits = model_output.logits similarity_score = F.sigmoid(logits).item() print(f"Similarity score between the two sentences: {similarity_score}") # Mean Pooling Function: If using the model for generating sentence embeddings, you can use the following mean pooling function: def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] # First element of model_output contains the token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).float() sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1) sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9) return sum_embeddings / sum_mask # Limitations Domain Specificity: The model is fine-tuned on the mteb/stsbenchmark-sts dataset and may perform differently on other types of text or datasets. Biases: As with any model trained on human language data, it may inherit and reflect biases present in the training data. # Future Work Potential improvements include fine-tuning on additional datasets, experimenting with different architectures or hyperparameters, and incorporating additional training techniques to improve performance and robustness. Citation If you use this model in your research, please cite it as follows: @inproceedings{your_paper, title={Fine-Tuned Paraphrase-Multilingual-MiniLM-L12-v2 for Sentence Similarity}, author={Your Name}, year={2024}, publisher={Your Institution} } # License This model is licensed under the MIT License. See the LICENSE file for more information.