File size: 3,073 Bytes
af7dace 1d753ae af7dace |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
---
language: en
license: mit
library_name: transformers
tags:
- generated_from_trainer
- text-classification
- fill-mask
- embeddings
metrics:
- accuracy
model-index:
- name: snowflake-arctic-embed-xs-zyda-2
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: Zyphra/Zyda-2 (subset)
type: Zyphra/Zyda-2
metrics:
- type: accuracy
value: 0.4676
name: Accuracy
base_model: Snowflake/snowflake-arctic-embed-xs
---
# snowflake-arctic-embed-xs-zyda-2
## Model Description
This model is a fine-tuned version of [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) on a subset of the [Zyphra/Zyda-2](https://huggingface.co/datasets/Zyphra/Zyda-2) dataset. It was trained using the Masked Language Modeling (MLM) objective to enhance its understanding of the English language.
## Performance
The model achieves the following results on the evaluation set:
- Loss: 3.0689
- Accuracy: 0.4676
## Intended Uses & Limitations
This model is designed to be used and finetuned for the following tasks:
- Text embedding
- Text classification
- Fill-in-the-blank tasks
**Limitations:**
- English language only
- May be inaccurate for specialized jargon, dialects, slang, code, and LaTeX
## Training Data
The model was trained on the first 300 000 rows of the [Zyphra/Zyda-2](https://huggingface.co/datasets/Zyphra/Zyda-2) dataset.
5% of that data was used for validation.
## Training Procedure
### Hyperparameters
The following hyperparameters were used during training:
- Learning rate: 5e-05
- Train batch size: 8
- Eval batch size: 8
- Seed: 42
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- Learning rate scheduler: Linear
- Number of epochs: 1.0
### Framework Versions
- Transformers: 4.44.2
- PyTorch: 2.5.1+cu124
- Datasets: 3.1.0
- Tokenizers: 0.19.1
## Usage Examples
### Masked Language Modeling
```python
from transformers import pipeline
unmasker = pipeline('fill-mask', model='agentlans/snowflake-arctic-embed-xs-zyda-2')
result = unmasker("[MASK] is the capital of France.")
print(result)
```
### Text Embedding
```python
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "agentlans/snowflake-arctic-embed-xs-zyda-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
text = "Example sentence for embedding."
inputs = tokenizer(text, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
print(embeddings)
```
## Ethical Considerations and Bias
As this model is trained on a subset of the Zyda-2 dataset, it may inherit biases present in that data. Users should be aware of potential biases and evaluate the model's output critically, especially for sensitive applications.
## Additional Information
For more details about the base model, please refer to [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs).
|