File size: 3,073 Bytes
af7dace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d753ae
af7dace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
language: en
license: mit
library_name: transformers
tags:
- generated_from_trainer
- text-classification
- fill-mask
- embeddings
metrics:
- accuracy
model-index:
- name: snowflake-arctic-embed-xs-zyda-2
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      name: Zyphra/Zyda-2 (subset)
      type: Zyphra/Zyda-2
    metrics:
    - type: accuracy
      value: 0.4676
      name: Accuracy
base_model: Snowflake/snowflake-arctic-embed-xs
---

# snowflake-arctic-embed-xs-zyda-2

## Model Description

This model is a fine-tuned version of [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) on a subset of the [Zyphra/Zyda-2](https://huggingface.co/datasets/Zyphra/Zyda-2) dataset. It was trained using the Masked Language Modeling (MLM) objective to enhance its understanding of the English language.

## Performance

The model achieves the following results on the evaluation set:
- Loss: 3.0689
- Accuracy: 0.4676

## Intended Uses & Limitations

This model is designed to be used and finetuned for the following tasks:
- Text embedding
- Text classification
- Fill-in-the-blank tasks

**Limitations:**
- English language only
- May be inaccurate for specialized jargon, dialects, slang, code, and LaTeX

## Training Data

The model was trained on the first 300 000 rows of the [Zyphra/Zyda-2](https://huggingface.co/datasets/Zyphra/Zyda-2) dataset.
5% of that data was used for validation.

## Training Procedure

### Hyperparameters

The following hyperparameters were used during training:
- Learning rate: 5e-05
- Train batch size: 8
- Eval batch size: 8
- Seed: 42
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- Learning rate scheduler: Linear
- Number of epochs: 1.0

### Framework Versions

- Transformers: 4.44.2
- PyTorch: 2.5.1+cu124
- Datasets: 3.1.0
- Tokenizers: 0.19.1

## Usage Examples

### Masked Language Modeling

```python
from transformers import pipeline

unmasker = pipeline('fill-mask', model='agentlans/snowflake-arctic-embed-xs-zyda-2')
result = unmasker("[MASK] is the capital of France.")
print(result)
```

### Text Embedding

```python
from transformers import AutoTokenizer, AutoModel
import torch

model_name = "agentlans/snowflake-arctic-embed-xs-zyda-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "Example sentence for embedding."
inputs = tokenizer(text, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state.mean(dim=1)
print(embeddings)
```

## Ethical Considerations and Bias

As this model is trained on a subset of the Zyda-2 dataset, it may inherit biases present in that data. Users should be aware of potential biases and evaluate the model's output critically, especially for sensitive applications.

## Additional Information

For more details about the base model, please refer to [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs).