File size: 3,365 Bytes
0e189b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c1be6b
 
0e189b6
 
 
9c1be6b
0e189b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# CoreML Conversion of the mxbai-embed-large-v1 sentence embedding model

After extensive testing (and a lot of debugging with ChatGPT), I was able to convert the mxbai-embed-large-v1 model to CoreML and run it mostly on the GPU.

```Python3
import torch
from transformers import AutoModel, AutoTokenizer
import coremltools as ct

# Define a wrapper class for the AutoModel to return only the last_hidden_state
class ModelWrapper(torch.nn.Module):
    def __init__(self, model):
        super(ModelWrapper, self).__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        # Extract the 'last_hidden_state' from the model output
        output = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return output.last_hidden_state  # or use 'pooler_output' if needed

# Load your SentenceTransformer model and tokenizer
model_name = "mixedbread-ai/mxbai-embed-large-v1"  # Replace with your model
model = AutoModel.from_pretrained(model_name)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Wrap the model to return only the tensor output
wrapped_model = ModelWrapper(model)
wrapped_model.eval()

# Sample input to export the model
dummy_input = tokenizer("This is a sample input", return_tensors="pt")

# Trace the model using tensor inputs (input_ids, attention_mask)
traced_model = torch.jit.trace(wrapped_model, (dummy_input['input_ids'], dummy_input['attention_mask']))

# Convert the traced PyTorch model to CoreML using the ML Program format
model_from_torch = ct.convert(
    traced_model,
    inputs=[
        ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 512)), dtype=np.float32),
        ct.TensorType(name="attention_mask", shape=(1, ct.RangeDim(1, 512)), dtype=np.float32)
    ],
    minimum_deployment_target=ct.target.iOS17,
    convert_to="mlprogram",
    compute_precision=ct.precision.FLOAT16
)

# Save the CoreML model as an mlpackage
model_from_torch.save("mxbai-embed-large-v1.mlpackage")
```


It can be run like this:
```Python
import coremltools as ct
from transformers import AutoTokenizer
import numpy as np

# Load the CoreML model
model = ct.models.MLModel("mxbai-embed-large-v1.mlpackage")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mixedbread-ai/mxbai-embed-large-v1")

# Prepare some input text
input_text = "This is a test sentence for the CoreML model"
inputs = tokenizer(input_text, return_tensors="np", padding=True, truncation=True, max_length=512)

# Extract input tensors
input_ids = inputs['input_ids'].astype(np.float32)  # CoreML expects float32
attention_mask = inputs['attention_mask'].astype(np.float32)

# Prepare inputs for the CoreML model
coreml_input = {"input_ids": input_ids, "attention_mask": attention_mask}

predictions = model.predict(coreml_input)

hidden_states = predictions['hidden_states']
cls_embedding = hidden_states[0, 0, :]
np.set_printoptions(threshold=np.inf)

# Print the CLS token embedding, which is a 1024-dimensional vector
print("CLS Token Embedding:", cls_embedding, len(cls_embedding))
```

I verified the output with ollama:

```
curl http://localhost:11434/api/embeddings -d '{
    "model": "mxbai-embed-large",
        "prompt": "This is a test sentence for the CoreML model"
    }'
```

Environment: Python 3.11
coremltools 8.0
sentence-transformers 3.1.0
transformers 4.44.2