Spaces:

coreml-projects
/

transformers-to-coreml

Running

customise input_tokens for text models?

#21

by Leszekasdfff - opened Jul 25, 2023

Jul 25, 2023

Hi! I tried using the github repo, but failed miserably. Is it possible to change the default 128 input tokens for text models in the no code space?

pcuenq

Core ML Projects org Jul 25, 2023

Not yet, but that's an important option and we should expose it soon.

What problems did you have with exporters? What model were you trying to convert?

Leszekasdfff

Jul 25, 2023

I tried converting this model but with 512 input tokens seqlen (as it specified in the model's config file that it accepts 512 tokens input max) and always had 8+ errors, did exactly how it was explained in github repo, but always had errors.

atharvamundada99/bert-large-question-answering-finetuned-legal

Leszekasdfff

Jul 25, 2023

@pcuenq what is the easiest way to change number of input tokens from default 128 to 512? Maybe there is a hardcoded way where I can just change it in the code if I know what model and what feature i plan to use?

Leszekasdfff

Jul 26, 2023

•

edited Jul 26, 2023

This is the code I run:

from collections import OrderedDict
from transformers import BertTokenizer, BertForQuestionAnswering
from exporters.coreml import export
from exporters.coreml.models import BertCoreMLConfig
from exporters.coreml.config import InputDescription

Load the fine-tuned BERT model and tokenizer from the Hugging Face Hub

model_name = "atharvamundada99/bert-large-question-answering-finetuned-legal"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

Create a custom Core ML configuration for BERT

class MyBertCoreMLConfig(BertCoreMLConfig):
@property
def inputs(self) -> OrderedDict[str, InputDescription]:
input_descs = super().inputs
# Setting the desired sequence length here (e.g., 512 tokens)
input_descs["input_ids"].sequence_length = 512
return input_descs

Export the model using the custom configuration

coreml_config = MyBertCoreMLConfig(model.config, task="question-answering")
mlmodel = export(tokenizer, model, coreml_config)

Save the exported Core ML model to a file

mlmodel.save("BERT_QA_LEGAL.mlpackage")

Leszekasdfff

Jul 27, 2023

So, if I just change this to return 512 (as the Bert model is supposed to have), this should work right?

// file: config.py
@property
def maxSequenceLength(self) -> int:
if self.inferSequenceLengthFromConfig:
# Alternatives such as n_positions are automatically mapped to max_position_embeddings
if hasattr(self._config, "max_position_embeddings"):
return self._config.max_position_embeddings
return 128

pcuenq

Core ML Projects org Jul 28, 2023

•

edited Jul 28, 2023

That should work. Perhaps a better way to do it would be by overriding the configuration object of the model you are interested in, and implementing property inferSequenceLengthFromConfig to return True. This is how configurations work: https://github.com/huggingface/exporters#overriding-default-choices-in-the-configuration-object

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment