For fine-tuning deplot, what form of text should be given as input data table?
I am trying to do fine-tuning google/deplot according to the link and Notebook below.
link: https://huggingface.co/docs/transformers/main/en/model_doc/deplot#finetuning
Notebook: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb
What form of text should be given as input data table for fine-tuning deplot?
From the figure 1 of paper, I think that the text format is as follows
text = """
Header: models | augmented-set | human-set
Row 1: VisionTapas |67.2 | 22.2
Row 2: Pix2Struct |82.9 | 30.4
"""
Is this correct?
On the fine-tuning notebook, I think that the above data will be placed in texts
of the following code.(in collator
functions)
text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
Hi, thanks for the questions.
We used the following format for groudtruth:
models | augmented-set | human-set
VisionTapas | 67.2 | 22.2
Pix2Struct | 82.9 | 30.4
The Header:
and Row x:
labels are added using a post-processing function (see here). Or you can see this as an example groudtruth table.
And yes the groudtruths should be put into texts
in the collator
function.
Hope this helps!
@fl399 Hi! In regards Fine Tuning and The Text PreProcessing, Are \n newline character by default converted into <0x0A>, or do we need to do this ourselves before passing the text into the tokenizer?
@fl399
or
@sinchir0
Please could you share the changes needed for image_captioning_pix2struct.ipynb? I've spent the better part of an afternoon / evening working with Colab on solving this and allowing it to fine tune. I've shared the paper, the notebook, and this discussion with Anthropic's Claude and tried with Google's Colab LLM but I can't get it working. Please share with us what we need to change in the cells of the notebook, specifically the ImageCaptioningDataset class, the collator, and anything else? I, and many others, would be so grateful for this, please ππ
The challenge is that we're using a different processor, initialized with:
processor = Pix2StructProcessor.from_pretrained("google/deplot")
If we reference the model card fine-tuning section
we see the example for using the processor as:inputs = processor(images=images, text="Generate underlying data table of the figure below:", return_tensors="pt")
And we are pointed to the image_captioning_pix2struct notebook
The code below is from the original notebook where "text" is essentially the text-based label/answer for the "image", but can we get the remaining updates so at least the example will work? I have fiddled with it until it is training but left with a lot of regret and uncertainty that it's making progress. I'd like to know the code is implemented correctly before dedicating an A100 for a few hours on the job.
from torch.utils.data import Dataset, DataLoader
MAX_PATCHES = 1024
class ImageCaptioningDataset(Dataset):
def __init__(self, dataset, processor):
self.dataset = dataset
self.processor = processor
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
item = self.dataset[idx]
encoding = self.processor(images=item["image"], return_tensors="pt", add_special_tokens=True, max_patches=MAX_PATCHES)
encoding = {k:v.squeeze() for k,v in encoding.items()}
encoding["text"] = item["text"]
return encoding
def collator(batch):
new_batch = {"flattened_patches":[], "attention_mask":[]}
texts = [item["text"] for item in batch]
text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
new_batch["labels"] = text_inputs.input_ids
for item in batch:
new_batch["flattened_patches"].append(item["flattened_patches"])
new_batch["attention_mask"].append(item["attention_mask"])
new_batch["flattened_patches"] = torch.stack(new_batch["flattened_patches"])
new_batch["attention_mask"] = torch.stack(new_batch["attention_mask"])
return new_batch
Anyone?
Anyone?
You need to change below line
--------------------------- before ------------------------------------
text_inputs = processor(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
--------------------------- after ------------------------------------
text_inputs = processor.tokenizer(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
thank you @SungBeom π