Finetuning for a different language
What would be the best way to fine-tune it on Korean? Is there any open source code specifically for the Qwen2-VL backbone?
The tevatron repo has code for Phi-3 but not Qwen...
I have prepared a dataset ~500k rows of image-query pairs and want to fine-tune this model, but I'm not sure what would be the right way to do it.
thanks for your interest. I can update the code tomorrow for qwen and let you know
thank you very much!
Hi @jdchawla , I have updated the basic qwen code (see the dse/qwen example folder in tevatron), however, I recently do not have many compute resources to test it well, feel free to open issue or pull request.
For your specific use case, I suggest the following training procedures:
- encode your images with
MrLight/dse-qwen2-2b-mrl-v1
by following the example code for document encoding.
- prepare your corpus dataset like
Tevatron/wiki-ss-corpus
- encode all your queries with
MrLight/dse-qwen2-2b-mrl-v1
by following the example code for query encoding
- prepare your query dataset like
Tevatron/wiki-ss-nq
- (no need to deal with positive docs and negative docs for now)
- Do search using above query and passage representations, by following the search code
The above steps will gives you retrieval results using MrLight/dse-qwen2-2b-mrl-v1
for your queries over your doc images , here the main purpose is to do hard negative mining. (I assume our qwen can do some good zeroshot retrieval on Korean task )
Now, based on the retrieval results, you can create your training data, query, positive documents, negative_docuements in the format of
Tevatron/wiki-ss-nq
. By putting the paired document id into positive, and negative document id into negative passages. (leave the text field as empty, we only need docid here)train your model by following the training example.
I will work on making the Tevatron toolkit more friendly for multimodal in the following weeks, but meanwhile, feel free to ping me if there is further question.
btw. when I initially finetune qwen, I turned off backward propagation for the visual encoder to save VRAM usage. If qwen it self has good enough performance on Korean document OCR etc, I think its ok to turn off the backward propagation for the visual encoder.
Thank you so much for the code and the insights!