Multilingual-Multimodal-NLP
/

MM-Coder-7B

 ---
 license: apache-2.0
 language:
 - en
+pipeline_tag: image-text-to-text
+tags:
+- multimodal
+library_name: transformers
+---
+# MM-Coder-7B: (from Qwen2-VL-7B)
+## Introduction
+MM-Coder-7B is a multimodal model that can process both text and images, 擅长根据UML/FlowChart生成相应代码. It is based on Qwen2-VL-7B and has been fine-tuned on dataset from [MMc-Instruct-Stage1](Coming soon) and [MMc-Instruct-Stage2](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/MMc-Instruct-Stage2).
+## Requirements
+Verified on:
+- vllm==0.9.1
+- transformers==4.49.0
+- qwen-vl-utils==0.0.11
+- accelerate==1.9.0
+(Note: Higher version transformers may cause errors(https://github.com/vllm-project/vllm/issues/15614), please use the version above.)
+## Quickstart
+Below, we provide simple examples to show inference of MM-Coder-7B with transformers. Our model is fully compatible with Qwen-2-7B-Instruct. More usage method could refer to
+[Qwen-2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
+```python
+from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+model = Qwen2VLForConditionalGeneration.from_pretrained(
+    "Multilingual-Multimodal-NLP/MM-Coder-7B", torch_dtype="auto", device_map="auto"
+)
+# default processer
+processor = AutoProcessor.from_pretrained("Multilingual-Multimodal-NLP/MM-Coder-7B")
+# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
+# min_pixels = 256*28*28
+# max_pixels = 1280*28*28
+# processor = AutoProcessor.from_pretrained("Multilingual-Multimodal-NLP/MM-Coder-7B", min_pixels=min_pixels, max_pixels=max_pixels)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "[IMAGE_PATH]",
+            },
+            {"type": "text", "text": "Use Python to complete the task as described in the diagram:\nDesign a Crop class in a virtual farm management system."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=1024)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+#[OUTPUT]
+# Here is a comprehensive solution for the Crop class based on the provided diagram:
+# ```python
+# class Crop:
+#     def __init__(self, name, plant_date):
+#         self.name = name
+#         self.plant_date = plant_date
+#         self.status = "Planted"
+#     def grow(self):
+#         if self.status == "Planted":
+#             self.status = "Growing"
+#         elif self.status == "Growing":
+#             self.status = "Harvested"
+#     def get_crop_infos(self):
+#         return f"Crop(name={self.name}, status={self.status})"
+# ...
+```
+## Citation
+If you find our work helpful, feel free to give us a cite.
+```
+@misc{mmcoder,
+      title={Multilingual Multimodal Software Developer for Code Generation},
+      author={Linzheng Chai and Jian Yang and Shukai Liu and Wei Zhang and Liran Wang and Ke Jin and Tao Sun and Congnan Liu and Chenchen Zhang and Hualei Zhu and Jiaheng Liu and Xianjie Wu and Ge Zhang and Tianyu Liu and Zhoujun Li},
+      year={2025},
+      eprint={2507.08719},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2507.08719},
+}
+```