--- library_name: transformers tags: [] --- # Model Card for DeTi*k*Zify_v2 (8b) DeTi*k*Zify_v2 (8b) is a language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving Ti*k*Z graphics programs. It is based on [LLaMA_3.1 (8b)](https://huggingface.co/meta-llama/Llama-3.1-8B) and the SigLIP vision encoder of [PaliGemma_Mix-448 (3b)](https://huggingface.co/google/paligemma-3b-mix-448). Check out the [DeTi*k*Zify](https://github.com/potamides/DeTikZify) project for more information and tips on how to best run the model. > [!WARNING] > This release is considered a preview and may be updated in the near future. ## Usage ```python from operator import itemgetter from detikzify.model import load from detikzify.infer import DetikzifyPipeline image = "https://w.wiki/A7Cc" pipeline = DetikzifyPipeline(*load( model_name_or_path="nllg/detikzify-v2-8b", device_map="auto", torch_dtype="bfloat16", )) # generate a single TikZ program fig = pipeline.sample(image=image) # if it compiles, rasterize it and show it if fig.is_rasterizable: fig.rasterize().show() # run MCTS for 10 minutes and generate multiple TikZ programs figs = set() for score, fig in pipeline.simulate(image=image, timeout=600): figs.add((score, fig)) # save the best TikZ program best = sorted(figs, key=itemgetter(0))[-1][1] best.save("fig.tex") ``` ## Changes from DeTi*k*Zify_v1 ### Architecture Similar to DeTi*k*Zify_v1, DeTi*k*Zify_v2 uses a SigLIP vision encoder. However, inspired by the continued ViT pretraining of [InternVL](https://arxiv.org/abs/2404.16821), we initialize the weights with the fine-tuned vision encoder of [PaliGemma_Mix-448 (3b)](https://arxiv.org/abs/2407.07726) and increase DeTi*k*Zify's resolution to 420x420 pixels. Further, the vision encoder is no longer kept frozen but fully fine-tuned with the rest of the model. ### Training Data For pretraining, we switch from MetaFig to the much larger [ArXivCap](https://huggingface.co/datasets/MMInstruction/ArxivCap) dataset and extract 1 million (figure, caption, OCR) tuples for pretraining the modality connector. For fine-tuning, we create a new DaTi*k*Z_v3 dataset (to be released soon) with over 450k Ti*k*Z drawings. We also train a new model called [UltraSketch](https://huggingface.co/nllg/ultrasketch) to generate synthetic sketches during training. It is based on [UltraEdit](https://arxiv.org/abs/2407.05282) and achieves a congruence coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using image transformation. While these sketches are less diverse, they are better at preserving text rendering, achieving a similar CC of 0.75. When we average the sketch representations produced by both methods, the resulting CC increases to 0.82, indicating that the methods are orthogonal and complement each other effectively. ### Training & Inference We observe improved performance by extending the training to 5 epochs and increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder means that we can no longer compute SelfSim as the cosine similarity between pooled outputs during inference, as the pooling head is not fine-tuned. However, by instead computing Earth Mover's Distance on the fine-tuned patch embeddings, it actually enhances the correlation with human judgments (0.456 segment-level and 0.911 system-level correlation). This means that DeTikZify_v2 also works well with our MCTS-based inference algorithm. # Evaluation Here is how DeTi*k*Zify_v2 (8b) compares to [DeTikZify_v1 (DS-7b)](https://huggingface.co/nllg/detikzify-ds-7b), previously the best performing DeTi*k*Zify model, as evaluated on the test split of DaTi*k*Z_v3.

	Reference Figures					Synthetic Sketches
Model	MTE_↑	cBLEU_↑	TED_↓	DSim_↑	KID_↓	MTE_↑	cBLEU_↑	TED_↓	DSim_↑	KID_↓
DeTikZify_v1 (DS-7b)	84.019	2.953	56.851	73.589	8.423	84.401	1.541	59.589	65.446	7.66
DeTikZify_v2 (8b)	93.326	6.105	54.946	78.943	6.256	93.858	3.356	58.32	72.969	7.507