library_name: transformers
tags: []
Model Card for DeTikZifyv2 (8b)
DeTikZifyv2 (8b) is a language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving TikZ graphics programs. It is based on LLaMA3.1 (8b) and the SigLIP vision encoder of PaliGemmaMix-448 (3b). Check out the DeTikZify project for more information and tips on how to best run the model.
This release is considered a preview and may be updated in the near future.
Usage
from operator import itemgetter
from detikzify.model import load
from detikzify.infer import DetikzifyPipeline
image = "https://w.wiki/A7Cc"
pipeline = DetikzifyPipeline(*load(
model_name_or_path="nllg/detikzify-v2-8b",
device_map="auto",
torch_dtype="bfloat16",
))
# generate a single TikZ program
fig = pipeline.sample(image=image)
# if it compiles, rasterize it and show it
if fig.is_rasterizable:
fig.rasterize().show()
# run MCTS for 10 minutes and generate multiple TikZ programs
figs = set()
for score, fig in pipeline.simulate(image=image, timeout=600):
figs.add((score, fig))
# save the best TikZ program
best = sorted(figs, key=itemgetter(0))[-1][1]
best.save("fig.tex")
Changes from DeTikZifyv1
Architecture
Similar to DeTikZifyv1, DeTikZifyv2 uses a SigLIP vision encoder. However, inspired by the continued ViT pretraining of InternVL, we initialize the weights with the fine-tuned vision encoder of PaliGemmaMix-448 (3b) and increase DeTikZify's resolution to 420x420 pixels. Further, the vision encoder is no longer kept frozen but fully fine-tuned with the rest of the model.
Training Data
For pretraining, we switch from MetaFig to the much larger ArXivCap dataset and extract 1 million (figure, caption, OCR) tuples for pretraining the modality connector. For fine-tuning, we create a new DaTikZv3 dataset (to be released soon) with over 450k TikZ drawings.
We also train a new model called UltraSketch to generate synthetic sketches during training. It is based on UltraEdit and achieves a congruence coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using image transformation. While these sketches are less diverse, they are better at preserving text rendering, achieving a similar CC of 0.75. When we average the sketch representations produced by both methods, the resulting CC increases to 0.82, indicating that the methods are orthogonal and complement each other effectively.
Training & Inference
We observe improved performance by extending the training to 5 epochs and increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder means that we can no longer compute SelfSim as the cosine similarity between pooled outputs during inference, as the pooling head is not fine-tuned. However, by instead computing Earth Mover's Distance on the fine-tuned patch embeddings, it actually enhances the correlation with human judgments (0.456 segment-level and 0.911 system-level correlation). This means that DeTikZifyv2 also works well with our MCTS-based inference algorithm.
Evaluation
Here is how DeTikZifyv2 (8b) compares to DeTikZifyv1 (DS-7b), previously the best performing DeTikZify model, as evaluated on the test split of DaTikZv3.
Reference Figures | Synthetic Sketches | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Model | MTEβ | cBLEUβ | TEDβ | DSimβ | KIDβ | MTEβ | cBLEUβ | TEDβ | DSimβ | KIDβ |
DeTikZifyv1 (DS-7b) | 84.019 | 2.953 | 56.851 | 73.589 | 8.423 | 84.401 | 1.541 | 59.589 | 65.446 | 7.66 |
DeTikZifyv2 (8b) | 93.326 | 6.105 | 54.946 | 78.943 | 6.256 | 93.858 | 3.356 | 58.32 | 72.969 | 7.507 |