nllg
/

Model Card for DeTikZifyv2 (8b)

DeTikZifyv2 (8b) is a language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving TikZ graphics programs. It is based on LLaMA3.1 (8b) and the SigLIP vision encoder of PaliGemmaMix-448 (3b). Check out the DeTikZify project for more information and tips on how to best run the model.

This release is considered a preview and may be updated in the near future.

Usage

from operator import itemgetter

from detikzify.model import load
from detikzify.infer import DetikzifyPipeline

image = "https://w.wiki/A7Cc"
pipeline = DetikzifyPipeline(*load(
    model_name_or_path="nllg/detikzify-v2-8b",
    device_map="auto",
    torch_dtype="bfloat16",
))

# generate a single TikZ program
fig = pipeline.sample(image=image)

# if it compiles, rasterize it and show it
if fig.is_rasterizable:
    fig.rasterize().show()

# run MCTS for 10 minutes and generate multiple TikZ programs
figs = set()
for score, fig in pipeline.simulate(image=image, timeout=600):
    figs.add((score, fig))

# save the best TikZ program
best = sorted(figs, key=itemgetter(0))[-1][1]
best.save("fig.tex")

Changes from DeTikZifyv1

Architecture

Similar to DeTikZifyv1, DeTikZifyv2 uses a SigLIP vision encoder. However, inspired by the continued ViT pretraining of InternVL, we initialize the weights with the fine-tuned vision encoder of PaliGemmaMix-448 (3b) and increase DeTikZify's resolution to 420x420 pixels. Further, the vision encoder is no longer kept frozen but fully fine-tuned with the rest of the model.

Training Data

For pretraining, we switch from MetaFig to the much larger ArXivCap dataset and extract 1 million (figure, caption, OCR) tuples for pretraining the modality connector. For fine-tuning, we create a new DaTikZv3 dataset (to be released soon) with over 450k TikZ drawings.

We also train a new model called UltraSketch to generate synthetic sketches during training. It is based on UltraEdit and achieves a congruence coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using image transformation. While these sketches are less diverse, they are better at preserving text rendering, achieving a similar CC of 0.75. When we average the sketch representations produced by both methods, the resulting CC increases to 0.82, indicating that the methods are orthogonal and complement each other effectively.

Training & Inference

We observe improved performance by extending the training to 5 epochs and increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder means that we can no longer compute SelfSim as the cosine similarity between pooled outputs during inference, as the pooling head is not fine-tuned. However, by instead computing Earth Mover's Distance on the fine-tuned patch embeddings, it actually enhances the correlation with human judgments (0.456 segment-level and 0.911 system-level correlation). This means that DeTikZifyv2 also works well with our MCTS-based inference algorithm.

Evaluation

Here is how DeTikZifyv2 (8b) compares to DeTikZifyv1 (DS-7b), previously the best performing DeTikZify model, as evaluated on the test split of DaTikZv3.

Reference Figures Reference Figures
Model MTE↑ cBLEU↑ TED↓ DSim↑ KID↓ MTE↑ cBLEU↑ TED↓ DSim↑ KID↓
DeTikZifyv1 (DS-7b) 84.019 2.953 56.851 73.589 8.423 84.401 1.541 59.589 65.446 7.66
DeTikZifyv2 (8b) 93.326 6.105 54.946 78.943 6.256 93.858 3.356 58.32 72.969 7.507
Downloads last month
1,011
Safetensors
Model size
8.47B params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including nllg/detikzify-v2-8b