nllg
/

detikzify-v2-8b

Text2Text Generation

Transformers

Safetensors

detikzify

Inference Endpoints

Model card Files Files and versions Community

potamides commited on Dec 4, 2024

Commit

03edfb2

verified ·

1 Parent(s): 1f01a38

Update README.md

Browse files

Files changed (1) hide show

README.md +134 -192

README.md CHANGED Viewed

@@ -3,197 +3,139 @@ library_name: transformers
 tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 tags: []
 ---
+# Model Card for DeTi*k*Zify<sub>v2</sub> (8b)
+DeTi*k*Zify<sub>v2</sub> (8b) is a language model that automatically converts
+sketches and existing scientific figures into editable, semantics-preserving
+Ti*k*Z graphics programs. It is based on [LLaMA<sub>3.1</sub>
+(8b)](https://huggingface.co/meta-llama/Llama-3.1-8B) and the SigLIP vision
+encoder of [PaliGemma<sub>Mix-448</sub>
+(3b)](https://huggingface.co/google/paligemma-3b-mix-448). Check out the
+[DeTi*k*Zify](https://github.com/potamides/DeTikZify) project for more
+information and tips on how to best run the model.
+> [!WARNING]
+> This release is considered a preview and may be updated in the near future.
+## Usage
+```python
+from operator import itemgetter
+from detikzify.model import load
+from detikzify.infer import DetikzifyPipeline
+image = "https://w.wiki/A7Cc"
+pipeline = DetikzifyPipeline(*load(
+    model_name_or_path="nllg/detikzify-v2-8b",
+    device_map="auto",
+    torch_dtype="bfloat16",
+))
+# generate a single TikZ program
+fig = pipeline.sample(image=image)
+# if it compiles, rasterize it and show it
+if fig.is_rasterizable:
+    fig.rasterize().show()
+# run MCTS for 10 minutes and generate multiple TikZ programs
+figs = set()
+for score, fig in pipeline.simulate(image=image, timeout=600):
+    figs.add((score, fig))
+# save the best TikZ program
+best = sorted(figs, key=itemgetter(0))[-1][1]
+best.save("fig.tex")
+```
+## Changes from DeTi*k*Zify<sub>v1</sub>
+### Architecture
+Similar to DeTi*k*Zify<sub>v1</sub>, DeTi*k*Zify<sub>v2</sub> uses a SigLIP
+vision encoder. However, inspired by the continued ViT pretraining of
+[InternVL](https://arxiv.org/abs/2404.16821), we initialize the weights with
+the fine-tuned vision encoder of [PaliGemma<sub>Mix-448</sub>
+(3b)](https://arxiv.org/abs/2407.07726) and increase DeTi*k*Zify's
+resolution to 420x420 pixels. Further, the vision encoder is no longer kept
+frozen but fully fine-tuned with the rest of the model.
 ### Training Data
+For pretraining, we switch from MetaFig to the much larger
+[ArXivCap](https://huggingface.co/datasets/MMInstruction/ArxivCap) dataset and
+extract 1 million (figure, caption, OCR) tuples for pretraining the modality
+connector. For fine-tuning, we create a new DaTi*k*Z<sub>v3</sub> dataset (to
+be released soon) with over 450k Ti*k*Z drawings.
+We also train a new model called
+[UltraSketch](https://huggingface.co/nllg/ultrasketch) to generate synthetic
+sketches during training. It is based on
+[UltraEdit](https://arxiv.org/abs/2407.05282) and achieves a congruence
+coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using
+image transformation. While these sketches are less diverse, they are better at
+preserving text rendering, achieving a similar CC of 0.75. When we average the
+sketch representations produced by both methods, the resulting CC increases to
+0.82, indicating that the methods are orthogonal and complement each other
+effectively.
+### Training & Inference
+We observe improved performance by extending the training to 5 epochs and
+increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder
+means that we can no longer compute SelfSim as the cosine similarity between
+pooled outputs during inference, as the pooling head is not fine-tuned.
+However, by instead computing Earth Mover's Distance on the fine-tuned patch
+embeddings, it actually enhances the correlation with human judgments (0.456
+segment-level and 0.911 system-level correlation). This means that
+DeTikZify<sub>v2</sub> also works well with our MCTS-based inference algorithm.
+# Evaluation
+Here is how DeTi*k*Zify<sub>v2</sub> (8b) compares to
+[DeTi<i>k</i>Zify<sub>v1</sub>
+(DS-7b)](https://huggingface.co/nllg/detikzify-ds-7b), previously the best
+performing DeTi*k*Zify model, as evaluated on the test split of
+DaTi*k*Z<sub>v3</sub>.
+<table>
+  <tr>
+    <th></th>
+    <th colspan="5">Reference Figures</th>
+    <th colspan="5">Reference Figures</th>
+  </tr>
+  <tr>
+    <th>Model</th>
+    <th>MTE<sub>&uarr;</sub></th>
+    <th>cBLEU<sub>&uarr;</sub></th>
+    <th>TED<sub>&darr;</sub></th>
+    <th>DSim<sub>&uarr;</sub></th>
+    <th>KID<sub>&darr;</sub></th>
+    <th>MTE<sub>&uarr;</sub></th>
+    <th>cBLEU<sub>&uarr;</sub></th>
+    <th>TED<sub>&darr;</sub></th>
+    <th>DSim<sub>&uarr;</sub></th>
+    <th>KID<sub>&darr;</sub></th>
+  </tr>
+  <tr>
+    <td>DeTi<i>k</i>Zify<sub>v1</sub> (DS-7b)</td>
+    <td>84.019</td>
+    <td> 2.953</td>
+    <td>56.851</td>
+    <td>73.589</td>
+    <td> 8.423</td>
+    <td>84.401</td>
+    <td> 1.541</td>
+    <td>59.589</td>
+    <td>65.446</td>
+    <td> 7.66 </td>
+  </tr>
+  <tr>
+    <td>DeTi<i>k</i>Zify<sub>v2</sub> (8b)</td>
+    <td><b>93.326</b></td>
+    <td><b> 6.105</b></td>
+    <td><b>54.946</b></td>
+    <td><b>78.943</b></td>
+    <td><b> 6.256</b></td>
+    <td><b>93.858</b></td>
+    <td><b> 3.356</b></td>
+    <td><b>58.32 </b></td>
+    <td><b>72.969</b></td>
+    <td><b> 7.507</b></td>
+  </tr>
+</table>