InfAI
/

flan-t5-text2sparql-custom-tokenizer

text2text-generation

Generated from Trainer

text-generation-inference

Model card Files Files and versions Community

felixb85 commited on Sep 15, 2023

Commit

3c11e49

·

1 Parent(s): 0bbe640

Update README.md

Files changed (1) hide show

README.md +47 -2

README.md CHANGED Viewed

@@ -20,11 +20,44 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
@@ -32,6 +65,18 @@ More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:

 ## Model description
+This model uses the T5 tokenizer just for the input and a [custom one](https://huggingface.co/InfAI/sparql-tokenizer) for the SPARQL queries. This
+has lead to a dramatic improvement in performance, albeit not quite usable yet.
 ## Intended uses & limitations
+Because we used two different tokenizers, you cannot use this model simply in a pipeline. Use the following Python code as a starting point:
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+model_checkpoint = "InfAI/flan-t5-text2sparql-custom-tokenizer"
+question = "What was the population of Clermont-Ferrand on 1-1-2013?"
+gold_answer = "SELECT ?obj WHERE { wd:Q42168 p:P1082 ?s . ?s ps:P1082 ?obj . ?s pq:P585 ?x filter(contains(YEAR(?x),'2013')) }"
+model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
+tokenizer_in = AutoTokenizer.from_pretrained("google/flan-t5-base")
+tokenizer_out = AutoTokenizer.from_pretrained("InfAI/sparql-tokenizer")
+sample = f"Create SPARQL Query: {question}"
+inputs = tokenizer_in([sample], return_tensors="pt")
+outputs = model.generate(**inputs)
+print(f"Gold answer: {gold_answer}")
+print( "            " + tokenizer_out.decode(outputs[0]))
+```
+```
+Gold answer: SELECT ?obj WHERE { wd:Q42168 p:P1082 ?s . ?s ps:P1082 ?obj . ?s pq:P585 ?x filter(contains(YEAR(?x),'2013')) }
+            <pad> SELECT?obj WHERE { wd:Q4754 p:P1082?s.?s ps:P1082?obj.?s pq:P585?x filter(contains(YEAR(?x),'2013')) }
+```
+Common errors include:
+- Adding a `<pad>` token at the beginning
+- A stray closed curly brace at the end
+- One of subject / predicate / object is wrong, while the other two are correct
 ## Training and evaluation data
 ## Training procedure
+We trained the model for 50 epochs, which was way over the top. The loss stagnates after about 25 epochs and looking manually
+at some examples from the validation set showed us that the queries do not improve beyond this point using these hyperparameters.
+We were aware that the number of epochs was probably too high, but our goal was to find out how many epochs were beneficial
+to the performance.
+There are two avenues we will explore to get rid of these errors:
+- Continue training with different hyperparameters
+- Apply more preprocessing to the dataset
+The results will be uploaded to this repo.
 ### Training hyperparameters
 The following hyperparameters were used during training: