Update README.md
Browse files
README.md
CHANGED
@@ -20,11 +20,44 @@ It achieves the following results on the evaluation set:
|
|
20 |
|
21 |
## Model description
|
22 |
|
23 |
-
|
|
|
24 |
|
25 |
## Intended uses & limitations
|
26 |
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
## Training and evaluation data
|
30 |
|
@@ -32,6 +65,18 @@ More information needed
|
|
32 |
|
33 |
## Training procedure
|
34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
### Training hyperparameters
|
36 |
|
37 |
The following hyperparameters were used during training:
|
|
|
20 |
|
21 |
## Model description
|
22 |
|
23 |
+
This model uses the T5 tokenizer just for the input and a [custom one](https://huggingface.co/InfAI/sparql-tokenizer) for the SPARQL queries. This
|
24 |
+
has lead to a dramatic improvement in performance, albeit not quite usable yet.
|
25 |
|
26 |
## Intended uses & limitations
|
27 |
|
28 |
+
Because we used two different tokenizers, you cannot use this model simply in a pipeline. Use the following Python code as a starting point:
|
29 |
+
|
30 |
+
```python
|
31 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
32 |
+
|
33 |
+
model_checkpoint = "InfAI/flan-t5-text2sparql-custom-tokenizer"
|
34 |
+
question = "What was the population of Clermont-Ferrand on 1-1-2013?"
|
35 |
+
gold_answer = "SELECT ?obj WHERE { wd:Q42168 p:P1082 ?s . ?s ps:P1082 ?obj . ?s pq:P585 ?x filter(contains(YEAR(?x),'2013')) }"
|
36 |
+
|
37 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
|
38 |
+
|
39 |
+
tokenizer_in = AutoTokenizer.from_pretrained("google/flan-t5-base")
|
40 |
+
tokenizer_out = AutoTokenizer.from_pretrained("InfAI/sparql-tokenizer")
|
41 |
+
|
42 |
+
sample = f"Create SPARQL Query: {question}"
|
43 |
+
|
44 |
+
inputs = tokenizer_in([sample], return_tensors="pt")
|
45 |
+
outputs = model.generate(**inputs)
|
46 |
+
|
47 |
+
print(f"Gold answer: {gold_answer}")
|
48 |
+
print( " " + tokenizer_out.decode(outputs[0]))
|
49 |
+
```
|
50 |
+
|
51 |
+
```
|
52 |
+
Gold answer: SELECT ?obj WHERE { wd:Q42168 p:P1082 ?s . ?s ps:P1082 ?obj . ?s pq:P585 ?x filter(contains(YEAR(?x),'2013')) }
|
53 |
+
<pad> SELECT?obj WHERE { wd:Q4754 p:P1082?s.?s ps:P1082?obj.?s pq:P585?x filter(contains(YEAR(?x),'2013')) }
|
54 |
+
```
|
55 |
+
|
56 |
+
Common errors include:
|
57 |
+
|
58 |
+
- Adding a `<pad>` token at the beginning
|
59 |
+
- A stray closed curly brace at the end
|
60 |
+
- One of subject / predicate / object is wrong, while the other two are correct
|
61 |
|
62 |
## Training and evaluation data
|
63 |
|
|
|
65 |
|
66 |
## Training procedure
|
67 |
|
68 |
+
We trained the model for 50 epochs, which was way over the top. The loss stagnates after about 25 epochs and looking manually
|
69 |
+
at some examples from the validation set showed us that the queries do not improve beyond this point using these hyperparameters.
|
70 |
+
We were aware that the number of epochs was probably too high, but our goal was to find out how many epochs were beneficial
|
71 |
+
to the performance.
|
72 |
+
|
73 |
+
There are two avenues we will explore to get rid of these errors:
|
74 |
+
|
75 |
+
- Continue training with different hyperparameters
|
76 |
+
- Apply more preprocessing to the dataset
|
77 |
+
|
78 |
+
The results will be uploaded to this repo.
|
79 |
+
|
80 |
### Training hyperparameters
|
81 |
|
82 |
The following hyperparameters were used during training:
|