File size: 1,510 Bytes
798342a 950a8d3 798342a 950a8d3 b99e3e6 950a8d3 00e6621 950a8d3 00e6621 950a8d3 00e6621 950a8d3 f1750de 950a8d3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
---
license: mit
datasets:
- lc_quad
---
This repo contains a custom tokenizer for SPARQL. It is a SentencePieceBPE tokenizer trained on lc_quad. Here is an example.
Original query:
```
SELECT ?answer WHERE { wd:Q825946 wdt:P371 ?X . ?X wdt:P2048 ?answer}
```
Result from default T5 tokenizer (just as an example):
```
['▁', 'SEL', 'ECT', '▁', '?', 'ans', 'wer', '▁W', 'HER', 'E', '▁', '{', '▁', 'w', 'd', ':', 'Q', '82', '59', '46', '▁',
'w', 'd', 't', ':', 'P', '37', '1', '▁', '?', 'X', '▁', '.', '▁', '?', 'X', '▁', 'w', 'd', 't', ':', 'P', '20', '48',
'▁', '?', 'ans', 'wer', '}']
```
Result from this tokenizer:
```
['▁SELECT', '▁?answer', '▁WHERE', '▁{', '▁wd:Q8', '259', '46', '▁wdt:P371', '▁?X', '▁.', '▁?X', '▁wdt:P2048', '▁?answer', '}']
```
# How to use
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("InfAI/sparql-tokenizer")
tokenizer.tokenize("SELECT ?answer WHERE { wd:Q825946 wdt:P371 ?X . ?X wdt:P2048 ?answer}")
```
```
['▁SELECT', '▁?answer', '▁WHERE', '▁{', '▁wd:Q8', '259', '46', '▁wdt:P371', '▁?X', '▁.', '▁?X', '▁wdt:P2048', '▁?answer', '}']
```
```python
tokenizer("SELECT ?answer WHERE { wd:Q825946 wdt:P371 ?X . ?X wdt:P2048 ?answer}")
```
```
{'input_ids': [441, 444, 431, 422, 606, 1388, 720, 1791, 456, 418, 456, 3657, 444, 185], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
``` |