Code explanations and links for the model's checkpoints and datasets are on Github mRAT-SQL

Here is the Hugging Face collection, you can download the model's checkpoints and datasets, but to understand is better to go to Github mRAT-SQL.

mRAT-SQL-FIT

A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention

Marcelo Archanjo Jose, Fabio Gagliardi Cozman

Long sequences of text are challenging in the context of transformers, due to quadratic memory increase in the self-attention mechanism. As this issue directly affects the translation from natural language to SQL queries (as techniques usually take as input a concatenated text with the question and the database schema), we present techniques that allow long text sequences to be handled by transformers with up to 512 input tokens. We propose a training process with database schema pruning (removal of tables and columns names that are useless for the query of interest). In addition, we used a multilingual approach with the mT5-large model fine-tuned with a data-augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French. Our proposed technique used the Spider dataset and increased the exact set match accuracy results from 0.718 to 0.736 in a validation dataset (Dev). Source code, evaluations, and checkpoints are available at: mRAT-SQL.

paper published in Springer-Nature - International Journal of Information Technology, here the SharedIt link. here the pre-print in arXiv.

mRAT-SQL+GAP

mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer

Marcelo Archanjo José, Fabio Gagliardi Cozman

The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: mRAT-SQL.

BRACIS 2021: paper published in Springer Lecture Notes in Computer Science, here the pre-print in arXiv.

Based on: RAT-SQL+GAP: Github. Paper: AAAI 2021 paper

Marchanjo
/

mRAT-SQL

mRAT-SQL-FIT

A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention

mRAT-SQL+GAP

mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer

Collection including Marchanjo/mRAT-SQL

mRAT-SQL