We will attempt to understand the ResdSQL paper and its implementation in github
Due to the structural property of the SQL queries, the seq2seq model takes the responsibility of parsing both the schema items (i.e., tables and columns) and the skeleton (i.e., SQL keywords). Such coupled targets increase the difficulty of parsing the correct SQL queries especially when they involve many schema items and logic operators
for a seq2seq encoder-decode model, its encoder is injected by the most relevant schema items instead of the whole unordered ones, which could alleviate the schema linking effort during SQL parsing, and its decoder first generates the skeleton and then the actual SQL query, which could implicitly constrain the SQL parsing
So the two main features of this paper are - Ranking-enhanced Encoding : Instead of using the whole schema, the encoder is injected with the most relevant schema items. we train an additional cross-encoder to classify the tables and columns simultaneously based on the input question, and then rank and filter them according to the classification probabilities to form a ranked schema sequence - Skeleton-first Decoding : The decoder first generates the skeleton (SQL keywords) and then the actual SQL query. Since skeleton parsing is much easier than SQL parsing, the first generated skeleton could implicitly guide the subsequent SQL parsing via the masked self-attention mechanism in the decoder.
Implementation
We will identify the steps performed by the scripts in github repository.
1. Run the contents of preprocess.sh
Minor changes (remove the –db_path parameter to use the default)
python preprocessing.py \
--mode "train" \
--table_path "./data/spider/tables.json" \
--input_dataset_path "./data/spider/train_spider.json" \
--output_dataset_path "./data/preprocessed_data/preprocessed_train_spider.json" \
--target_type "sql"
python preprocessing.py \
--mode "eval" \
--table_path "./data/spider/tables.json" \
--input_dataset_path "./data/spider/dev.json" \
--output_dataset_path "./data/preprocessed_data/preprocessed_dev.json" \
--target_type "sql"
The preprocessing adds the following to the dataset and saves it in a new json file. - norm_sql
: normalized SQL query - sql_skeleton
: SQL skeleton - nat_sql
: nat_sql if it exists (for train and dev, not test) - norm_nat_sql
: normalized nat_sql if it exists - natsql_skeleton
: nat_sql skeleton - pk
: primary keys - fk
: foreign keys - db_schema
: database schema with original and semantic table and column names - db_contents
: column value matches. See the query ‘What are the names of the heads who are born outside the California state’ in the generated file preprocessed_train_spider.json