Structure-Grounded Pretraining for Text-to-SQL

被引:0
|
作者
Deng, Xiang [1 ,2 ]
Awadallah, Ahmed Hassan [2 ]
Meek, Christopher [2 ]
Polozov, Oleksandr [2 ]
Sun, Huan [1 ]
Richardson, Matthew [2 ]
机构
[1] Ohio State Univ, Columbus, OH 43210 USA
[2] Microsoft Res, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (STRUG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel pretraining tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing textto-SQL datasets for cross-database evaluation. S TRuG brings significant improvement over BERTLARGE in all settings. Compared with existing pretraining methods such as GRAPPA, S TRuG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. All the code and data used in this work is public available at https://aka.ms/strug.
引用
收藏
页码:1337 / 1350
页数:14
相关论文
共 50 条
  • [41] Text-to-SQL Error Correction with Language Models of Code
    Chen, Ziru
    Chen, Shijie
    White, Michael
    Mooney, Raymond
    Payani, Ali
    Srinivasa, Jayanth
    Su, Yu
    Sun, Huan
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1359 - 1372
  • [42] Bridging the Generalization Gap in Text-to-SQL Parsing with Schema Expansion
    Zhao, Chen
    Su, Yu
    Pauls, Adam
    Platanios, Emmanouil Antonios
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5568 - 5578
  • [43] A Heterogeneous Graph to Abstract Syntax Tree Framework for Text-to-SQL
    Cao, Ruisheng
    Chen, Lu
    Li, Jieyu
    Zhang, Hanchong
    Xu, Hongshen
    Zhang, Wangyou
    Yu, Kai
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13796 - 13813
  • [44] Synthesizing Text-to-SQL Data from Weak and Strong LLMs
    Yang, Jiaxi
    Hui, Binyuan
    Yang, Min
    Yang, Jian
    Lin, Junyang
    Zhou, Chang
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7864 - 7875
  • [45] Evaluating Cross-Domain Text-to-SQL Models and Benchmarks
    Pourreza, Mohammadreza
    Rafiei, Davood
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1601 - 1611
  • [46] Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
    Gao, Dawei
    Wang, Haibin
    Li, Yaliang
    Sun, Xiuyu
    Qian, Yichen
    Ding, Bolin
    Zhou, Jingren
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (05): : 1132 - 1145
  • [47] Towards Robustness of Text-to-SQL Models against Synonym Substitution
    Gan, Yujian
    Chen, Xinyun
    Huang, Qiuping
    Purver, Matthew
    Woodward, John R.
    Xie, Jinxia
    Huang, Pengsheng
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2505 - 2515
  • [48] RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL
    Li, Haoyang
    Zhang, Jing
    Li, Cuiping
    Chen, Hong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13067 - 13075
  • [49] EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records
    Lee, Gyubok
    Hwang, Hyeonji
    Bae, Seongsu
    Kwon, Yeonsu
    Shin, Woncheol
    Yang, Seongjun
    Seo, Minjoon
    Kim, Jongyeup
    Choi, Edward
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [50] PHOTON: A Robust Cross-Domain Text-to-SQL System
    Zeng, Jichuan
    Lin, Xi Victoria
    Xiong, Caiming
    Socher, Richard
    Lyu, Michael R.
    King, Irwin
    Hoi, Steven C. H.
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): SYSTEM DEMONSTRATIONS, 2020, : 204 - 214