Enhancing Transformer with Horizontal and Vertical Guiding Mechanisms for Neural Language Modeling

被引:0
|
作者
Qu, Anlin [1 ,2 ]
Niu, Jianwei [1 ,2 ,3 ,4 ]
Mo, Shasha [1 ,2 ,5 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, Beijing Adv Innovat Ctr Big Data & Brain Comp, Beijing 100191, Peoples R China
[3] Beihang Univ, Hangzhou Innovat Res Inst, Hangzhou 310051, Peoples R China
[4] Zhengzhou Univ, Res Inst Ind Technol, Zhengzhou 450001, Peoples R China
[5] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
基金
中国国家自然科学基金;
关键词
neural language modeling; transformer; attention mechanism; information guiding;
D O I
10.1109/ICC42927.2021.9500450
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Language modeling is an important problem in Natural Language Processing (NLP), and the multi-layer Transformer network is currently the most advanced and effective model for this task. However, there exist two inherent defects in its multi-head self-attention structure: (1) attention information loss: the lower-level attention weights cannot be explicitly passed through upper layers, which may lead the network lose some pivotal attention information captured by lower-level layers; (2) multi-head bottleneck: the dimension of each head in vanilla Transformer is relatively small and the process of each head is independent, which introduces an expressive bottleneck and makes subspace learning inadequate constitutionally. To overcome these two weaknesses, a novel neural architecture named Guide-Transformer is proposed in this paper. The Guide-Transformer utilizes horizontal and vertical attention information to guide the original process of the multi-head self-attention sublayer without introducing excessive complexity. The experimental results on three authoritative language modeling benchmarks demonstrate the effectiveness of Guide-Transformer. For the popular perplexity (ppl) and bits-per-character (bpc) evaluation metrics, Guide-Transformer achieves moderate improvements over the powerful baseline model.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] LANGUAGE MODELING WITH TRANSFORMER
    Zhang, Jian Guo
    Li, Jian Ping
    Li, Huang
    2019 16TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICWAMTIP), 2019, : 249 - 253
  • [2] Vertical and horizontal transmission in language evolution
    Wang, WSY
    Minett, JW
    TRANSACTIONS OF THE PHILOLOGICAL SOCIETY, 2005, 103 (02) : 121 - 146
  • [3] A Tensorized Transformer for Language Modeling
    Ma, Xindian
    Zhang, Peng
    Zhang, Shuai
    Duan, Nan
    Hou, Yuexian
    Song, Dawei
    Zhou, Ming
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [4] Enhancing Vertical Efficiency Through Horizontal Licensing
    Anil Arya
    Brian Mittendorf
    Journal of Regulatory Economics, 2006, 29 : 333 - 342
  • [5] Enhancing vertical efficiency through horizontal licensing
    Arya, A
    Mittendorf, B
    JOURNAL OF REGULATORY ECONOMICS, 2006, 29 (03) : 333 - 342
  • [6] HORIZONTAL AND VERTICAL PATHWAYS IN NEURAL INDUCTION
    GUTHRIE, S
    TRENDS IN NEUROSCIENCES, 1991, 14 (04) : 123 - 126
  • [7] BERTAC: Enhancing Transformer-based Language Models with Adversarially Pretrained Convolutional Neural Networks
    Oh, Jong-Hoon
    Iida, Ryu
    Kloetzer, Julien
    Torisawa, Kentaro
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2103 - 2115
  • [8] Horizontal and vertical prism adaptation are different mechanisms
    Brautaset, RL
    Jennings, JAM
    OPHTHALMIC AND PHYSIOLOGICAL OPTICS, 2005, 25 (03) : 215 - 218
  • [9] Horizontal Power, Vertical Weakness: Enhancing the "Circuit of Culture"
    Champ, Joseph G.
    POPULAR COMMUNICATION, 2008, 6 (02) : 85 - 102
  • [10] Horizontal and Vertical Determination of Mental and Neural States
    Harbecke, Jens
    Atmanspacher, Harald
    JOURNAL OF THEORETICAL AND PHILOSOPHICAL PSYCHOLOGY, 2012, 32 (03): : 161 - 179