Enhancing Transformer with Horizontal and Vertical Guiding Mechanisms for Neural Language Modeling

被引:0
|
作者
Qu, Anlin [1 ,2 ]
Niu, Jianwei [1 ,2 ,3 ,4 ]
Mo, Shasha [1 ,2 ,5 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, Beijing Adv Innovat Ctr Big Data & Brain Comp, Beijing 100191, Peoples R China
[3] Beihang Univ, Hangzhou Innovat Res Inst, Hangzhou 310051, Peoples R China
[4] Zhengzhou Univ, Res Inst Ind Technol, Zhengzhou 450001, Peoples R China
[5] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
来源
IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2021) | 2021年
基金
中国国家自然科学基金;
关键词
neural language modeling; transformer; attention mechanism; information guiding;
D O I
10.1109/ICC42927.2021.9500450
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Language modeling is an important problem in Natural Language Processing (NLP), and the multi-layer Transformer network is currently the most advanced and effective model for this task. However, there exist two inherent defects in its multi-head self-attention structure: (1) attention information loss: the lower-level attention weights cannot be explicitly passed through upper layers, which may lead the network lose some pivotal attention information captured by lower-level layers; (2) multi-head bottleneck: the dimension of each head in vanilla Transformer is relatively small and the process of each head is independent, which introduces an expressive bottleneck and makes subspace learning inadequate constitutionally. To overcome these two weaknesses, a novel neural architecture named Guide-Transformer is proposed in this paper. The Guide-Transformer utilizes horizontal and vertical attention information to guide the original process of the multi-head self-attention sublayer without introducing excessive complexity. The experimental results on three authoritative language modeling benchmarks demonstrate the effectiveness of Guide-Transformer. For the popular perplexity (ppl) and bits-per-character (bpc) evaluation metrics, Guide-Transformer achieves moderate improvements over the powerful baseline model.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] Language and Cognition Interaction Neural Mechanisms
    Perlovsky, Leonid
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2011, 2011
  • [32] Similar Mechanisms Underlie the Detection of Horizontal and Vertical Disparity Corrugations
    Witz, Nirel
    Zhou, Jiawei
    Hess, Robert F.
    PLOS ONE, 2014, 9 (01):
  • [33] Transcormer: Transformer for Sentence Scoring with Sliding Language Modeling
    Song, Kaitao
    Leng, Yichong
    Tan, Xu
    Zou, Yicheng
    Qin, Tao
    Li, Dongsheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [34] The Go Transformer: Natural Language Modeling for Game Play
    Ciolino, Matthew
    Kalin, Josh
    Noever, David
    2020 THIRD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE FOR INDUSTRIES (AI4I 2020), 2020, : 23 - 26
  • [35] Tunable Discounting Mechanisms for Language Modeling
    Guo, Junfei
    Liu, Juan
    Chen, Xianlong
    Han, Qi
    Zhou, Kunxiao
    INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: BIG DATA AND MACHINE LEARNING TECHNIQUES, ISCIDE 2015, PT II, 2015, 9243 : 585 - 594
  • [36] Modeling the vertical and horizontal distribution of the mineral aerosol: Comparison with observations
    Guelle, W.
    Balkanski, Y.
    Schulz, M.
    Dulac, F.
    Prospero, J.M.
    Winker, D.
    Hoff, R.
    Chazette, P.
    Journal of Aerosol Science, 1997, 28 (SUPPL. 1)
  • [37] Modeling and assessment of the efficiency of horizontal and vertical ground heat exchangers
    Florides, G.
    Theofanous, E.
    Iosif-Stylianou, I.
    Tassou, S.
    Christodoulides, P.
    Zomeni, Z.
    Tsiolakis, E.
    Kalogirou, S.
    Messaritis, V.
    Pouloupatis, P.
    Panayiotou, G.
    ENERGY, 2013, 58 : 655 - 663
  • [38] Modeling Dike Propagation in Both Vertical Length and Horizontal Breadth
    Pansino, Stephen
    Emadzadeh, Adel
    Taisne, Benoit
    JOURNAL OF GEOPHYSICAL RESEARCH-SOLID EARTH, 2022, 127 (10)
  • [39] LSTM Neural Networks for Language Modeling
    Sundermeyer, Martin
    Schlueter, Ralf
    Ney, Hermann
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 194 - 197
  • [40] Neural Networks Compression for Language Modeling
    Grachev, Artem M.
    Ignatov, Dmitry I.
    Savchenko, Andrey V.
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PREMI 2017, 2017, 10597 : 351 - 357