Enhancing Transformer with Horizontal and Vertical Guiding Mechanisms for Neural Language Modeling

被引：0

作者：

Qu, Anlin ^{[1
,2
]}

Niu, Jianwei ^{[1
,2
,3
,4
]}

Mo, Shasha ^{[1
,2
,5
]}

机构：

[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China

[2] Beihang Univ, Sch Comp Sci & Engn, Beijing Adv Innovat Ctr Big Data & Brain Comp, Beijing 100191, Peoples R China

[3] Beihang Univ, Hangzhou Innovat Res Inst, Hangzhou 310051, Peoples R China

[4] Zhengzhou Univ, Res Inst Ind Technol, Zhengzhou 450001, Peoples R China

[5] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China

来源：

IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2021) | 2021年

基金：

中国国家自然科学基金;

关键词：

neural language modeling; transformer; attention mechanism; information guiding;

D O I：

10.1109/ICC42927.2021.9500450

中图分类号：

TN [电子技术、通信技术];

学科分类号：

0809 ;

摘要：

Language modeling is an important problem in Natural Language Processing (NLP), and the multi-layer Transformer network is currently the most advanced and effective model for this task. However, there exist two inherent defects in its multi-head self-attention structure: (1) attention information loss: the lower-level attention weights cannot be explicitly passed through upper layers, which may lead the network lose some pivotal attention information captured by lower-level layers; (2) multi-head bottleneck: the dimension of each head in vanilla Transformer is relatively small and the process of each head is independent, which introduces an expressive bottleneck and makes subspace learning inadequate constitutionally. To overcome these two weaknesses, a novel neural architecture named Guide-Transformer is proposed in this paper. The Guide-Transformer utilizes horizontal and vertical attention information to guide the original process of the multi-head self-attention sublayer without introducing excessive complexity. The experimental results on three authoritative language modeling benchmarks demonstrate the effectiveness of Guide-Transformer. For the popular perplexity (ppl) and bits-per-character (bpc) evaluation metrics, Guide-Transformer achieves moderate improvements over the powerful baseline model.

引用

页数：6

共 50 条

[31] Language and Cognition Interaction Neural Mechanisms
Perlovsky, Leonid
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2011, 2011
[32] Similar Mechanisms Underlie the Detection of Horizontal and Vertical Disparity Corrugations
Witz, Nirel
Zhou, Jiawei
Hess, Robert F.
PLOS ONE, 2014, 9 (01):
[33] Transcormer: Transformer for Sentence Scoring with Sliding Language Modeling
Song, Kaitao
Leng, Yichong
Tan, Xu
Zou, Yicheng
Qin, Tao
Li, Dongsheng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[34] The Go Transformer: Natural Language Modeling for Game Play
Ciolino, Matthew
Kalin, Josh
Noever, David
2020 THIRD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE FOR INDUSTRIES (AI4I 2020), 2020, : 23 - 26
[35] Tunable Discounting Mechanisms for Language Modeling
Guo, Junfei
Liu, Juan
Chen, Xianlong
Han, Qi
Zhou, Kunxiao
INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: BIG DATA AND MACHINE LEARNING TECHNIQUES, ISCIDE 2015, PT II, 2015, 9243 : 585 - 594
[36] Modeling the vertical and horizontal distribution of the mineral aerosol: Comparison with observations
Guelle, W.
Balkanski, Y.
Schulz, M.
Dulac, F.
Prospero, J.M.
Winker, D.
Hoff, R.
Chazette, P.
Journal of Aerosol Science, 1997, 28 (SUPPL. 1)
[37] Modeling and assessment of the efficiency of horizontal and vertical ground heat exchangers
Florides, G.
Theofanous, E.
Iosif-Stylianou, I.
Tassou, S.
Christodoulides, P.
Zomeni, Z.
Tsiolakis, E.
Kalogirou, S.
Messaritis, V.
Pouloupatis, P.
Panayiotou, G.
ENERGY, 2013, 58 : 655 - 663
[38] Modeling Dike Propagation in Both Vertical Length and Horizontal Breadth
Pansino, Stephen
Emadzadeh, Adel
Taisne, Benoit
JOURNAL OF GEOPHYSICAL RESEARCH-SOLID EARTH, 2022, 127 (10)
[39] LSTM Neural Networks for Language Modeling
Sundermeyer, Martin
Schlueter, Ralf
Ney, Hermann
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 194 - 197
[40] Neural Networks Compression for Language Modeling
Grachev, Artem M.
Ignatov, Dmitry I.
Savchenko, Andrey V.
PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PREMI 2017, 2017, 10597 : 351 - 357

← 1 2 3 4 5 →