Mask Attention Networks: Rethinking and Strengthen Transformer

被引:0
|
作者
Fan, Zhihao [1 ]
Gong, Yeyun [2 ]
Lit, Dayiheng [3 ]
Wei, Zhongyu [1 ,6 ]
Wang, Siyuan [1 ]
Jiao, Jian [4 ]
Duan, Nan [2 ]
Zhang, Ruofei [4 ]
Huang, Xuanjing [5 ]
机构
[1] Fudan Univ, Sch Data Sci, Shanghai, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
[3] DAMO Acad, Hangzhou, Peoples R China
[4] Microsoft, Beijing, Peoples R China
[5] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[6] Fudan Univ, Res Inst Intelligent & Complex Syst, Shanghai, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer is an attention-based neural network, which consists of two sublayers, namely, Self-Attention Network (SAN) and FeedForward Network (FFN). Existing research explores to enhance the two sublayers separately to improve the capability of Transformer for text representation. In this paper, we present a novel understanding of SAN and FFN as Mask Attention Networks (MANs) and show that they are two special cases of MANs with static mask matrices. However, their static mask matrices limit the capability for localness modeling in text representation learning. We therefore introduce a new layer named dynamic mask attention network (DMAN) with a learnable mask matrix which is able to model localness adaptively. To incorporate advantages of DMAN, SAN, and FFN, we propose a sequential layered structure to combine the three types of layers. Extensive experiments on various tasks, including neural machine translation and text summarization demonstrate that our model outperforms the original Transformer.
引用
收藏
页码:1692 / 1701
页数:10
相关论文
共 50 条
  • [21] Rethinking vision and attention
    Cole, GG
    APPLIED COGNITIVE PSYCHOLOGY, 2004, 18 (06) : 781 - 783
  • [22] STAR plus plus : Rethinking spatio-temporal cross attention transformer for video action recognition
    Ahn, Dasom
    Kim, Sangwon
    Ko, Byoung Chul
    APPLIED INTELLIGENCE, 2023, 53 (23) : 28446 - 28459
  • [23] Neighborhood Attention Transformer
    Hassani, Ali
    Walton, Steven
    Li, Jiachen
    Li, Shen
    Shi, Humphrey
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6185 - 6194
  • [24] Attention-Aware Social Graph Transformer Networks for Stochastic Trajectory Prediction
    Liu, Yao
    Li, Binghao
    Wang, Xianzhi
    Sammut, Claude
    Yao, Lina
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (11) : 5633 - 5646
  • [25] TOWARDS ROBUST VISUAL TRANSFORMER NETWORKS VIA K-SPARSE ATTENTION
    Amini, Sajjad
    Ghaemmaghami, Shahrokh
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4053 - 4057
  • [26] Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction
    Yang, Guanglei
    Tang, Hao
    Ding, Mingli
    Sebe, Nicu
    Ricci, Elisa
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 16249 - 16259
  • [27] Conversational Question Answering over Knowledge Graphs with Transformer and Graph Attention Networks
    Kacupaj, Endri
    Plepi, Joan
    Singh, Kuldeep
    Thakkar, Harsh
    Lehmann, Jens
    Maleshkova, Maria
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 850 - 862
  • [28] Exploring Transformer for Face Mask Detection
    Mao, Yonghua
    Lv, Yuhang
    Zhang, Guangxin
    Gui, Xiaolin
    IEEE ACCESS, 2024, 12 : 118377 - 118388
  • [29] ADVERSARIAL MASK TRANSFORMER FOR SEQUENTIAL LEARNING
    Lio, Hou
    Li, Shang-En
    Chien, Jen-Tzung
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4178 - 4182
  • [30] k-means Mask Transformer
    Yu, Qihang
    Wang, Huiyu
    Qiao, Siyuan
    Collins, Maxwell
    Zhu, Yukun
    Adam, Hartwig
    Yuille, Alan
    Chen, Liang-Chieh
    COMPUTER VISION, ECCV 2022, PT XXIX, 2022, 13689 : 288 - 307