Byte Pair Encoding for Symbolic Music

被引:0
|
作者
Fradet, Nathan [1 ,2 ]
Gutowski, Nicolas [3 ]
Chhel, Fabien [3 ,4 ]
Briot, Jean-Pierre [1 ]
机构
[1] Sorbonne Univ, CNRS, LIP6, F-75005 Paris, France
[2] Aubay, Boulogne, France
[3] Univ Angers, LERIA, F-49000 Angers, France
[4] ESEO, ERIS, F-49100 Angers, France
来源
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different approaches, as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations rely on small vocabularies of tokens describing the note attributes and time events, resulting in fairly long token sequences, and a sub-optimal use of the embedding space of language models. Recent research has put efforts on reducing the overall sequence length by merging embeddings or combining tokens. In this paper, we show that Byte Pair Encoding, a compression technique widely used for natural language, significantly decreases the sequence length while increasing the vocabulary size. By doing so, we leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks. The source code is shared on Github(1), along with a companion website(2). Finally, BPE is directly implemented in MidiTok(3), allowing the reader to easily benefit from this method.
引用
收藏
页码:2001 / 2020
页数:20
相关论文
共 50 条
  • [1] A Statistical Extension of Byte-Pair Encoding
    Vilar, David
    Federico, Marcello
    IWSLT 2021: THE 18TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION, 2021, : 263 - 275
  • [2] A Formal Perspective on Byte-Pair Encoding
    Zouhar, Vilem
    Meister, Clara
    Gastaldi, Juan Luis
    Du, Li
    Vieira, Tim
    Sachan, Mrinmaya
    Cotterell, Ryan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 598 - 613
  • [3] Controlling Byte Pair Encoding for Neural Machine Translation
    Tacorda, Alfred John
    Ignacio, Marvin John
    Oco, Nathaniel
    Roxas, Rachel Edita
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 168 - 171
  • [4] Byte Pair Encoding is Suboptimal for Language Model Pretraining
    Bostrom, Kaj
    Durrett, Greg
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4617 - 4624
  • [5] Slovak morphological tokenizer using the Byte-Pair Encoding algorithm
    Drzik, David
    Forgac, Frantisek
    PEERJ COMPUTER SCIENCE, 2024, 10 : 1 - 21
  • [6] Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding
    Amalia, Amalia
    Sitompul, Opim Salim
    Mantoro, Teddy
    Nababan, Erna Budhiarti
    IEEE ACCESS, 2021, 9 : 155699 - 155710
  • [7] Transliteration and Byte Pair Encoding to Improve Tamil to Sinhala Neural Machine Translation
    Tennage, Pasindu
    Herath, Achini
    Thilakarathne, Malith
    Sandaruwan, Prabath
    Ranathunga, Surangika
    2018 MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON) 4TH INTERNATIONAL MULTIDISCIPLINARY ENGINEERING RESEARCH CONFERENCE, 2018, : 390 - 395
  • [8] Byte-Pair Encoding for Classifying Routine Clinical Electroencephalograms in Adults Over the Lifespan
    Klymenko, Mykola
    Doesburg, Sam M.
    Medvedev, George
    Xi, Pengcheng
    Ribary, Urs
    Vakorin, Vasily A.
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1881 - 1890
  • [9] A LITTLE BYTE MUSIC
    HAYASHI, A
    TECHNOLOGY REVIEW, 1992, 95 (07): : 10 - 11
  • [10] Research on Chinese-Tibetan Machine Translation Model Based on Improved Byte Pair Encoding
    Thupten, Tsering
    Rinchen, Dhondub
    Nyima, Tashi
    Yu, Yong-Bin
    Deng, Quan-Xin
    Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2021, 50 (02): : 249 - 255