Industry Specific Word Embedding and its Application in Log Classification

被引:10
|
作者
Khabiri, Elham [1 ]
Gifford, Wesley M. [1 ]
Vinzamuri, Bhanukiran [1 ]
Patel, Dhaval [1 ]
Mazzoleni, Pietro [2 ]
机构
[1] IBM Res, Yorktown Hts, NY 10598 USA
[2] IBM Corp, Armonk, NY USA
关键词
natural language processing; word embeddings; text classification;
D O I
10.1145/3357384.3357827
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Word, sentence and document embeddings have become the cornerstone of most natural language processing-based solutions. The training of an effective embedding depends on a large corpus of relevant documents. However, such corpus is not always available, especially for specialized heavy industries such as oil, mining, or steel. To address the problem, this paper proposes a semi-supervised learning framework to create document corpus and embedding starting from an industry taxonomy, along with a very limited set of relevant positive and negative documents. Our solution organizes candidate documents into a graph and adopts different explore and exploit strategies to iteratively create the corpus and its embedding. At each iteration, two metrics, called Coverage and Context Similarity, are used as proxy to measure the quality of the results. Our experiments demonstrate how an embedding created by our solution is more effective than the one created by processing thousands of industry-specific document pages. We also explore using our embedding in downstream tasks, such as building an industry specific classification model given labeled training data, as well as classifying unlabeled documents according to industry taxonomy terms.
引用
收藏
页码:2713 / 2721
页数:9
相关论文
共 50 条
  • [1] Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification
    Tang, Duyu
    Wei, Furu
    Yang, Nan
    Zhou, Ming
    Liu, Ting
    Qin, Bing
    PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 1555 - 1565
  • [2] Dual-Clustering Maximum Entropy with Application to Classification and Word Embedding
    Wang, Xiaolong
    Wang, Jingjing
    Zhai, Chengxiang
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3323 - 3329
  • [3] Malware Classification with Word Embedding Features
    Kale, Aparna Sunil
    Di Troia, Fabio
    Stamp, Mark
    ICISSP: PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS SECURITY AND PRIVACY, 2021, : 733 - 742
  • [4] Classification of Taxonomical Relationship by Word Embedding
    Omine, Kazuki
    Paik, Incheon
    2018 IEEE INTERNATIONAL CONFERENCE ON COGNITIVE COMPUTING (ICCC), 2018, : 122 - 125
  • [5] Improving Text Classification with Word Embedding
    Ge, Lihao
    Moh, Teng-Sheng
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 1796 - 1805
  • [6] Study on the Chinese Word Semantic Relation Classification with Word Embedding
    Shijia, E.
    Jia, Shengbin
    Xiang, Yang
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2017, 2018, 10619 : 849 - 855
  • [7] Document Sentiment Classification based on the Word Embedding
    Yin, Yanping
    Jin, Zhong
    PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MECHATRONICS, MATERIALS, CHEMISTRY AND COMPUTER ENGINEERING 2015 (ICMMCCE 2015), 2015, 39 : 456 - 461
  • [8] Automated Patent Classification Using Word Embedding
    Grawe, Mattyws F.
    Martins, Claudia A.
    Bonfante, Andreia G.
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 408 - 411
  • [9] Citation Intent Classification Using Word Embedding
    Roman, Muhammad
    Shahid, Abdul
    Khan, Shafiullah
    Koubaa, Anis
    Yu, Lisu
    IEEE ACCESS, 2021, 9 : 9982 - 9995
  • [10] Topic Classification Based on Improved Word Embedding
    Sheng, Liangliang
    Xu, Lizhen
    2017 14TH WEB INFORMATION SYSTEMS AND APPLICATIONS CONFERENCE (WISA 2017), 2017, : 117 - 121