SUDMAD: Sequential and Unsupervised Decomposition of a Multi-Author Document Based on a Hidden Markov Model

被引:0
|
作者
Aldebei, Khaled [1 ,2 ]
He, Xiangjian [1 ,3 ]
Jia, Wenjing [1 ]
Yeh, Weichang [4 ]
机构
[1] Univ Technol Sydney, Global Big Data Technol Ctr, Sydney, NSW, Australia
[2] Minjiang Univ, Fujian Prov Key Lab Informat Proc & Intelligent C, Fuzhou 350121, Fujian, Peoples R China
[3] Northwestern Polytech Univ, Sch Software & Microelect, Xian, Shaanxi, Peoples R China
[4] Natl Tsing Hua Univ, Dept Ind Engn & Engn Management, Hsinchu, Taiwan
关键词
D O I
10.1002/asi.23956
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Decomposing a document written by more than one author into sentences based on authorship is of great significance due to the increasing demand for plagiarism detection, forensic analysis, civil law (i.e., disputed copyright issues), and intelligence issues that involve disputed anonymous documents. Among existing studies for document decomposition, some were limited by specific languages, according to topics or restricted to a document of two authors, and their accuracies have big room for improvement. In this paper, we consider the contextual correlation hidden among sentences and propose an algorithm for Sequential and Unsupervised Decomposition of a Multi-Author Document (SUDMAD) written in any language, disregarding topics, through the construction of a Hidden Markov Model (HMM) reflecting the authors' writing styles. To build and learn such a model, an unsupervised, statistical approach is first proposed to estimate the initial values of HMM parameters of a preliminary model, which does not require the availability of any information of author's or document's context other than how many authors contributed to writing the document. To further boost the performance of this approach, a boosted HMM learning procedure is proposed next, where the initial classification results are used to create labeled training data to learn a more accurate HMM. Moreover, the contextual relationship among sentences is further utilized to refine the classification results. Our proposed approach is empirically evaluated on three benchmark datasets that are widely used for authorship analysis of documents. Comparisons with recent state-of-the-art approaches are also presented to demonstrate the significance of our new ideas and the superior performance of our approach.
引用
收藏
页码:201 / 214
页数:14
相关论文
共 50 条
  • [21] Diagnosis of sucker rod pumping based on dynamometer card decomposition and hidden Markov model
    Zheng, Boyuan
    Gao, Xianwen
    TRANSACTIONS OF THE INSTITUTE OF MEASUREMENT AND CONTROL, 2018, 40 (16) : 4309 - 4320
  • [22] Bearing Fault Diagnosis Method Based on Singular Value Decomposition and Hidden Markov Model
    Xu, Hongwu
    Fan, Yugang
    Wu, Jiande
    Gao, Yang
    Yu, Zhongli
    2015 27TH CHINESE CONTROL AND DECISION CONFERENCE (CCDC), 2015, : 6355 - 6359
  • [23] Non-intrusive Load Decomposition Method based on the Factor Hidden Markov Model
    Liu Song
    Wu Yao
    Tian Jie
    2018 37TH CHINESE CONTROL CONFERENCE (CCC), 2018, : 8994 - 8999
  • [24] Unsupervised modeling and feature selection of sequential spherical data through nonparametric hidden Markov models
    Fan, Wentao
    Hou, Wenjuan
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2022, 13 (10) : 3019 - 3029
  • [25] Unsupervised modeling and feature selection of sequential spherical data through nonparametric hidden Markov models
    Wentao Fan
    Wenjuan Hou
    International Journal of Machine Learning and Cybernetics, 2022, 13 : 3019 - 3029
  • [26] Multi-channel EEG based automatic epileptic seizure detection using iterative filtering decomposition and Hidden Markov Model
    Dash, Deba Prasad
    Kolekar, Maheshkumar H.
    Jha, Kamlesh
    COMPUTERS IN BIOLOGY AND MEDICINE, 2020, 116
  • [27] Prediction of Rationality for Ferry Crossing Behavior Based on Multi Hidden Markov Model
    Cheng, Ting-ting
    Wu, Qing
    Wu, Bing
    Zhang, Ming-yang
    2019 5TH INTERNATIONAL CONFERENCE ON TRANSPORTATION INFORMATION AND SAFETY (ICTIS 2019), 2019, : 1382 - 1388
  • [28] The Biometric Based Convertible Undeniable Multi-Signature Scheme to Ensure Multi-Author Copyrights and Profits
    SungHyun Yun
    Heuiseok Lim
    Young-Sik Jeong
    SoonYoung Jung
    Jae-Khun Chang
    Wireless Personal Communications, 2011, 60 : 405 - 418
  • [29] Research on a method of load identification based on multi parameter hidden Markov model
    Zhang L.
    Zhang T.
    Zhang H.
    Wang F.
    Guo J.
    Dianli Xitong Baohu yu Kongzhi/Power System Protection and Control, 2019, 47 (20): : 81 - 90
  • [30] Multi-Layer Hidden Markov Model Based Intrusion Detection System
    Zegeye, Wondimu K.
    Dean, Richard A.
    Moazzami, Farzad
    MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2019, 1 (01): : 265 - 286