On the Automatic Construction of an Arabic Thesaurus

被引:0
|
作者
Mohsen, Ghassan [1 ]
Al-Ayyoub, Mahmoud [1 ]
Hmeidi, Ismail [1 ]
Al-Aiad, Ahmad [1 ]
机构
[1] Jordan Univ Sci & Technol, Irbid, Jordan
关键词
Automatic Thesaurus Construction; Modern Standard Arabic; Term Frequency-Inverse Document Frequency (TF-IDF); Pointwise Mutual Information (PMI); Latent Semantic Analysis (LSA); Cosine Similarity; Jaccard Similarity; Dice Similarity; LATENT SEMANTIC ANALYSIS;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Despite its accuracy, the traditional approach of manually constructing a thesaurus can be a very complex task with many challenges. On the other hand, constructing the thesaurus automatically has been found to be very useful in avoiding a number of drawbacks of the manual approach. Automating the process of thesaurus construction can save time, effort and cost in addition to allowing easy maintenance and expansion of the constructed thesaurus. Several approaches have been proposed to build thesauri in many languages (mainly English). To the best of our knowledge, there are very limited efforts towards automatically building a high-quality large-scale thesaurus for the Arabic language. To fill this knowledge gap, the paper aims to automatically build a thesaurus and compare various methods for this task. To this end, a dataset of 14,148 Arabic documents is collected on different topics such as Arts, Politics, etc. The dataset is analyzed to assign weights to each term using three different weighting approaches: Term Frequency-Inverse Document Frequency (TF-IDF), Pointwise Mutual Information (PMI) and Latent Semantic Analysis (LSA). Then, three different similarity measures (Cosine, Jaccard and Dice) are used to compute term-term similarity. We test the constructed thesauri on 20 queries to evaluate their accuracies and determine which combination performs the best. Recall and precision are the main accuracy measures used to evaluate the retrieval process. The experimental results demonstrated the superiority of TF-IDF approach over PMI and LSA approaches.
引用
收藏
页码:243 / 247
页数:5
相关论文
共 50 条
  • [1] ON AUTOMATIC THESAURUS CONSTRUCTION
    IVANOVA, NS
    NAUCHNO-TEKHNICHESKAYA INFORMATSIYA SERIYA 2-INFORMATSIONNYE PROTSESSY I SISTEMY, 1969, (06): : 17 - &
  • [2] AUTOMATIC THESAURUS CONSTRUCTION AND RELATION OF A THESAURUS TO INDEXING TERMS
    SPARCK-JONES, K
    ASLIB PROCEEDINGS, 1970, 22 (05): : 226 - +
  • [3] PLSI utilization for automatic thesaurus construction
    Hagiwara, M
    Ogawa, Y
    Toyama, K
    NATURAL LANGUAGE PROCESSING - IJCNLP 2005, PROCEEDINGS, 2005, 3651 : 334 - 345
  • [4] Automatic Thesaurus Construction for Modern Hebrew
    Liebeskind, Chaya
    Dagan, Ido
    Schler, Jonathan
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1446 - 1451
  • [5] Automatic thesaurus construction using Bayesian networks
    Park, YC
    Choi, KS
    INFORMATION PROCESSING & MANAGEMENT, 1996, 32 (05) : 543 - 553
  • [6] AUTOMATIC THESAURUS CONSTRUCTION BASED ON TERM CENTROIDS
    CRAWFORD, RG
    CANADIAN JOURNAL OF INFORMATION SCIENCE-REVUE CANADIENNE DES SCIENCES DE L INFORMATION, 1979, 4 (MAY): : 124 - 136
  • [7] Use of automatic keyphrase generation for creation of a construction thesaurus
    Kosovac, B
    Vanier, DJ
    DURABILITY OF BUILDING MATERIALS AND COMPONENTS 8, VOLS 1-4, PROCEEDINGS, 1999, : 2507 - 2516
  • [8] IMPROVED STATISTICAL METHODS FOR AUTOMATIC CONSTRUCTION OF A MEDICAL THESAURUS
    WOLFFTER.M
    ROUAULT, B
    RIMBERT, D
    METHODS OF INFORMATION IN MEDICINE, 1972, 11 (02) : 104 - &
  • [9] AUTOMATIC CONSTRUCTION OF A THESAURUS FOR LEGAL DATABASES ON CD ROM
    CLEMENTI, F
    DAMIANI, E
    DANTONA, O
    ORTOLANI, B
    ELETTROTECNICA, 1992, 79 (7-8): : 689 - 699
  • [10] AUTOMATIC THESAURUS CONSTRUCTION BY MACHINE LEARNING FROM RETRIEVAL SESSIONS
    GUNTZER, U
    JUTTNER, G
    SEEGMULLER, G
    SARRE, F
    INFORMATION PROCESSING & MANAGEMENT, 1989, 25 (03) : 265 - 273