Using simulated microhaplotype genotyping data to evaluate the value of machine learning algorithms for inferring DNA mixture contributor numbers

被引:2
|
作者
Wang, Haoyu [1 ]
Zhu, Qiang [1 ]
Huang, Yuguo [1 ]
Cao, Yueyan [1 ]
Hu, Yuhan [1 ]
Wei, Yifan [1 ]
Wang, Yuting [1 ]
Hou, Tingyun [1 ]
Shan, Tiantian [1 ]
Dai, Xuan [1 ]
Zhang, Xiaokang [1 ]
Wang, Yufang [1 ]
Zhang, Ji [1 ]
机构
[1] Sichuan Univ, West China Sch Basic Med Sci & Forens Med, Chengdu, Peoples R China
基金
中国国家自然科学基金;
关键词
Microhaplotypes; Complex DNA mixtures; Inference of the number of contributors; Machine learning; STR; UNCERTAINTY; PROFILES; LOCI;
D O I
10.1016/j.fsigen.2024.103008
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Inferring the number of contributors (NoC) is a crucial step in interpreting DNA mixtures, as it directly affects the accuracy of the likelihood ratio calculation and the assessment of evidence strength. However, obtaining the correct NoC in complex DNA mixtures remains challenging due to the high degree of allele sharing and dropout. This study aimed to analyze the impact of allele sharing and dropout on NoC inference in complex DNA mixtures when using microhaplotypes (MH). The effectiveness and value of highly polymorphic MH for NoC inference in complex DNA mixtures were evaluated through comparing the performance of three NoC inference methods, including maximum allele count (MAC) method, maximum likelihood estimation (MLE) method, and random forest classification (RFC) algorithm. In this study, we selected the top 100 most polymorphic MH from the Southern Han Chinese (CHS) population, and simulated over 40 million complex DNA mixture profiles with the NoC ranging from 2 to 8. These profiles involve unrelated individuals (RM type) and related pairs of individuals, including parent-offspring pairs (PO type), full-sibling pairs (FS type), and second-degree kinship pairs (SE type). Our results indicated that how the number of detected alleles in DNA mixture profiles varied with the markers' polymorphism, kinship's involvement, NoC, and dropout settings. Across different types of DNA mixtures, the MAC and MLE methods performed best in the RM type, followed by SE, FS, and PO types, while RFC models showed the best performance in the PO type, followed by RM, SE, and FS types. The recall of all three methods for NoC inference were decreased as the NoC and dropout levels increased. Furthermore, the MLE method performed better at low NoC, whereas RFC models excelled at high NoC and/or high dropout levels, regardless of the availability of a priori information about related pairs of individuals in DNA mixtures. However, the RFC models which considered the aforementioned priori information and were trained specifically on each type of DNA mixture profiles, outperformed RFC_ALL model that did not consider such information. Finally, we provided recommendations for model building when applying machine learning algorithms to NoC inference.
引用
收藏
页数:13
相关论文
共 9 条
  • [1] Inferring geological structural features from geophysical and geological mapping data using machine learning algorithms
    Xu, Limin
    Green, Eleanor C. R.
    GEOPHYSICAL PROSPECTING, 2023, 71 (09) : 1728 - 1742
  • [2] A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
    Judson, Richard
    Elloumi, Fathi
    Setzer, R. Woodrow
    Li, Zhen
    Shah, Imran
    BMC BIOINFORMATICS, 2008, 9 (1)
  • [3] A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
    Richard Judson
    Fathi Elloumi
    R Woodrow Setzer
    Zhen Li
    Imran Shah
    BMC Bioinformatics, 9
  • [4] Quantitative Analysis of X-Ray Spectral Data for a Mixture of Compounds Using Machine-Learning Algorithms
    Algasov, A. S.
    Guda, S. A.
    Guda, A. A.
    Rusalev, Yu. V.
    Soldatov, A. V.
    JOURNAL OF SURFACE INVESTIGATION, 2021, 15 (03): : 495 - 501
  • [5] Quantitative Analysis of X-Ray Spectral Data for a Mixture of Compounds Using Machine-Learning Algorithms
    A. S. Algasov
    S. A. Guda
    A. A. Guda
    Yu. V. Rusalev
    A. V. Soldatov
    Journal of Surface Investigation: X-ray, Synchrotron and Neutron Techniques, 2021, 15 : 495 - 501
  • [6] Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets
    Cortes-Ciriano, Isidro
    Bender, Andreas
    Malliavin, Therese E.
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2015, 55 (07) : 1413 - 1425
  • [7] Disentangling data dependency using cross-validation strategies to evaluate prediction quality of cattle grazing activities using machine learning algorithms and wearable sensor data
    Ribeiro, Leonardo Augusto Coelho
    Bresolin, Tiago
    Rosa, Guilherme Jordao de Magalhaes
    Casagrande, Daniel Rume
    Danes, Marina de Arruda Camargo
    Dorea, Joao Ricardo Reboucas
    JOURNAL OF ANIMAL SCIENCE, 2021, 99 (09)
  • [8] Machine Learning Algorithms to Predict Forage Nutritive Value of In Situ Perennial Ryegrass Plants Using Hyperspectral Canopy Reflectance Data
    Smith, Chaya
    Karunaratne, Senani
    Badenhorst, Pieter
    Cogan, Noel
    Spangenberg, German
    Smith, Kevin
    REMOTE SENSING, 2020, 12 (06)
  • [9] Inferring fault structures and overburden depth in 3D from geophysical data using machine learning algorithms - A case study on the Fenelon gold deposit, Quebec, Canada
    Xu, Limin
    Green, E. C. R.
    Kelly, C.
    GEOPHYSICAL PROSPECTING, 2024, 72 (09) : 3474 - 3494