Using simulated microhaplotype genotyping data to evaluate the value of machine learning algorithms for inferring DNA mixture contributor numbers

被引：2

作者：

Wang, Haoyu ^{[1
]}

Zhu, Qiang ^{[1
]}

Huang, Yuguo ^{[1
]}

Cao, Yueyan ^{[1
]}

Hu, Yuhan ^{[1
]}

Wei, Yifan ^{[1
]}

Wang, Yuting ^{[1
]}

Hou, Tingyun ^{[1
]}

Shan, Tiantian ^{[1
]}

Dai, Xuan ^{[1
]}

Zhang, Xiaokang ^{[1
]}

Wang, Yufang ^{[1
]}

Zhang, Ji ^{[1
]}

机构：

[1] Sichuan Univ, West China Sch Basic Med Sci & Forens Med, Chengdu, Peoples R China

来源：

FORENSIC SCIENCE INTERNATIONAL-GENETICS | 2024年 / 69卷

基金：

中国国家自然科学基金;

关键词：

Microhaplotypes; Complex DNA mixtures; Inference of the number of contributors; Machine learning; STR; UNCERTAINTY; PROFILES; LOCI;

D O I：

10.1016/j.fsigen.2024.103008

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

Inferring the number of contributors (NoC) is a crucial step in interpreting DNA mixtures, as it directly affects the accuracy of the likelihood ratio calculation and the assessment of evidence strength. However, obtaining the correct NoC in complex DNA mixtures remains challenging due to the high degree of allele sharing and dropout. This study aimed to analyze the impact of allele sharing and dropout on NoC inference in complex DNA mixtures when using microhaplotypes (MH). The effectiveness and value of highly polymorphic MH for NoC inference in complex DNA mixtures were evaluated through comparing the performance of three NoC inference methods, including maximum allele count (MAC) method, maximum likelihood estimation (MLE) method, and random forest classification (RFC) algorithm. In this study, we selected the top 100 most polymorphic MH from the Southern Han Chinese (CHS) population, and simulated over 40 million complex DNA mixture profiles with the NoC ranging from 2 to 8. These profiles involve unrelated individuals (RM type) and related pairs of individuals, including parent-offspring pairs (PO type), full-sibling pairs (FS type), and second-degree kinship pairs (SE type). Our results indicated that how the number of detected alleles in DNA mixture profiles varied with the markers' polymorphism, kinship's involvement, NoC, and dropout settings. Across different types of DNA mixtures, the MAC and MLE methods performed best in the RM type, followed by SE, FS, and PO types, while RFC models showed the best performance in the PO type, followed by RM, SE, and FS types. The recall of all three methods for NoC inference were decreased as the NoC and dropout levels increased. Furthermore, the MLE method performed better at low NoC, whereas RFC models excelled at high NoC and/or high dropout levels, regardless of the availability of a priori information about related pairs of individuals in DNA mixtures. However, the RFC models which considered the aforementioned priori information and were trained specifically on each type of DNA mixture profiles, outperformed RFC_ALL model that did not consider such information. Finally, we provided recommendations for model building when applying machine learning algorithms to NoC inference.

引用

页数：13

共 9 条

[1] Inferring geological structural features from geophysical and geological mapping data using machine learning algorithms
Xu, Limin
Green, Eleanor C. R.
GEOPHYSICAL PROSPECTING, 2023, 71 (09) : 1728 - 1742
[2] A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
Judson, Richard
Elloumi, Fathi
Setzer, R. Woodrow
Li, Zhen
Shah, Imran
BMC BIOINFORMATICS, 2008, 9 (1)
[3] A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
Richard Judson
Fathi Elloumi
R Woodrow Setzer
Zhen Li
Imran Shah
BMC Bioinformatics, 9
[4] Quantitative Analysis of X-Ray Spectral Data for a Mixture of Compounds Using Machine-Learning Algorithms
Algasov, A. S.
Guda, S. A.
Guda, A. A.
Rusalev, Yu. V.
Soldatov, A. V.
JOURNAL OF SURFACE INVESTIGATION, 2021, 15 (03): : 495 - 501
[5] Quantitative Analysis of X-Ray Spectral Data for a Mixture of Compounds Using Machine-Learning Algorithms
A. S. Algasov
S. A. Guda
A. A. Guda
Yu. V. Rusalev
A. V. Soldatov
Journal of Surface Investigation: X-ray, Synchrotron and Neutron Techniques, 2021, 15 : 495 - 501
[6] Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets
Cortes-Ciriano, Isidro
Bender, Andreas
Malliavin, Therese E.
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2015, 55 (07) : 1413 - 1425
[7] Disentangling data dependency using cross-validation strategies to evaluate prediction quality of cattle grazing activities using machine learning algorithms and wearable sensor data
Ribeiro, Leonardo Augusto Coelho
Bresolin, Tiago
Rosa, Guilherme Jordao de Magalhaes
Casagrande, Daniel Rume
Danes, Marina de Arruda Camargo
Dorea, Joao Ricardo Reboucas
JOURNAL OF ANIMAL SCIENCE, 2021, 99 (09)
[8] Machine Learning Algorithms to Predict Forage Nutritive Value of In Situ Perennial Ryegrass Plants Using Hyperspectral Canopy Reflectance Data
Smith, Chaya
Karunaratne, Senani
Badenhorst, Pieter
Cogan, Noel
Spangenberg, German
Smith, Kevin
REMOTE SENSING, 2020, 12 (06)
[9] Inferring fault structures and overburden depth in 3D from geophysical data using machine learning algorithms - A case study on the Fenelon gold deposit, Quebec, Canada
Xu, Limin
Green, E. C. R.
Kelly, C.
GEOPHYSICAL PROSPECTING, 2024, 72 (09) : 3474 - 3494

← 1 →