Faster model-based estimation of ancestry proportions

Ancestry estimation from genotype data in unrelated individuals has become an essential tool in population and medical genetics to understand demographic population histories and to model or correct for population structure. The ADMIXTURE software is a widely used model-based approach to account for...

Full description

Saved in:
Bibliographic Details
Main Authors: Santander, Cindy G., Refoyo Martinez, Alba, Meisner, Jonas
Format: Article
Language:English
Published: Peer Community In 2024-12-01
Series:Peer Community Journal
Subjects:
Online Access:https://peercommunityjournal.org/articles/10.24072/pcjournal.503/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825206394615234560
author Santander, Cindy G.
Refoyo Martinez, Alba
Meisner, Jonas
author_facet Santander, Cindy G.
Refoyo Martinez, Alba
Meisner, Jonas
author_sort Santander, Cindy G.
collection DOAJ
description Ancestry estimation from genotype data in unrelated individuals has become an essential tool in population and medical genetics to understand demographic population histories and to model or correct for population structure. The ADMIXTURE software is a widely used model-based approach to account for population stratification, however, it struggles with convergence issues and does not scale to modern human datasets or the large number of variants in whole-genome sequencing data. Likelihood-free approaches optimize a least square objective and have gained popularity in recent years due to their scalability. However, this comes at the cost of accuracy in the ancestry estimates in more complex admixture scenarios. We present a new model-based approach, fastmixture, which adopts aspects from likelihood-free approaches for parameter initialization, followed by a mini-batch expectation-maximization procedure to model the standard likelihood. In a simulation study, we demonstrate that the model-based approaches of fastmixture and ADMIXTURE are significantly more accurate than recent and likelihood-free approaches. We further show that fastmixture runs approximately 30$\times$ faster than ADMIXTURE on both simulated and empirical data from the 1000 Genomes Project such that our model-based approach scales to much larger sample sizes than previously possible.
format Article
id doaj-art-567b4c8927af4e54a17905fd6b26dcdf
institution Kabale University
issn 2804-3871
language English
publishDate 2024-12-01
publisher Peer Community In
record_format Article
series Peer Community Journal
spelling doaj-art-567b4c8927af4e54a17905fd6b26dcdf2025-02-07T10:17:17ZengPeer Community InPeer Community Journal2804-38712024-12-01410.24072/pcjournal.50310.24072/pcjournal.503Faster model-based estimation of ancestry proportions Santander, Cindy G.0https://orcid.org/0000-0003-3021-6809Refoyo Martinez, Alba1https://orcid.org/0000-0002-3674-4007Meisner, Jonas2https://orcid.org/0000-0002-9540-6673Department of Biology, University of Copenhagen, DenmarkCenter for Health Data Science, University of Copenhagen, DenmarkMental Health Centre Copenhagen, Copenhagen University Hospital, Denmark; Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, DenmarkAncestry estimation from genotype data in unrelated individuals has become an essential tool in population and medical genetics to understand demographic population histories and to model or correct for population structure. The ADMIXTURE software is a widely used model-based approach to account for population stratification, however, it struggles with convergence issues and does not scale to modern human datasets or the large number of variants in whole-genome sequencing data. Likelihood-free approaches optimize a least square objective and have gained popularity in recent years due to their scalability. However, this comes at the cost of accuracy in the ancestry estimates in more complex admixture scenarios. We present a new model-based approach, fastmixture, which adopts aspects from likelihood-free approaches for parameter initialization, followed by a mini-batch expectation-maximization procedure to model the standard likelihood. In a simulation study, we demonstrate that the model-based approaches of fastmixture and ADMIXTURE are significantly more accurate than recent and likelihood-free approaches. We further show that fastmixture runs approximately 30$\times$ faster than ADMIXTURE on both simulated and empirical data from the 1000 Genomes Project such that our model-based approach scales to much larger sample sizes than previously possible.https://peercommunityjournal.org/articles/10.24072/pcjournal.503/Ancestry estimation, population structure, population genetics, evolutionary genetics, bioinformatics
spellingShingle Santander, Cindy G.
Refoyo Martinez, Alba
Meisner, Jonas
Faster model-based estimation of ancestry proportions
Peer Community Journal
Ancestry estimation, population structure, population genetics, evolutionary genetics, bioinformatics
title Faster model-based estimation of ancestry proportions
title_full Faster model-based estimation of ancestry proportions
title_fullStr Faster model-based estimation of ancestry proportions
title_full_unstemmed Faster model-based estimation of ancestry proportions
title_short Faster model-based estimation of ancestry proportions
title_sort faster model based estimation of ancestry proportions
topic Ancestry estimation, population structure, population genetics, evolutionary genetics, bioinformatics
url https://peercommunityjournal.org/articles/10.24072/pcjournal.503/
work_keys_str_mv AT santandercindyg fastermodelbasedestimationofancestryproportions
AT refoyomartinezalba fastermodelbasedestimationofancestryproportions
AT meisnerjonas fastermodelbasedestimationofancestryproportions