Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds

Accurate prediction of liquid chromatographic retention times is becoming increasingly important in nontargeted screening applications. Traditional retention time approaches heavily rely on the use of standard compounds, which is limited by the speed of synthesis and manufacture of standard products...

Full description

Saved in:
Bibliographic Details
Main Authors: Mengdie Fan, Chenhui Sang, Hua Li, Yue Wei, Bin Zhang, Yang Xing, Jing Zhang, Jie Yin, Wei An, Bing Shao
Format: Article
Language:English
Published: American Association for the Advancement of Science (AAAS) 2025-01-01
Series:Research
Online Access:https://spj.science.org/doi/10.34133/research.0607
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825206631247380480
author Mengdie Fan
Chenhui Sang
Hua Li
Yue Wei
Bin Zhang
Yang Xing
Jing Zhang
Jie Yin
Wei An
Bing Shao
author_facet Mengdie Fan
Chenhui Sang
Hua Li
Yue Wei
Bin Zhang
Yang Xing
Jing Zhang
Jie Yin
Wei An
Bing Shao
author_sort Mengdie Fan
collection DOAJ
description Accurate prediction of liquid chromatographic retention times is becoming increasingly important in nontargeted screening applications. Traditional retention time approaches heavily rely on the use of standard compounds, which is limited by the speed of synthesis and manufacture of standard products, and is time-consuming and labor-intensive. Recently, machine learning and artificial intelligence algorithms have been applied to retention time prediction, which show unparalleled advantages over traditional experimental methods. However, existing retention time prediction methods usually suffer from the scarcity of comprehensive training datasets, sparsity of valid data, and lack of classification in datasets, resulting in poor generalization capability and accuracy. In this study, a dataset for 10,905 compounds was constructed including their retention times. Next, an innovative classification system was implemented, classifying 10,905 compounds into a 3-tier hierarchy across 141 classes, based on functional group weighting. Then, data augmentation was performed within each category using simplified molecular input line entry system (SMILES) enumeration combined with structural similarity expansion. Finally, by training the optimal quantitative structure–retention relationship (QSRR) models for each category of compounds and selecting the best-fitting model for prediction via discriminant analysis during the prediction period, a novel and universal high-throughput retention time prediction model was established. The results demonstrate that this model achieves an R2 of 0.98 and an average prediction error of 23 s, outperforming currently published models. This study provides a scientific basis for high throughput and rapid prediction of unknown pollutants, data mining, nontargeted screening, etc.
format Article
id doaj-art-856bb3313a3144cea7b1596d16bda604
institution Kabale University
issn 2639-5274
language English
publishDate 2025-01-01
publisher American Association for the Advancement of Science (AAAS)
record_format Article
series Research
spelling doaj-art-856bb3313a3144cea7b1596d16bda6042025-02-07T08:00:37ZengAmerican Association for the Advancement of Science (AAAS)Research2639-52742025-01-01810.34133/research.0607Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic CompoundsMengdie Fan0Chenhui Sang1Hua Li2Yue Wei3Bin Zhang4Yang Xing5Jing Zhang6Jie Yin7Wei An8Bing Shao9National Key Laboratory of Veterinary Public Health Security, College of Veterinary Medicine, China Agricultural University, Beijing Key Laboratory of Detection Technology for Animal-Derived Food Safety, and Beijing Laboratory for Food Quality and Safety, Beijing 100193, China.Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China.National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China.National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China.National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China.Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China.Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China.Beijing Key Laboratory of Diagnostic and Traceability Technologies for Food Poisoning, Beijing Center for Disease Prevention and Control, Beijing 100013, China.National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China.National Key Laboratory of Veterinary Public Health Security, College of Veterinary Medicine, China Agricultural University, Beijing Key Laboratory of Detection Technology for Animal-Derived Food Safety, and Beijing Laboratory for Food Quality and Safety, Beijing 100193, China.Accurate prediction of liquid chromatographic retention times is becoming increasingly important in nontargeted screening applications. Traditional retention time approaches heavily rely on the use of standard compounds, which is limited by the speed of synthesis and manufacture of standard products, and is time-consuming and labor-intensive. Recently, machine learning and artificial intelligence algorithms have been applied to retention time prediction, which show unparalleled advantages over traditional experimental methods. However, existing retention time prediction methods usually suffer from the scarcity of comprehensive training datasets, sparsity of valid data, and lack of classification in datasets, resulting in poor generalization capability and accuracy. In this study, a dataset for 10,905 compounds was constructed including their retention times. Next, an innovative classification system was implemented, classifying 10,905 compounds into a 3-tier hierarchy across 141 classes, based on functional group weighting. Then, data augmentation was performed within each category using simplified molecular input line entry system (SMILES) enumeration combined with structural similarity expansion. Finally, by training the optimal quantitative structure–retention relationship (QSRR) models for each category of compounds and selecting the best-fitting model for prediction via discriminant analysis during the prediction period, a novel and universal high-throughput retention time prediction model was established. The results demonstrate that this model achieves an R2 of 0.98 and an average prediction error of 23 s, outperforming currently published models. This study provides a scientific basis for high throughput and rapid prediction of unknown pollutants, data mining, nontargeted screening, etc.https://spj.science.org/doi/10.34133/research.0607
spellingShingle Mengdie Fan
Chenhui Sang
Hua Li
Yue Wei
Bin Zhang
Yang Xing
Jing Zhang
Jie Yin
Wei An
Bing Shao
Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds
Research
title Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds
title_full Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds
title_fullStr Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds
title_full_unstemmed Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds
title_short Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds
title_sort development of an efficient and generalized mtscam model to predict liquid chromatography retention times of organic compounds
url https://spj.science.org/doi/10.34133/research.0607
work_keys_str_mv AT mengdiefan developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT chenhuisang developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT huali developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT yuewei developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT binzhang developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT yangxing developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT jingzhang developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT jieyin developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT weian developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds
AT bingshao developmentofanefficientandgeneralizedmtscammodeltopredictliquidchromatographyretentiontimesoforganiccompounds