Creation and interpretation of machine learning models for aqueous solubility prediction

Aim: Solubility prediction is an essential factor in rational drug design and many models have been developed with machine learning (ML) methods to enhance the predictive ability. However, most of the ML models are hard to interpret which limits the insights they can give in the lead optimization pr...

Full description

Saved in:
Bibliographic Details
Main Authors: Minyi Su, Enric Herrero
Format: Article
Language:English
Published: Open Exploration 2023-10-01
Series:Exploration of Drug Science
Subjects:
Online Access:https://www.explorationpub.com/uploads/Article/A100826/100826.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825199620631822336
author Minyi Su
Enric Herrero
author_facet Minyi Su
Enric Herrero
author_sort Minyi Su
collection DOAJ
description Aim: Solubility prediction is an essential factor in rational drug design and many models have been developed with machine learning (ML) methods to enhance the predictive ability. However, most of the ML models are hard to interpret which limits the insights they can give in the lead optimization process. Here, an approach to construct and interpret solubility models with a combination of physicochemical properties and ML algorithms is presented. Methods: The models were trained, optimized, and tested in a dataset containing 12,983 compounds from two public datasets and further evaluated in two external test sets. More importantly, the SHapley Additive exPlanations (SHAP) and heat map coloring approaches were used to explain the predictive models and assess their suitability to guide compound optimization. Results: Among the different ML methods, random forest (RF) models obtain the best performance in the different test sets. From the interpretability perspective, fragment-based coloring offers a more robust interpretation than atom-based coloring and that normalizing the values further improves it. Conclusions: Overall, for certain applications simple ML algorithms such as RF work well and can outperform more complex methods and that combining them with fragment-coloring can offer guidance for chemists to modify the structure with a desired property. This interpretation strategy is publicly available at https://github.com/Pharmacelera/predictive-model-coloring and could be further applied in other property predictions to improve the interpretability of ML models.
format Article
id doaj-art-eeeb6bbebd384a46a00b23cda7539746
institution Kabale University
issn 2836-7677
language English
publishDate 2023-10-01
publisher Open Exploration
record_format Article
series Exploration of Drug Science
spelling doaj-art-eeeb6bbebd384a46a00b23cda75397462025-02-08T03:49:05ZengOpen ExplorationExploration of Drug Science2836-76772023-10-011538840410.37349/eds.2023.00026Creation and interpretation of machine learning models for aqueous solubility predictionMinyi Su0https://orcid.org/0000-0001-5830-059XEnric Herrero1https://orcid.org/0000-0001-7837-3593Pharmacelera, 08028 Barcelona, SpainPharmacelera, 08028 Barcelona, SpainAim: Solubility prediction is an essential factor in rational drug design and many models have been developed with machine learning (ML) methods to enhance the predictive ability. However, most of the ML models are hard to interpret which limits the insights they can give in the lead optimization process. Here, an approach to construct and interpret solubility models with a combination of physicochemical properties and ML algorithms is presented. Methods: The models were trained, optimized, and tested in a dataset containing 12,983 compounds from two public datasets and further evaluated in two external test sets. More importantly, the SHapley Additive exPlanations (SHAP) and heat map coloring approaches were used to explain the predictive models and assess their suitability to guide compound optimization. Results: Among the different ML methods, random forest (RF) models obtain the best performance in the different test sets. From the interpretability perspective, fragment-based coloring offers a more robust interpretation than atom-based coloring and that normalizing the values further improves it. Conclusions: Overall, for certain applications simple ML algorithms such as RF work well and can outperform more complex methods and that combining them with fragment-coloring can offer guidance for chemists to modify the structure with a desired property. This interpretation strategy is publicly available at https://github.com/Pharmacelera/predictive-model-coloring and could be further applied in other property predictions to improve the interpretability of ML models.https://www.explorationpub.com/uploads/Article/A100826/100826.pdfaqueous solubilitymachine learningfragment-coloringproperty prediction
spellingShingle Minyi Su
Enric Herrero
Creation and interpretation of machine learning models for aqueous solubility prediction
Exploration of Drug Science
aqueous solubility
machine learning
fragment-coloring
property prediction
title Creation and interpretation of machine learning models for aqueous solubility prediction
title_full Creation and interpretation of machine learning models for aqueous solubility prediction
title_fullStr Creation and interpretation of machine learning models for aqueous solubility prediction
title_full_unstemmed Creation and interpretation of machine learning models for aqueous solubility prediction
title_short Creation and interpretation of machine learning models for aqueous solubility prediction
title_sort creation and interpretation of machine learning models for aqueous solubility prediction
topic aqueous solubility
machine learning
fragment-coloring
property prediction
url https://www.explorationpub.com/uploads/Article/A100826/100826.pdf
work_keys_str_mv AT minyisu creationandinterpretationofmachinelearningmodelsforaqueoussolubilityprediction
AT enricherrero creationandinterpretationofmachinelearningmodelsforaqueoussolubilityprediction