Improving drug repositioning with negative data labeling using large language models

Abstract Introduction Drug repositioning offers numerous advantages, such as faster development timelines, reduced costs, and lower failure rates in drug development. Supervised machine learning is commonly used to score drug candidates but is hindered by the lack of reliable negative data—drugs tha...

Full description

Saved in:
Bibliographic Details
Main Authors: Milan Picard, Mickael Leclercq, Antoine Bodein, Marie Pier Scott-Boyer, Olivier Perin, Arnaud Droit
Format: Article
Language:English
Published: BMC 2025-02-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-025-00962-0
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861689759563776
author Milan Picard
Mickael Leclercq
Antoine Bodein
Marie Pier Scott-Boyer
Olivier Perin
Arnaud Droit
author_facet Milan Picard
Mickael Leclercq
Antoine Bodein
Marie Pier Scott-Boyer
Olivier Perin
Arnaud Droit
author_sort Milan Picard
collection DOAJ
description Abstract Introduction Drug repositioning offers numerous advantages, such as faster development timelines, reduced costs, and lower failure rates in drug development. Supervised machine learning is commonly used to score drug candidates but is hindered by the lack of reliable negative data—drugs that fail due to inefficacy or toxicity— which is difficult to access, lowering their prediction accuracy and generalization. Positive-Unlabeled (PU) learning has been used to overcome this issue by either randomly sampling unlabeled drugs or identifying probable negatives but still suffers from misclassification or oversimplified decision boundaries. Results We proposed a novel strategy using Large Language Models (GPT-4) to analyze all clinical trials on prostate cancer and systematically identify true negatives. This approach showed remarkable improvement in predictive accuracy on independent test sets with a Matthews Correlation Coefficient of 0.76 (± 0.33) compared to 0.55 (± 0.15) and 0.48 (± 0.18) for two commonly used PU learning approaches. Using our labeling strategy, we created a training set of 26 positive and 54 experimentally validated negative drugs. We then applied a machine learning ensemble to this new dataset to assess the repurposing potential of the remaining 11,043 drugs in the DrugBank database. This analysis identified 980 potential candidates for prostate cancer. A detailed review of the top 30 revealed 9 promising drugs targeting various mechanisms such as genomic instability, p53 regulation, or TMPRSS2-ERG fusion. Conclusion By expanding our negative data labeling approach to all diseases within the ClinicalTrials.gov database, our method could greatly advance supervised drug repositioning, offering a more accurate and data-driven path for discovering new treatments.
format Article
id doaj-art-f2b48bfe9f3942cea9a71a3a59c23c1a
institution Kabale University
issn 1758-2946
language English
publishDate 2025-02-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj-art-f2b48bfe9f3942cea9a71a3a59c23c1a2025-02-09T12:52:16ZengBMCJournal of Cheminformatics1758-29462025-02-0117111210.1186/s13321-025-00962-0Improving drug repositioning with negative data labeling using large language modelsMilan Picard0Mickael Leclercq1Antoine Bodein2Marie Pier Scott-Boyer3Olivier Perin4Arnaud Droit5Molecular Medicine Department, CHU de Québec Research Center, Université LavalMolecular Medicine Department, CHU de Québec Research Center, Université LavalMolecular Medicine Department, CHU de Québec Research Center, Université LavalMolecular Medicine Department, CHU de Québec Research Center, Université LavalDigital Transformation and Innovation Department, L′Oréal Advanced ResearchMolecular Medicine Department, CHU de Québec Research Center, Université LavalAbstract Introduction Drug repositioning offers numerous advantages, such as faster development timelines, reduced costs, and lower failure rates in drug development. Supervised machine learning is commonly used to score drug candidates but is hindered by the lack of reliable negative data—drugs that fail due to inefficacy or toxicity— which is difficult to access, lowering their prediction accuracy and generalization. Positive-Unlabeled (PU) learning has been used to overcome this issue by either randomly sampling unlabeled drugs or identifying probable negatives but still suffers from misclassification or oversimplified decision boundaries. Results We proposed a novel strategy using Large Language Models (GPT-4) to analyze all clinical trials on prostate cancer and systematically identify true negatives. This approach showed remarkable improvement in predictive accuracy on independent test sets with a Matthews Correlation Coefficient of 0.76 (± 0.33) compared to 0.55 (± 0.15) and 0.48 (± 0.18) for two commonly used PU learning approaches. Using our labeling strategy, we created a training set of 26 positive and 54 experimentally validated negative drugs. We then applied a machine learning ensemble to this new dataset to assess the repurposing potential of the remaining 11,043 drugs in the DrugBank database. This analysis identified 980 potential candidates for prostate cancer. A detailed review of the top 30 revealed 9 promising drugs targeting various mechanisms such as genomic instability, p53 regulation, or TMPRSS2-ERG fusion. Conclusion By expanding our negative data labeling approach to all diseases within the ClinicalTrials.gov database, our method could greatly advance supervised drug repositioning, offering a more accurate and data-driven path for discovering new treatments.https://doi.org/10.1186/s13321-025-00962-0AI-driven drug discoveryComputational drug scoringNegative data labelingDrug repurposingCastration resistant prostate cancerBiomedical text mining
spellingShingle Milan Picard
Mickael Leclercq
Antoine Bodein
Marie Pier Scott-Boyer
Olivier Perin
Arnaud Droit
Improving drug repositioning with negative data labeling using large language models
Journal of Cheminformatics
AI-driven drug discovery
Computational drug scoring
Negative data labeling
Drug repurposing
Castration resistant prostate cancer
Biomedical text mining
title Improving drug repositioning with negative data labeling using large language models
title_full Improving drug repositioning with negative data labeling using large language models
title_fullStr Improving drug repositioning with negative data labeling using large language models
title_full_unstemmed Improving drug repositioning with negative data labeling using large language models
title_short Improving drug repositioning with negative data labeling using large language models
title_sort improving drug repositioning with negative data labeling using large language models
topic AI-driven drug discovery
Computational drug scoring
Negative data labeling
Drug repurposing
Castration resistant prostate cancer
Biomedical text mining
url https://doi.org/10.1186/s13321-025-00962-0
work_keys_str_mv AT milanpicard improvingdrugrepositioningwithnegativedatalabelingusinglargelanguagemodels
AT mickaelleclercq improvingdrugrepositioningwithnegativedatalabelingusinglargelanguagemodels
AT antoinebodein improvingdrugrepositioningwithnegativedatalabelingusinglargelanguagemodels
AT mariepierscottboyer improvingdrugrepositioningwithnegativedatalabelingusinglargelanguagemodels
AT olivierperin improvingdrugrepositioningwithnegativedatalabelingusinglargelanguagemodels
AT arnauddroit improvingdrugrepositioningwithnegativedatalabelingusinglargelanguagemodels