XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
Abstract Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences—plays a critical...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2025-02-01
|
Series: | BioData Mining |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13040-024-00415-8 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1825197657903071232 |
---|---|
author | Salman Khan Sumaiya Noor Tahir Javed Afshan Naseem Fahad Aslam Salman A. AlQahtani Nijad Ahmad |
author_facet | Salman Khan Sumaiya Noor Tahir Javed Afshan Naseem Fahad Aslam Salman A. AlQahtani Nijad Ahmad |
author_sort | Salman Khan |
collection | DOAJ |
description | Abstract Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences—plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson’s and Alzheimer’s. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model’s reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development. |
format | Article |
id | doaj-art-af46290e133248f2ab4f09bd7be5149d |
institution | Kabale University |
issn | 1756-0381 |
language | English |
publishDate | 2025-02-01 |
publisher | BMC |
record_format | Article |
series | BioData Mining |
spelling | doaj-art-af46290e133248f2ab4f09bd7be5149d2025-02-09T12:15:55ZengBMCBioData Mining1756-03812025-02-0118111810.1186/s13040-024-00415-8XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sitesSalman Khan0Sumaiya Noor1Tahir Javed2Afshan Naseem3Fahad Aslam4Salman A. AlQahtani5Nijad Ahmad6New Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud UniversityBusiness and Management Sciences Department, Purdue UniversityDepartment of Computer Science, Allama Iqbal Open UniversityInstitute of Oceanography and Environment (INOS), Universiti Malaysia TerengganuInstitute of Oceanography and Environment (INOS), Universiti Malaysia TerengganuNew Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud UniversityDepartment of Computer Science, Khurasan University JalalabadAbstract Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences—plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson’s and Alzheimer’s. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model’s reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.https://doi.org/10.1186/s13040-024-00415-8Pseudo position-specific score matrixSumoylationPost-translation modificationXGBoostSHAP |
spellingShingle | Salman Khan Sumaiya Noor Tahir Javed Afshan Naseem Fahad Aslam Salman A. AlQahtani Nijad Ahmad XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites BioData Mining Pseudo position-specific score matrix Sumoylation Post-translation modification XGBoost SHAP |
title | XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites |
title_full | XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites |
title_fullStr | XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites |
title_full_unstemmed | XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites |
title_short | XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites |
title_sort | xgboost enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites |
topic | Pseudo position-specific score matrix Sumoylation Post-translation modification XGBoost SHAP |
url | https://doi.org/10.1186/s13040-024-00415-8 |
work_keys_str_mv | AT salmankhan xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites AT sumaiyanoor xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites AT tahirjaved xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites AT afshannaseem xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites AT fahadaslam xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites AT salmanaalqahtani xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites AT nijadahmad xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites |