PIPENN-EMB ensemble net and protein embeddings generalise protein interface prediction beyond homology

Abstract Protein interactions are crucial for understanding biological functions and disease mechanisms, but predicting these remains a complex task in computational biology. Increasingly, Deep Learning models are having success in interface prediction. This study presents PIPENN-EMB which explores...

Full description

Saved in:
Bibliographic Details
Main Authors: David P. G. Thomas, Carlos M. Garcia Fernandez, Reza Haydarlou, K. Anton Feenstra
Format: Article
Language:English
Published: Nature Portfolio 2025-02-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-88445-y
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Protein interactions are crucial for understanding biological functions and disease mechanisms, but predicting these remains a complex task in computational biology. Increasingly, Deep Learning models are having success in interface prediction. This study presents PIPENN-EMB which explores the added value of using embeddings from the ProtT5-XL protein language model. Our results show substantial improvement over the previously published PIPENN model for protein interaction interface prediction, reaching an MCC of 0.313 vs. 0.249, and AUROC 0.800 vs. 0.755 on the BIO_DL_TE test set. We furthermore show that these embeddings cover a broad range of ‘hand-crafted’ protein features in ablation studies. PIPENN-EMB reaches state-of-the-art performance on the ZK448 dataset for protein-protein interface prediction. We showcase predictions on 25 resistance-related proteins from Mycobacterium tuberculosis. Furthermore, whereas other state-of-the-art sequence-based methods perform worse for proteins that have little recognisable homology in their training data, PIPENN-EMB generalises to remote homologs, yielding stable AUROC across all three test sets with less than 30% sequence identity to the training dataset, and even to proteins with less than 15% sequence identity.
ISSN:2045-2322