sORFdb – a database for sORFs, small proteins, and small protein families in bacteria
Abstract Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria’s often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2025-02-01
|
Series: | BMC Genomics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12864-025-11301-w |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823863249027727360 |
---|---|
author | Julian M. Hahnfeld Oliver Schwengers Lukas Jelonek Sonja Diedrich Franz Cemič Alexander Goesmann |
author_facet | Julian M. Hahnfeld Oliver Schwengers Lukas Jelonek Sonja Diedrich Franz Cemič Alexander Goesmann |
author_sort | Julian M. Hahnfeld |
collection | DOAJ |
description | Abstract Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria’s often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection of an increasing number of previously unknown small proteins. Despite this, they are overlooked in many cases by automated genome annotation pipelines, and often, no functional descriptions can be assigned due to a lack of known homologs. To understand and overcome these limitations, the current abundance of small proteins in existing databases was evaluated, and a new dedicated database for small proteins and their potential functions, called ’sORFdb’, was created. To this end, small proteins were extracted from annotated bacterial genomes in the GenBank database. Subsequently, they were quality-filtered, compared, and complemented with proteins from Swiss-Prot, UniProt, and SmProt to ensure reliable identification and characterization of small proteins. Families of similar small proteins were created using bidirectional best BLAST hits followed by Markov clustering. Analysis of small proteins in public databases revealed that their number is still limited due to historical and technical constraints. Additionally, functional descriptions were often missing despite the presence of potential homologs. As expected, a taxonomic bias was evident in over-represented clinically relevant bacteria. This new and comprehensive database is accessible via a feature-rich website providing specialized search features for sORFs and small proteins of high quality. Additionally, small protein families with Hidden Markov Models and information on taxonomic distribution and other physicochemical properties are available. In conclusion, the novel small protein database sORFdb is a specialized, taxonomy-independent database that improves the findability and classification of sORFs, small proteins, and their functions in bacteria, thereby supporting their future detection and consistent annotation. All sORFdb data is freely accessible via https://sorfdb.computational.bio . |
format | Article |
id | doaj-art-19e2acb7d4264318b5f5020ef371f712 |
institution | Kabale University |
issn | 1471-2164 |
language | English |
publishDate | 2025-02-01 |
publisher | BMC |
record_format | Article |
series | BMC Genomics |
spelling | doaj-art-19e2acb7d4264318b5f5020ef371f7122025-02-09T12:13:51ZengBMCBMC Genomics1471-21642025-02-0126111410.1186/s12864-025-11301-wsORFdb – a database for sORFs, small proteins, and small protein families in bacteriaJulian M. Hahnfeld0Oliver Schwengers1Lukas Jelonek2Sonja Diedrich3Franz Cemič4Alexander Goesmann5Bioinformatics and Systems Biology, Justus Liebig University GiessenBioinformatics and Systems Biology, Justus Liebig University GiessenBioinformatics and Systems Biology, Justus Liebig University GiessenBioinformatics and Systems Biology, Justus Liebig University GiessenDepartment of Computer Science, University of Applied Sciences GiessenBioinformatics and Systems Biology, Justus Liebig University GiessenAbstract Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria’s often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection of an increasing number of previously unknown small proteins. Despite this, they are overlooked in many cases by automated genome annotation pipelines, and often, no functional descriptions can be assigned due to a lack of known homologs. To understand and overcome these limitations, the current abundance of small proteins in existing databases was evaluated, and a new dedicated database for small proteins and their potential functions, called ’sORFdb’, was created. To this end, small proteins were extracted from annotated bacterial genomes in the GenBank database. Subsequently, they were quality-filtered, compared, and complemented with proteins from Swiss-Prot, UniProt, and SmProt to ensure reliable identification and characterization of small proteins. Families of similar small proteins were created using bidirectional best BLAST hits followed by Markov clustering. Analysis of small proteins in public databases revealed that their number is still limited due to historical and technical constraints. Additionally, functional descriptions were often missing despite the presence of potential homologs. As expected, a taxonomic bias was evident in over-represented clinically relevant bacteria. This new and comprehensive database is accessible via a feature-rich website providing specialized search features for sORFs and small proteins of high quality. Additionally, small protein families with Hidden Markov Models and information on taxonomic distribution and other physicochemical properties are available. In conclusion, the novel small protein database sORFdb is a specialized, taxonomy-independent database that improves the findability and classification of sORFs, small proteins, and their functions in bacteria, thereby supporting their future detection and consistent annotation. All sORFdb data is freely accessible via https://sorfdb.computational.bio .https://doi.org/10.1186/s12864-025-11301-wSmall proteinsProtein familiesShort open reading framesSORFDatabaseBacteria |
spellingShingle | Julian M. Hahnfeld Oliver Schwengers Lukas Jelonek Sonja Diedrich Franz Cemič Alexander Goesmann sORFdb – a database for sORFs, small proteins, and small protein families in bacteria BMC Genomics Small proteins Protein families Short open reading frames SORF Database Bacteria |
title | sORFdb – a database for sORFs, small proteins, and small protein families in bacteria |
title_full | sORFdb – a database for sORFs, small proteins, and small protein families in bacteria |
title_fullStr | sORFdb – a database for sORFs, small proteins, and small protein families in bacteria |
title_full_unstemmed | sORFdb – a database for sORFs, small proteins, and small protein families in bacteria |
title_short | sORFdb – a database for sORFs, small proteins, and small protein families in bacteria |
title_sort | sorfdb a database for sorfs small proteins and small protein families in bacteria |
topic | Small proteins Protein families Short open reading frames SORF Database Bacteria |
url | https://doi.org/10.1186/s12864-025-11301-w |
work_keys_str_mv | AT julianmhahnfeld sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria AT oliverschwengers sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria AT lukasjelonek sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria AT sonjadiedrich sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria AT franzcemic sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria AT alexandergoesmann sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria |