sORFdb – a database for sORFs, small proteins, and small protein families in bacteria

Abstract Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria’s often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection...

Full description

Saved in:
Bibliographic Details
Main Authors: Julian M. Hahnfeld, Oliver Schwengers, Lukas Jelonek, Sonja Diedrich, Franz Cemič, Alexander Goesmann
Format: Article
Language:English
Published: BMC 2025-02-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-025-11301-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823863249027727360
author Julian M. Hahnfeld
Oliver Schwengers
Lukas Jelonek
Sonja Diedrich
Franz Cemič
Alexander Goesmann
author_facet Julian M. Hahnfeld
Oliver Schwengers
Lukas Jelonek
Sonja Diedrich
Franz Cemič
Alexander Goesmann
author_sort Julian M. Hahnfeld
collection DOAJ
description Abstract Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria’s often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection of an increasing number of previously unknown small proteins. Despite this, they are overlooked in many cases by automated genome annotation pipelines, and often, no functional descriptions can be assigned due to a lack of known homologs. To understand and overcome these limitations, the current abundance of small proteins in existing databases was evaluated, and a new dedicated database for small proteins and their potential functions, called ’sORFdb’, was created. To this end, small proteins were extracted from annotated bacterial genomes in the GenBank database. Subsequently, they were quality-filtered, compared, and complemented with proteins from Swiss-Prot, UniProt, and SmProt to ensure reliable identification and characterization of small proteins. Families of similar small proteins were created using bidirectional best BLAST hits followed by Markov clustering. Analysis of small proteins in public databases revealed that their number is still limited due to historical and technical constraints. Additionally, functional descriptions were often missing despite the presence of potential homologs. As expected, a taxonomic bias was evident in over-represented clinically relevant bacteria. This new and comprehensive database is accessible via a feature-rich website providing specialized search features for sORFs and small proteins of high quality. Additionally, small protein families with Hidden Markov Models and information on taxonomic distribution and other physicochemical properties are available. In conclusion, the novel small protein database sORFdb is a specialized, taxonomy-independent database that improves the findability and classification of sORFs, small proteins, and their functions in bacteria, thereby supporting their future detection and consistent annotation. All sORFdb data is freely accessible via https://sorfdb.computational.bio .
format Article
id doaj-art-19e2acb7d4264318b5f5020ef371f712
institution Kabale University
issn 1471-2164
language English
publishDate 2025-02-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj-art-19e2acb7d4264318b5f5020ef371f7122025-02-09T12:13:51ZengBMCBMC Genomics1471-21642025-02-0126111410.1186/s12864-025-11301-wsORFdb – a database for sORFs, small proteins, and small protein families in bacteriaJulian M. Hahnfeld0Oliver Schwengers1Lukas Jelonek2Sonja Diedrich3Franz Cemič4Alexander Goesmann5Bioinformatics and Systems Biology, Justus Liebig University GiessenBioinformatics and Systems Biology, Justus Liebig University GiessenBioinformatics and Systems Biology, Justus Liebig University GiessenBioinformatics and Systems Biology, Justus Liebig University GiessenDepartment of Computer Science, University of Applied Sciences GiessenBioinformatics and Systems Biology, Justus Liebig University GiessenAbstract Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria’s often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection of an increasing number of previously unknown small proteins. Despite this, they are overlooked in many cases by automated genome annotation pipelines, and often, no functional descriptions can be assigned due to a lack of known homologs. To understand and overcome these limitations, the current abundance of small proteins in existing databases was evaluated, and a new dedicated database for small proteins and their potential functions, called ’sORFdb’, was created. To this end, small proteins were extracted from annotated bacterial genomes in the GenBank database. Subsequently, they were quality-filtered, compared, and complemented with proteins from Swiss-Prot, UniProt, and SmProt to ensure reliable identification and characterization of small proteins. Families of similar small proteins were created using bidirectional best BLAST hits followed by Markov clustering. Analysis of small proteins in public databases revealed that their number is still limited due to historical and technical constraints. Additionally, functional descriptions were often missing despite the presence of potential homologs. As expected, a taxonomic bias was evident in over-represented clinically relevant bacteria. This new and comprehensive database is accessible via a feature-rich website providing specialized search features for sORFs and small proteins of high quality. Additionally, small protein families with Hidden Markov Models and information on taxonomic distribution and other physicochemical properties are available. In conclusion, the novel small protein database sORFdb is a specialized, taxonomy-independent database that improves the findability and classification of sORFs, small proteins, and their functions in bacteria, thereby supporting their future detection and consistent annotation. All sORFdb data is freely accessible via https://sorfdb.computational.bio .https://doi.org/10.1186/s12864-025-11301-wSmall proteinsProtein familiesShort open reading framesSORFDatabaseBacteria
spellingShingle Julian M. Hahnfeld
Oliver Schwengers
Lukas Jelonek
Sonja Diedrich
Franz Cemič
Alexander Goesmann
sORFdb – a database for sORFs, small proteins, and small protein families in bacteria
BMC Genomics
Small proteins
Protein families
Short open reading frames
SORF
Database
Bacteria
title sORFdb – a database for sORFs, small proteins, and small protein families in bacteria
title_full sORFdb – a database for sORFs, small proteins, and small protein families in bacteria
title_fullStr sORFdb – a database for sORFs, small proteins, and small protein families in bacteria
title_full_unstemmed sORFdb – a database for sORFs, small proteins, and small protein families in bacteria
title_short sORFdb – a database for sORFs, small proteins, and small protein families in bacteria
title_sort sorfdb a database for sorfs small proteins and small protein families in bacteria
topic Small proteins
Protein families
Short open reading frames
SORF
Database
Bacteria
url https://doi.org/10.1186/s12864-025-11301-w
work_keys_str_mv AT julianmhahnfeld sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria
AT oliverschwengers sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria
AT lukasjelonek sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria
AT sonjadiedrich sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria
AT franzcemic sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria
AT alexandergoesmann sorfdbadatabaseforsorfssmallproteinsandsmallproteinfamiliesinbacteria