BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse

Abstract Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provi...

Full description

Saved in:
Bibliographic Details
Main Authors: Qiaowang Li, Yaser Gamallat, Jon George Rokne, Tarek A. Bismar, Reda Alhajj
Format: Article
Language:English
Published: BMC 2025-02-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-025-06050-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861523487916032
author Qiaowang Li
Yaser Gamallat
Jon George Rokne
Tarek A. Bismar
Reda Alhajj
author_facet Qiaowang Li
Yaser Gamallat
Jon George Rokne
Tarek A. Bismar
Reda Alhajj
author_sort Qiaowang Li
collection DOAJ
description Abstract Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.
format Article
id doaj-art-09a6eccb4b4f4757a32dbafa8536d8a9
institution Kabale University
issn 1471-2105
language English
publishDate 2025-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj-art-09a6eccb4b4f4757a32dbafa8536d8a92025-02-09T12:56:57ZengBMCBMC Bioinformatics1471-21052025-02-0126111710.1186/s12859-025-06050-2BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouseQiaowang Li0Yaser Gamallat1Jon George Rokne2Tarek A. Bismar3Reda Alhajj4Department of Computer Science, University of CalgaryDepartment of Pathology and Laboratory Medicine, University of CalgaryDepartment of Computer Science, University of CalgaryDepartment of Pathology and Laboratory Medicine, University of CalgaryDepartment of Computer Science, University of CalgaryAbstract Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.https://doi.org/10.1186/s12859-025-06050-2Parallel computingData lakehouseExpression analysisData visualizationProstate cancer
spellingShingle Qiaowang Li
Yaser Gamallat
Jon George Rokne
Tarek A. Bismar
Reda Alhajj
BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse
BMC Bioinformatics
Parallel computing
Data lakehouse
Expression analysis
Data visualization
Prostate cancer
title BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse
title_full BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse
title_fullStr BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse
title_full_unstemmed BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse
title_short BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse
title_sort biolake an rna expression analysis framework for prostate cancer biomarker powered by data lakehouse
topic Parallel computing
Data lakehouse
Expression analysis
Data visualization
Prostate cancer
url https://doi.org/10.1186/s12859-025-06050-2
work_keys_str_mv AT qiaowangli biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse
AT yasergamallat biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse
AT jongeorgerokne biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse
AT tarekabismar biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse
AT redaalhajj biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse