BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse
Abstract Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provi...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2025-02-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12859-025-06050-2 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823861523487916032 |
---|---|
author | Qiaowang Li Yaser Gamallat Jon George Rokne Tarek A. Bismar Reda Alhajj |
author_facet | Qiaowang Li Yaser Gamallat Jon George Rokne Tarek A. Bismar Reda Alhajj |
author_sort | Qiaowang Li |
collection | DOAJ |
description | Abstract Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing. |
format | Article |
id | doaj-art-09a6eccb4b4f4757a32dbafa8536d8a9 |
institution | Kabale University |
issn | 1471-2105 |
language | English |
publishDate | 2025-02-01 |
publisher | BMC |
record_format | Article |
series | BMC Bioinformatics |
spelling | doaj-art-09a6eccb4b4f4757a32dbafa8536d8a92025-02-09T12:56:57ZengBMCBMC Bioinformatics1471-21052025-02-0126111710.1186/s12859-025-06050-2BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouseQiaowang Li0Yaser Gamallat1Jon George Rokne2Tarek A. Bismar3Reda Alhajj4Department of Computer Science, University of CalgaryDepartment of Pathology and Laboratory Medicine, University of CalgaryDepartment of Computer Science, University of CalgaryDepartment of Pathology and Laboratory Medicine, University of CalgaryDepartment of Computer Science, University of CalgaryAbstract Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.https://doi.org/10.1186/s12859-025-06050-2Parallel computingData lakehouseExpression analysisData visualizationProstate cancer |
spellingShingle | Qiaowang Li Yaser Gamallat Jon George Rokne Tarek A. Bismar Reda Alhajj BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse BMC Bioinformatics Parallel computing Data lakehouse Expression analysis Data visualization Prostate cancer |
title | BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse |
title_full | BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse |
title_fullStr | BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse |
title_full_unstemmed | BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse |
title_short | BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse |
title_sort | biolake an rna expression analysis framework for prostate cancer biomarker powered by data lakehouse |
topic | Parallel computing Data lakehouse Expression analysis Data visualization Prostate cancer |
url | https://doi.org/10.1186/s12859-025-06050-2 |
work_keys_str_mv | AT qiaowangli biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse AT yasergamallat biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse AT jongeorgerokne biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse AT tarekabismar biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse AT redaalhajj biolakeanrnaexpressionanalysisframeworkforprostatecancerbiomarkerpoweredbydatalakehouse |