The Data Artifacts Glossary: a community-based repository for bias on health datasets

Abstract Background The deployment of Artificial Intelligence (AI) in healthcare has the potential to transform patient care through improved diagnostics, personalized treatment plans, and more efficient resource management. However, the effectiveness and fairness of AI are critically dependent on t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Rodrigo R. Gameiro, Naira Link Woite, Christopher M. Sauer, Sicheng Hao, Chrystinne Oliveira Fernandes, Anna E. Premo, Alice Rangel Teixeira, Isabelle Resli, An-Kwok Ian Wong, Leo Anthony Celi
Format:	Article
Language:	English
Published:	BMC 2025-02-01
Series:	Journal of Biomedical Science
Subjects:	Bias Health equity Dataset Data Artifacts Glossary Artificial intelligence Machine learning
Online Access:	https://doi.org/10.1186/s12929-024-01106-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1823861699187310592
author	Rodrigo R. Gameiro Naira Link Woite Christopher M. Sauer Sicheng Hao Chrystinne Oliveira Fernandes Anna E. Premo Alice Rangel Teixeira Isabelle Resli An-Kwok Ian Wong Leo Anthony Celi
author_facet	Rodrigo R. Gameiro Naira Link Woite Christopher M. Sauer Sicheng Hao Chrystinne Oliveira Fernandes Anna E. Premo Alice Rangel Teixeira Isabelle Resli An-Kwok Ian Wong Leo Anthony Celi
author_sort	Rodrigo R. Gameiro
collection	DOAJ
description	Abstract Background The deployment of Artificial Intelligence (AI) in healthcare has the potential to transform patient care through improved diagnostics, personalized treatment plans, and more efficient resource management. However, the effectiveness and fairness of AI are critically dependent on the data it learns from. Biased datasets can lead to AI outputs that perpetuate disparities, particularly affecting social minorities and marginalized groups. Objective This paper introduces the “Data Artifacts Glossary”, a dynamic, open-source framework designed to systematically document and update potential biases in healthcare datasets. The aim is to provide a comprehensive tool that enhances the transparency and accuracy of AI applications in healthcare and contributes to understanding and addressing health inequities. Methods Utilizing a methodology inspired by the Delphi method, a diverse team of experts conducted iterative rounds of discussions and literature reviews. The team synthesized insights to develop a comprehensive list of bias categories and designed the glossary’s structure. The Data Artifacts Glossary was piloted using the MIMIC-IV dataset to validate its utility and structure. Results The Data Artifacts Glossary adopts a collaborative approach modeled on successful open-source projects like Linux and Python. Hosted on GitHub, it utilizes robust version control and collaborative features, allowing stakeholders from diverse backgrounds to contribute. Through a rigorous peer review process managed by community members, the glossary ensures the continual refinement and accuracy of its contents. The implementation of the Data Artifacts Glossary with the MIMIC-IV dataset illustrates its utility. It categorizes biases, and facilitates their identification and understanding. Conclusion The Data Artifacts Glossary serves as a vital resource for enhancing the integrity of AI applications in healthcare by providing a mechanism to recognize and mitigate dataset biases before they impact AI outputs. It not only aids in avoiding bias in model development but also contributes to understanding and addressing the root causes of health disparities.
format	Article
id	doaj-art-11b22dc444284626a5789a7067a3ef45
institution	Kabale University
issn	1423-0127
language	English
publishDate	2025-02-01
publisher	BMC
record_format	Article
series	Journal of Biomedical Science
spelling	doaj-art-11b22dc444284626a5789a7067a3ef452025-02-09T12:48:55ZengBMCJournal of Biomedical Science1423-01272025-02-013211910.1186/s12929-024-01106-6The Data Artifacts Glossary: a community-based repository for bias on health datasetsRodrigo R. Gameiro0Naira Link Woite1Christopher M. Sauer2Sicheng Hao3Chrystinne Oliveira Fernandes4Anna E. Premo5Alice Rangel Teixeira6Isabelle Resli7An-Kwok Ian Wong8Leo Anthony Celi9Laboratory for Computational Physiology, Massachusetts Institute of TechnologyLaboratory for Computational Physiology, Massachusetts Institute of TechnologyLaboratory for Computational Physiology, Massachusetts Institute of TechnologyDivision of Pulmonary, Allergy, and Critical Care Medicine, Duke UniversityLaboratory for Computational Physiology, Massachusetts Institute of TechnologyLearning Research and Development Center, University of PittsburghDepartment of Philosophy, Universitat Autónoma de BarcelonaSchool of Electrical Engineering and Computer Science, Oregon State UniversityDivision of Pulmonary, Allergy, and Critical Care Medicine, Duke UniversityLaboratory for Computational Physiology, Massachusetts Institute of TechnologyAbstract Background The deployment of Artificial Intelligence (AI) in healthcare has the potential to transform patient care through improved diagnostics, personalized treatment plans, and more efficient resource management. However, the effectiveness and fairness of AI are critically dependent on the data it learns from. Biased datasets can lead to AI outputs that perpetuate disparities, particularly affecting social minorities and marginalized groups. Objective This paper introduces the “Data Artifacts Glossary”, a dynamic, open-source framework designed to systematically document and update potential biases in healthcare datasets. The aim is to provide a comprehensive tool that enhances the transparency and accuracy of AI applications in healthcare and contributes to understanding and addressing health inequities. Methods Utilizing a methodology inspired by the Delphi method, a diverse team of experts conducted iterative rounds of discussions and literature reviews. The team synthesized insights to develop a comprehensive list of bias categories and designed the glossary’s structure. The Data Artifacts Glossary was piloted using the MIMIC-IV dataset to validate its utility and structure. Results The Data Artifacts Glossary adopts a collaborative approach modeled on successful open-source projects like Linux and Python. Hosted on GitHub, it utilizes robust version control and collaborative features, allowing stakeholders from diverse backgrounds to contribute. Through a rigorous peer review process managed by community members, the glossary ensures the continual refinement and accuracy of its contents. The implementation of the Data Artifacts Glossary with the MIMIC-IV dataset illustrates its utility. It categorizes biases, and facilitates their identification and understanding. Conclusion The Data Artifacts Glossary serves as a vital resource for enhancing the integrity of AI applications in healthcare by providing a mechanism to recognize and mitigate dataset biases before they impact AI outputs. It not only aids in avoiding bias in model development but also contributes to understanding and addressing the root causes of health disparities.https://doi.org/10.1186/s12929-024-01106-6BiasHealth equityDatasetData Artifacts GlossaryArtificial intelligenceMachine learning
spellingShingle	Rodrigo R. Gameiro Naira Link Woite Christopher M. Sauer Sicheng Hao Chrystinne Oliveira Fernandes Anna E. Premo Alice Rangel Teixeira Isabelle Resli An-Kwok Ian Wong Leo Anthony Celi The Data Artifacts Glossary: a community-based repository for bias on health datasets Journal of Biomedical Science Bias Health equity Dataset Data Artifacts Glossary Artificial intelligence Machine learning
title	The Data Artifacts Glossary: a community-based repository for bias on health datasets
title_full	The Data Artifacts Glossary: a community-based repository for bias on health datasets
title_fullStr	The Data Artifacts Glossary: a community-based repository for bias on health datasets
title_full_unstemmed	The Data Artifacts Glossary: a community-based repository for bias on health datasets
title_short	The Data Artifacts Glossary: a community-based repository for bias on health datasets
title_sort	data artifacts glossary a community based repository for bias on health datasets
topic	Bias Health equity Dataset Data Artifacts Glossary Artificial intelligence Machine learning
url	https://doi.org/10.1186/s12929-024-01106-6
work_keys_str_mv	AT rodrigorgameiro thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT nairalinkwoite thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT christophermsauer thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT sichenghao thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT chrystinneoliveirafernandes thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT annaepremo thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT alicerangelteixeira thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT isabelleresli thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT ankwokianwong thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT leoanthonyceli thedataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT rodrigorgameiro dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT nairalinkwoite dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT christophermsauer dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT sichenghao dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT chrystinneoliveirafernandes dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT annaepremo dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT alicerangelteixeira dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT isabelleresli dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT ankwokianwong dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets AT leoanthonyceli dataartifactsglossaryacommunitybasedrepositoryforbiasonhealthdatasets

The Data Artifacts Glossary: a community-based repository for bias on health datasets

Similar Items