Comparison of anonymization techniques regarding statistical reproducibility.

<h4>Background</h4>Anonymization opens up innovative ways of using secondary data without the requirements of the GDPR, as anonymized data does not affect anymore the privacy of data subjects. Anonymization requires data alteration, and this project aims to compare the ability of such pr...

Full description

Saved in:
Bibliographic Details
Main Authors: David Pau, Camille Bachot, Charles Monteil, Laetitia Vinet, Mathieu Boucher, Nadir Sella, Romain Jegou
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-02-01
Series:PLOS Digital Health
Online Access:https://doi.org/10.1371/journal.pdig.0000735
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825206838318071808
author David Pau
Camille Bachot
Charles Monteil
Laetitia Vinet
Mathieu Boucher
Nadir Sella
Romain Jegou
author_facet David Pau
Camille Bachot
Charles Monteil
Laetitia Vinet
Mathieu Boucher
Nadir Sella
Romain Jegou
author_sort David Pau
collection DOAJ
description <h4>Background</h4>Anonymization opens up innovative ways of using secondary data without the requirements of the GDPR, as anonymized data does not affect anymore the privacy of data subjects. Anonymization requires data alteration, and this project aims to compare the ability of such privacy protection methods to maintain reliability and utility of scientific data for secondary research purposes.<h4>Methods</h4>The French data protection authority (CNIL) defines anonymization as a processing activity that consists of using methods to make impossible any identification of people by any means in an irreversible manner. To answer project's objective, a series of analyses were performed on a cohort, and reproduced on four sets of anonymized data for comparison. Four assessment levels were used to evaluate impact of anonymization: level 1 referred to the replication of statistical outputs, level 2 referred to accuracy of statistical results, level 3 assessed data alteration (using Hellinger distances) and level 4 assessed privacy risks (using WP29 criteria).<h4>Results</h4>87 items were produced on the raw cohort data and then reproduced on each of the four anonymized data. The overall level 1 replication score ranged from 67% to 100% depending on the anonymization solution. The most difficult analyses to replicate were regression models (sub-score ranging from 78% to 100%) and survival analysis (sub-score ranging from 0% to 100. The overall level 2 accuracy score ranged from 22% to 79% depending on the anonymization solution. For level 3, three methods had some variables with different probability distributions (Hellinger distance = 1). For level 4, all methods had reduced the privacy risk of singling out, with relative risk reductions ranging from 41% to 65%.<h4>Conclusion</h4>None of the anonymization methods reproduced all outputs and results. A trade-off has to be find between context risk and the usefulness of data to answer the research question.
format Article
id doaj-art-f909a44af9c4494f8d58d3b4feeb464a
institution Kabale University
issn 2767-3170
language English
publishDate 2025-02-01
publisher Public Library of Science (PLoS)
record_format Article
series PLOS Digital Health
spelling doaj-art-f909a44af9c4494f8d58d3b4feeb464a2025-02-07T05:31:11ZengPublic Library of Science (PLoS)PLOS Digital Health2767-31702025-02-0142e000073510.1371/journal.pdig.0000735Comparison of anonymization techniques regarding statistical reproducibility.David PauCamille BachotCharles MonteilLaetitia VinetMathieu BoucherNadir SellaRomain Jegou<h4>Background</h4>Anonymization opens up innovative ways of using secondary data without the requirements of the GDPR, as anonymized data does not affect anymore the privacy of data subjects. Anonymization requires data alteration, and this project aims to compare the ability of such privacy protection methods to maintain reliability and utility of scientific data for secondary research purposes.<h4>Methods</h4>The French data protection authority (CNIL) defines anonymization as a processing activity that consists of using methods to make impossible any identification of people by any means in an irreversible manner. To answer project's objective, a series of analyses were performed on a cohort, and reproduced on four sets of anonymized data for comparison. Four assessment levels were used to evaluate impact of anonymization: level 1 referred to the replication of statistical outputs, level 2 referred to accuracy of statistical results, level 3 assessed data alteration (using Hellinger distances) and level 4 assessed privacy risks (using WP29 criteria).<h4>Results</h4>87 items were produced on the raw cohort data and then reproduced on each of the four anonymized data. The overall level 1 replication score ranged from 67% to 100% depending on the anonymization solution. The most difficult analyses to replicate were regression models (sub-score ranging from 78% to 100%) and survival analysis (sub-score ranging from 0% to 100. The overall level 2 accuracy score ranged from 22% to 79% depending on the anonymization solution. For level 3, three methods had some variables with different probability distributions (Hellinger distance = 1). For level 4, all methods had reduced the privacy risk of singling out, with relative risk reductions ranging from 41% to 65%.<h4>Conclusion</h4>None of the anonymization methods reproduced all outputs and results. A trade-off has to be find between context risk and the usefulness of data to answer the research question.https://doi.org/10.1371/journal.pdig.0000735
spellingShingle David Pau
Camille Bachot
Charles Monteil
Laetitia Vinet
Mathieu Boucher
Nadir Sella
Romain Jegou
Comparison of anonymization techniques regarding statistical reproducibility.
PLOS Digital Health
title Comparison of anonymization techniques regarding statistical reproducibility.
title_full Comparison of anonymization techniques regarding statistical reproducibility.
title_fullStr Comparison of anonymization techniques regarding statistical reproducibility.
title_full_unstemmed Comparison of anonymization techniques regarding statistical reproducibility.
title_short Comparison of anonymization techniques regarding statistical reproducibility.
title_sort comparison of anonymization techniques regarding statistical reproducibility
url https://doi.org/10.1371/journal.pdig.0000735
work_keys_str_mv AT davidpau comparisonofanonymizationtechniquesregardingstatisticalreproducibility
AT camillebachot comparisonofanonymizationtechniquesregardingstatisticalreproducibility
AT charlesmonteil comparisonofanonymizationtechniquesregardingstatisticalreproducibility
AT laetitiavinet comparisonofanonymizationtechniquesregardingstatisticalreproducibility
AT mathieuboucher comparisonofanonymizationtechniquesregardingstatisticalreproducibility
AT nadirsella comparisonofanonymizationtechniquesregardingstatisticalreproducibility
AT romainjegou comparisonofanonymizationtechniquesregardingstatisticalreproducibility