Re-identification Risk of Online Data Repositories

Show simple item record

dc.contributor.advisor Milne, B en
dc.contributor.author Roberts, Shaun en
dc.date.accessioned 2020-02-17T01:01:22Z en
dc.date.issued 2019 en
dc.identifier.uri http://hdl.handle.net/2292/50065 en
dc.description Full Text is available to authenticated members of The University of Auckland only. en
dc.description.abstract The intention of this project was to highlight and hopefully raise awareness of the importance of data protection to people who wish to make their data available to others by investigating the reidentification risk of openly available datasets and to facilitate the investigation of this for others. We investigated the disclosure risk of datasets kept in online data repositories at both the file and record levels using the Data Intrusion Simulation (DIS) method for the file level disclosure risk, and the DIS adjusted Special Uniques Detection Algorithm (DIS-SUDA) method for the record level disclosure risks. We analysed 209 datasets from 2017 across 4 Dataverses within The Dataverse Project, namely The Australian Data Archive Dataverse (28 files), The Harvard Dataverse (113 files), Scholars Portal Dataverse (21 files), and the UNC Dataverse (47 files). We found disclosure risks were generally very low with only a few datasets which had very high disclosure risks. The very high disclosure risks were due to the inclusion of direct identifying variables or very specific geographic information. We found no evidence of differences of disclosure risks between datasets intended for publication and those not (p-value = 0.509). However, we found evidence of difference between the Dataverses with the Australian Data (ADA) Archive having higher file disclosure risks than the other three Dataverses (p-values all 0). The ADA Dataverse has methods to ensure both Safe people and Safe projects so therefore they can accept a higher disclosure risk. We have laid out a framework to assist both industry professionals and academics in investigating both file and record level disclosure risks. The disclosure risk calculation methods and statistical disclosure control techniques we suggest are reasonably simple to implement and understand for people across a wide range of industries and disciplines. The disclosure risks at both the file and the individual level for datasets stored in online data repositories are reasonably low, except in a few cases. Care should always be taken whenever uploading data and many existing public datasets may have high disclosure risks. en
dc.publisher ResearchSpace@Auckland en
dc.relation.ispartof Masters Thesis - University of Auckland en
dc.relation.isreferencedby UoA99265295611802091 en
dc.rights Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher. en
dc.rights Restricted Item. Full Text is available to authenticated members of The University of Auckland only. en
dc.rights.uri https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm en
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/nz/ en
dc.title Re-identification Risk of Online Data Repositories en
dc.type Thesis en
thesis.degree.discipline Statistics en
thesis.degree.grantor The University of Auckland en
thesis.degree.level Masters en
dc.rights.holder Copyright: The author en
pubs.elements-id 794557 en
pubs.record-created-at-source-date 2020-02-17 en
dc.identifier.wikidata Q112950082


Files in this item

Find Full text

This item appears in the following Collection(s)

Show simple item record

Share

Search ResearchSpace


Browse

Statistics