Recommendations for approaching anonymization of health

user-configured hierarchies for the various attributes. This form of generalization is non-perturbative, meaning the data transformations remain truthful to the original data. For example, generalizing age 45 into age range 40 to 50 still tells the truth about the age, however, a perturbative transformation changing the age from 45 to 46, would make that value not truthful to the original data. Perturbative transformation methods employed in a clever fashion may result in different utility in the resulting data set, however they have not been considered for this project.

Furthermore, the strategy for applying the generalization transformations is simple. There exist algorithms which employ specific strategies for attempting to comply with various privacy model requirements. Examples include the BUREL and perturbative methods briefly covered in the literature review section related to the β-likeness privacy model. Those algorithms, specifically, were designed to ensure compliance with the β-likeness privacy model the paper introduced while retaining a high degree of utility in the resulting data sets. Different algorithms also exist which focus other privacy models.

I initially considered implementing some such algorithms in a self-created tool, however ended up moving away from that when I abandoned the plan of creating my own tool in favor of using a pre-existing tool. As such, this project does not cover the use of such specific algorithms, which may or may not influence utility in resulting data sets.

9.4.4 Tool

The only tools considered for use in this project are free, publicly available tools. I ended up choosing to use ARX, which includes many useful features and proved highly valuable in my project, however, other non-free alternatives for data anonymization exist. I have not extensively researched their capabilities, but they may offer features which could assist in producing better results than those which I produced using ARX.

ARX is somewhat limited in the way in which it allows users to apply transformations to the attributes of an input data set. It only supports generalization, microaggregation and microaggregation with clustering.

In addition, the strategies it allows for applying the generalization transformation is limited to following a user-specified hierarchy. This limitation makes it difficult to implement algorithms for optimizing transformations, such as the ones mentioned in the previous section.

9.5 Recommendations for approaching anonymiza-tion of health data

Through working on this project, I have learned a lot about the field of anonymization, the GDPR, and other related subjects. With my goal of assisting system administrators of the DHIS2 in anonymization

CHAPTER 9. DISCUSSION

efforts, I wanted to share the lessons I’ve learned, by compiling a set of recommendations for approaching anonymization of health data.

Appendix B lists the recommendations, and they mainly consist of what to be aware of when doing anonymization: What steps to go through, what considerations to make and what knowledge is required about both the data to be released and relevant legislations, as well as some potential pitfalls. While these recommendations are not, and are not intended to be, a complete set of guidelines for doing anonymization, they can serve as a resource to lean on through the anonymization process.

The recommendations cover the following topics:

1. The purpose for releasing the data.

2. The type of data that is to be released.

3. What makes the data useful.

4. Relevant legislation.

5. What risks might be faced.

6. The identifiability of the data.

7. Privacy models to inform the anonymization process.

8. How to transform the data to achieve the goals of the anonymization process.

9. Consideration of tools to assist with the process.

10. Ethical considerations regarding releasing sensitive data.

11. Potential pitfalls.

9.5. RECOMMENDATIONS FOR APPROACHING ANONYMIZATION OF HEALTH DATA

Chapter 10

Conclusion

The aim of this research project was to establish a starting point which researchers, developers and system administrators of DHIS2 could lean on when wishing to publish data for a variety of purposes. The GDPR serves as a common piece of legislation for the EU which regulates the processing of personal data. That made it a good candidate to contextualize the publishing health data, a type of data covered by its scope. To achieve my goal, I posited the following research question:

How can existing approaches to data anonymization be applied to health data to sufficiently comply with privacy and data protection regulations stipulated in the General Data Protection Regulation (GDPR), while preserving utility in the resulting data?

Through the this project, I have worked towards answering this question, building on previous literature, constructing an approach to testing various anonymization methods and doing preparatory work to facilitate such testing, including the gathering of test data, examining privacy models and judging various tools before determining one of which to make use. All of this culminated in the testing of three approaches to anonymization, utilizing several privacy models, which produced results regarding the compliance with the requirements of the GDPR for the resulting anonymized data sets and regarding the utility of the anonymized data sets.

Based on the considerations taken during the examination of the test data set when selecting the identifying and quasi-identifying attributes, as well as the results showing strong protection against identity disclosure under the prosecutor and journalist scenarios and against attribute disclosure provided by the various privacy models, we can conclude that all three approaches can be used to ensure compliance with the GDPR.

Our experiments indicate a similar utility in the anonymized data sets resulting from the three anonymization approaches, but the results are subject to potential weaknesses. They mainly stem from two factors: the

10.1. FUTURE RESEARCH

test data consisted of synthetic data and the chosen tool was restricted in the ways it could perform transformations on the original data set to achieve the goals of the various anonymization approaches. Nevertheless, the k-approach preserves utility slightly better for the medium and large data sets, with the δ- and β-approach scoring slightly better for the small data set. Beyond that, a larger original data sets ensures a better utility score.

Finally, to assist system administrators of DHIS2 in anonymization efforts, lessons learned through working on this project have been compiled as a set of recommendations for how to approach anonymization of health data, listed in Appendix B.

10.1 Future research

This thesis has focused on a narrow section of the health data concept, utilized a limited approach to anonymization and experimented using fake test data. Building on my findings, each of those factors could be further explored in future work. Expanding the type of health data to test on could further inform the applicability of the anonymization approaches on in the field of health data. To directly improve upon my findings, however, testing on real data rather than fake data would provide more legitimacy to the results and utilizing purpose-built algorithms for transformations related to the various anonymization approaches could potentially have a significant impact on utility scores, perhaps better showcasing the strengths and weaknesses the different privacy models have compared to one another. Expanding testing to focus on more privacy models could also lead to interesting results.

Finally, this thesis covered compliance with the GDPR, however, it did not establish a minimum requirement for such compliance. Studying the GDPR to determine a more concrete measure of when data is anonymous, rather than an individual not being reasonably likely to be identified, could facilitate less strict transformations on a data set, likely increasing the utility of the resulting anonymized data set.

Bibliography

[1] 7 Essential GDPR Data Processing Principles. GDPR Informer. Library Catalog: gdprinformer.com Section: GDPR Articles. 20th Sept. 2017.

URL: https : / / gdprinformer . com / gdpr articles / 7 essential gdpr data -processing-principles(visited on 02/06/2020).

[2] About DHIS2 | DHIS2.URL:https://www.dhis2.org/about(visited on 29/05/2020).

[3] Amin Aminifar et al. ‘A Practical Methodology forAnonymization of Structured Health Data’. In: (2019), p. 7.

[4] Amnesia — data anonymization made easy. URL: https : / / amnesia . openaire.eu/(visited on 07/06/2020).

[5] Amnesia — data anonymization made easy. URL: https : / / amnesia . openaire.eu/documentation.html(visited on 07/06/2020).

[6] Art. 3 GDPR – Territorial scope. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.URL: https://gdpr-info.eu/art-3-gdpr/(visited on 04/06/2020).

[7] Art. 32 GDPR – Security of processing. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.URL: https://gdpr-info.eu/art-32-gdpr/(visited on 04/06/2020).

[8] Art. 35 GDPR – Data protection impact assessment. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu. URL: https://gdpr-info.eu/art-35-gdpr/(visited on 04/06/2020).

[9] Art. 4 GDPR – Definitions. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.URL: https://gdpr-info.eu/art-4-gdpr/(visited on 04/06/2020).

[10] Art. 44 GDPR – General principle for transfers | General Data Protection Regulation (GDPR).URL:https://gdpr-info.eu/art-44-gdpr/(visited on 04/06/2020).

[11] Art. 5 GDPR – Principles relating to processing of personal data. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.

URL:https://gdpr-info.eu/art-5-gdpr/(visited on 04/06/2020).

[12] Art. 6 GDPR – Lawfulness of processing. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.URL: https://gdpr-info.eu/art-6-gdpr/(visited on 04/06/2020).

BIBLIOGRAPHY

[13] Art. 9 GDPR – Processing of special categories of personal data. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.

URL:https://gdpr-info.eu/art-9-gdpr/(visited on 04/06/2020).

[14] ARX - Data Anonymization Tool | A comprehensive software for privacy-preserving microdata publishing. Library Catalog: arx.deidentifier.org.

URL:https://arx.deidentifier.org/(visited on 07/06/2020).

[15] Maxime Bergeat et al. ‘A French Anonymization Experiment with Health Data’. In: (Sept. 2014), p. 17.

[16] Raffael Bild, Klaus A. Kuhn and Fabian Prasser. ‘SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees’. In:

Proceedings on Privacy Enhancing Technologies 2018.1 (1st Jan. 2018), pp. 67–87.ISSN: 2299-0984.DOI:10.1515/popets-2018-0004.URL:http:

//content.sciendo.com/view/journals/popets/2018/1/article- p67.xml (visited on 01/06/2019).

[17] Justin Brickell and Vitaly Shmatikov. ‘The cost of privacy: destruction of data-mining utility in anonymized data publishing’. In:Proceeding of the 14th ACM SIGKDD international conference on Knowledge discov-ery and data mining - KDD 08. the 14th ACM SIGKDD international conference. Las Vegas, Nevada, USA: ACM Press, 2008, p. 70.ISBN: 978-1-60558-193-4.DOI:10.1145/1401890.1401904.URL:http://dl.acm.

org/citation.cfm?doid=1401890.1401904(visited on 04/06/2020).

[18] Jianneng Cao and Panagiotis Karras. ‘Publishing microdata with a robust privacy guarantee’. In:Proceedings of the VLDB Endowment5.11 (July 2012), pp. 1388–1399.ISSN: 2150-8097. DOI:10.14778/2350229.

2350255. URL: http : / / dl . acm . org / doi / 10 . 14778 / 2350229 . 2350255 (visited on 04/06/2020).

[19] Configuration | ARX - Data Anonymization Tool. Library Catalog:

arx.deidentifier.org.URL: https://arx.deidentifier.org/anonymization-tool/configuration/(visited on 07/06/2020).

[20] Data quality models | ARX - Data Anonymization Tool. URL:https : / / arx.deidentifier.org/overview/metrics- for- information- loss/(visited on 07/06/2020).

[21] Data Quality: Definition and Why It’s Important | Informatica. URL: https : / / www . informatica . com / services and training / glossary of -terms/data-quality-definition.html(visited on 02/06/2020).

[22] DHIS 2 Tracker Implementation Guide.URL:https://docs.dhis2.org/2.34/

en/implementer/html/dhis2_tracker_implementation_guide_full.html (visited on 14/06/2020).

[23] Josep Domingo-Ferrer, David Sánchez and Jordi Soria-Comas. Data-base anonymization: privacy models, data utility, and microaggregation-based inter-model connections. OCLC: 1030541536. 2016. ISBN: 978-1-62705-844-5 978-1-62705-843-8.

BIBLIOGRAPHY

[24] Mary Donnelly and Maeve McDonagh. ‘Health research and data protection: researchers’ obligations under the GDPR frame-work’. In: Practical Law (2018). Library Catalog: uk-practicallaw-thomsonreuters-com.ezproxy.uio.no. URL: http : / / uk . practicallaw .

thomsonreuters.com.ezproxy.uio.no/Document/I2134EF90000311E9BD43CBA550965AD3/

View / FullText . html ? originationContext = document & transitionType = DocumentItem&contextData=%28sc.Default%29&comp=wluk(visited on 04/06/2020).

[25] Downloads | ARX - Data Anonymization Tool. Library Catalog:

arx.deidentifier.org.URL:https://arx.deidentifier.org/downloads/ (vis-ited on 07/06/2020).

[26] Khaled El Emam and Fida Kamal Dankar. ‘Protecting Privacy Using k-Anonymity’. In: Journal of the American Medical Informatics Association : JAMIA 15.5 (2008), pp. 627–637. ISSN: 1067-5027. DOI: 10 . 1197 / jamia . M2716. URL: https : / / www . ncbi . nlm . nih . gov / pmc / articles/PMC2528029/(visited on 03/02/2020).

[27] EUR-Lex - - EN. Official Journal C 326 , 26/10/2012 P. 0001 - 0390;

Library Catalog: eur-lex.europa.eu Publisher: OPOCE. URL: https : / / eur - lex . europa . eu / legal - content / EN / TXT / HTML / ?uri = CELEX : 12012M/TXT(visited on 04/06/2020).

[28] Exploration | ARX - Data Anonymization Tool. Library Catalog:

arx.deidentifier.org.URL: https://arx.deidentifier.org/anonymization-tool/exploration/(visited on 07/06/2020).

[29] FutureLearn. The GDPR and its scope - Understanding the GDPR.

FutureLearn. Library Catalog: www.futurelearn.com. URL: https : / / www . futurelearn . com / courses / general - data - protection - regulation / 0 / steps/32409(visited on 04/06/2020).

[30] GDPR – The Data Subject , Citizen or Resident? | CYBER COUNSEL.

Library Catalog: cybercounsel.co.uk.URL:https://cybercounsel.co.uk/

data-subjects/(visited on 02/06/2020).

[31] generatedata.com. URL: https : / / www . generatedata . com/ (visited on 02/06/2020).

[32] Michele Gentili, Sara Hajian and Carlos Castillo. ‘A Case Study of Anonymization of Medical Surveys’. In:Proceedings of the 2017 Inter-national Conference on Digital Health - DH ’17. the 2017 InterInter-national Conference. London, United Kingdom: ACM Press, 2017, pp. 77–

81. ISBN: 978-1-4503-5249-9. DOI: 10 . 1145 / 3079452 . 3079490. URL: http : / / dl . acm . org / citation . cfm ? doid = 3079452 . 3079490 (visited on 04/06/2020).

[33] Nils Gruschka et al. ‘Privacy Issues and Data Protection in Big Data: A Case Study Analysis under GDPR’. In:arXiv:1811.08531 [cs]

(20th Nov. 2018). arXiv:1811.08531. URL:http://arxiv.org/abs/1811.

08531(visited on 02/06/2019).

BIBLIOGRAPHY

[34] Avery Hartmans. It’s impossible to know exactly what data Cambridge Analytica scraped from Facebook — but here’s the kind of informa-tion apps could access in 2014. Business Insider. Library Catalog:

www.businessinsider.com. 22nd Mar. 2018. URL: https : / / www . businessinsider . com / what data did cambridge analytica have access -to-from-facebook-2018-3(visited on 06/06/2020).

[35] Health Information Systems Programme (HISP) - Department of Inform-atics. Library Catalog: www.mn.uio.no. 17th Mar. 2011. URL: https : / / www . mn . uio . no / ifi / english / research / networks / hisp / index . html (visited on 14/06/2020).

[36] How to Measure Data Quality – 7 Metrics to Assess the Quality of Your Data. Syncsort Blog. Library Catalog: blog.syncsort.com Section: Data Quality. 14th Feb. 2018. URL: https : / / blog . syncsort . com / 2018 / 02 / data - quality / how - to - measure - data - quality - 7 - metrics/ (visited on 02/06/2020).

[37] Anco Hundepool.Statistical disclosure control. Wiley series in survey methodology. Chichester, West Sussex, United Kingdom: John Wiley

& Sons Inc, 2012. 286 pp.ISBN: 978-1-119-97815-2.

[38] Clete A. Kushida et al. ‘Strategies for De-identification and Anonym-ization of Electronic Health Record Data for Use in Multicenter Re-search Studies:’ in:Medical Care50 (July 2012), S82–S101.ISSN: 0025-7079.DOI:10.1097/MLR.0b013e3182585355.URL:http://journals.lww.

com/00005650-201207001-00017(visited on 03/06/2020).

[39] Hyukki Lee et al. ‘Utility-preserving anonymization for health data publishing’. In: BMC Medical Informatics and Decision Making 17.1 (Dec. 2017).ISSN: 1472-6947. DOI:10.1186/s12911- 017- 0499- 0.URL: http : / / bmcmedinformdecismak . biomedcentral . com / articles / 10 . 1186 / s12911-017-0499-0(visited on 01/06/2019).

[40] Patient Generator – MiHIN. Library Catalog: mihin.org. URL: https : //mihin.org/services/patient-generator/(visited on 02/06/2020).

[41] PLAY.URL:https://play.dhis2.org/(visited on 02/06/2020).

[42] Privacy models | ARX - Data Anonymization Tool. Library Catalog:

arx.deidentifier.org.URL: https://arx.deidentifier.org/overview/privacy-criteria/(visited on 07/06/2020).

[43] R: The R Project for Statistical Computing. URL:https://www.r-project.

org/(visited on 07/06/2020).

[44] Asmaa H. Rashid and Abd-Fatth Hegazy. ‘Protect privacy of med-ical informatics using k-anonymization model’. In:2010 The 7th In-ternational Conference on Informatics and Systems (INFOS). 2010 The 7th International Conference on Informatics and Systems (INFOS). Mar.

2010, pp. 1–10.

[45] Recital 14 - Not Applicable to Legal Persons. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.URL: https://gdpr-info.eu/recitals/no-14/(visited on 04/06/2020).

BIBLIOGRAPHY

[46] Recital 18 - Not Applicable to Personal or Household Activities. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.

URL:https://gdpr-info.eu/recitals/no-18/(visited on 04/06/2020).

[47] Recital 26 - Not Applicable to Anonymous Data. General Data Protection Regulation (GDPR). Library Catalog: gdpr-info.eu.URL: https://gdpr-info.eu/recitals/no-26/(visited on 04/06/2020).

[48] Office for Civil Rights (OCR). Summary of the HIPAA Privacy Rule.

HHS.gov. 7th May 2008. URL: https : / / www . hhs . gov / hipaa / for - professionals / privacy / laws - regulations / index . html (visited on 01/06/2019).

[49] Risk analysis | ARX - Data Anonymization Tool. Library Catalog:

arx.deidentifier.org.URL: https://arx.deidentifier.org/anonymization-tool/risk-analysis/(visited on 07/06/2020).

[50] Luc Rocher, Julien M. Hendrickx and Yves-Alexandre de Montjoye.

‘Estimating the success of re-identifications in incomplete datasets using generative models’. In:Nature Communications10.1 (Dec. 2019), p. 3069. ISSN: 2041-1723. DOI: 10 . 1038 / s41467 - 019 - 10933 - 3. URL: http : / / www . nature . com / articles / s41467 - 019 - 10933 - 3 (visited on 01/06/2020).

[51] Sensitive Data and the GDPR: What You Need to Know. GDPR Informer.

Library Catalog: gdprinformer.com Section: GDPR Articles. 5th Sept.

2017.URL: https://gdprinformer.com/gdpr-articles/sensitive-data-gdpr-need-know(visited on 02/06/2020).

[52] Sanjay Sharma. Data Privacy and GDPR Handbook. 1st ed. Wiley, 21st Oct. 2019. ISBN: 978-1-119-59424-6 978-1-119-59430-7. DOI: 10 . 1002/9781119594307. URL:https://onlinelibrary.wiley.com/doi/book/

10.1002/9781119594307(visited on 04/06/2020).

[53] SSBs virksomhet. ssb.no. Library Catalog: www.ssb.no.URL:https://

www.ssb.no/omssb/om-oss(visited on 05/06/2020).

[54] Latanya Sweeney. ‘ACHIEVING k-ANONYMITY PRIVACY PRO-TECTION USING GENERALIZATION AND SUPPRESSION’. In: In-ternational Journal of Uncertainty, Fuzziness and Knowledge-Based Sys-tems 10.5 (Oct. 2002), pp. 571–588. ISSN: 0218-4885, 1793-6411. DOI: 10.1142/S021848850200165X.URL:https://www.worldscientific.com/

doi/abs/10.1142/S021848850200165X(visited on 03/02/2020).

[55] Synthea by the Standard Health Record Collaborative. URL: https : / / synthetichealth.github.io/synthea/(visited on 02/06/2020).

[56] Matthias Templ, Bernhard Meindl and Alexander Kowarik.sdcMicro:

Statistical Disclosure Control Methods for Anonymization of Data and Risk Estimation. Version 5.5.1. 11th Feb. 2020.URL:https://CRAN.R-project.

org/package=sdcMicro(visited on 07/06/2020).

[57] Transformation models | ARX - Data Anonymization Tool. Library Catalog: arx.deidentifier.org. URL: https : / / arx . deidentifier . org / overview/transformation-models/(visited on 07/06/2020).

BIBLIOGRAPHY

[58] TSD Prices - University of Oslo. Library Catalog: www.uio.no. URL: https://www.uio.no/english/services/it/research/sensitive-data/access/

prices/index.html(visited on 02/06/2020).

[59] Universal Declaration of Human Rights. Library Catalog: www.un.org.

6th Oct. 2015. URL: https : / / www . un . org / en / universal declaration -human-rights/index.html(visited on 14/06/2020).

[60] Utility analysis | ARX - Data Anonymization Tool. Library Catalog:

arx.deidentifier.org.URL: https://arx.deidentifier.org/anonymization-tool/analysis/(visited on 07/06/2020).

[61] What is GDPR, the EU’s new data protection law? GDPR.eu. 7th Nov.

2018.URL:https://gdpr.eu/what-is-gdpr/(visited on 30/05/2020).

[62] μ-ARGUS. URL: http : / / neon . vb . cbs . nl / casc / mu . htm (visited on 07/06/2020).

Appendices

Appendix A

Transformation hierarchies for

quasi-identifiers

Figure A.1: Age hierarchy

APPENDIX A. TRANSFORMATION HIERARCHIES FOR QUASI-IDENTIFIERS

Figure A.2: ZIP code hiearchy

Figure A.3: Health facility location hierarchy

APPENDIX A. TRANSFORMATION HIERARCHIES FOR QUASI-IDENTIFIERS

Figure A.4: Time of encounter hierarchy

Appendix B

Recommendations for

approaching anonymization of health data

The following are recommendations for approaching anonymization for data controllers intending to release data:

1. Purposes:

Before doing anything else, you should consider why you want to release data. Are there any specific purposes you want the data to be used for, or do you think the data is useful and therefore want to publish it so that anyone who wants to might utilize it? Is it important the released data be truthful to the original data, or does the data have some properties which make it useful regardless of its truthfulness?

2. Release type:

Having considered how your data could be useful, decide upon a type of release that may facilitate those purposes. Does the data need to be in the form of records concerning individuals, or is statistical information about your data set sufficient to fulfill the purposes from the previous point? Do you have the intention and resources to follow up on your data post-release, in which case, do you want to publish an interactive database instead?

3. Data:

The original data set contains some information or properties which you want to be made available to more parties. What in this data set is important? Examine the different attributes, and how their presence in the data set affects what the data may be used for. Consider if any attributes in the data set could be excluded from the result without hampering its usefulness. If your data is about a patient’s health, perhaps their hair color isn’t important? The more data

which is included in the data to be anonymized, the harsher the transformation of the data is likely to be. Limiting such data may be useful.

4. Legislation:

Make absolutely sure you understand the requirements of relevant legislation, such as the GDPR, and what consequences might follow should such requirements not be met. Different pieces of legislation

In document Anonymization of Health Data (sider 126-0)