Attribute utility - Anonymization of Health Data

6.4 Amnesia

7.1.3 Attribute utility

Having created hierarchies for the relevant quasi-identifiers, we need to decide the weight of each attribute with regard to their impact on the utility of the data set. To weight the different attributes, however, we must make some assumptions about the purposes of the data. This is not necessarily easy to do, but the context in which the release happens and what is being released, we can guess, not necessarily the exact purposes for which the data will be used, but what about the data makes it interesting.

Our data set is specifically a data set about the diagnoses made on patients at various health facilities, for a set of specific diseases. We, therefore, make the assumption that the most interesting information in the data set has to do with the diseases themselves and the patients which were diagnosed with them. The data set includes information contextualizing this diagnosis, however. Including where it took place and what health worker was responsible for the diagnosis. The purposes for information about health workers and the diagnosis they make might be useful for several purposes, including statistical data about how many of the various diagnoses are made by different occupations in the health sector. It might also be interesting to know how the various hospitals differ in their rate of diagnosing the diseases. Another piece of information which might be useful is the rate of infection with the diseases, which would make the time of the diagnosis important, as well as the geographical location of either the hospital or the patient’s residence, and information about the patient themselves, such as age and sex, might also be helpful in that regard.

CHAPTER 7. EXECUTION OF ANONYMIZATION

Figure 7.9: Attribute weights

Figure 7.10: Utility metric - information loss

We are going to make four different levels of weight for the attribute. The patient’s specific diagnosis will not be transformed in any way, to keep the truthfulness of the data. Beyond that, information about the patient will be weighted the heaviest, including their age, sex and place of residence. The next level will be information about the diagnosis itself, including where it was made and when. Third will be information about the health worker, and fourth will be the facility name. The facility name infers its location, and the location of a facility gives some information about what facility it is, which means the information added by the facility name attribute, which has not been provided by its location, is low. We weight the four levels 0.8, 0.6, 0.4 and 0.2 in the weight distribution in ARX, as demonstrated in Figure 7.9.

We also specify the means with which we want to measure the utility of the data set, information loss. We want to measure the information loss of the various attributes utilizing an arithmetic mean, ensuring we properly account for outliers in the attribute distribution. See Figure 7.10

We can also specify that ARX may perform local suppression to reach the requirements specified in the privacy models. This enables ARX to delete outlier records in the result data set which prevent the data from conforming to, for example, various distribution requirements for the equivalence classes.

We decide to specify a 5% limit, see Figure 7.11 to the amount suppressed records in the output data set. This allows ARX to remove some outlier information which prevents otherwise high-utility solutions from fulfilling

7.1. CONFIGURING ANONYMIZATION TOOL

Figure 7.11: Record suppression limit

the privacy model requirements, while also ensuring that the anonymized data set is still mostly representative of the original data.

7.1.4 Privacy models

The final step before applying the anonymization process is configuring the privacy models. We will be using three different combinations of privacy models, one featuring k-anonymity, l-diversity and t-closeness, one featureδ-disclosure, and finally one featuringβ-likeness, more specifically enhancedβ-likeness.

The main objective of these privacy models is to obtain an anonymized data set which sufficiently protects against information disclosure, but is not unnecessarily distorted to reduce the utility of the data set. We therefore choose the values for the different privacy model metrics to aim for this goal.

The configurations will be as seen in Table 7.1.

Approach 1 Approach 2 Approach 3

Privacy model

k-anonymity

δ-disclosure privacy Enhancedβ-likeness l-diverisity

t-closeness Configuration

k: 10

δ: 2 β: 2

l: 3 t: 0.2

Table 7.1: Configuration of anonymization approaches

A k-value of 10 should sufficiently protect against identity disclosure, giving an attacker only a 10% chance of correctly guessing an identity, given that the attacker already knows an individual is in the data set. Beyond protecting against identity disclosure, the l-value of 3 will ensure some variety in an equivalence class, making exact attribute disclosure impossible, and a t-value of 0.2 will ensure that even should the equivalence class be large, the amount of information disclosed will

CHAPTER 7. EXECUTION OF ANONYMIZATION

Figure 7.12: Executing the anonymization process

Figure 7.13: Waiting time for anonymization of large data set not be disproportional to the information gained from only knowing the distribution over the entire data set.

We choose aδ-value of 2 to with a similar strategy to our t-value, ensuring the proximity of distributions of sensitive values in the equivalence classes of the anonymized data set to the data set as a whole. It is for this same reason that we choose aβ-value of 2.

In document Anonymization of Health Data (sider 90-93)