Anonymisation and pseudonymisation of textual documents

(1)

Anonymisation and

pseudonymisation of textual documents

Candidate number: 6005

Submission deadline: 31.05.2021 Number of words: 15879

(2)

i

1 Introduction

The topics of Artificial Intelligence, Big Data and the protection of personal data are booming in the last several years, mainly because of the pandemic and increased use of digital tools. The questions of privacy, security and tools that can achieve them gaining more attention from both the professional and general public. Among such tools are anonymisation and pseudonymisation. Discussions around them are primarily within the expert community. They are not widely understood, which is apparent from common misunderstandings, such as, for example, that appropriate anonymisation is easily achievable and resistant to attacks.

Anonymisation and pseudonymisation are used in relevant cases to protect data subjects' identities, lower the risk and damage from security breaches, and provide a higher level of data protection. It applies to many areas, from industry to governmental bodies and research projects.

There is an array of industry standards, guidelines and some legislative acts, which address the use of such tools, but untangling them and applying them effectively still present challenges.

There is a growing amount of personal data. While techniques like anonymisation and pseudonymisation can be applied, still, there is another level of complexity for certain fields, such as, for example, legal. There are vast numbers of textual documents generated every day, and many of them have personal data. The likes of all kinds of contracts, court decisions, police reports, etc. When it comes to textual documents, it is harder to process them with the same effectiveness as numerical data. Here, a field that combines linguistics and computer science, natural language processing (NLP), finds its application. NLP develops techniques that allow processing of textual data and turning it into numerical values to achieve various tasks such as identifying personal data in the document. NLP also finds new utilisation in anonymisation and pseudonymisation of legal (and other) textual documents, and can be a way to make document sharing more privacy and security friendly.

The particular interest and, at the same time, the challenge of this thesis are attempts to bring together the data protection and technology, in particular anonymisation and pseudonymisation, with the employment of NLP. Such a combination is rarely researched in-depth, so this thesis aims to address this gap. The topic is relevant and interesting because the number of textual documents containing personal data is growing, and there is a need to process them in a secure and privacy-friendly way.

To ensure the achievement of this task, the scope of the work is relatively narrow, extracting only necessary concepts from different fields, such as the use of anonymisation and pseudonymisation in the data protection field and specific use of NLP techniques in order to achieve it.

The interdisciplinary approach brings such challenges as the difference between the notion of personal data in law and computer science (the definition of personal data is discussed in the

(4)

2

legal framework section). There is also confusion about anonymisation and pseudonymisation.

The latter is not part of the former, and these processes employ different techniques and approaches (this and other misunderstandings are analysed in relevant sections in more details).

NLP is of particular interest because it works with textual data, and many fields generate a lot of textual documents, which contain personal data (legal sector, health sector and others).

However, to extract benefits from these possibilities, both legal professionals have to know more about the technical side of things, and technical experts need to learn more about legal requirements and needs. Thus, to explore and illustrate this dimension and bridge the gap between the legal and the technical, this research question was developed.

1.1 Research question

My research question is: What are the GDPR requirements for data anonymisation and pseudonymisation, and how can existing NLP techniques be applied to legal documents to comply with the requirements of the GDPR? Essentially, the thesis analyses whether NLP is a good tool to deal with textual data from legal and practical points of view to make it more secure and privacy-friendly.

My specific target includes:

1) the legal requirements for anonymisation and pseudonymisation of personal data;

2) the main challenges (high costs, time and resources heavy, unavailable for many private and public actors etc.);

3) to review modern technical approaches (with a focus on NLP) and the ways they can benefit public and private actors.

The methodology used:

- identify relevant sources, which address the processing of personal data and specifically its further anonymisation or pseudonymisation, establish where such techniques are appropriate and which legal acts give additional incentives to do it;

- examine such relevant sources and their provisions regarding anonymisation and pseudonymisation, and the legal requirements for such processing;

- analyse the issues arising from anonymisation and pseudonymisation of textual data and whether tools proposed by the NLP can resolve them in both legal and practical way;

- review the overall application of NLP services in the legal field (LegalTech) and review concrete cases of NLP application to anonymise and pseudonymise data to analyse whether the NLP solution is applicable to real-world situations;

(5)

3

- put the topic of anonymisation and pseudonymisation in a wider context, such as approaches to privacy, how these techniques can help to achieve safety and security of data, and the reasons for anonymisation and pseudonymisation;

- give the final overview of the legal sources, applications of anonymisation and pseudonymisation, and whether NLP is a successful tool for anonymisation and pseudonymisation of textual documents.

1.2 Structure of the thesis

To answer my research question according to the methodology established, I will start the legal analysis. First follows the overview of data protection law, mainly the GDPR. It will put the research question in the context of the overall discussion of privacy and data protection. The section includes analysis of relevant provisions of Open Data Directive regarding sharing of data by public sector, and proposals for ePrivacy Regulations and AI Act to put the research question within legal framework.

The next part starts with a brief overview of anonymisation, pseudonymisation, reasons to un- dergo these processes, differences between them and challenges arising from these two processes. As there are different challenges to anonymisation and pseudonymisation overall and application of it to the textual data, they will be analysed separately.

The section will continue with the brief introduction into NLP itself, the overview of the use of NLP in the LegalTech field and, the opportunities it presents when working with textual documents, and the anonymisation and pseudonymisation of personal data in the textual documents.

The section is concluded with the research of relevant examples. It serves as an illustration of concrete applications and indicates the future of NLP employment for anonymisation and pseudonymisation of textual documents.

The next section puts the overall issue into the global perspective to see how different the per- ception of privacy is and how the failure to ensure data security can affect businesses and individuals. It sums up the reasons for anonymisation and pseudonymisation.

The last section is dedicated to the final overview, which includes summaries of the legal analysis, the technical sides of anonymisation, pseudonymisation and NLP and how they can be used to legal and practically solve the problem of the secure and privacy-friendly way to process textual documents containing personal data. It is followed by a conclusion.

(6)

4

2 Legal Framework

This section will give an overview of the legal sources, starting from the GDPR. Other legal acts that are important for understanding the context include the European Convention on Hu- man Rights, Charter of Fundamental Rights of the European Union, California Consumer Pri- vacy Act of 2018, etc. Guidelines are also used for a better understanding of the data protection field, such as, the OECD Guidelines Governing the Protection of Privacy and Transborder Flows of Personal Data. They will be discussed in the section about privacy, so they are not included in this section.

As technologies constantly develop and the legislative process is lagging behind, it seems relevant to review not only the already enacted legal instruments (GDPR in the section on privacy and Open Data Directive further in this chapter) but also new proposals, such as the Regulation on Privacy and Electronic Communications and the Regulation, laying down harmonised rules on Artificial Intelligence as they also underline the careful handling of personal data and the use of anonymisation and pseudonymisation to ensure more privacy-friendly and secure data processing. It will help to put things in perspective, underline the importance of the research in this sphere and highlight multiple purposes that anonymisation may have.

2.1 Selected provisions of the GDPR

While the provisions of the GDPR that explicitly address anonymisation and pseudonymisation are reviewed in the relevant chapter, it is still vital to briefly overview the Regulation. This section does not aim to exhaustively explain the GDPR concepts, but instead to put anonymisation and pseudonymisation on the broader data protection perspective with references to the relevant legal provisions.

In order to anonymise or pseudonymise textual data, one needs to establish what personal data is. The GDPR defines personal data as "any information relating to an identified or identifiable natural person,"¹ which consists of four elements: "any information", "relating to", "identified or identifiable", and "natural person", or "data subject."

Let's take a closer look at these elements. The first one, "any information", has a wording that indicates a broad scope with an indefinite number of possible cases with objective and subjec- tive information, not only private or sensitive information, which was confirmed in Court of Justice of the European Union (CJEU) Nowak case.² The second element concerns "relating to"

and indicates the linkage of the information to a specific person³ and is also to be understood

1 GDPR Article 4(1).

2 C-434/16 Peter Nowak v Data Protection Commissioner, sec. 34 CHECK

3 C-434/16 Peter Nowak v Data Protection Commissioner, sec. 35 CHECK

(7)

5

broadly. The next element is "identified or identifiable", which also includes the opportunity for identification and not just actual identification by the controller or another person. Within this element, possible means to be used for identification should be considered, including costs, time and technological development.⁴ The last element, "natural person", indicates that the GDPR concerns only human beings and excludes information about legal persons.

Another point to consider when talking about personal data is sensitive data or the special categories of personal data, which are "particularly sensitive in relation to fundamental rights and freedoms."⁵ Such categories are listen separately and include: "personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person's sex life or sexual orientation."⁶

However, there is also another definition of personal data used by businesses, specifically in the USA, and it is personally identifiable information. As this definition is used within the tech industry and computer science experts, also when conducting anonymisation or pseudonymisation process, it is beneficial to discuss it as well. According to the Guide to Protecting Confi- dentiality of personally identifiable information prepared by The National Institute of Standards and Technology (NIST) (USA), it is "Any information about an individual maintained by an agency, including (1) any information that can be used to distinguish or trace an individual's identity, such as name, social security number, date and place of birth, mother's maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information."⁷ This is also used in ISO standards, which are popular within the industry. In comparison, the EU takes "an expansionist approach to personally identifiable information".⁸ The "Identifiable" part of the personal data definition is important here and makes the GDPR definition more comprehensive, but as some of the research discussed further focuses on personally identifiable information, its definition should also be highlighted.

After considering what personal data is, it is suitable to see where anonymised and pseudonymised data stand, and it is also the place where the main distinction between the two processes lies. In Recital 26, the GDPR establishes that it still applies to the pseudonymised data, while the processing of personal data rendered anonymous is outside of its concern.⁹ This will be

4 GDPR Recital 26.

5 GDPR Recital 51.

6 GDPR Art. 9.

7 NIST

8 Schwartz (2011), p. 1873.

9 GDPR Recital 26.

(8)

6

discussed in related chapters in more detail, but this clear difference between anonymised and pseudonymised data is important throughout the thesis.

Another crucial definition is that of processing, meaning "any operation or set of operations which is performed on personal data or the sets of personal data, whether or not by automated means".¹⁰ Thus anonymisation and pseudonymisation are also processing. It is also stipulated that "any processing of personal data should be lawful and fair",¹¹ including anonymisation and pseudonymisation. So, after the anonymisation process, the data is no longer personal and outside of the scope of the GDPR, but the anonymisation process itself should follow all the data protection principles.

Requirements for processing of personal data are based on principles such as (a) lawfulness, fairness and transparency, (b) purpose limitation, (c) data minimisation, (d) accuracy, (e) stor- age limitation, (f) integrity and confidentiality and (g) accountability.¹² They were introduced in the OECD Guidelines in 1980¹³ and are still relevant and applicable, even for the relatively new processes, such as anonymisation and pseudonymisation, and thus should be followed.

"Every use of personal data is a potential interference with, or limitation of the right to, data protection."¹⁴ The list of legal grounds for processing is exhaustive, stipulated in another article and consist of:

a) consent, which has to be given freely, be specific, informed and unambiguous;¹⁵

b) fulfilment of the contract, which establishes that as long as the data controller needs to process the data subject's personal data to fulfil the contract (as data controller has the legal obligation to), such processing is lawful;¹⁶

c) fulfilment of a legal obligation, which deals with legal obligations under EU or Member State law that makes the processing of personal data necessary (for example, obligations of banks to process personal data under money-laundering laws; the processing of em- ployee personal data by the employer for social insurance purposes);¹⁷

d) protect vital interests, with "vital" being a word setting a high threshold to apply this norm. The examples include: "when processing is necessary for humanitarian purposes,

11 GDPR Recital (39).

12 GDPR Article 5.

13 OECD Guidelines 2013.

14 Kotschy (2020), p.329.

16 Kotschy (2020), p.330.

17 Kotschy (2020), p.332.

(9)

7

including for monitoring epidemics and their spread or in situations of humanitarian emergencies, in particular in situations of natural and man-made disasters";¹⁸

e) performance of a task carried out in the public interest, which not necessarily covers determined obligations for the controller, but acts more like a general authorisation to act as necessary to fulfil the task;¹⁹

f) prevailing legitimate interests,²⁰ which refers solely to the interests of data controller from the private sector and covers interest, which "is visibly, although not necessarily explicitly, recognised by law, more precisely by Union or Member State law". Just com- mercial interest is not enough to establish such interest;²¹

It is a prerequisite for the processing to be lawful (and lawfulness is one of the principles of the data protection law) to obtain data subjects' consent or on the other legitimate basis named above.²² As mentioned above, anonymisation and pseudonymisation are considered processing and thus must have a lawful ground for the processing, each case assessed individually.

The GDPR defines two main actors in personal data processing: data controllers and data processors. The data controller is "the natural or legal person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data…", while a data processor is "a natural or legal person, public authority, agency or other body which processes personal data on behalf of the controller".²³ These entities, among other obligations, must ensure that the process of anonymisation and pseudonymisation is done properly, in a privacy and security-friendly way, with the protection of data and interests of data subjects.

The influence and importance of the GDPR also come from its territorial scope. The GDPR establishes that it applies to any entity (regardless of whether it is located within or outside of the EU), which targets data subjects in the EU and processes their personal data.²⁴ Thus, it creates obligations in the sphere of data protection for data controllers and processors across the world. It can also be an incentive for the anonymisation of personal data, as the GDPR does not apply to it.

The GDPR establishes rights and obligations for data controllers, data processor and data subjects. It is one of the groundbreaking developments in data protection law. While it does not

18 GDPR Recital 46.

19 Kotschy (2020), p.336.

21 Kotschy (2020), p.337.

23 GDPR Article 4(7), (8).

24 GDPR Article 3.

(10)

8

explicitly name all the technical measures that can be used to ensure privacy and data protection, it gives some examples and stays open to new developments. The processes of anonymisation and pseudonymisation are among such technological measures. Combined together with NLP, these technologies can be used to ensure the protection of personal data when processing legal textual documents.

2.2 Open Data Directive

Full name: Directive (EU) 2019/1024 of the European Parliament and of the Council of the 20th of June 2019 on open data and the re-use of public sector information.²⁵

The public sector in the EU also recognises the importance and benefits of data sharing, while balancing it with the right to the protection of personal data. The public sector collects, pro- duces, reproduces and disseminates²⁶ a great amount of information in various sectors (basically, any area it operates in) and thus represents an exceptional source of data. The data can, in turn, be used to the benefit of the internal market and development of new products and application for both consumers and legal entities.²⁷

However, possible benefits do not hinge the personal data protection. So, for public bodies to fulfil this commitment, they need to ensure that published data does not include personal or sensitive data or the data which can help to identify individuals.²⁸ It means that the purpose limitation should be followed (GDPR Art. 5 and Art. 6) or the documents should be anonymised: "Rendering information anonymous is a means of reconciling the interests in making public sector information as reusable as possible with the obligations under data protection law…".²⁹ As the process of anonymisation is costly, these expenses are included in the marginal cost of reusable data.³⁰ The directive does not mention pseudonymisation as an alternative.

The data sharing by public sector is an ongoing process, and many documents and data sets are already available on the official portal for European data.³¹ When searching by countries,³² one can find datasets from the countries outside of the EU/EEA area, such as Ukraine, Serbia and some others, which joined the efforts to provide reusable information from the public sector.

25 Directive 2019/1024.

26 Open Data Directive Recital 8.

27 Ibid Recital 9.

28 Henriksen-Bulmer (2019), p. 271.

29 Open Data Directive Recital 52.

30 Open Data Directive Recital 36, 38, 52.

31 The official portal for European data.

32 The official portal for European data; by county.

(11)

9

There is a proposal, which complements the Open Data Directive: Proposal for a Regulation of the European Parliament and of the Council on European data governance (Data Governance Act).³³ This proposal is aiming to facilitate data sharing and build trust in data sharing interme- diaries. In Recital 6 it names anonymisation and pseudonymisation as the techniques, which enable privacy-friendly analyses of the datasets.

This Directive and a proposal that complements it are part of a bigger picture - the European data strategy, which aims to create a single market for data, give businesses in the EU opportunity to use it for their development to make the EU a leader in a data-driven society. There is goal to establish rules and efficient enforcement mechanisms, which will ensure the flow of the data withing the EU, while other European rules and values (personal data protection, competition law, consumer protection, intellectual property) are fully respected.

2.3 ePrivacy Regulation (proposal)

Full name: Proposal for a Regulation of the European Parliament and of the Council concerning the respect for private life and the protection of personal data in electronic communications and repealing Directive 2002/58/E.C. (Regulation on Privacy and Electronic Communications).³⁴ To additionally emphasise the importance of pseudonymisation and anonymisation, I would like to review some points from the latest proposal (made publicly available on the 10^th of February 2021) regarding the protection of personal data in electronic communication in the sections relevant to the topic of this thesis.

Electronic communication is an important part of everyday life as people are using various messaging services, like, for example, Gmail, to send legal documents.³⁵ As the current ePri- vacy directive covers only traditional telecom providers, there is a need for new rules. Thus, the reasons for the adoption of a new ePrivacy Regulation include keeping up with the technological development and aligning the rules in electronic communication with the privacy rules and the GDPR.

One of the central concepts here is metadata, information about data, which includes information about numbers called and geographical location during these calls, websites visited, the time, data and duration of these activities.³⁶ Metadata is important for communication because, without it, the messages or calls would not be able to reach the destination, and while seeming general, this kind of data can lead to conclusions being drawn about private lives, including

33 Proposal for Data Governance Act

34 ePrivacy proposal.

35 European Commission: Infographic.

36 Regulation on Privacy and Electronic Communications proposal Recital (2).

(12)

10

medical conditions, sexual preferences and political views,³⁷ which are considered to be special categories of data. There are many possible scenarios, how metadata can shed light on the personal affair; for example, a person receives a call from a phone number attached to the head- quarters of a big company and later that day visits a governmental website, which has an application for unemployment benefits. It is not hard to come to a conclusion about what happened in such a scenario.

Why is this proposal relevant to the topic of this thesis? Firstly, email headers beside information about from and to also include such textual information as subject/topic, where we are usually summarising the content of the email (like "working contract Parker", "tax report from Connor", etc.). While the subject line does not include the document itself, it may contribute to raising the security risk by attracting the attention of third parties (hackers, etc.). The NLP and specifically Named Entity Recognition discussed in the NLP section can be used to work with such kind of information, for example, to turn "working contract Parker" into "working contract AB" or "file type A for AB" and help to make it more secure.

Secondly, for the purposes of the thesis, it is interesting to see the remedies proposed in this version of ePrivacy Regulation, in particular, such security measures as encryption and pseudonymisation.³⁸ The data, which is no longer used for the purposes it was originally collected for, should be anonymised (if sharing with third parties)³⁹ or erased.⁴⁰ The use of anonymised data is lower in this case because anonymised metadata loses its original value.⁴¹ Another example is how applying anonymisation and pseudonymisation techniques can make the process of data sharing (in this case, communication data, which may also include textual data) more privacy and security friendly. It also underlines the cross-disciplinary approach and the implementation of the same tools across different fields.

2.4 AI Act (proposal)

Full name: Proposal for a regulation of the European Parliament and of the Council laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts.

The long-awaited Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act),⁴² among other topics, addresses the processing of personal

37 Ibid.

38 Ibid Recital 17(b).

39 Ibid Recital 17(aa).

40 Ibid Recital 25.

41 Ibid Recital 17.

42 EU Commission

(13)

11

data in the sphere of AI, use of state-of-art security and privacy-preserving measures as, for example, pseudonymisation (or encryption) in cases when anonymisation may have a strong effect on the purposes pursued.⁴³ Every case and the task is different, no one size fits all kind of decision, but anonymisation and pseudonymisation provide different trade-offs, which will have to be chosen individually. In any case, these are measures, and will help to ensure security and preserving of privacy.

It also mentions anonymisation and pseudonymisation of "judicial decisions, documents or data, communication between personnel, administrative tasks or allocation of resources"⁴⁴ and AI systems developed for this kind of purposes as not high-risk ones (in comparison to the other AI systems used for administration of justice and democratic processes).⁴⁵ This underlines the importance of such AI systems, which will be able to automate the anonymisation and pseudonymisation of the judicial decisions and free the resources (both monetary and manual) allo- cated for this kind of tasks currently.

So, several legislative acts include the anonymisation and pseudonymisation as possible ways to protect personal data and making its processing more privacy and security-friendly.

43 AI Act Proposal Art. 10.5.

44 AI Act Proposal Recital 40.

45 Ibid.

(14)

12

3 An introduction in anonymisation and pseudonymisation

This section is dedicated to the overview of anonymisation and pseudonymisation, their use and development, which is followed by the challenges, both general and connected with the processing of textual data.

3.1 What is anonymisation and pseudonymisation?

This chapter aims to review the general concepts, practices and challenges in the sphere of data anonymisation and pseudonymisation. Understanding anonymisation and pseudonymisation is important for the purposes of this thesis; thus, the work is built on diverse and relevant sources, such as the WP29 paper (regarding anonymisation techniques; on the concept of personal data), reports and guidelines from ENISA concerning pseudonymisation, code of practice (ICO, Code of Practice regarding anonymisation), guidelines on deidentification (from OECD, The Spanish Data Protection Agency), research papers, etc. It starts with the description of anonymisation and pseudonymisation, and reasons to perform these processes on personal data. For a better understanding of the topic, it continues with challenges present in general and when working specifically with textual data, which contains personal information. It is followed by an overview of the use of NLP in the legal field in general, NLP technical implementation and concrete cases, where NLP is used to anonymised or pseudonymised data in legal textual documents.

Recently, the Spanish Data Protection Agency published a Guide, "10 Misunderstandings Re- lated to Anonymisation", which sums up the most common misconceptions in this sphere.

Among them is the fact that pseudonymisation is not the same as anonymisation as the first one with the use of additional information can lead to the data subject's identification, and the second renders it impossible.⁴⁶ However, pseudonymisation often relies on the techniques from anonymisation to increase efficiency.⁴⁷

Anonymisation was perceived as a panacea in the age of Big Data as it allowed to get it all when opening "the data floodgates while ensuring that no one was unexpectedly swept up or away by the deluge."⁴⁸ It is possible because anonymised data is no longer within the data protection law domain, as was highlighted in the section about selected provisions of the GDPR.

On the other hand, there are challenges to fulfil the criteria needed for the data to be recognised as anonymous, and there are discussions whether it is possible at all, thus preventing anonymisation from becoming a solution to all privacy problems.

46 The Spanish Data Protection Agency

47 ENISA 2018, p.14.

48 Barocas (2014), p. 45.

(15)

13

Pseudonymisation, while keeping the data within the data protection law domain, may be pref- erable in cases when it is not possible to anonymise data and to keep the information needed for purposes of processing at the same time.⁴⁹ For example, when the data processing purposes concern "processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes."⁵⁰

The anonymisation process can provide privacy guarantees when done with appropriate measures and fulfilling the criteria, while pseudonymisation is a useful security measure,⁵¹ thus, the areas of application may differ. For both processes, there is a variety of possible techniques, and there is no "one-size-fits-all" solution, so each situation and scenario will have to be evaluated individually. For example, one business cannot just copy the anonymisation process of another company and expect the same result, as it should be tailored to the scope, context, nature and, of course, purpose.⁵²

3.2 Anonymisation

The anonymous information, according to the GDPR, is "information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable."⁵³ Once the subject is no longer identified or identifiable, the GDPR does not apply to such information. Thus, one of the perks of working with anonymised data is that there is no need to follow data protection principles and requirements. On the other hand, such type of information may be not very useful for some kinds of tasks (for example, when analysing the average age of burglars without an opportunity to access even age group as it was anonymised).

There are also international standards used within the industry, for example, ISO 29100:2011.

According to the standard, data anonymisation is a "Process by which personally identifiable information is irreversibly altered in such a way that a personally identifiable information prin- cipal can no longer be identified directly or indirectly, either by the personally identifiable information controller alone or in collaboration with any other party."⁵⁴ Here, personally identifiable information is used instead of the "personal data" as in the GDPR. The main criteria that can be picked out from this definition are that identifiable information should be irreversibly altered in a way that the person no longer can be directly or indirectly identifiable. Irreversibility is also another difference between anonymisation and pseudonymisation as that the latter can

49 Bolognini (2017), P. 177.

50 GDPR Art. 89

51 WP 29 2014, p.3.

52 The Spanish Data Protection Agency, p. 7.

53 GDPR Recital 26.

54 ISO standard (ISO 29100:2011).

(16)

14

be reversed. Anonymisation, on the other hand, cannot be reversed, it is a permanent action, and it has a strong protection threshold (the identification is no longer possible, or only by strong attacker efforts).⁵⁵

The process of anonymisation is a processing of personal data and has to fulfil the GDPR requirements, such as "the specific purposes for which personal data are processed should be explicit and legitimate and determined at the time of the collection of the personal data."⁵⁶ So, the process of anonymisation itself has to be lawful.

There are risk factors associated with identification, such as singling out, the inference and the linkability. In an analysis of some of the anonymisation techniques, these are also used as cri- teria.⁵⁷ Singling out corresponds to the opportunity to isolate some records, which will help to identify an individual in the dataset. Inference corresponds to the opportunity to derive the value of the attribute (characteristic such as IP address or numerical ID, etc.) from the values of a set of other attributes. Linkability is the possibility to link two or more records about the same data subject. Even in the case when information about a group of individuals can be linked between two records, the individuals are protected against singling out, but linkability is still an issue.⁵⁸ These criteria can be used to assess the robustness of anonymisation techniques. There is a variety of such techniques, and some of them are described in Article 29 WP, such as random- isation (noise addition, the differential privacy, which aim to weaken the link between a person and a data) and generalisation (aggregation, k-anonymity, l-diversity and t-closeness, which aim to dilute the features of data subjects by changing the magnitude, like using a month instead of a day).⁵⁹ None of these techniques guarantees the absence of risk of singling out, inference or linkability, but they can reduce the risk of re-identification.⁶⁰ The techniques are chosen depending on the task and type of data, and thus, when it is anonymisation of textual data is needed, it is relevant to use approaches from the field specifically concerned with computer processing of such data – natural language processing (NLP).

Some of the common misunderstandings about anonymisation are that it is always possible or that it is forever. Anonymisation is about balancing the risk of re-identification and the utility of the data set. There are possible scenarios when it is not possible, like when the number of data subjects included is small⁶¹ (for example, containing information about the current

55 Stam (2020), p.3.

56 GDPR Recital 39.

57 WP29 2014, p. 3.

58 WP29 2014, p. 11-12.

59 WP29 2014.

60 WP29 2014, p.27.

(17)

15

members of The Norwegian Parliament with only 169 representatives).⁶² In addition, anonymisation may be reversed due to the technological development or availability of new information.⁶³

Like many other things in the data protection field, the anonymisation process is not a one-time exercise or a routine established once and forever. Still, it needs to be periodically reassessed, tested and evaluated, taking into account technological and other developments.⁶⁴ Article 29 WP also evaluates in this context and adds to the responsibilities of data controllers: "… data controllers should consider that an anonymised dataset can still present residual risks to data subjects. Indeed, on the one hand, anonymisation and re-identification are active fields of research. New discoveries are regularly published. On the other hand, even anonymised data, like statistics, may be used to enrich existing profiles of individuals creating new data protection issues. Thus, anonymisation should not be regarded as a one-off exercise, and the attending risks should be reassessed regularly by data controllers."⁶⁵

The use of the NLP and NER in anonymisation will be analysed in the NLP chapter.

3.3 Pseudonymisation

Pseudonymisation is discussed in the WP29 2007 and defined as "the process of disguising identities"⁶⁶ by replacing a personal identifier (e.g., name, social security number, date of birth etc.) in a dataset with another attribute (e.g., a randomly assigned code). In WP29 2014, the definition stays similar, establishing that pseudonymisation "consists of replacing one attribute (typically a unique attribute) in a record by another."⁶⁷ In 2017, the data pseudonymisation was defined in ISO standard as a "particular type of deidentification that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms" with deidentification meaning a" 'general term for any process of reducing the association between a set of identifying data and the data subject."⁶⁸ Here another term "deidentification" appears, which may bring additional confusion, especially because sometimes it is used interchangeably with "anonymisation" and "pseudonymisation" or to sum up both of them.

62 Stortinget.

64 GDPR Article 32(1)(d).

65 WP29 2014, p. 4.

66 WP29 2007, p. 18.

67 WP29 2014, p.20.

68 ISO/TS 25237:2017.

(18)

16

Later, the GDPR established the legal definition of pseudonymisation as "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person."⁶⁹

It sets up two cumulative conditions that need to be fulfilled for the data to be recognised as pseudonymised. First is that personal data is processed so that it is no longer linkable to the data subject without additional information (usually by replacing a personal identifier as shown above) and that this additional information is kept separately with all the appropriate technical and organisational measures.⁷⁰ The separation should be possible "within the same controller."⁷¹ While the definition of pseudonymisation in the GDPR is similar to the technical ones, it establishes a more strict framework as it covers not only the protection of the person's identity but also indirect identifiers, which relate to the data subject.⁷²

The process of pseudonymisation has been known in the information security field for a while (and as shown above, it was reviewed by the WP29 several times). It can be a useful security measure;⁷³ it has also appeared in various guidelines (such as Convention 108, OECD Privacy Guidelines, APEC Privacy Framework) and in the national legislation of Germany and Aus- tria.⁷⁴ In the EU, in the legislative acts, the term was first introduced in GDPR. The GDPR was also the one to resolve a discussion whether pseudonymised data is still personal data, establishing that yes, after pseudonymisation, the data stays personal data (at least, in Europe).

One of the outcomes and benefits of pseudonymisation is hiding the data subject identity from any third party while performing various data processing operations.⁷⁵ Applying pseudonymisation techniques can help to reduce the risks and to comply with obligations in the sphere of data protection.⁷⁶ However, it will still remain personal data, making a significant distinction between pseudonymised and anonymised data.

There is also another layer to the use of the pseudonymisation process. Besides its aiding in reaching the appropriate level of security,⁷⁷ it may also contribute to the fulfilling of a general

72 ENISA 2018, p. 10-11.

73 WP 29 2014, p.20.

74 Tosoni (2020), pp.134-135.

75 ENISA 2019, p. 1.

77 GDPR Article 32(1)(a).

(19)

17

duty of "data protection by design"⁷⁸ and data minimisation safeguards related to the processing for archiving purposes, scientific or historical research purposes or statistical purposes.⁷⁹ Thus, pseudonymisation appears in many GDPR articles and seems to be generally encouraged. How- ever, even among the researchers, there may be confusion between anonymised and pseudonymised data. Many believe that the data is anonymous, even if they still hold the keys, have previous versions or related information (like contacts) stored, allowing them to re-identify individual from their research.⁸⁰

Additionally, there is no expressed authorisation for its adoption in the national legislation of the Member States.⁸¹ In the end, pseudonymisation is one of many ways to ensure data protection, encouraged but not obligatory. The countries may choose whether they want to adopt a legal requirement for it. Business or academics can also choose whether they want to employ pseudonymisation and whether it is relevant to their case.

Thus, while the data stays personal after the pseudonymisation, it provides more security, as- sisting in implementing data protection measures.

3.4 Challenges with anonymisation (general)

As already mentioned, personal data include ANY information relating to an identified or identifiable natural person,⁸² and it brings a number of challenges when anonymising the data.

First is listing all the possible informational connections that both authorised and unauthorised subjects can use to identify a person.⁸³ It can be an exercise where a person tasked with the anonymisation of the personal data reviews all the possibilities to identify the person using the available data. The question is: will this piece of data lead to the natural person? For example, the date of birth, will it still lead to identification once reduced only to the year, or should an age group be used to ensure anonymisation?

Secondly is the analysis of the domain-specific information.⁸⁴ An illustrative example can be health data, as while some diseases are quite frequent, the others are rare, and this alone (or combined with other data) can lead to the identification. This challenge is also present in the legal documents, such as evident from the anonymisation of the public sector documents in Finland. They expressly point that different public bodies have different needs and different

80 Stam, p.4.

81 Tosoni (2020), p. 126.

82 GDPR Art. 4.

83 Martinelli (2020)

84 Martinelli. (2020)

(20)

18

stakeholders and the difficulty of evaluating the anonymisation itself.⁸⁵ The governmental bodies or businesses, when setting to perform anonymisation, have to consider their goal and what information specifically they will need to achieve it.

These two challenges lead to the third, such as the necessity of human experts who poses such kind of domain-specific knowledge.⁸⁶ While a lot of automated tools may be used to achieve anonymisation (for example, the machine can highlight all the personal data, which it considers relevant, while human expert will check and confirm these highlights), the importance of the exercise itself and its assessment does not allow to leave out human experts,⁸⁷ at least at the current stage of technological development. If it is hard to automate, it becomes costly and time- consuming.

The fourth challenge is that the procedure itself is "too expensive for being applied with effectiveness."⁸⁸ All the previous challenges stack together leads to the simple question; "is it worth it"? This will be considered in more detail in a separate section below.

There are many research groups, including one at the University of Oslo (the CLEANUP project)⁸⁹, working on ways to automatically anonymise personal data, and many of them are using NLP to some degree. It is an important area of research, as such techniques will contribute to privacy and data protection, will help companies and organisations to comply with legal rules and make day-to-day operations with personal data more secure.

The challenge is also a constant risk of re-identification. While the goal from the data protection perspective is always to achieve complete – 100% - anonymisation, there is still a residual risk of the re-identification.⁹⁰ This risk of re-identification is only growing with time as technology develops, making anonymisation a challenging task. It is a significant problem, and something those who own the anonymised or pseudonymised data has to consider it. However, it is not the only type of attack; others include hacking, surveillance, etc. Identification, being one of the privacy problems together with, for example, discrimination.⁹¹ Thus, privacy, personal data protection, anonymisation and pseudonymisation are all part of a broader context, not isolated, but interconnected parts. The diversity of risks to consider only adds to the complexity. There is a suggestion to establish appropriate bans on de-anonymisation with an established penalty. It may cover cases when individuals were identified from a formerly anonymous data set due to technological

85 Tamper (2018), p. 2.

86 Martinelli (2020).

87 The Spanish Data Protection Agency, p.6.

88 Martinelli (2020).

89 The CLEANUP Project.

90 The Spanish Data Protection Agency, p.5.

91 Rubinstein (2015), p. 756.

(21)

19

development.⁹² This is an interesting and ambitious proposal, which may address this challenge, but there are still questions about implementation and enforcement of such standards and law.

Moreover, there is a lack of legal certainty regarding the anonymisation and pseudonymisation of data, and the linkage between individuals and allegedly anonymised data sets.⁹³ The field is wide and interdisciplinary, with no unified approach. This issue may be addressed by the development of standardised technologies and procedures. Such standards have to consider the fast pace of technological development. Therefore, The German Data Ethics Commission even recommends lobbying this issue at the EU level, as "easy-to-use anonymisation standards that would benefit both data subjects and users and for pseudonymisation measures that are com- mensurate with the level of risk faced by the data subjects in their private lives."⁹⁴ The encour- agement of developments of such standards for anonymisation and pseudonymisation underlines the importance of such techniques in the long run. Despite all the challenges, when dealing with personal data, especially in textual documents, these two techniques are valid options to make processing more secure and privacy friendly.

One more dimension to consider is the ethical side of the challenge. Many gladly turned to anonymisation as a solution to the processing of personal data, concerned mostly with its possibility and utility of the data afterwards. However, the question stands whether the process of anonymisation actually addresses privacy and ethical issues connected with big data.⁹⁵ What is the best and most ethical way to handle personal data is an under discussion and is outside of the scope of this thesis.

3.5 Challenges with anonymisation (textual data)

Automated text anonymisation or pseudonymisation is one way to deal with the personal data in the documents. However, textual data presents several challenges. First of all, it is unstructured (or structured in a way readable for humans, but not for machines, many legal documents being far from a nice table of values, used in data science). Secondly, there is a certain ambi- guity in natural language,⁹⁶ and it is not always straightforward. While this may be relatively easy to untangle for a human, it is a challenge to a machine.

Anonymisation and pseudonymisation of textual data present a particular challenge. It is due to the nature of textual data itself, as computers "think" or operate with numbers, and it adds a

92 The Federal Government’s Data Ethics Commission, p. 132.

95 Barocas (2014), p.50-51.

96 Mamede (2016), p. 1287.

(22)

20

layer of complexity when working with text and one of the tasks of NLP. So, turn the text into the number is one way of dealing with it, discussed further in the NLP section.

Machine learning and neural networks boost the capabilities of already known NLP tools and allow the development of new ones. While several years ago, even the admission to the literary competition was considered a success,⁹⁷ now the new and not yet widely available GPT-3 model writes an article for The Guardian.⁹⁸ However, the computer still does not understand the text in the sense that humans do, it can distinguish between words, but it will not have much meaning to the machine.

It means that the machine learning model can "learn" what a personal name, a date of birth and an address is, there are NLP techniques already working with such kind of tasks (specifically, Named Entity Recognition, discussed in the NLP section). However, even advance systems will have problems with the sentence like "the richest man in the country has cancer", as it can be hard to detect as personal data that relatively general expression as "the richest man in the country", while for a human it can be really easy to identify this person. Expressions like this make person identifiable. And legal documents are full of such kind of information. An example can be a high-profile criminal case; even an anonymised such case can still contain a lot of facts that can indirectly identify the person involved, making automation of anonymisation especially difficult. ⁹⁹

97 Brogan (2016).

98 The Guardian, GPT-3.

99 Gianola (2020), p. 224.

(23)

21

4 An Introduction into NLP

4.1 NLP

In this section, I will briefly discuss the NLP basics and techniques used for the text. It will also be helpful for understanding the examples in the next section.

Here, it is important to remember that humans and computers "speak" different languages. Be- hind all the fancy and human-readable interfaces, down there are still zeros and ones. Defining NLP Armour summarises multiple researchers and describes NLP as a field, which changes unstructured textual data into numeric vectors, which can be later analysed by ML. It builds and relies on statistical relationships between words or their patterns. While NLP is good with retrieving information from textual documents, it lacks "social intelligence" or "understanding"

of semantic context, which is easily understood or picked up by humans from the textual information.¹⁰⁰ This is evident in cases with complex queries expressed in natural language (e.g., plain English), which systems have problems responding to.¹⁰¹ It is the central part of NLP processing to transform the text into mathematical objects,¹⁰² where textual data has to be con- verted into numerical values for the machine to "understand."

One of the solutions to breach this gap is Word2Vec, or a special way to turn a word into a vector or a list of numbers (for example, a list of 25 numbers), which can be put on the coordi- nate. After this, a machine can learn from these numbers and compare them. There also may be a rule-based approach, when rules of grammar serve as a decision tree program, thus helping to navigate the text. NLP proposes several other approaches, but all of them were not quite good enough for proficient work with textual data until machine learning (ML) arrived.

Before starting any work with textual documents, they should be pre-processed.

The machine-readable format is one of the requirements (and also a challenge) for this, which is followed by a number of NLP techniques developed throughout the years. These steps or modules are usually united in an NLP pipeline (software, where the input is processed in a specific order). The original unstructured text is loaded to the first module and leaves the last module structured and with necessary information. The structure of the pipeline depends on the pursued task.

100 Armour (2019), p.30.

101 Ashley (2017), p. 339.

102 Kedia (2020), p. 26.

(24)

22

The typical NLP pipeline starts with tokenisation, where words and other textual units (like punctuation) are uncovered. It is the first step to vectorisation¹⁰³ (turning text into numerical values and further processing). The sentence detector may be used to separate the sentences.

Whether or not to do it will depend on a specific task at hand and the choice of technical ways to achieve it. The Part-of-speech-tagging (POS-tagging) finds such lexical categories as a verb, noun, etc., of every word. Knowing POS of every word helps to deduce the contextual meaning.

It is also important for the performance of the Named Entity Recognition (NER).¹⁰⁴ Contextual meaning is important because the same words may have different meaning, depending on their position and the words next to them.

Figure 1. Stages in a typical/simple NLP pipeline may look like this. There are different tools to achieve it, including almost "no-code" solutions. Development is ongoing.

Named Entity Recognition (NER) helps to identify and classify essential players in the sentence, such as names, places, organisations. A named entity is "anything that can be referred to with a proper name".¹⁰⁵ For humans, this may seem like a trivial task, and we can almost in- stantly identify the proper name in the sentence. However, the machine needs a lot of context for it. NER actually opens an opportunity to profound applications. For example, text summa- risation, where the program can scan the text and identify key features (processing of CVs or contracts with highlighting of relevant information); automatic indexing is used to organise data for efficient retrieval (again, it can be used for contract review or detecting relevant information in the legal document even when wording is different); information extraction is for the faster processing of a document (for example, what product, store location or person this document talks about).¹⁰⁶ Concrete, relevant to the research question examples are given in the next section, after this overview which aims to be a short introduction to NLP.

NER has a wide range of application, and it can be used for document de-identification, machine translation and also conversational models.¹⁰⁷ A popular open-source NER tool is

103 Kedia (2020), p. 27.

104 Kedia (2020), p.30.

105 Jurafsky.

106 Kedia (2020), p.17.

107 Lison (2020), p.1518.

Tokenizer Sentence- detector

POS-

tagger NER

(25)

23

Stanford Named Entity Tagger,¹⁰⁸ but there are many others, and user can also train their own, which better suits their needs. These examples are given to illustrate the wide range of task that NLP and NER can handles, which can also be serving legal experts.

Figure 2. Screenshot of an online demo version of Stanford Named Entity Tagger.¹⁰⁹ This is an illustration of how NLP can help to highlight named entities - certain information (specifically, proper names) for further processing. This example has identified date, person, organisation and location.

It is important to consider the role of machine learning (ML) in NLP. NLP is one of many fields influenced by the development of ML. ML is a broad field of research, which covers a number of topic and algorithms with the aim to find patterns or make predictions. ML leverage fre- quency information statistically to "learn" how features and target outcomes correspond between each other.¹¹⁰ ML is combined with NLP and gives a boost to converting the unstructured textual data to numeric vectors that can be analysed using ML techniques further.¹¹¹ Basically, once the NLP started using ML, its capabilities grew together with the amount of processing it

108 The Stanford Natural Language Processing Group

109 The Stanford Natural Language Processing Group.

110 Ashley (2017), p. 107.

111 Jurafsky.

(26)

24

can do and the results it can achieve. There are still a lot of challenges, especially because of the nature of textual data, which will be discussed in the dedicated chapter.

Just to give an example, here are some of the ML algorithms that are also used in NLP: Hidden Markov Models, Naive Bayes classifiers, Convolutional and Recurrent Neural Networks, etc.

These are not necessary for a legal expert to know. However, experts in ML and NLP are usually very well acquainted with them.

Any ML technique requires data, and the combination of ML and NLP is no exception. Data may be labelled and unlabelled, which will influence the way of handling and processing such data. Many ML learning models, such as supervised ML, which aim to make predictions, requires labelled data to be trained on. An unsupervised ML model, which aims to find patterns, may be used on an unlabelled data set. Labelling of data is an expensive and time-consuming, but in many cases, necessary exercise. The value of data and these concepts will be illustrated in the sections with concrete examples below.

4.2 NLP and Legal Services

LegalTech is a broad term, which covers the use of technology, computer analytics, and software to facilitate justice and provide legal services.¹¹² In the last years new techniques such as machine learning (ML) and Big Data, which are coming from the various disciplines were adopted within law. While they yet to be widely deployed in the legal practice, they still provoke new and interesting developments in such tasks as analysing sets of documents or summarizing the provisions of the contracts.¹¹³ There many innovative companies in the legal sphere, one of the websites that track them is by Stanford Law School. It lists 1761 companies who are "changing the way legal is done,"¹¹⁴ and the number is slowly growing. The project divides the companies into nine areas: analytics, compliance, document automation, legal education, legal research, marketplace, online dispute resolution, practice management and eDiscovery. However, there are other possible divisions, for example, areas where NLP has an important role: legal research, electronic discovery, contract review, document automation and legal advice.¹¹⁵ We can see that NLP (the technical side of it is analysed in the next chapter) increasingly gets into the legal field, used by the traditional law firms in new ways, especially for document processing. The development of the anonymisation and pseudonymisation techniques with the use of NLP may be a way to add an additional sphere to the document processing, where the documents may be de-identified.

112 Sandvik (2020)

113 Susskind (2017), p. 53.

114 Stanford Law School.

115 Dale (2019), pp.211-212.

(27)

25

However, there are challenges to the growth of LegalTech and to the wider employment of such techniques. Probably the main one is the way legal services are traditionally operated and paid by billable hours. As technology contributes to more effective and fast fulfilment of the task, it reduces the number of hours that can be billed. However, disruption of the current situation is inevitable due to many reasons, for instance due to the increasing demands from worldwide movements focused on access to justice.¹¹⁶ Also, in the case of the anonymisation and pseudonymisation of textual legal documents, it's not only law firms that are interested, but governmental bodies, courts and the research community.

A recent study shows that many LegalTech projects are often based on activities, which are already performed by the lawyers and are not out of touch with the existing situation, inspired by current challenges. These are also the lawyers who initiate these new projects.¹¹⁷ Real-world problems motivate people who experience them regularly to look for solutions. As in many countries, the courts are obliged or at least encouraged to publish their decisions; de-identifying them becomes a serious and relevant task. Thus, projects like these in Finland, Canada, and Norway are emerging (concrete cases are illustrated below).

Moreover, it is often a joint effort, driven not by an individual but by collective rationality.

These projects are often put in context and make use of local opportunities. Overall, LegalTech projects influence both legal professional knowledge and practices (making them more automated) of the lawyers taking part in it. Such projects require technological skills, and this con- sequently assists in making legal information more accessible online, also faster and cheaper for the clients.¹¹⁸ There is interaction and convergence between law and technology, new ideas coming from both legal and technology areas, and more opportunities are explored. The NLP is part of it as well.

It can be seen through the way the legal search changed, over the last decade, influenced by NLP, moving from simple keyword-based techniques to methodologies, which are context-ori- ented.¹¹⁹ It made search more effective for lawyers and their complex inquiries.

4.3 Concrete cases of NLP application in the legal field

In this section, I will review some of the recently published research papers, which combine anonymisation and NLP, when processing textual documents. The papers are united by one goal – to make anonymisation of the textual files easier, minimising the number of hours the

116 Dale (2019), p.217.

117 Dubois (2020), p.12.

118 Dubois (2020), p.12.

119 Buddarapu (2019), p.1.

Anonymisation and pseudonymisation of textual documents