Last Sunday, an expert committee appointed by the government and chaired by former Infosys CEO Kris Gopalakrishnan published its preliminary recommendations on the regulation of non-personal data.
The broad idea behind such an exercise is to figure out safe ways in which “India can create a modern framework for creation of economic value” from the use of non-personal data collected by private and government organisations.
The report has invited stakeholders to share their input and concerns, all of which is expected to help pave the way towards eventual regulation.
Efforts towards regulating non-personal data are taking place in parallel to deliberations surrounding the regulation of personal data by the Joint Parliamentary Committee on the Personal Data Protection Bill, 2019 (PDP Bill).
The committee has outlined, in the primer appended (see appendix 3) to its report, some techniques of anonymisation. These techniques may be relevant in determining the appropriate standards of anonymisation. Since personal data once anonymised, falls outside the scope of the PDP Bill, knowing exactly when data becomes anonymised is crucial.
Certainty about when data becomes anonymised is an important concern for both the data fiduciary (organisation) in terms of business certainty, and the data principal (individual) in terms of privacy harms which could arise if the PDP Bill ceases to apply.
What is Non-Personal Data?
The PDP Bill defines personal data as data about or relating to a natural person who is directly or indirectly identifiable.
Accordingly, non-personal data could be of two types. First, data or information which was never about an individual (e.g. weather data). Second, data or information that once was related to an individual (e.g. mobile number) but has now irreversibly ceased to be identifiable due to the removal of certain identifiers through the process of ‘anonymisation’.
In practice, however, the distinction between personal data and non-personal data is fairly murky. The degree to which data is de-identified can lie somewhere in a spectrum between being clearly personal or being clearly anonymous or even somewhere in between.
What is Anonymisation?
Anonymisation is currently an unclear standard of de-identification that is to be determined by the Data Protection Authority of India (to be established under the PDP Bill once it is enacted). De-identification is a process by which identifiers that help in attributing data to an individual are removed so that the data is delinked from the individual.
The PDP Bill defines anonymisation as the “irreversible process of transforming or converting personal data to a form in which a data principal cannot be identified, which meets the standards of irreversibility specified by the Authority.” Even though the PDP Bill is yet to be enacted, the characterisation of the process as irreversible indicates that the standard must be fairly high. To be clear, there have been studies which show that personal data can never be truly irreversibly anonymised.
In order to better understand the process of de-identification, let us consider one of the techniques that have been mentioned in the Gopalakrishnan Committee Report; say K-anonymity. K-anonymity helps in preventing attempts to link the data to a particular person by generalising existing attributes.
Let us assume that a digital contact tracing app collects some personal information at the time of registration. This could include identifiers such as name, city, health condition and gender, as represented in table 1 below:
|Date of Birth||Name||City||COVID Status||Gender|
|04.04.1976||Ankit||New Delhi||COVID-19 Negative||Male|
Table 2 generalises and de-identifies the information collected by the app as represented earlier in table 1 to illustrate the process of k-anonymity. If we look at the two tables together closely and compare them, the names of the individuals and their exact date of births have been omitted to attain some degree of generalisation. Only their year of birth, city, gender and COVID status is accessible now:
|Date of Birth||Name||City||COVID Status||Gender|
|XX.XX.1967||Patient 1||Mumbai||COVID-19 Positive||Female|
|XX.XX.1976||Patient 2||New Delhi||COVID-19 Negative||Male|
To some (albeit a limited) extent, therefore, the information in table 1 has been de-identified in table 2. Does this mean that the data (as represented in table 2) has really been anonymised?
Back in the 1990s, a private corporation in the US fairly active in the health sector reportedly did something very similar. It decided to release a dataset in public after anonymising it using the k-anonymity process. The names of the individuals were omitted from the dataset, however, some attributes including ZIP code, gender and full date of birth remained. Even with such a limited data-set, researchers had a re-identification success rate of over 80% through publicly available registers such as voter lists which contained the same dataset, linking it back to specific individuals.
Similarly, in table 2, while it may seem that it is not possible to identify an individual on the basis of just the year of birth, gender, city and COVID status, it is, in fact, possible to re-identify a particular individual. This could take place through the combination of other background information that the government or an organisation may possess, even though the information has undergone the process of k-anonymity. What was ostensibly non-personal data as a result of removing two identifiers (i.e. personal name and date of birth), is therefore visibly inadequate to make the data anonymised or non-personal.
The susceptibility of various anonymization techniques (such as k-anonymity in our example above) to risks of reidentification has been recognised by the Article 29 Working Party in the European Union as well. Consistently, the Gopalakrishnan Committee Report has also acknowledged that even after anonymisation, there continues to remain a risk of re-identification.
Even so, while anonymization of personal data is important to protect privacy, over-generalization through an extremely high standard of anonymization can also render the data less useful and in some instances, perhaps not useful at all. For instance, in table 2, if the ‘Year of Birth’ column is removed, generalisation could happen to a greater extent. However, it could reduce the usefulness of the data by limiting the ability to gather insights into prevalent health vulnerabilities that are present in specific age groups.
Of course, companies, organisations and governments (i.e. data fiduciaries) would likely prefer to reduce generalisation to increase usefulness of the data. At the same time, it is in the interest of the data principal to ensure that their data is generalised and not not-identifiable. In this market of competing privacy interests, an unclear standard of anonymization can create conflict between these competing interests. Accordingly, there is a strong need to develop a bright line test which delineates personal and non-personal data. A bright line test will be vital to determining whether the PDP Bill or the non-personal data framework will apply to (anonymised) personal data. Regulatory certainty about the applicable law is crucial, since non-compliance or compliance with the incorrect law could involve significant fines or penalties.
The first step towards developing a bright line test would be for the Gopalakrishnan Committee and the Joint Parliamentary Committee to jointly deliberate upon developing indicative principles (and consequently specific rules) which guide entities handling personal data in determining the exact degree to which data needs to be anonymised.
While developing contextually-applicable bright-line tests, the committee must carefully balance the individual’s right and interest in securing and protecting privacy and the societal and organisational interest in gleaning insights and learnings from data which is vital to unlock the digital economy in the global south.
Samraat Basu is a technology and data protection lawyer. Siddharth Sonkar is graduating from the National University of Juridical Sciences (NUJS), Kolkata with the class of 2020 and has a strong interest in law, technology and regulatory policy. They can be reached out at @samraat_basu and @ssiddharth96 on Twitter.