Listen to this article:
Around August 30, draft guidelines on how government projects can anonymise and harness e-governance-related data were opened for public consultation by the Ministry of Electronics and Information Technology (MeitY), with little to no fanfare.
Why would the Ministry propose anonymising data?
Large datasets are useful for research, governance, or commerce – they often contain a mix of personally identifiable data alongside descriptive data related to that individual. This data, which supposedly does not identify a person, can be useful to access and analyse. However, as long as it sits alongside personal data (which is usually protected by data protection laws), processing it poses privacy risks to individuals.
So, to make use of this information, organisations and governments ‘scrub’ datasets of personal data. This supposedly leaves them full of ‘non-personal data’ that is ‘anonymised’ because it doesn’t actually link back to an individual anymore or harm their privacy. The datasets are then released for public use.
The guidelines were another step for non-personal data governance in India
Like many countries, the Indian government is pushing for the utilisation of anonymised non-personal data to improve governance, research and competition between businesses. State governments are keen on the idea of utilising anonymised non-personal data too – in April, the Tamil Nadu government released ‘masked’ data on the Tamil Nadu Public Service Commission selection process under its open data policy.
However, around September 6, the guidelines were withdrawn from the ‘e-Governance Standards and Guidelines’ website, almost as unceremoniously as when they were first uploaded. Reports suggest that the Ministry withdrew the guidelines because “they were released without adequate expert consultation”. A new document will be released soon.
Perhaps the Ministry’s decision was wise. As experts told MediaNama, the ‘anonymisation’ of personal data does not guarantee individual privacy – techniques to protect information are easily susceptible to being reversed. This can lead to the ‘re-identification’ or ‘deanonymisation’ of a dataset – revealing the identity of an individual or a group of people, while violating their privacy and exposing them to a wide range of harms.
What’s worse, deanonymisation is a relatively easy exercise for well-trained malicious actors. And, in the event that it does happen, Indian citizens have no recourse to protect themselves given the absence of a data protection law.
“With data being the new gold, cybercriminals or other individuals will target areas with large stores of personally identifiable data [or potentially identifiable data],” argues Utsav Mittal, a Certified Information Systems Security Professional and founder-CEO of Indian cybersecurity firm Xiarch.
“Ultimately, the likelihood of a sector or organisation being hacked [or its available information being used for malicious purposes] is directly proportional to the amount of personal data they have,” Mittal said.
The privacy risks posed by large anonymised datasets – which capture the characteristics of India’s population, businesses and landscapes in granular detail – are self-evident. However, none of these risks can be mitigated – or penalised – without a data protection law in force. Even the withdrawn e-governance guidelines were “voluntary” and lacked statutory backing.
“Ultimately, data-driven policymaking will be successful only if we address the triad of intertwined deficits India is facing in the fields of democracy, data and development,” surmises Vikas Kumar, faculty at the School of Development at Azim Premji University, whose recent work includes Numbers in India’s Periphery: The Political Economy of Government Statistics.
“Think of these like a tripod. If you raise the height of one leg [data processing] but do not raise the height of the other two legs [democracy and development], then data-driven policymaking will be ineffective. You need to raise all three legs at the same time – but for that, you need to be concerned about the other two legs in the first place.”
What is deanonymisation and how does it work?
“[To explain anonymisation] In the simplest of terms, once collected by an entity, the data is stripped of Personal Identifiers (PI) and released in small segments (only 1% of a larger dataset or only the anonymised medical information of 1000 patients in a hospital housing 100,000),” explains Ayushman Kaul, senior analyst at Logically.
“Once treated in this manner, the data is widely distributed, and even modern research institutions and academicians are encouraged to release anonymised datasets of their work to have their workings independently verified by the broader community. In fact, after a dataset has been through the anonymisation process, it is no longer considered to be ‘personal information’ and thus is often exempt from many of the judicial safeguards designed to safeguard an individual’s privacy. These datasets can subsequently be freely used, shared, and sold.”
Deanonymisation, on the other hand, is performed by combining scrubbed datasets to identify information about the same user in different contexts. This linking of datasets can reveal layered and comprehensive personal information about an individual, which is why experts suggest that anonymisation is not a foolproof technique to protect privacy.
“The term ‘anonymised data’ can convey a false sense of security because it’s almost impossible to be sure that personal data has been made truly anonymous and will always be anonymous,” notes Christine Runnegar, senior director of Internet Trust at the Internet Society. “A better term to use when trying to anonymise personal data by removing identifying information is ‘de-identified data’. It conveys the idea that known identifying information has been removed. Be aware though that there may still be some unrecognised identifying information, or the data could be re-identified when combined with other data.”
As far back as 2006, researchers studying US census data found that 63% of the population sampled could be identified by combining just three demographic indicators: gender, zip code and birth date. The researchers were building on studies from the early 2000s – which found that 87% of the US population could be identified by using the same indirect identifiers.
In 2019, researchers found that “99.98% of Americans would be correctly re-identified in any [anonymised] dataset using 15 demographic attributes”. The Belgium and London-based researchers concluded that “even heavily sampled anonymised datasets are unlikely to satisfy the modern standards for anonymisation set forth by GDPR (the European Union’s privacy-protecting law)”.
What’s more, accessing anonymised data and performing these analyses on it isn’t necessarily difficult – at least, for those in the know.
“Any dataset is very easy to access for a reasonably resourceful cybercriminal or person,” explains Mittal. “This data can be easily bought off the dark web using cryptocurrencies like Bitcoin, which provide these actors with a degree of anonymity too. The data is ‘cheap’ and its price often decreases – after all, this is a marketplace of information.”
For example, BBC recently reported that 80GB of NATO’s confidential security data was being sold online for 15 Bitcoin (around £273,000). India has seen many private and public datasets breached online too.
“Once procured, technical actors deanonymise the data,” says Mittal. As Kaul explains, these can include “governments, law enforcement agencies, data brokerage firms, social media platforms, digital marketers, scammers, journalists, and security researchers (..) The complexity of this process is directly correlated with the ‘granularity’ of the anonymous dataset and the number of ‘auxiliary’ datasets available for cross-referencing.”
“Then, the technical actors simply sell the data to the next groups looking to buy this information on the ground,” adds Mittal.
How does deanonymisation harm people?
“De-identified data, even if it cannot ever be linked back to a particular known individual, can still have serious privacy implications, if it can be used to single out an individual or section of the community,” warns Runnegar. “One of the key risks is discrimination.”
For example, in 2006, AOL released “anonymised” search logs of half a million of its users. While names weren’t included, reporters from the New York Times were still able to quickly identify 62-year-old Thelma Arnold from the dataset.
In 2014, New York City released “anonymised” data on over 173 million taxi rides for public use. However, some experts were able to identify which taxis carried out specific trips (and thus who drove them). This was because the dataset wasn’t anonymised strongly enough, making it easier to triangulate identities.
More recently, in 2021, the sexual activities of a high-ranking US Catholic priest, Monsignor Jeffrey Burrill, were exposed through triangulating his location using “aggregated” Grindr-related usage procured from data brokers. All of this information was procured legally by a newsletter covering the Catholic church– and deanonymised successfully by it too. Burrill resigned after the news broke.
In India specifically, the identity-based risks of combined public datasets are already apparent.
“I’ve seen census data being misused in the run-up to violence while on the field,” recalls Kumar. “Vulnerable communities anticipate this, so they sometimes falsify the information they provide to government data collectors to prevent this from happening. For example, in the 1984 riots in the aftermath of Indira Gandhi’s assassination, electoral rolls were used to identify and target Sikhs living in Delhi.” Rioters reportedly used school registration and ration lists too.
During the 2020 communal riots in Northeast Delhi, reports further suggested that data from the Ministry of Road Transport and Highways’s vehicle registration database ‘Vahan’ may have been used to identify vehicles owned by Muslims and set them ablaze.
The probability of deanonymisation taking place is also constantly shifting.
“Given the sheer amount of data collected on individuals, the growing sophistication of machine learning algorithms that can be trained on incomplete or heavily segmented datasets, and the ease of access to auxiliary datasets, the methodology of ‘anonymisation’ is becoming largely redundant,” argues Kaul.
“In fact, a research paper published in 2013 analysing mobility data [one of the most sensitive forms of anonymous datasets as it contains the approximate location of an individual and can be used to reconstruct individuals’ movements across space and time] found that due to the uniqueness of human mobility traces, little outside information was needed to re-identify and trace targeted individuals in sparse, large-scale and coarse mobility datasets.”
“The problem (..) is that you can never be sure what other data is out there and how someone might map it against your anonymous data set. Neither can you tell what data will surface tomorrow, or how re-identification techniques might evolve. Data brokers readily selling location access data without the owners’ knowledge amplifies the dangers.”
Citizens in India currently have no recourse against such harms. “We’re discussing this next phase of data governance in India [of non-personal data] before we even have a data protection law in place,” argues Shashank Mohan, project manager at the Centre for Communication Governance at the National Law University, Delhi.
“Other countries may promote the sharing and processing of non-personal data, but they have mature, evolving, and robust data protection laws in place. This conversation on non-personal data in India becomes merely academic as a result – leaving aside non-personal data, even if my personal data is breached today, or if an entity is not adhering to basic data protection principles, as a citizen I have next to no redressal mechanisms.”
Why is there a push for utilising non-personal data, especially in the notable absence of a data protection law?
The use of anonymised non-personal data for India’s governance and economic growth was notably fleshed out in 2020’s Kris Gopalakrishnan Committee report, which defined non-personal data as “any data that is not related to an identified or identifiable natural person, or is personal data that has been anonymised”. The Committee’s vision partially translated to subsequent draft data protection laws, indicating a clear governmental enthusiasm for collecting and processing citizens’ data for economic benefits.
“There is significant value to anonymising datasets for better governance,” explains Astha Kapoor, co-founder and director of the Aapti Institute. “Ultimately, this value is defined in two ways: public value and economic value. These are not mutually exclusive, at least in the eyes of the Kris Gopalakrishnan Committee’s report on non-personal data. There’s an economic value to improved efficiency.”
“The UNDP-incubated innovation lab ‘Pintig’ really ramped up efforts to collect non-personal data about COVID-19 infection patterns in the Philippines,” illustrates Soujanya Sridharan, research analyst at the Aapti Institute. “Data on the infections was used to create a dashboard primarily for policymakers and municipal administrators to determine how to deliver aid and care. In Finland, legislation for the ‘Secondary Use of Health and Social Data‘ unlocks non-personal data to use for specific purposes. Among them is obviously planning and governance, but also research, innovation, and education.”
Kapoor adds that in the Indian context, the Niti Ayog has also developed the National Data and Analytics Platform, which seeks to “democratise data delivery by making government datasets readily accessible, implementing rigorous data sharing standards, enabling interoperability across the Indian data landscape, and providing a seamless user interface and user-friendly tools”.
The Balakrishnan Committee also pushed for the anonymisation and sharing of company data to spur innovation within Indian industries and reduce the often hegemonic advantages large companies can have within a sector.
Besides companies and governments, non-personal data is also useful to localised groups and communities too. “We’ve seen patients with Multiple Sclerosis pool their data, anonymise it and share it with researchers investigating the disease. Indigenous communities in Australia and Canada also use non-personal data on their water bodies or land to negotiate with the government on specific issues,” shares Kapoor.
However, regulators the world over may also be interested in utilising anonymised datasets because they lie outside the usually stringent provisions of personal data protection law. Deanonymisation can act as a ‘workaround’ in the face of laws protecting personal data.
“Traditionally, data protection law has covered personal data, as that is what’s intrinsically linked to your privacy and can lead to your identification,” says Mohan. “But, as technology evolves, multiple players in the data ecosystem have realised there is immense value in processing data. That’s where the whole conversation around NPD and data governance is hinged on – the processing of data that’s not largely protected under the ‘burdens’ of data protection laws.”
“Right now, if you obtain personal data in India and use it without consent, there still might be certain drawbacks,” adds Anushka Jain, associate policy counsel (surveillance and transparency) at the Internet Freedom Foundation.
“For example, the Information Technology (Reasonable security practices and procedures and sensitive personal data or information) Rules, 2011 [2011 Rules] prevent companies from doing anything with sensitive personal data that the person has not consented to. The enforcement of those rules is another question, but they at least exist. So, the moment you’re processing personal data without consent, you’re doing it illegally. Why do that, when you can deanonymise non-personal data, and then do whatever you want with it?”
What are the legal and practical gaps in how the government views anonymised data?
The government suggested a wide array of anonymisation techniques government departments could use in its now-withdrawn guidelines, perhaps in order to mitigate the privacy risks of deanonymisation. However, while a strong anonymisation technique may make a malicious actor’s job that much harder, it may not always protect datasets from deanonymisation either.
“Poor de-identification methods such as simply removing an individual’s name from the data pose a high risk of re-identification, but even better methods of de-identification may not prevent re-identification some day in the future,” notes Runnegar. “However, organisations can reduce privacy risks of re-identification by treating the de-identified data as personal data, applying good data protection practices such as data minimisation, limitations on use, access and sharing, and applying security such as encryption. Also, for data collected from groups or populations, secure multiparty computation (MPC) can be used to analyse aggregate data while protecting the privacy of the data.”
The fact that India doesn’t have a data protection law in place – whether for personal or non-personal data – perhaps renders such good practices redundant. As the government drags its feet on introducing the law, companies themselves remain unsure as to what information should be anonymised and how, leading to a patchwork approach to privacy protection. This points to a larger classification problem in India, where definitions of personal and non-personal data, and what constitutes privacy, remain in a state of flux.
For example, as seen above, deanonymisation may allow for the identification of not only individuals, but of larger groups of people centred around specific characteristics. In this light, as Varunavi Bangia has previously argued for MediaNama, the state must “‘fundamentally alter (..) [its] conception of the subjects of the right to privacy from protecting individuals to protecting groups (..) The focus must (..) be on conceptualising a right available to a group not merely because each individual in that group has an independent right to privacy, but a right that belongs to the group as a group. The most important regulatory intervention is to ensure that collective rights are neither subordinated to nor seen in conflict with individual rights.”
While Indian courts have recognised that the right to privacy extends beyond individual privacy to collective rights too, iterations of India’s proposed data protection law continued to club personal and non-personal data together for regulatory purposes. “This is because of a lack of understanding of the value of non-personal data to corporations in creating group profiles,” argues Bangia.
With different re-identification scenarios outside the regulator’s ambit, accountability for breaches and harms is hard to come by.
Some hope appeared in the now-withdrawn draft Data Protection Bill, 2021. “While the Bill combined the protection of personal and non-personal data, it acknowledged that anonymisation can fail, which is why it brought all data under its regulatory ambit,” recalls Kapoor. “It imposed penalties for deanonymisation, making it a punishable offence.”
With the Bill now withdrawn, MeitY is murmuring that non-personal data will not be governed by the personal data protection law. A potential successor is months (if not years) away from being passed. In the meantime, how users are supposed to file grievances or hold data processors accountable for re-identification also remains murky.
How do citizens seek redressal for re-identification now?
“Any re-identification of non-personal data will be counted as a data breach in India,” explains Sridharan. “While we don’t have a personal data protection law, we do have the 2011 Rules. But, there are no penalties or costs actually attached to data breaches or re-identification. The only avenue people have is filing writ petitions in courts to protect their right to privacy. But, institutional bottlenecks arise in that context too.”
Currently, then, data processors may be liable for re-identification. “How these actors will be held liable is using a cocktail of existing laws,” says Jain. “This may involve stretching the provisions of the Indian Penal Code and the Information Technology Act, 2000, and applying them to individual instances of re-identification.”
Tejasi Panjiar, associate policy counsel at the Internet Freedom Foundation, adds that given this regulatory vacuum, “neither private actors nor governments are held accountable or mandated to follow international best practices like data minimisation and storage and purpose limitation. Even the way the Balakrishnan Committee report approached deanonymisation is very post-facto. If it occurred, the data would then fall under the personal data protection legislation and penal provisions would be imposed. We don’t have robust, enforceable policies on anonymisation that ensure that identifiable aspects of personal data are removed to a high degree.”
This approach may be at odds with global standards for anonymisation imbued in privacy laws, and against the privacy-protecting model MeitY purportedly aims to embed in its new batch of Internet laws.
For example, the European Union’s General Data Protection Regulation (GDPR) considers a dataset “anonymous” when each person is individually protected. Even when datasets are scrubbed of identifiers, if they do contain data that could lead to re-identification, then this ‘anonymised’ dataset would fall within the provisions of the GDPR.
The stringency of the GDPR’s provisions on anonymisation came to light in March 2019 when the Danish Data Protection Regulator fined taxi company Taxa approximately $180,000 for retaining data on nine million taxi rides across five years. The company argued that it was exempted from the GDPR’s provisions on data minimisation and storage limitations because the dataset was anonymised by deleting individual names. This meant it could use and store the anonymised data for much longer.
While Taxa’s actions were in line with Recital 26 of the GDPR, the Danish regulator argued that the company failed to meet the high standards of anonymisation set out in the same recital. The fact that individuals could still be easily re-identified meant that the dataset was not anonymised, subjecting it to the personal data protections of the GDPR.
How can the state approach regulation?
“There’s always a risk of fire consuming a building, But, does that mean you don’t put laws in place to mitigate the chance of a fire?” asks Mittal. “Just because a risk of deanonymisation exists doesn’t mean we throw [the potential benefits] of anonymisation out of the window. It means we need to push for stronger standards and laws to be brought in.”
“In some ways, personal data protection laws risk becoming obsolete unless they keep up with data processing technologies,” argues Mohan provocatively. “Scholars are now suggesting that we need to have a right to reasonable inference, instead. This concept discards the distinction between personal data and non-personal data, arguing that if a data processor is making an inference about a user using their data and if there is potential harm that could arise, then the user needs to be protected against that.”
Given the existing rights framework, however, the government faces a crossroads in terms of how to regulate the sector, which largely hinges on how personal and non-personal data are defined and processed.
“To govern personal and non-personal data under the same law, India’s regulatory structure for data governance may need to develop and mature quickly by learning from other countries and experiences,” says Mohan, alluding to the different ways in which group privacy can be harmed by data processing. “I appreciate that there’s an attempt by the government to shift the power dynamics of business models and data gathering practices in India. But for such non-personal data policies to work for everyone involved, we need a multi-pronged approach: robust laws for personal data protection, non-personal data, and antitrust need to work together.”
The ways in which non-personal data can be protected under a new law are nuanced and purpose-driven, argue Sridharan and Kapoor.
“The purpose of data use should be the starting point for protection,” explains Kapoor. “A farmer, for example, may use a certain kind of fertiliser for their soil. That may be because that fertiliser – and the underlying farming style – is their intellectual property. So, even non-human NPD pertaining to that fertiliser may need to be anonymised and protected to honour their intellectual property rights.”
Sridharan adds that there is a need for separate legislation that “clearly defines the rules and responsibilities for not just the state, but also the rights of the community who helped produce this non-personal data in the first place.”
Panjiar diverges, arguing that as long as non-personal data is viewed through the prism of economic value, as is the case in India, there is a need to bring it under the stringencies of personal data protection.
“As opposed to how personal data is approached, the Balakrishnan Committee report and the draft National Data Governance Framework very explicitly premised the regulation of non-personal data on commercial and financial motives instead of data privacy and user safety,” says Panjiar.
“When you separate the two to the extent that you don’t have strict provisions in place to regulate and protect non-personal data, then the risks that arise out of the deanonymisation of data become even more real. So, in my opinion, we need to consider the regulation of non-personal data through an expert independent body, which would most likely be the proposed Data Protection Authority (DPA) [introduced in the Data Protection Bill, 2021], and is focused on promoting user safety and privacy over commercial motives.”
“There was a reason why India’s founding fathers provided a high degree of privacy to the sharing of census data [as seen in Section 15 of the Census Act],” concludes Kumar. “There is a need to embed that similar notion of privacy into how we approach data now too. Currently, we are looking at things in an abstract fashion. That needs to change. The level of protection provided should be informed by a good understanding of where data processing policies are headed.”