The illusion of user privacy in anonymous datasets: when sharing too much anonymous information ends up creating new PIIs
The GDPR regulation in Europe is a useful legal framework that has brought a couple of good rules in what was, until one and half year ago, a totally unregulated market - the management of user privacy when it comes to digital data. It's far from perfect, but it's something.
However, since its very first draft I've been concerned about the possibility to freely exchange bulks of "anonymized data". I strongly advocate the value of sharing large anonymized data sets, especially with research institutes whose models become better and better when more data is fed into them (with a high potential for making our lives better and better as well). But I've always got one question in my mind: what's the maximum amount of anonymous aggregated information about a particular person that can be shared in a dataset in a way that such individual cannot be uniquely identified? In other words: we all agree that no datasets that contain my first and last name, email address or phone number can be shared with 3rd-parties without my consent - that's the textbook definition of "personal identifiable information". But if Google shares a dataset with a 3rd-party that contains anonymous records about a male user aged between 30-35, born in Italy, who lives in Amsterdam, works in fintech, regularly purchases sensors and robot motors from Pimoroni, likes Japanese and Thai cuisine, plays guitar, is signed up to IEEE and regularly flies to Portugal for surfing, what are the odds that a filter built in such way would return more than one single record? Note that none of these attributes is considered by itself personal identifiable information, as you can't identify a single individual by knowing that he's a male aged between 30 and 35. But if the combination of all these "public" attributes leads to a situation where a 3rd-party can pinpoint a single individual, then we should also treat their combination as personal identifiable information.
My question so far wasn't much whether such cross-anonymous-attributes identifications were possible or not, but how common they were in the anonymized datasets shared today. This paper recently published on Nature analyzed some of those datasets and calculated a "uniqueness index" for the individuals associated to those samples. The results are quite disturbing: 99.98% of the Americans can be correctly identified if at least 15 attributes about them are shared in a dataset, even in anonymized datasets with added noise and range fuzziness. Working on stripping PIIs out of datasets has certainly been a big step forward, but now we need to reiterate by expanding the definition of PII also to aggregate anonymous data that can still uniquely pinpoint an individual. Each dataset, before being shared, should be screened to make sure that there's no set of values that grouped together makes a query collapse into a single individual. It's not technically difficult to do, but challenging enough to enforce, and we need tech companies that collect and share tons of user data to be onboard in this process.