Discovered Personal Information Will be a Big Problem for Big Data

Just in case you didn’t read about the study done by Cambridge University with Facebook data, let me summarize the story.  Cambridge University looked at data from 58,466 Facebook members in the U.S.  Specifically, they looked at Likes and demographic profiles.  They also looked at testing results obtained through the myPersonality application, which is a Meyers – Briggs type of personality trait evaluation test.

Among other things, this is what they found:

Models proved 88% accurate for determining male sexuality, 95% accurate distinguishing African-American from Caucasian American and 85% accurate differentiating Republican from Democrat. Christians and Muslims were correctly classified in 82% of cases, and good prediction accuracy was achieved for relationship status and substance abuse – between 65 and 73%.

Of course this is personal information.  This personal information was not given to the researchers, it was obtained by the researchers from the analysis of other information that was given to the researchers.

Assume that you obtain personal information from consumers under the terms of a privacy policy, and that privacy policy (and possibly applicable law) places limits on how you can disclose that personal information to others.  Does that privacy policy apply to different personal information that you derive from the analysis of the personal information that you received from the consumers?

What about non-personal information that can be used in data analysis to discover personal information?  Will that non-personal information be treated under the law as though it was personal information for that reason?

This is not a hypothetical issue.  Consider the views of Jan Philipp Albrecht, a rapporteur with respect to the European Commission’s proposed Data Protection.  As reported in Spiegel Online:

Jan Philipp Albrecht isn’t pleased with the European Commission definition of personal data as laid out in Reding’s draft. The reason is that, taken individually, many pieces of data may not be considered to be personal. If combined, however, it may be possible to clearly identify the end user using these bits of data. These are defined as “online identifiers provided by their devices, applications, tools and protocols, such as IP addresses or cookie identifiers.”

That’s a reference to data combinations.  The data discovery issue is not far from that.

The logical approach to dealing with this as a matter of law is for privacy law to be clearly applicable to data in one’s possession and not just personal information received from another party.

In any event, and even if it’s just limited to Mr. Albrecht’s data combination concern, this presents a huge data governance problem that will have to be solved.   If it can’t be solved in a balanced manner, we’ll likely lose opportunities for the use of data in scientific research among other things.

And EU regulators have sometimes chosen privacy over science in situations that don’t make sense – to me at least.

Consider what happened (as described in an article in Nature) when the EU prevented fisheries scientists from using data gathered in a fisheries study because that data included specific information about the fishing boats involved in the study. 

At the heart of the problem is information from devices called Vessel Monitoring Systems, which are attached to many European fishing boats to record their position, direction and speed. From these data, the boats’ fishing patterns can be reconstructed, allowing researchers to assess fishing activity and, for example, examine the environmental impact on specific areas.

That information is critical to the success of the research being conducted.  See this article for a more detailed explanation why that is the case:

The balance of science and privacy has been an issue for a long time and will remain so. 

But the personal information discovery is a relatively new one.  It’s yet another thorny issue that hasn’t been resolved in the law, and will be an operational problem for Big Data no matter how it’s resolved in the law.

This entry was posted in Big Data, Privacy and tagged , , . Bookmark the permalink.