Last month Pentaho has announced its Business Analytics 5.0 product.   Pentaho is emphasizing the data blending capability of the product.  Pentaho isn’t the only player in this particular game, but let’s stick with them for this discussion.

In this article on datanami, Chuck Yarbrough of Pentaho is quoted as saying:

It has become very apparent that the real value of big data is not necessarily in the data, but in the combination of that data with other relevant data from existing internal and external systems and sources.

(A discussion of the meaning of data blending, as well as some discussion of Pentaho’s product, here.) 

Combining data is a potential legal issue.

Here’s the thing, much of the data being analyzed comes with rules attached.  The rules could be in an agreement with a data provider that limits the uses that can be made of the data.  The rules could be in a privacy policy that governs the collection of the data and the permitted uses of the collected data. Or the rules could be in laws or regulations. 

So the problem is:

You want to use Data Set X for an analysis project.  The rules attached to Data Set X permit that, but you also want to blend some Data Set Z data into Data Set X for that project.  The rules attached to Data Set Z don’t allow Data Set Z to be used for that project. 

How are you going to prevent the unauthorized blending?

I happily note that Pentaho also references the data governance capability of its product, which I hope will give their customers the ability to maintain the separation the applicable rules require. 

They say:

Since the data is left in its original landing place, it maintains whatever level of data governance and security it was given when it was first stored, making audits easier.

However, having the capability and using the capability are obviously two different things.  So here the concern is: …whatever level of data governance and security it was given when it was first stored….

For data governance to address this blending issue, users have to go to the trouble of establishing and maintaining the link between each data set and the rules that apply to them.  This is where I get concerned.  I was concerned before, but with each advance in capability comes an advance in the risk of bad outcomes.

That’s the simple data blending issue. 

There’s a harder one. 

Where this is heading, of course, is the ability to use existing data sets to discover data that is otherwise unavailable.  Yes, I know that’s kind of the whole point, but the question is what otherwise unavailable data.  Maybe it’s personal data.  We’ve previously discussed this here at Big Data and the Law.   Regulating the possible is always a challenge, but as noted in that earlier post, regulators are thinking about it.

All of that aside, looking at Pentaho’s website has pretty much sold me on the idea that one day mere mortals like myself will able to use data analytics products.  That’ll be fun, I think.

