Big Data in College Admissions – Is This Wrong?

As described in this article from Fast Company, Wichita State University is using Big Data analytics in its recruiting and admissions program.  IBM is using Wichita State’s Big Data program as a case study in a White Paper you can find here.

I like this quote in the IBM White Paper, attributed to David Wright, Assistant Vice-President for Strategic Planning and Business Intelligence:

Ultimately, business analytics helps our students succeed by matching them closely to the courses that we predict will go well for them, which is healthy for their careers and healthy for our long-term future.

I get this:

Managing the business affairs of WSU requires attention to both academic standards and financial stability, and the two are closely linked.

I’m concerned about this:

Predicts the chances of success for potential students, enabling marketing teams to focus on high-quality applicants.

I’m always concerned when I see Big Data predictive tools used as a screening tool with respect to people.  I’m concerned about false certainty, asking the wrong questions and most importantly the intent behind the analysis.

Sooner or later some school is going to get caught using Big Data analytics to weed out applicants for unlawful reasons.  

In the meantime, there is at least one more legal issue to think about here.   

Schools using Big Data in their admissions practices or to develop their admissions practices will need to understand how their use of Big Data affects those practices.  Then they will need to consider whether, in that light, they are accurately communicating their admissions standards and practices to applicants. 

Here’s an example of what I mean.  Although not a case of Big Data abuse, consider what George Washington University has been doing as described here.  Apparently George Washington has been basing admission decisions on ability to pay, while at the same time asserting that their admissions process is needs blind.

George Washington might have been within its rights to discriminate in its application based on ability to pay – but the lack of candor is a problem.  What George Washington did was take money for applications doomed to fail.  That the applicants had financial needs obviously makes it worse.  I think the Federal Trade Commission would call it fraud.  


Posted in Big Data, Education | Tagged , | Comments Off on Big Data in College Admissions – Is This Wrong?

You Didn’t Use Big Data to Research That New Product. You Get Sued. You Lose.

If Big Data is supposed to be anything, it’s supposed to a tool for gaining knowledge and insight.  There are times when having knowledge and insight is non-optional.  One of those times is when you make products that can hurt people.

As we have previously noted here at Big Data and the Law, it’s dangerous to make broad, general, simple statements about legal issues, but this is a blog – so here we go.

American Law of Products Liability 3d Treatise tells us this:

For the purpose of determining whether a product manufacturer has sufficient knowledge to give rise to a duty to warn, the manufacturer is held to the degree of knowledge and skill of an expert. In their capacity as experts, manufacturers must keep abreast of scientific knowledge, discoveries, and advances, and are presumed to know what is imparted thereby. They must be aware of all current information that may be gleaned from research, adverse reaction reports, scientific literature, and other available methods. This high standard ensures that the public will be protected from dangers as those dangers are discovered.

I give particular significance to and other available methods.

Take medical research.  The work described in this New York Times article pretty much puts to rest any doubt that Big Data is now a legitimate available method – to be ignored at one’s own peril.

The short version:

A Stanford graduate student created an algorithm that, when used with medical records from the Stanford University Medical Center, identified a combination of two commonly prescribed drugs that causes a rise in blood sugar – that neither drug cause on their own. 

And now courts are beginning to connect Big Data as a source of knowledge and insight with manufacturers’ obligation to, as described above:

… be aware of all current information that may be gleaned from research, adverse reaction reports, scientific literature, and other available methods

This past April, in a product liability case dealing with the drug Fosamax, a Federal judge ruled on the qualifications of an individual as an expert witness.  That person is Dr. David Madigan, Professor and Chair of Statistics at Columbia University.  In qualifying him as an expert [oversimplification alert], the court also  endorsed the idea that data analysis like the work at Stanford is one way to determine what a manufacturer should know.

The court said:

In fact, “[p]harmaceutical companies, health authorities, and drug monitoring centers use SRS databases for global screening for signals of new adverse events or changes in the frequency, character, or severity of existing adverse events (AEs) after regulatory authorization for use in clinical practice.” “SRS systems provide the primary data for day-to-day drug safety surveillance by regulators and manufacturers worldwide.” In addition, the QScan software Dr. Madigan used in formulating his opinion is generally accepted by the scientific community because it “has been in widespread use for over 10 years and has been validated extensively.” Moreover, “[m]any peer-reviewed publications report results derived from QScan.”

When you read that quote, it seem obvious that the manufacturer wasn’t up to date, but I guess it wasn’t obvious to the manufacturer – even though the tool Dr. Madigan used “has been in widespread use for over 10 years.” 

It makes you wonder how many businesses are doing it the same way they’ve always done it.  They might be one curious grad student away from disaster.




Posted in Big Data, Negligence, Product Liability, Technology | Tagged , , | Comments Off on You Didn’t Use Big Data to Research That New Product. You Get Sued. You Lose.

Manipulating Search Engine Results – Picking Winners and Losers

I didn’t know this was a thing, but apparently there is (or was) money to be made publishing mug shots.  You make that money by charging for taking the mug shots down from your web site.  Innocent people that were arrested before being exonerated (and probably some guilty people too) will pay you money to avoid having their lives ruined. 

In response to something of an outcry about this, Google has modified its search algorithm to demote these mug shot sites in search results.  In that way, the site owners lose revenue and the subjects of the mug shots are less likely to be affected by the publication of their mug shots. 

That’s nice, but it also reminds us that Google, Yahoo, Bing and the others have the power to decide winners and losers in other contexts.  We’ve talked about the power of search engine results here in the past.  Staying above the fold is life.

Posted in Privacy, Search Engine Bias | Tagged , | Comments Off on Manipulating Search Engine Results – Picking Winners and Losers

The Balkanization of the Internet Begins

The European regulatory response to NSA-gate is gathering momentum.  The focus du jour is the cloud.  This New York Times article is a nice summary of the nonsensical thinking of some Europeans.  For example, provided in the article:

Viviane Reding, the European Commission’s justice minister, said in her own statement that she wanted to see “the development of European clouds” certified to strict new European standards.

She said that European governments could promote such a move “by making sure that data processed by them are only stored in clouds to which E.U. data protection laws and European jurisdiction applies.”

“For the private sector, such European clouds could become also attractive as they could advertise, ‘These are European clouds, so your personal data is safe,’ ” she said.

That’s great.  I didn’t know that just being in Europe guarantees that your data is safe.  None of those nasty hackers from outside the EU can break through the magical EU firewall, and of course the NSA will be repelled as well.

I wonder if there will be a new EU regulation that also protects data in Europe from Europeans, like the UK, or Germany, or France and those people. 

Read a little further into the article and you’ll find some more rational thinking.  I hope the rational actors in the EU have more control over policy there than the rational actors do here.

Posted in Big Data, Data Security, EU, Privacy, Regulation | Tagged , , , | Comments Off on The Balkanization of the Internet Begins

Blended Data and Data Governance

Last month Pentaho has announced its Business Analytics 5.0 product.   Pentaho is emphasizing the data blending capability of the product.  Pentaho isn’t the only player in this particular game, but let’s stick with them for this discussion.

In this article on datanami, Chuck Yarbrough of Pentaho is quoted as saying:

It has become very apparent that the real value of big data is not necessarily in the data, but in the combination of that data with other relevant data from existing internal and external systems and sources.

(A discussion of the meaning of data blending, as well as some discussion of Pentaho’s product, here.) 

Combining data is a potential legal issue.

Here’s the thing, much of the data being analyzed comes with rules attached.  The rules could be in an agreement with a data provider that limits the uses that can be made of the data.  The rules could be in a privacy policy that governs the collection of the data and the permitted uses of the collected data. Or the rules could be in laws or regulations. 

So the problem is:

You want to use Data Set X for an analysis project.  The rules attached to Data Set X permit that, but you also want to blend some Data Set Z data into Data Set X for that project.  The rules attached to Data Set Z don’t allow Data Set Z to be used for that project. 

How are you going to prevent the unauthorized blending?

I happily note that Pentaho also references the data governance capability of its product, which I hope will give their customers the ability to maintain the separation the applicable rules require. 

They say:

Since the data is left in its original landing place, it maintains whatever level of data governance and security it was given when it was first stored, making audits easier.

However, having the capability and using the capability are obviously two different things.  So here the concern is: …whatever level of data governance and security it was given when it was first stored….

For data governance to address this blending issue, users have to go to the trouble of establishing and maintaining the link between each data set and the rules that apply to them.  This is where I get concerned.  I was concerned before, but with each advance in capability comes an advance in the risk of bad outcomes.

That’s the simple data blending issue. 

There’s a harder one. 

Where this is heading, of course, is the ability to use existing data sets to discover data that is otherwise unavailable.  Yes, I know that’s kind of the whole point, but the question is what otherwise unavailable data.  Maybe it’s personal data.  We’ve previously discussed this here at Big Data and the Law.   Regulating the possible is always a challenge, but as noted in that earlier post, regulators are thinking about it.

All of that aside, looking at Pentaho’s website has pretty much sold me on the idea that one day mere mortals like myself will able to use data analytics products.  That’ll be fun, I think.

Extra points for getting this reference:

“What’s all this about truth in blending?”

Emily Litella

Posted in Big Data, Data Blending, Data Governance, Privacy | Tagged , , , , | Comments Off on Blended Data and Data Governance

The Third Element of Today’s Privacy Problem


It’s people.  We’ve talked about the NSA and governments generally.  We’ve talked about Google and Facebook and other corporate actors.  Finally now, we have to consider people.  Not their actions in the conduct of their work – however much we might dislike it (or not) – it’s their personal actions.

For example, we have a problem with government workers snooping in places they are aren’t supposed to be.   We also have a problem with employees in the private sector playing with information collected by the government.

Now let’s talk about people in the tech industry – people who work for the businesses that collect and hold lots of personal information.

There is a not-really-so-much-of-a-secret in the tech industry.  There are too many people in the tech industry with, shall we say, issues.  Often this problem manifests itself in the form of misogyny – and it goes right up to the executive ranks.   It’s not the only form of the problem though.

Another form of the problem is double standards.  Some people believe it’s OK to do things that they believe governments and corporations should not do.  In an earlier post, we discussed this in the context of using other people’s computers (without consent) to do something they think is cool.  The “greater good” it was called:

Are people willing to sacrifice their privacy for the greater good? And, if so, does this count?

Obviously this story came out before the NSA thing became a thing.

How can smart people be so stupid?  Maybe all the coddling they get from their employers gets in the way of adapting to the adult world – a world where actions and consequences have a relationship.  And perhaps, as noted in this Wired article:

When people with likeminded beliefs congregate together, they collectively move to a more extreme position.

Whatever the reason, we should be concerned about the amount information and the computational tools accessible by people who might not have the kind of judgment we’d like them to have.  As noted in an earlier post, one wonders what kind of screening such employees go through before they get access to those tools and that information. 

This is a gap in privacy regulation that must be considered.  What screening should be required before we give people access to personal information? 

So what is going on out there that we don’t know about?  I’m sure there is more than one good story to tell.

I wonder if Mr. Snowden would have told us if he worked at Google.

Posted in Big Data, Diatribe, Privacy | Tagged , , , , | Comments Off on The Third Element of Today’s Privacy Problem

Wearables in the Workplace – The Internet of Creepy Things

In this Big Data world we have to remember that all of the data we collect can be misused and improperly disclosed.  We should not be collecting all the information we can collect.  We should be collecting the information we need to collect – only.  

I’ve just read a Harvard Business Review article about employers using wearables to collect data from and about their employees.  It’s behind the paywall.  The article describes a number of different ways that wearables can be used by employers, and uses the term physiolytics, which the author (H. James Wilson) defines as:

 … the practice of linking wearable computing devices with data analysis and quantified feedback to improve performance.

Here’s the creepy part: 

It’s early days for physiolytics. But over time managers in many types of companies will embrace the opportunities it offers to improve workers’ output. As with Taylor’s time and motion studies, predicting all the effects will be difficult: Although Taylorism is best remembered for sparking the age of scientific management, it was also a factor in the rise of organized labor. As wearable technology spreads, managers should keep the emphasis on creating a better team—as it was during Cam Newton’s dash. Physiolytics could then fulfill its promise as a new management science that increases organizational efficiency while heightening individual motivation.

I like the bit about how unions were in part the result of time and motion studies.  Very encouraging.  Does that mean that being attached to a bunch of sensors will result in more worker rights?  Well consider this from the same article:

At a distribution center in Ireland, Tesco workers move among 87 aisles of three-story shelves. Many wear armbands that track the goods they’re gathering, freeing up time they would otherwise spend marking clipboards. A band also allots tasks to the wearer, forecasts his completion time, and quantifies his precise movements among the facility’s 9.6 miles of shelving and 111 loading bays. A 2.8-inch display provides analytical feedback, verifying the correct fulfillment of an order, for instance, or nudging a worker whose order is short.

The grocer has been tapping such tools since 2004, when it signed a $9 million deal for an earlier generation of wearables to put into service in 300 locations across the UK. The efficiency gains it hoped for have been realized: From 2007 to 2012, the number of full-time employees needed to run a 40,000-square-foot store dropped by 18%. That pleases managers and shareholders—but not all workers, some of whom have complained about the surveillance and charged that the system measures only speed, not quality of work.

That doesn’t please all workers.  Does the number of displeased workers include the18% that lost their jobs, or just the ones that are getting nudged by the armband?

There’s also this from an article in Business Week about the same subject:

Some companies are using this approach to boost productivity. Bank of America (BAC) analyzed their call center operation to change how their employees took breaks, reducing turnover and increasing performance dramatically. Cubist Pharmaceuticals (CBST) found that it had too many coffee machines. By introducing centralized coffee areas it was able to increase serendipitous interactions and sales.

I guess only coffee machines lost their jobs at Cubist Pharmaceuticals, so that’s good. 

Beyond the risk of losing one’s job as a result of workforce data collection, being strapped to sensors all day is creepy. 

On the other hand, maybe managers should be concerned about the Hawthorne Effect.

Posted in Big Data, Internet of Things, Privacy, Wearables | Tagged , , , , | Comments Off on Wearables in the Workplace – The Internet of Creepy Things

There is Something Else to Discuss about Facebook’s Proposed New Rules – It’s Inferred Information

Our last post about Facebook’s proposed new rules was focused on the commercialization of Facebook user content.  That’s been the general focus of the discussion of Facebook’s action.  Take this from the Wall Street Journal for example.

There is something else to discuss though.  In a previous post here at Big Data and the Law we raised the issue of discovered or inferred information.  In particular, we referenced a study conducted with Facebook data.  That study proved it is possible to use disclosed personal information to discover (or infer) additional and undisclosed personal information.

With that in mind, what seems to be missing from the discussion of Facebook’s proposed new rules is Facebook’s addition of the word infer.  It appears in several places in Facebook’s proposed new Data Use Policy, so one has to assume there was some thought and purpose behind it.  What might that be?

Perhaps Facebook learned something from that study about inferred information.  Perhaps Facebook learned that they could create a whole universe of personal information that was never actually disclosed to Facebook, and they decided it would be great if they could use it.

So here is where Facebook added infer.

It’s here:

So we can show you content that you may find interesting, we may use all of the information we receive about you to serve ads that are more relevant to you. For example, this includes:

    • information you provide at registration or add to your account or timeline,
    • things you share and do on Facebook, such as what you like, and your interactions with advertisements, partners, or apps,
    • keywords from your stories, and
    • things we infer from your use of Facebook.

It’s here:

For many ads we serve, advertisers may choose their audience by location, demographics, likes, keywords, and any other information we receive or infer about users. Here are some of the ways advertisers may target relevant ads:

    • demographics and interests: for example, 18 to 35 year-old women who live in the United States and like basketball;
    • topics or keywords: for example, “music” or people who like a particular song or artist;
    • Page likes (including topics such as products, brands, religion, health status, or political views): for example, if you like a Page about Gluten-free food, you may receive ads about relevant food products; or
    • categories (including things like “moviegoer” or a “sci-fi fan”): for example, if a person “likes” the “Star Trek” Page and mentions “Star Wars” when they check into a movie theater, we may infer that this person is likely to be a sci-fi fan and advertisers of sci-fi movies could ask us to target that category.

And it’s here:

We use the information we receive, including the information you provide at registration or add to your account or timeline, to deliver ads and to make them more relevant to you. This includes all of the things you share and do on Facebook, such as the Pages you like or key words from your stories, and the things we infer from your use of Facebook.

The inference of personal data is going to be a huge deal at some point.  In the present situation with Facebook’s proposed new rules, it’s a subtle thing – likely because Facebook’s is seeking only a limited right to inferred information, and of course because the more obvious issues are so troublesome.  When the inferred information issue becomes more widely discussed though, people will start becoming concerned.  As noted in our previous post, that has started to happen in Europe. 


Posted in Big Data, Facebook, Privacy, Social Media | Tagged , , , , | Comments Off on There is Something Else to Discuss about Facebook’s Proposed New Rules – It’s Inferred Information

Does Mark Zuckerberg Read Big Data and the Law?

Probably not.  On the other hand, we here at Big Data and the Law have said more than once that your information is the price of participating in social media.  For example – here.  Now Facebook has finally made it clear that your personal information is the price of using Facebook.   

This admission comes in revisions that Facebook has proposed to make to its Statement of Rights and Responsibilities.  Specifically:

You give us permission to use your name, profile picture, content, and information in connection with commercial, sponsored, or related content (such as a brand you like) served or enhanced by us.  This means, for example, that you permit a business or other entity to pay us to display your name and/or profile picture with your content or information, without any compensation to you.

There you have it.  You agree that Facebook can monetize your stuff. 

In all fairness, this provision is in the current Facebook Data Use Policy:

Granting us this permission to use your information not only allows us to provide Facebook as it exists today, but it also allows us to provide you with innovative features and services we develop in the future that use the information we receive about you in new ways.

That kind of says it, but this latest revelation puts the matter beyond question.  At least they’re honest, I guess.  

But that’s not really the point.   The point is that giving Facebook these rights is a necessity if you have to be on Facebook for some reason.  To get timely information about your investments for example – and that’s wrong.  Although you could maintain a blank Facebook page, which some people do.

Two more points.  First, Facebook’s current Statement of Rights and Responsibilities says this:

For content that is covered by intellectual property rights, like photos and videos (IP content), you specifically give us the following permission, subject to your privacy and application settings: you grant us a non-exclusive, transferable, sub-licensable, royalty-free, worldwide license to use any IP content that you post on or in connection with Facebook (IP License). This IP License ends when you delete your IP content or your account unless your content has been shared with others, and they have not deleted it.

This differs from the proposed new provisions in that the rights you grant are more limited.  More importantly, in theory at least those rights end at some time.   Although, as a practical matter, they live in perpetuity unless everyone that you have shared your content with actually goes to the trouble of deleting it. 

Which brings us to this final observation – as far as I can see, the new language about selling your stuff doesn’t include a time limit. 

So, if you think the poetry you put out there on Facebook might be valuable someday – well don’t put it out there.  One day you’ll wake up and find out that Facebook has sold your best work to some vacuum cleaner company.  Then you’ll be sorry.



Posted in Big Data, Facebook, Privacy | Tagged , , , | Comments Off on Does Mark Zuckerberg Read Big Data and the Law?

The Future of Search Results is more of the Same Problems

The importance of search engines.  We use them to find goods and services, we use them for research, as I am while putting this post together.

At a minimum, as a matter of commerce it is obvious that search results have a lot of power. Consider this from research done by Public Relations firm Fleishman-Hillard:

Internet search engines continue to be the most prominent tool consumers rely upon to help make purchase decisions (89 percent), indicating the ongoing importance of a strong search engine optimization strategy.

Also consider this from an article at Search Engine Watch:

New findings from online ad network Chitika confirm it’s anything but lonely at the top. According to the study, the top listing in Google’s organic search results receives 33 percent of the traffic, compared to 18 percent for the second position, and the traffic only degrades from there….

Now consider this from an article on Search Engine Journal:

While hardly a new product, Google Now is paving the way for their deep search experience by delivering information based on your life, not keywords. For example, as I leave my house Google may recommend I take an umbrella if it looks like rain, or an alternate travel route if there is a traffic jam up ahead.

Now consider how this data might be used to enhance a common search such as one for “new car.” Today, a normal search for this term might return several paid listings and 10 organic listings that cover car reviews, videos, manufacturer websites, and local car dealers.

But if you were to layer-on a level of deep personalization, that same search might factor in the following: how far you drive to work, how many kids you have, what cars your friends drive, what car photos you’ve looked at online, current dealer incentives, local dealer inventory, price range based on recent spending habits, best insurance rates, and even your favorite color.

The end result? That search would return an info card with a single car perfectly tailored to who Google thinks you are based on their in-depth user profile.

This is, of course, an expansion of search result bias in the form of personalization, and personalization of search results cuts us off from differing ideas and perspectives, as we discussed in an earlier post.  Far more comprehensive than that humble post is The Filter Bubble: How the New Personalized Web Is Changing What We Read and How We Think, a book written by Eli Pariser.  If you are interested in this topic you should read it.

I shouldn’t pick on just Google though.  Bing has been guilty of the same thing, albeit at a much less comprehensive way.  Did you know that during the 2012 election Bing made it possible to revise the content of its election page based on your political views?  You could slide a bar from right to left (or left to right if that’s your thing) and the perspective of the news would change as you did it.

There.  I’m not biased.  I’m working on a search engine though.  It will analyze your prejudices and return only those search results that directly conflict with them.

A couple more points to consider.  First, I cut this paragraph from the above quote from Search Engine Journal:

To make these recommendations Google must have access to your email, calendar, location and travel history. For most companies, this would be an impossible task, but with Google’s vast array of products they can easily get this information and much more.

So once again the price for participation is the exposure of personal information.

Finally, isn’t the disclosure of information by Google and others one of things that has people worked up about the NSA?  As noted in an earlier post, Google can’t disclose what you don’t give them.

Posted in Big Data, Search Engine Bias, Technology | Tagged , , | Comments Off on The Future of Search Results is more of the Same Problems