Visualization of all editing activity on Wikipedia, as illustration of Big Data/Fernanda B. Viégas

As noted earlier, the Internet can be a tool for oppression or freedom, depending on who is using it and how. Recently, I’ve been writing about Big Data, the belief that the world will know itself better if government and private concerns can just monitor our lives cheaply and efficiently in sufficient detail, via our actions and transactions on the Internet. That was the basis of Facebook experimenting on one in 2500 of its unsuspecting users (here) and (here).

There are good statistics-based reasons for doubting the claim. Big Data also risks joining evolutionary psychology and pop neuroscience in giving a preferred hypothesis the authority of science.

Does Big Data actually lead to more accurate analyses, leading to more confident decision making? Let’s look at the famous example of Google Flu Trends

As Tim Harford, author of The Undercover Economist Strikes Back, explains, five years ago, Google researchers announced in Nature that

Without needing the results of a single medical check-up, they were nevertheless able to track the spread of influenza across the US. What’s more, they could do it more quickly than the Centers for Disease Control and Prevention (CDC).

All they did was correlate search terms “flu symptoms” or “pharmacies near me”in the top 50 million search terms with the spread of the disease. Who coul dispute th power of Big Data at that point. However,

Four years after the original Nature paper was published, Nature News had sad tidings to convey: the latest flu outbreak had claimed an unexpected victim: Google Flu Trends.

That time, Google’s estimates were overstated by almost a factor of two.

What happened? Possibly, by 2012, healthy people were using the search terms Google itself suggested due to widespread publicity about flu. Which points up a problem: The engineers had used only a correlation between search terms and flu. Correlation is not causation. Coming down with the flu does not cause people to use Google search terms on the subject, and being healthy does not cause them to avoid the terms. In the absence of a true causal relationship, the correlation was unstable.

There’s also the problem of unnoticed false positives. Harford also recounts the 2012 New York Times story by Charles Duhigg about how a father

stormed into a Target near Minneapolis and complained to the manager that the company was sending coupons for baby clothes and maternity wear to his teenage daughter. The manager apologised profusely and later called to apologise again – only to be told that the teenager was indeed pregnant. Her father hadn’t realised. Target, after analysing her purchases of unscented wipes and magnesium supplements, had.

Data magic? Not really, he notes. Lots of women who were not pregnant were probably “Targeted” with such ads, mixed with others, but just did not respond.

There are fascinating problems with all types of statistical analysis and Big Data, as it happens, does not resolve them just by being Bigger.

Incidentally, when politicians get involved with Big Data, the results can be…memorable. Commentator Jonah Goldberg recounts,

In the run-up to the midterms, the Democrats sent out letters to presumed Democratic voters in an effort to shame them into voting. “Who you vote for is your secret,” read a letter sent out by the New York State Democratic Committee. “But whether or not you vote is public record.”

“We will be reviewing voting records … to determine whether you joined your neighbors who voted in 2014.” The letter ends with a creepy, if not outright threatening, warning: “If you do not vote this year, we will be interested to hear why not.”

In that atmosphere, one can only wonder how long who they voted for will remain a secret.

Microsoft researcher Kate Crawford warns,

As Virginia Eubanks’s work has shown, if you want to see the future of surveillance, look to poor communities. Her work with low-income Americans on welfare benefits has shown the degree to which tracking for them has been normalized, from Electronic Benefit Transfer cards recording every purchase to higher levels of neighborhood police scrutiny and camera surveillance. While these tools and techniques of data tracking have now been broadened to ensnare the whole population, their greatest impact is still felt by marginalized communities.

But she also notes something else:

…what do you do when you realize that all that data is not enough? From the Boston bombings to Malaysian Airlines flight 370, we know that data black holes exist. Even when there were direct tip-offs about the Tsarnaevs, the data didn’t set off the right red flags. These moments demonstrate why the epistemic big-data ambition — to collect it all — is both never-ending and deeply flawed. The bigger the data gets, the more small things can be overlooked.

In later columns, I hope to explore ways of addressing abuses. For now, the main thing to see is that the claims for Big Data are often unfounded, and its practical effectiveness can be tragically less than believed.

Next: What’s this about Net Neutrality? Will it work? Is it a good thing?


Denyse O’Leary is a Canadian journalist, author, and blogger.

Denyse O’Leary is an author, journalist, and blogger who has mainly written popular science and social science. Fellow Canadian Marshall McLuhan’s description of electronic media as a global village...