By Cathy O’Neil, a data scientist and a member of the Occupy Wall Street Alternative Banking Group. Cross posted from mathbabe
I recently read and article off the newsstand called The Rise of Big Data, It was written by Kenneth Neil Cukier and Viktor Mayer-Schoenberger and it was published in the May/June 2013 edition of Foreign Affairs, which is published by the Council on Foreign Relations (CFR). I mention this because CFR is an influential think tank, filled with powerful insiders, including people like Robert Rubin himself, and for that reason I want to take this view on big data very seriously: it might reflect the policy view before long.
And if I think about it, compared to the uber naive view I came across last week when I went to the congressional hearing about big data and analytics, that would be good news. I’ll write more about it soon, but let’s just say it wasn’t everything I was hoping for.
At least Cukier and Mayer-Schoenberger discuss their reservations regarding “big data” in this article. To contrast this with last week, it seemed like the only background material for the hearing, at least for the congressmen, was the McKinsey report talking about how sexy data science is and how we’ll need to train an army of them to stay competitive.
So I’m glad it’s not all rainbows and sunshine when it comes to big data in this article. Unfortunately, whether because they’re tied to successful business interests, or because they just haven’t thought too deeply about the dark side, their concerns seem almost token, and their examples bizarre.
The article is unfortunately behind the pay wall, but I’ll do my best to explain what they’ve said.
First they discuss the concept of datafication, which they exemplify by the idea that we quantifying friendships with “likes”: it’s the way everything we do, online or otherwise, ends up recorded for later examination in someone’s data storage units. Or maybe multiple storage units, and maybe for sale.
They formally define later in the article as a process:
… taking all aspect of life and turning them into data. Google’s augmented-reality glasses datafy the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional networks.
Datafication is an interesting concept, although as far as I can tell they did not coin the word, and it has led me to consider its importance with respect to intentionality of the individual.
Here’s what I mean. We are being datafied, or rather our actions are, and when we “like” someone or something online, we are intending to be datafied, or at least we should expect to be. But when we merely browse the web, we are unintentionally, or at least passively, being datafied through cookies that we might or might not be aware of. And when we walk around in a store, or even on the street, we are being datafied in an completely unintentional way, via sensors or Google glasses.
This spectrum of intentionality ranges from us gleefully taking part in a social media experiment we are proud of to all-out surveillance and stalking. But it’s all datafication. Our intentions may run the gambit but the results don’t.
They follow up their definition in the article, once they get to it, with a line that speaks volumes about their perspective:
Once we datafy things, we can transform their purpose and turn the information into new forms of value
But who is “we” when they write it? What kinds of value do they refer to? As you will see from the examples below, mostly that translates into increased efficiency through automation.
So if at first you assumed they mean we, the American people, you might be forgiven for re-thinking the “we” in that sentence to be the owners of the companies which become more efficient once big data has been introduced, especially if you’ve recently read this article from Jacobin by Gavin Mueller, entitled “The Rise of the Machines” and subtitled “Automation isn’t freeing us from work — it’s keeping us under capitalist control.” From the article (which you should read in its entirety):
In the short term, the new machines benefit capitalists, who can lay off their expensive, unnecessary workers to fend for themselves in the labor market. But, in the longer view, automation also raises the specter of a world without work, or one with a lot less of it, where there isn’t much for human workers to do. If we didn’t have capitalists sucking up surplus value as profit, we could use that surplus on social welfare to meet people’s needs.
The big data revolution and the assumption that N=ALL
According to Cukier and Mayer-Schoenberger, the Big Data revolution consists of three things:
- Collecting and using a lot of data rather than small samples.
- Accepting messiness in your data.
- Giving up on knowing the causes.
They describe these steps in rather grand fashion, by claiming that big data doesn’t need to understand cause because the data is so enormous. It doesn’t need to worry about sampling error because it is literally keeping track of the truth. The way the article frames this is by claiming that the new approach of big data is letting “N = ALL”.
But here’s the thing, it’s never all. And we are almost always missing the very things we should care about most.
So for example, as this InfoWorld post explains, internet surveillance will never really work, because the very clever and tech-savvy criminals that we most want to catch are the very ones we will never be able to catch, since they’re always a step ahead.
Even the example from their own article, election night polls, is itself a great non-example: even if we poll absolutely everyone who leaves the polling stations, we still don’t count people who decided not to vote in the first place. And those might be the very people we’d need to talk to to understand our country’s problems.
Indeed, I’d argue that the assumption we make that N=ALL is one of the biggest problems we face in the age of Big Data. It is, above all, a way of excluding the voices of people who don’t have the time or don’t have the energy or don’t have the access to cast their vote in all sorts of informal, possibly unannounced, elections.
Those people, busy working two jobs and spending time waiting for buses, become invisible when we tally up the votes without them. To you this might just mean that the recommendations you receive on Netflix don’t seem very good because most of the people who bother to rate things are Netflix are young and have different tastes than you, which skews the recommendation engine towards them. But there are plenty of much more insidious consequences stemming from this basic idea.
Another way in which the assumption that N=ALL can matter is that it often gets translated into the idea that data is objective. Indeed the article warns us against not assuming that:
… we need to be particularly on guard to prevent our cognitive biases from deluding us; sometimes, we just need to let the data speak.
And later in the article,
In a world where data shape decisions more and more, what purpose will remain for people, or for intuition, or for going against the facts?
This is a bitch of a problem for people like me who work with models, know exactly how they work, and know exactly how wrong it is to believe that “data speaks”.
I wrote about this misunderstanding here, in the context of Bill Gates, but I was recently reminded of it in a terrifying way by this New York Times article on big data and recruiter hiring practices. From the article:
“Let’s put everything in and let the data speak for itself,” Dr. Ming said of the algorithms she is now building for Gild.
If you read the whole article, you’ll learn that this algorithm tries to find “diamond in the rough” types to hire. A worthy effort, but one that you have to think through.
Why? If you, say, decided to compare women and men with the exact same qualifications that have been hired in the past, but then, looking into what happened next you learn that those women have tended to leave more often, get promoted less often, and give more negative feedback on their environments, compared to the men, your model might be tempted to hire the man over the woman next time the two showed up, rather than looking into the possibility that the company doesn’t treat female employees well.
In other words, ignoring causation can be a flaw, rather than a feature. Models that ignore causation can add to historical problems instead of addressing them. And data doesn’t speak for itself, data is just a quantitative, pale echo of the events of our society.
Some cherry-picked examples
One of the most puzzling things about the Cukier and Mayer-Schoenberger article is how they chose their “big data” examples.
One of them, the ability for big data to spot infection in premature babies, I recognized from the congressional hearing last week. Who doesn’t want to save premature babies? Heartwarming! Big data is da bomb!
But if you’re going to talk about medicalized big data, let’s go there for reals. Specifically, take a look at this New York Times article from last week where a woman traces the big data footprints, such as they are, back in time after receiving a pamphlet on living with Multiple Sclerosis. From the article:
Now she wondered whether one of those companies had erroneously profiled her as an M.S. patient and shared that profile with drug-company marketers. She worried about the potential ramifications: Could she, for instance, someday be denied life insurance on the basis of that profile? She wanted to track down the source of the data, correct her profile and, if possible, prevent further dissemination of the information. But she didn’t know which company had collected and shared the data in the first place, so she didn’t know how to have her entry removed from the original marketing list.
Two things about this. First, it happens all the time, to everyone, but especially to people who don’t know better than to search online for diseases they actually have. Second, the article seems particularly spooked by the idea that a woman who does not have a disease might be targeted as being sick and have crazy consequences down the road. But what about a woman is actually is sick? Does that person somehow deserve to have their life insurance denied?
The real worries about the intersection of big data and medical records, at least the ones I have, are completely missing from the article. Although they did mention that “improving and lowering the cost of health care for the world’s poor” inevitable will lead to “necessary to automate some tasks that currently require human judgment.” Increased efficiency once again.
To be fair, they also talked about how Google tried to predict the flu in February 2009 but got it wrong. I’m not sure what they were trying to say except that it’s cool what we can try to do with big data.
Also, they discussed a Tokyo research team that collects data on 360 pressure points with sensors in a car seat, “each on a scale of 0 to 256.” I think that last part about the scale was added just so they’d have more numbers in the sentence – so mathematical!
And what do we get in exchange for all these sensor readings? The ability to distinguish drivers, so I guess you’ll never have to share your car, and the ability to sense if a driver slumps, to either “send an alert or atomatically apply brakes.” I’d call that a questionable return for my investment of total body surveillance.
Big data, business, and the government
Make no mistake: this article is about how to use big data for your business. It goes ahead and suggests that whoever has the biggest big data has the biggest edge in business.
Of course, if you’re interested in treating your government office like a business, that’s gonna give you an edge too. The example of Bloomberg’s big data initiative led to efficiency gain (read: we can do more with less, i.e. we can start firing government workers, or at least never hire more).
As for regulation, it is pseudo-dealt with via the discussion of market dominance. We are meant to understand that the only role government can or should have with respect to data is how to make sure the market is working efficiently. The darkest projected future is that of market domination by Google or Facebook:
But how should governments apply antitrust rules to big data, a market that is hard to define and is constantly changing form?
In particular, no discussion of how we might want to protect privacy.
Big data, big brother
I want to be fair to Cukier and Mayer-Schoenberger, because they do at least bring up the idea of big data as big brother. Their topic is serious. But their examples, once again, are incredibly weak.
Should we find likely-to-drop-out boys or likely-to-get-pregnant girls using big data? Should we intervene? Note the intention of this model would be the welfare of poor children. But how many models currently in production are targeting that demographic with that goal? Is this in any way at all a reasonable example?
Here’s another weird one: they talked about the bad metric used by US Secretary of Defense Robert McNamara in the Viet Nam War, namely the number of casualties. By defining this with the current language of statistics, though, it gives us the impression that we could just be super careful about our metrics in the future and: problem solved. As we experts in data know, however, it’s a political decision, not a statistical one, to choose a metric of success. And it’s the guy in charge who makes that decision, not some quant.
If you end up reading the Cukier and Mayer-Schoenberger article, please also read Julie Cohen’s draft of a soon-to-be published Harvard Law Review article called “What Privacy is For” where she takes on big data in a much more convincing and skeptical light than Cukier and Mayer-Schoenberger were capable of summoning up for their big data business audience.
I’m actually planning a post soon on Cohen’s article, which contains many nuggets of thoughtfulness, but for now I’ll simply juxtapose two ideas surrounding big data and innovation, giving Cohen the last word. First from the Cukier and Mayer-Schoenberger article:
Big data enables us to experiment faster and explore more leads. These advantages should produce more innovation
Second from Cohen, where she uses the term “modulation” to describe, more or less, the effect of datafication on society:
When the predicate conditions for innovation are described in this way, the problem with characterizing privacy as anti-innovation becomes clear: it is modulation, not privacy, that poses the greater threat to innovative practice. Regimes of pervasively distributed surveillance and modulation seek to mold individual preferences and behavior in ways that reduce the serendipity and the freedom to tinker on which innovation thrives. The suggestion that innovative activity will persist unchilled under conditions of pervasively distributed surveillance is simply silly; it derives rhetorical force from the cultural construct of the liberal subject, who can separate the act of creation from the fact of surveillance. As we have seen, though, that is an unsustainable fiction. The real, socially-constructed subject responds to surveillance quite differently—which is, of course, exactly why government and commercial entities engage in it. Clearing the way for innovation requires clearing the way for innovative practice by real people, by preserving spaces within which critical self-determination and self-differentiation can occur and by opening physical spaces within which the everyday practice of tinkering can thrive.
….aaah at last the worlds largest focus group, bliss, everything will be perfect now…
….original thought flew out the window long ago and no one even knew…..
Time for ethics in data collection and analysis.
I know when somebody leaves your house, when you take a shower, the temperature in your house, when you open a window, when you get up, your boiler told me. I know when you boil a kettle, switch on an electric fire, leave things on standby, charge your phone, watch television, your electric meter told me. I know your route to work, when you drive to fast, when you dont walk, when you go to the gym or not, the road cameras and your car told me. I know who your friends are , what they like, what you like that is different to them, when you sit at the computer and dont enter anything, what tv shows, books, and games you like, google, facebook, your computer and your TV told me. I know what food you buy, what treats you like, when you have visitors, your taste in clothing your loyalty card told me. I probably know enough to blackmail you if I choose to.
I don’t know everything about you but I am working on better monitoring.Perhaps by using individual room temperature monitoring to keep all rooms at an even temperature I will be able to tell when you have sex. Perhaps by fitting tracking into pills I will know whether you are following a course of medication correctly and of course whether you should be insurable. Perhaps by linking together big data I will know whether you are a skier ,boxer or sharp shooter and adjust your insurance acordingly. Perhaps by fitting tracking into your food I will know when and how much you eat and drink and whether it poses an insurance risk. Perhaps I will fit tracking into your shoes, so that I can tell how often you wear them and your replacement cycle and of course where you go and when and how much walking you do. I can dream about this level of information from big data, still I will not have to wait long since many of these are currently in trial.
Ok so that was a bit tin foil hat, but I must confess to becoming more aware of the data that I am giving out. I deliberately choose when I use my loyalty cards, I will deliberately choose different devices to access certain information (work/personal), I will deliberately switch things off if I dont need them. Its not that i am particularly concerned about people knowing all the individual aspects, but a supposedly complete picture of yourself can lead to the wrong stereotyping and I would rather that there were conflicting views out there about who I am, than an incorrect stereotype. If enough people become data aware, then the concept of big data will have limitations even if big data ignores people who deliberately give conflicting or incomplete data.
Personally, I take it as my civic duty to throw as many kinks in corporate data-collection as possible. I know it’s just little ol’ me, but it’s something (and it’s amusing).
I purposefully “like” things I don’t and claim things are “relevant to me” when they aren’t. Because of this, Hulu thinks I’m Latino (Mexican advertising is more entertaining/less annoying than American advertising, especially as I don’t speak Spanish) and Facebook thinks I’m much older/younger than I really am (can’t remember which).
Lie to big corporations whenever possible, they lie to you all the time.
Funny. You just changed my whole attitude. I used to chastise all those cold callers from India. Now its all computers asking automated questions. Much easier to fool. This could be downright fun.
Yes. I do this all the time – for example, sign up for Spotify – completely invented profile, and where possible, nonsense entries.
The financial benefit created through digital identity can be massive not just to Web 2.0 companies, but to the economy as a whole; where the public sector and health care industry stand to profit most from personal data applications, driven by two important factors:
The exponentially expanding, datafication being driven not only by in an increased volume, but also new types of data sets and amount of available data – expected to reach roughly 7 zettabytes (more than 1,000 gigabytes of data) twice the capacity of a standard laptop of data for every person on earth by 2015. Additionally, the rapidly improving ability to process, analyzes, commoditize and market that data.
For instance, Twitter sold access to the billions of tweets its users openly published through Twitter’s service. Third-party companies syndicating this data used it, among other things, to capture the stock market sentiment. The public sector too has tried to monetize personal data; in the US the state of Oklahoma realizes about $13 million annually by selling information it collects from drivers. This and similar European examples of secondary monetization raise questions about the balance of the expected return (compensation) and the potential loss to citizens. It seems to me that citizens are the primary source of this data; they should also be primary beneficiaries of its commoditization.
Especially, given that this data is being marketed/commoditized without the ‘owner’s’ explicit understanding or consent (but still require citizens‘ trust and collaboration), there should be remuneration or ‘fair compensation’ to individuals or communities – after all, it’s personal data worth billions of dollars. If GE makes money by marketing personal data collected by drivers throughout California, perhaps GE owe California something in return – a share of the profit. Maybe the answer is don’t fight the market, play the market and make it pay for the use of Our personal data.
The water company recently installed a meter on the outside of my house so they can monitor my water use from headquarters without having to send out a man to read the meter. It is probably not long into the future when monitors will be installed on our bodies to monitor how much air we breathe. Then we’ll get a monthly bill for oxygen consumption.
On the other hand, big data epigenetics and bioinformatics promises to result in gains of understanding on how to treat cancer, etc. at least for those few of us who are wealthy enough to pay for the personalized medicine treatments.
I have to think that any data collected will of course be fitted to the master plan. Sort of becoming another reason to fleece the flock. If the data showed that 90% of citizens opposed war efforts, do you suppose that’d drive any descisions to halt it? Suppose that 90% opposed corrupt officials and thought they should be strung from lamp posts. Do you think that’d inspire lamp post manufacturing businesses? Suppose that this Data stuff is just a bunch of crap?
Here is my admittedly crude analysis.
I think big data is used for two main purposes: there is the government surveillance and then there is the private sector surveillance which is about collecting and selling data with the eventual end use being to sell you something, right?
The returns on the investment to those in the private sector buying the data are going to be less and less. I guess one answer will be “collect more data!”. But eventually they have to run up against lack of demand– and I’m speaking of lack of demand in the economy.
We’ll have this huge infrastructure built up and the industry will turn increasingly to the government as the buyer of their product.
And yes, it is a little silly in this day and age to speak of the private sector and government as two separate entities, but it’s the best I could do.
That’s a really good point I was going to mention. Big data (as they refer to big data) is largely already here. I wouldn’t be surprised if the returns on datafication are already diminishing. I’ve heard facebook has had incredible difficulty getting the yields it wants on the data it collects.
I’m not sure why the government in the end would end up buying the excess supply from big data. Perhaps the government will but I feel the government too already receives most of the data it an work with.
Well, I suppose that private firms are already the ones doing most of the analysis for the government. They’ll come up with all sorts of reasons that their services are needed– whether overselling the ability to solve social problems or for plain ol’ spying. I think politicians are always way too impressed with anything high tech.
That’s the economy we have– the government is the customer, not the employer, of last resort.
What if I don’t want to participate in this brave new world endeavor?
Why do I think things such as datafication and gamification will just result in more harm then good in the long run? There’s certain areas where they work remarkably well and certain areas where they do not. I honestly expect there to be some blowbacks in the near future against these trends. At the same time though… never underestimate the power of being a boiling frog.
Great read! As a data analyst with an ill-informed boss, I felt like you just summed up my life. Few of the people who make decisions using data or data analysis everyday really consider the collection of that data or the thought behind the analysis. Usually the people making the decisions are not the people who understand the collection and analysis. Some of these people have too much power. Since this is Naked Capitalism, two relevant examples are Ken Rogoff, who thinks he can throw generic economic indicators into SPSS and meaningfully explain recessions, and the OCC, whose ignorance (perhaps willful) led them to the world of simple first-n samples.
To view the exponential growth of data and computing power by itself as a discrete part of our culture is a mistake–not that you are doing that but that’s the tendency. You cannot evaluate of make decisions about technology unless you come to an understanding as to what the purpose of it is. Yes, I know that technology, at this point, has a “mind” of it’s own. But what is it? It is the mechanization of a part of our consciousness that needs to solve concrete problems.
In the I Ching, this part of us is called the “Inferior Man” not as a put down but as a function. The “Superior Man” represents the functioning of our higher brain and things like poetry, music, philosophy, spirituality, love, ideas, science (in the old sense of the word) live in that part of our consciousness and it, according to the ancient Chinese, ought to be running the show. However, we made a bizarre turn in the modern age in that we made the business of practical affairs the ruler of our society, i.e. capitalism/consumerism. Thus the big Inferior Man corporations made stuff for the little Inferior Men inside of us to consume thus crowding out the higher-brain functions when, in fact, the original idea was to use technology to solve our practical problems so we could have time to be philosophers and artists.
In the 1950s serious thinkers began to wonder what we would do with all the time we would have when we had chreap energy and automation meeting all our needs. What happened? We decided, collectively, that we had to live as if we lived in a culture of scarcity because the prospect of spending time with our higher brain functions just seemed, well, just elitist or something. So what happened? We are now slaves to technology–we work for the system it doesn’t work for us. We acquiesce because we’ve stupified ourselves with the toxic (to the human spirit) consumerism we’re all bound to because we fail to even inquire if there is some other meaning to life.
So data will keep being put into repositories just because it’s there. It will be used and misursed depending on the degree of fear and ignorance that exists in the hearts of those who control the data and write the routines and meta-data. How it’s used, eventually, will be out of our control as systems digest this stuff and feed into the singularity which will have its own agenda if we refuse, as we are doing, to give the Inferior Man direction.
In hexagram after hexagram the I Ching warns us that if we put the Inferior Man in charge we are headed for disaster.
“There is always danger in circumstances of abundance. The inferior man pushes forward through excessive ambition, thereby losing touch with men of talent and virtue in positions below him.”
The assumption that surveillance will cow everyone is incorrect. It will provoke a certain number of people into acts of petty and not so petty rebellion.
Of course propaganda will paint these individuals as enemies of the people rather than enemies of the state. But they will persist.
On a side note, apparently 44% of Republicans believe that armed revolution in order to protect liberties might be necessary in the next few years. (27% of Independents and 18% of Democrats agree)
Granted it was a small poll, but that’s shocking.
yves, I think the time has come to start thinking about a Podcast. Nothing which would take a lot of your or anyone’s time. 5 or 10 minutes a week from everyone who contributes or guest speakers. Roll it into 30 minute weekly program. It’s time to start exploiting other media. I am volunteering my time to edit, email your spot and wrap some music around it and push it out to iTunes. Contact me if you would like to do some trial runs.
I gave up reading this article about one fourth of the way through. The style of writing made it incomprensible for me.
Examples: “The article is unfortunately behind the pay wall”
“First they discuss the concept of datafication, which they exemplify by the idea that we quantifying friendships with “likes”.
“They formally define later in the article as a process”
“Our intentions may run the gambit but the results don’t.”
Ms. O’Neil: Please get a copy of Strunk and White’s book, “The Elements of Style”.
““Let’s put everything in and let the data speak for itself,” Dr. Ming said of the algorithms she is now building for Gild.”
The very idea that “the data speaks for itself” is patently absurd.
Everyone who has ever written an undergraduate lab report knows that there is a standard section called “interpretation of results.” It’s not in there yelling at your 18 year old brain for no reason.
Could it be that we jump through so many putatively educational hoops that we’ve forgotten our foundations?
Also, great article on de-skilling from The Jacobin, itself becoming a note of sanity in a world of minds set on automatic pilot.