# Cathy O’Neil: Data Science – The Problem Isn’t Statisticians, It’s Too Many Poseurs

By Cathy O’Neil, a data scientist who lives in New York City and writes at mathbabe.org

Cosma Shalizi

I recently was hugely flattered by my friend Cosma Shalizi’s articulate argument against my position that data science distinguishes itself from statistics in various ways.

Cosma is a well-read broadly educated guy, and a role model for what a statistician can be, not that every statistician lives up to hist standard. I’ve enjoyed talking to him about data, big data, and working in industry, and I’ve blogged about his blogposts as well.

That’s not to say I agree with absolutely everything Cosma says in his post: in particular, there’s a difference between being a master at visualizations for the statistics audience and being able to put together a power point presentation for a board meeting, which some data scientists in the internet start-up scene definitely need to do (mostly this is a study in how to dumb stuff down without letting it become vapid, and in reading other people’s minds in advance to see what they find sexy).

And communications skills are a funny thing; my experience is communicating with an academic or a quant is a different kettle of fish than communicating with the Head of Product. Each audience has its own dialect.

But I totally believe that any statistician who willingly gets a job entitled “Data Scientist” would be able to do these things, it’s a self-selection process after all.

Statistics and Data Science are on the same team

I think that casting statistics as the enemy of data science is a straw man play. The truth is, an earnest, well-trained and careful statistician in a data scientist role would adapt very quickly to it and flourish as well, if he or she could learn to stomach the business-speak and hype (which changes depending on the role, and for certain data science jobs is really not a big part of it, but for others may be).

It would be a petty argument indeed to try to make this into a real fight. As long as academic statisticians are willing to admit they don’t typically spend just as much time (which isn’t to say they never spend as much time) worrying about how long it will take to train a model as they do wondering about the exact conditions under which a paper will be published, and as long as data scientists admit that they mostly just redo linear regression in weirder and weirder ways, then there’s no need for a heated debate at all.

Let’s once and for all shake hands and agree that we’re here together, and it’s cool, and we each have something to learn from the other.

Posers

What I really want to rant about today though is something else, namely posers. There are far too many posers out there in the land of data scientists, and it’s getting to the point where I’m starting to regret throwing my hat into that ring.

Without naming names, I’d like to characterize problematic pseudo-mathematical behavior that I witness often enough that I’m consistently riled up. I’ll put aside hyped-up, bullshit publicity stunts and generalized political maneuvering because I believe that stuff speaks for itself.

My basic mathematical complaint is that it’s not enough to just know how to run a black box algorithm. You actually need to know how and why it works, so that when it doesn’t work, you can adjust. Let me explain this a bit by analogy with respect to the Rubik’s cube, which I taught my beloved math nerd high school students to solve using group theory just last week.

Rubiks

First we solved the “position problem” for the 3-by-3-by-3 cube using 3-cycles, and proved it worked, by exhibiting the group acting on the cube, understanding it as a subgroup of $latex S_8 \times S_{12},$ and thinking hard about things like the sign of basic actions to prove we’d thought of and resolved everything that could happen. We solved the “orientation problem” similarly, with 3-cycles.

I did this three times, with the three classes, and each time a student would ask me if the algorithm is efficient. No, it’s not efficient, it takes about 4 minutes, and other people can solve it way faster, I’d explain. But the great thing about this algorithm is that it seamlessly generalizes to other problems. Using similar sign arguments and basic 3-cycle moves, you can solve the 7-by-7-by-7 (or any of them actually) and many other shaped Rubik’s-like puzzles as well, which none of the “efficient” algorithms can do.

Something I could have mentioned but didn’t is that the efficient algorithms are memorized by their users, are basically black-box algorithms. I don’t think people understand to any degree why they work. And when they are confronted with a new puzzle, some of those tricks generalize but not all of them, and they need new tricks to deal with centers that get scrambled with “invisible orientations”. And it’s not at all clear they can solve a tetrahedron puzzle, for example, with any success.

Back to data science. It’s a good thing that data algorithms are getting democratized, and I’m all for there being packages in R or Octave that let people run clustering algorithms or steepest descent.

But, contrary to the message sent by much of Andrew Ng’s class on machine learning, you actually do need to understand how to invert a matrix at some point in your life if you want to be a data scientist. And, I’d add, if you’re not smart enough to understand the underlying math, then you’re not smart enough to be a data scientist.

I’m not being a snob. I’m not saying this because I want people to work hard. It’s not a laziness thing, it’s a matter of knowing your shit and being for real. If your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you’d better be able to figure out what that is. That’s your job.

As I see it, there are three problems with the democratization of algorithms:

1. As described already, it lets people who can load data and press a button describe themselves as data scientists.
2. It tempts companies to never hire anyone who actually knows how these things work, because they don’t see the point. This is a mistake, and could have dire consequences, both for the company and for the world, depending on how widely their crappy models get used.
3. Businesses might think they have awesome data scientists when they don’t. That’s not an easy problem to fix from the business side: posers can be fantastically successful exactly because non-data scientists who hire data scientists in business, i.e. business people, don’t know how to test for real understanding.

How do we purge the posers?

We need to come up with a plan to purge the posers, they are annoying and making a bad name for data science.

One thing that will be helpful in this direction is Rachel Schutt’s Data Science class at Columbia next semester, which is going to be a much-needed bullshit free zone. Note there’s been a time change that hasn’t been reflected on the announcement yet, namely it’s going to be once a week, Wednesdays for three hours starting at 6:15pm. I’m looking forward to blogging on the contents of these lectures.

001050

1. Mike Kimel

Take a look at job listings in the “analytics” field. More often than not, its like they’re trying to hire a race car driver by asking people whether they can tell you what happens what happens when you press that button over there on the dashboard of a Honda Odyssey. The result is that companies end up with a lot of people who have memorized the commands needed to run some of the algorithms with the sexier names in SAS, and will happily run those algorithms whether they are appropriate or not. Such people also have the advantage of being able to state their conclusions as categorical fact, with no caveats, since they don’t have a clue what implicit assumptions they’ve just made.

Because of that, you sometimes run into smart undergrads who haven’t been trained to know better who can extract a lot more useful information about a problem using a simple spreadsheet than a trained economist using SAS or Stata. In my personal experience, the person I know with the best SAS skills (and a Ph.D. in Economics from a good school) happens to be someone whose conclusions from the data I would not trust on anything, ever.

2. jake chase

I’m not sure why this post is here or what it means but the business problem is easily articulated in the ancient meme, figures don’t lie, but liars figure.

1. Yves Smith Post author

In case you missed it, modelers run our lives. Faith in quant models (FICO based scoring, derivatives pricing, bank risk management) were at the hear of the crisis.

And economists also rely heavily on models, as we’ve discussed at length. although highly abstract ones, while data scientists get their hands dirty.

1. Capo Regime

Data and data analysis are a huge component of the economy. Analysis results support many key decsions–and when poorly done the decisions are poor. The economy is not so much politicians braying but thousands of organizations making decsisions and many of the larger ones (as is true of govt) rely on various types of desicsion support systems informed by analytic modeling of one type or another. Like many key unsexy things like civil engineering, water treatment, food science sewage and transportation/logistics people tend to underestimate the importance of data storage and modelling. Entire firms have sunk due to poor data management and analysis–think of Cigna. This is a major issue but due to its not being readily simplified will not get the attention it warrants.

1. jake chase

Don’t ignore the possibility that the purpose of the analysis is to justify decisions being made for reasons which have nothing to do with the data. This is particularly true in business and finance, where none of the data is hard. As for so called ‘risk management’ its techniques failed utterly because they proceed in blissful ignorance of the distinction between risk and uncertainty which Keynes explained in his 1923 Treatise on Probability.

1. Ruben

“And economists also rely heavily on models, as we’ve discussed at length. although highly abstract ones, while data scientists get their hands dirty.”

Modeling without abstraction will not get you far. The best model strikes a good balance between abstraction and realism.

2. Fifi

” In case you missed it, modelers run our lives. Faith in quant models (FICO based scoring, derivatives pricing, bank risk management) were at the hear of the crisis. ”

Please restate: Modelers try to pretend to run our lives and miserably fail, as you will note in each of the three examples you give.

Applied black-box modeling is cargo-cult science. Period. Very much like most of economics, it gives the appearance of scientificity to whatever self-serving conclusion fits your prejudices and the moral cover to do whatever one wants to do anyway (or to ‘prove’ to you boss that you are innovating, optimizing, meeting/improving your “metrics”, etc.). “Hey, the data says so” is just like “Hey, a bunch of Nobel Prizes in Economics say so”. Just pick the “right data” just like you can pick the “right Nobel Prize”.

For an honest practitioner, black-box modeling has one use and one use only. It allows you to know quickly whether there is a good chance of being a “there” there. From there, you have to dig in and understand what the **bleep** is really going on, be if for analyzing street crime or buying patterns. Otherwise, you have at best precious nothing in your hands.

1. Cathy O'Neil, mathbabe

I’d agree with most of this. But unfortunately, even really crappy models are in wide use right now, and it’s scary how many of our big life decisions (mortgages, medical care, insurance, credit card offers) and small life decisions (environment on the web, offers, recommendations) are controlled by models, good and bad.

In any case, I’m just agreeing with you that, if people were being honest, they’d dig in to understand something. But people aren’t being honest, either about what they understand or their ability to dig in.

1. craazyman

Even the Math is Easy

if you take 100,000 people and divide them by 500 jobs that pay \$500,000 per year, how many people per job?

I’d say something like 200, approximately. It could vary by the day or the month or the person. Since some of the 100,000 people are likely excessively ambitious and desirous of as much money as they can acquire in any way possible, the ratio might rise to 300 or 400 to 1. This is called “Energetically Amplified Quantification” and departs from traditional mathematics in several theoretical respects.

So if there’s 200 to 400 people who will do anything, almost, for that 1 job, and the dude or dudette who gets the job can’t do the math, the question is: Will the lucky guy or girl feel bad about that?

The answer is “Probably not,” As long as they don’t hang out with data scientists. And what are the odds of that happening? hahaha. At least 100,000 to 1.

1. Capo Regime

Indeed, the customer relationship management (CRM) departments at corporations are full of MBA’s who took one semester of watered down data mining who don’t know what they don;t know blowing smoke up managments butt. I know of more than one person with no real math or stats training who by virtue of software and a knack for self promotion are able to call themselves knowledge officers and VP’s of analytics. Interestingly contra Math Babes reference to the Poli Sci prof (of course) a lot of our problems stem from extensive innumeracy even among IVY league graduates–think of the number of ivy league graduates with communication degrees and law degrees who head risk management departments. John Paulous wrote the best accessible book on this–Innumeracy. Or old Herb Stein–the difference between 5% and 10% is not 5% but 100%. If only more people in the managerial/policy ranks knew a bit of algebra or calculus we would be far better off.

2. run75441

Jake:

The quants trusted an investment model with the LTCM collapse and also with the more recent root causes leading up to the 2008 collapse. The models are not fool proof and still require human intervention.

3. Capo Regime

Ha, ha. REminds me of when I was an academic. Many non math trained (i.e. never even took freshman calculus) “social scientists” especially political scientists and criminal justice types would undertake all sorts of complex analysis driven by their soft ware.(SPSS, STATA). They really did not know what they were doing and often it was comical but they and some of their peers took themselves so seriously. I have met some of these frauds in the private sector and in the corporate world. They have no idea what they are doing–but oddly enough do not know what they are doing. This sad state of affiars is pobably due to proliferation of easy to use analytical software and thus we have a lot of surface knowledge. The obvious downside is if they do not undertstand the underlying mathematical foundations they invariable use the wrong estimator and yield a nice clean solution which is wrong. Needless to say–analytics has taken a dive in its promise.

1. dirtbagger

There is often as much art as science/mathematics to take data and apply it in a meaningfuland useful manner. Often the hardest part is to have a clinical detachment to data.

One of my first jobs was as a geologist with a major gold mining company. I was part of a team of ambitious rookies who were trying desparately to prove how smart we were. One our more mundane tasks was to take thousands of samples for assays. The assay results (Ag,Au,Hg,As) would be plotted on a map with their corresponding values to see if this was any discernable trend.

Us rookies would get all excited at the plotted data as we would invariable see mineralization trends that were going to prove to be the next major gold mine find. The more seasoned Geologists would humour us for short while and then point out the myriad of errors in our interpretation of the data. For the less ethically inclined geologists and mine promoters, it is easy to understand how the old saying “a gold mine is a hole in the ground with a liar on top” came about.

4. dodahman

If you have to put ‘science’ after it, it probably is not science. Chemistry, physics, etc vs social science, computer science, math science etc.

1. Capo Regime

And Social Science and Political Science–you make a good point dodahman. EVen erstwhile computer science was constrived as a government program at several large universities and the faculty were drawn largely from electrical engineering and mathematics.

2. Fifi

For “Computer Science”, it’s largely a side effect of language, culture and history.

In German, Dutch, French and Spanish, “Computer Science” is called “Informatik”, “Informatica”, “Informatique”, “Informática”. And it’s not considered to be a science but an engineering discipline (which it is).

3. Ian Ollmann

Even some parts of science aren’t actually science in my opinion. Quite a bit of it is just engineering. True basic science is a rare thing indeed, and usually doesn’t get the funding it deserves because few really understand its value, if it has any at all. You have to believe in the intrinsic value of knowledge per se, to see it.

5. liberal

But, contrary to the message sent by much of Andrew Ng’s class on machine learning, you actually do need to understand how to invert a matrix at some point in your life if you want to be a data scientist.

Heh. I’m about to take that class, and I certainly know how to invert a matrix. Where does A.N. suggest that?

6. Ruben

I’m glad Cathy O’Neil mentioned R and Octave. They represent openness and freedom in the data analysis and modeling world.

7. duffolonious

How much of this is a problem of specialization? I see the same problem in other fields. CompSci is full of “posers”; you’ll really have to be in the know to ferret them out.

Things are so specialized that _no one_ is smart enough to see through all BS. One of the reasons I read this blog is because a lot of this finance stuff (or lawyerly readings of PR) is outside “my pay grade” :).

8. curlydan

Totally agree with Yves’s comment “In case you missed it, modelers run our lives.”

As Mandelbrot pointed out in the mid 60s, commodity and market prices do not behave in the perceived normally distributed “random walk”. These prices have undefined variance, but Wall St quants and data analysts relied far too much on models with constraints and small variances. Guess who pays the price for their mistakes? The taxpayers are left to subsidize bad models and bad behavior.

It’s not enough to put some data into packaged software and press run.

http://www.scientificamerican.com/blog/post.cfm?id=benoit-mandelbrot-and-the-wildness-2009-03-13

9. LAS

I’m not committed to blaming modelers for the financial crisis, etc. Rather, I bundle the modelers into staff doing what it is paid to do, sort of like a secretary or paralegal. Look for the clients and promoters of the modeler if you want to finger the evil; they are invariably sales people of some kind.

Lots of people love data analysis and stat programming. It isn’t the worst sin for goodness sake. Just that … far rarer is the individual who understands and appreciates that bias, confounding and lack of temporal or random ordered exposures in the data effectively invalidates everything else. Modeling does not overcome these fundamental design flaws; that needs to be more widely reinforced.

Working scientifically with data associations is soooo very unlike doing a Euclidian proof. With data, we only disprove alternative explanations for the associations we observe. After we have systematically disproved a lot of alternative explanations … then we start to believe we’re on to something. Our very, very best science is less than 100% categorical.

10. Glen

The easiest way to knock out “Posers” would have been to forced those banks and firms that used bad models to pay the price rather than get bailed out by by Timmy and Ben.

Who cares how bad a model you have when you keep making money? But models (and quants) that cause these same firms to cease to exist get dumped pretty quickly.

11. Joel3000

Even if you get the model mathematically correct you are still usually making the assumption that the past is a good predictor of the future. That might work for disease markers, but not for commodity prices. Prediction is hard, especially about the future, as the saying goes.

12. SME MOFO

I used to ask three questions when hiring analyst crews in order to quickly identify the posers.

Question #1
If I pick a number between 1 and 10, and you pick a number between 1 and 10, what is the chance that it is the same number?
Weeds out those who can’t map 9 years of advanced mathematics onto a simple real world problem. You would be shocked how many advanced degree holder failed this one. Actually after thinking about it later, it may be a test of how nervous the candidate was, and how well they performed under stress.

Question #2
What is the first derivative of e to the x?

Answer – e to the x
This question identifies those who studied math and likely had experience in the proofs of central limit theorems – weeds out the SPSS t-test press button types.

Question #3
How do you remove duplicate observations using SAS?

Answer- a number of ways, most common proc sort NODUP

this question identifies those who have experience with shitty data, meaning people who actually have experience.

13. Paul Tulloch

gotta love the censuring of comments to this “progressive site”

I read this page almost every day and like it a lot but this topic is so misleading I may just start boycotting this site.

14. Paul Tulloch

Yes that is what we need; let’s get rid of workers who can actually do some of the math! The hard core stat people need to keep their “high knowledge” from clouding their vision! We need to engage thousands of more “posers” not purge them.

15. Kimbra Garwood

Thank you for creating this website so easy to find information. good stuff. Saving this one for later.