Big, Thick, Throbbing, Data

I'm going to try and keep this short and sweet.

On Freakonomics the other day I finally heard someone echo my concerns about the "big data" approach to scientific problems.  The theme for the show was "this idea needs to die" and Emanual Derman, a professor at Columbia University, former Wall Street financial analyst, and also former particle physicist had the following to say:

"The scientific idea that I believe is ready for retirement is one that’s very fashionable now, and that’s the use and the power of statistics. It’s a subject that’s become increasingly popular with increasing power of computers, computer science, information technology, and everybody’s interest in economics and big data, which  have all come together in some sort of nexus to make people think that just looking at data is going to be enough to tell you truths about the world. And I don’t really believe that."

He goes on to describe how Kepler, after years of studying the work of Tycho Brahe (a hell of a character if you've never heard of him) managed to arrive at the equal area, equal time rule, and how it was tremendous insight and intuition that guided him to that knowledge, not churning through endless amounts of data.

We're at a wonderful crossroads right now in the sciences.  We have access to tools that are almost more powerful than we know what to do with.  Nvidia just today released a $15,000 supercomputer that would make the likes of Jon Von Neumann or Alan Turing faint.  Heck, we model entire physical worlds full of reflecting light and sound and surfaces and characters that talk to us... and that's just for entertainment (video games).  More amazing than anything is that it keeps getting more and more powerful.  Short of the most computationally intensive problems, you can spend a few thousand bucks and do powerful scientific modeling in your own home as a hobby; it's truly a beautiful thing.

There's this big push right now in science to generate tons and tons of data and the create these "machine learning" or "deep learning" algorithms to process them.  This approach is how Google gives you search results, it's how Siri knows what to return when you ask it for movie times, and it's how more and more folks are starting to approach science. There's sort of nothing wrong with these approaches in a pragmatic sense: they get the job done quite well, we have the computing power, and they will adapt as we adapt to give us what we want.  I also should echo here that I'm fascinated by this stuff and have a lot of personal interest in machine learning, so I don't want people to think that I'm opposed to it overall.

Part of the reason I got into science was that I was always fascinated at those moments of baffling intuition or sheer cleverness from great scientists and engineers.  I honestly don't believe that those are the same as shoving a big CSV file into an algorithm someone else wrote and getting some tuning parameters out.

Machine learning on a truly powerful scale is still very much in its infancy and we need to keep that in mind.  The power of statistics lies in just pure volume of measurements and data, and in many cases people are trying to apply these ideas without those things.  We also can't rely on machine learning as a replacement for intuition and study... at least not yet. Data can and absolutely should guide intuition, but we can't look at it as a replacement just yet for actually thinking through these problems.  Data and intuition are inseparable: data would be worthless without folks trying to interpret it and intuition would guide you to nothing if you didn't have data.  Let's just be sure to not forget that.

... but computers ARE fucking awesome.  Don't get it twisted.