Tuesday, November 12, 2013

An ambitious experiment in Data Science takes off: a biased, Open Source view from Berkeley

Today, during a White House OSTP event combining government, academia and industry, the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation announced a $37.8M funding commitment to build new data science environments. This caps a year's worth of hard work for us at Berkeley, and even more for the Moore and Sloan teams, led by Vicki Chandler, Chris Mentzel and Josh Greenberg: they ran a very thorough selection process to choose three universities to participate in this effort. The Berkeley team was led by Saul Perlmutter, and we are now thrilled to join forces with teams at the University of Washington and NYU, respectively led by Ed Lazowska and Yann LeCun. We have worked very hard on this in private, so it's great to finally be able to publicly discuss what this ambitious effort is all about.

Berkeley BIDS team

Most of the UC Berkeley BIDS team, from left to right: Josh Bloom, Cathryn Carson, Jas Sekhon, Saul Perlmutter, Erik Mitchell, Kimmen Sjölander, Jim Sethian, Mike Franklin, Fernando Perez. Not present: Henry Brady, David Culler, Philip Stark and Ion Stoica (photo credit: Kaja Sehrt, VCRO).

As Joshua Greenberg from the Sloan Foundation says, "What this partnership is trying to do is change the culture of universities to create a data science culture." For us at Berkeley, the whole story has two interlocking efforts:

  1. The Moore and Sloan foundations are supporting a cross-institution initiative, where we will tackle the challenges that the rise of data-intensive science is posing.

  2. Spurred by this, Berkeley is announcing the creation of the new Berkeley Institute for Data Science (BIDS), scheduled to start full operations in Spring 2014 (once the renovations of the Doe 190 space are completed). BIDS will be the hub of our activity in the broader Moore/Sloan initiative, as a partner with the UW eScience Institute and the newly minted NYU Center for Data Science.

Since the two Foundations, Berkeley and our university partners will provide ample detail elsewhere (see link summary at the bottom), I want to give my own perspective. This process has been, as one can imagine, a complex one: we were putting together a campus-wide effort that was very different from any traditional grant proposal, as it involved not only a team of PIs from many departments who normally don't work together, but also serious institutional commitment. But I have seen real excitement in the entire team: there is a sense that we have been given the chance to tackle a big and meaningful problem, and that people are willing to move way out of their comfort zone and take risks. The probability of failure is non-negligible, but these are the kinds of problems worth failing on.

The way I think about it, we're a startup in the Eric Ries sense, a "human institution designed to deliver a new product or service under conditions of extreme uncertainty". Our outcomes are new forms of doing science in academic settings, and we're going against the grain of large institutional priors. We've been given nearly $38M of VC funding and a 5 year runway, now it's our job to make it beyond this.

Berkeley BIDS team

Saul Perlmutter speaks during our internal launch event on Oct 25, at the location of the newly created BIDS in the campus Doe Library. This area, when renovated, will be set up in a similar fashion for seminars and workshops (photo credit: Kaja Sehrt, VCRO).

Why is "Data Science" a different problem?

The original mandate from our funders identified a set of challenges brought about by the rise of data intensive research to the modern university. I would summarize the problem as: the incentive mechanisms of academic research are at sharp odds with the rising need for highly collaborative interdisciplinary research, where computation and data are first-class citizens. The creation of usable, robust computational tools, and the work of data acquisition and analysis must be treated as equal partners to methodological advances or domain-specific results.

Here are, briefly stated, what I see as the main mechanisms driving this problem in today's academic environment:

  • An incentive structure that favors individualism, hyper-specialization and "novelty" to a toxic extreme. The lead-author-paper is the single currency of the realm, and work that doesn't lead to this outcome is ultimately undertaken as pure risk. On the altar of "novelty", intellectual integrity is often sacrificed.

  • Collaboration, tool reuse and cross-domain abstractions are punished.

  • The sharing of tools that enable new science is discouraged/punished: again, it's much better to put out a few more "novel" first-author papers than to work with others to amplify their scientific outcomes.

  • Building a CV spread across disciplines is typically career suicide: no single hiring/tenure committee will understand it, despite the potentially broad and significant impact.

  • Most scientists are taught to treat computation as an afterthought. Similarly, most methodologists are taught to treat applications as an afterthought (thanks to Bill Howe, one of our UW partners, for pointing out this mirror problem).

  • Rigor in algorithmic usage, application and implementation by practitioners isn't taught anywhere: people grab methods like shirts from a rack, to see if they work with the pants they are wearing that day. Critical understanding of algorithmic assumptions and of the range of validity and constraints of new methods is a rarity in many applied domains.

  • Not to fault only domain practitioners, methodologists tend to only offer proof-of-concept, synthetic examples, staying largely shielded from real-world concerns. This means that reading their papers often leaves domain users with precious little guidance on what to actually do with the method (and I'm just as guilty as the rest).

  • Computation and data skills are all of a sudden everybody's problem. Yet, many academic disciplines are not teaching their students these tools in a coherent way, leaving the job to a haphazard collage of student groups, ad-hoc workshops and faculty who create bootcamps and elective seminar courses to patch up the problem.

  • While academia punishes all the behaviors we need, industry rewards them handsomely (and often in the context of genuinely interesting problems). We have a new form of brain drain we had never seen before at this scale. Jake van der Plas' recent blog post articulates the problem with clarity.

Two overarching questions

The above points frame, at least for me, the "why do we have this problem?" part. As practicioners, for the next five years we will tackle it in a variety of ways. But all our work will be inscribed in the context of two large-scale questions, an epistemological one about the nature of science and research, and a very practical one about the structure of the institutions where science gets done.

1: Is the nature of science itself really changing?

"Data Science" is, in some ways, an over-abused buzzword that can mean anything and everything. I'm roughly taking the viewpoint nicely summarized by Drew Conway's well-known Venn Diagram:

It is not clear to me that a new discipline is arising, or at least that we should approach the question by creating yet another academic silo. I see this precisely as one of the questions we need to explore, both through our concrete practice and by taking a step back to reflect on the problem itself. The fact that we have an amazing interdisciplinary team, that includes folks who specialize in the philosophy and anthropology of knowledge creation, makes this particularly exciting. This is a big and complex issue, so I'll defer more writing on it for later.

2: How do the institutions of science need to change in this context?

The deeper epistemological question of whether Data Science "is a thing" may still be open. But at this point, nobody in their sane mind challenges the fact that the praxis of scientific research is under major upheaval because of the flood of data and computation. This is creating the tensions in reward structure, career paths, funding, publication and the construction of knowledge that have been amply discussed in multiple contexts recently. Our second "big question" will then be, can we actually change our institutions in a way that responds to these new conditions?

This is a really, really hard problem: there are few organizations more proud of their traditions and more resistant to change than universities (churches and armies might be worse, but that's about it). That pride and protective instinct have in some cases good reasons to exist. But it's clear that unless we do something about this, and soon, we'll be in real trouble. I have seen many talented colleagues leave academia in frustration over the last decade because of all these problems, and I can't think of a single one who wasn't happier years later. They are better paid, better treated, less stressed, and actually working on interesting problems (not every industry job is a boring, mindless rut, despite what we in the ivory tower would like to delude ourselves with).

One of the unique things about this Moore/Sloan initiative was precisely that the foundations required that our university officials were involved in a serious capacity, acknowledging that success on this project wouldn't be measured just by the number of papers published at the end of five years. I was delighted to see Berkeley put its best foot forward, and come out with unambiguous institutional support at every step of the way. From providing us with resources early on in the competitive process, to engaging with our campus Library so that our newly created institute could be housed in a fantastic, central campus location, our administration has really shown that they take this problem seriously. To put it simply, this wouldn't be happening if it weren't for the amazing work of Graham Fleming (our Vice Chancellor for Research) and Kaja Sehrt, from his office.

We hope that by pushing for disruptive change not only at Berkeley, but also with our partners at UW and NYU, we'll be able to send a clear message to the nation's universities on this topic. Education, hiring, tenure and promotion will all need to be considered as fair game in this discussion, and we hope that by the end of our initial working period, we will have already created some lasting change.

Open Source: a key ingredient in this effort

One of the topics that we will focus on at Berkeley, building off strong traditions we already have on that front, is the contribution of open source software and ideas to the broader Data Science effort. This is a topic I've spoken at length about at conferences and other venues, and I am thrilled to have now an official mandate to work on it.

Much of my academic career has been spent living a "double life", split between the responsibilities of a university scientist and trying to build valuable open source tools for scientific computing. I started writing IPython when I was a graduate student, and I immediately got pulled into the birth of the modern Scientific Python ecosystem. I worked on a number of the core packages, helped with conferences, wrote papers, taught workshops and courses, helped create a Foundation, and in the process met many other like-minded scientists who were committed to the same ideas.

We all felt that we could build a better computational foundation for science that would be, compared to the existing commercial alternatives and traditions:

  • Technically superior, built upon a better language.
  • Openly developed from the start: no more black boxes in science.
  • Based on validated, tested, documented, reusable code whose problems are publicly discussed and tracked.
  • Built collaboratively, with everyone participating as equal partners and worrying first about solving the problems, not about who would be first author or grant PI.
  • Licensed as open source but in a way that would encourage industry participation and collaboration.

In this process, I've learned a lot, built some hopefully interesting tools, made incredible friends, and in particular I realized that in many ways, the open source community practiced many of the ideals of science better than academia.

What do I mean by this? Well, in the open source software (OSS) world, people genuinely build upon each other's work: we all know that citing an opaque paper that hides key details and doesn't include any code or data is not really building off it, it's simply preemptive action to ward off a nasty comment from a reviewer. Furthermore, the open source community's work is reproducible by necessity: they are collaborating across the internet on improving the same software, which means they need to be able to really replicate what each other is doing to continually integrate new work. So this community has developed a combination of technical tools and social practices that, taken as a whole, produces a stunningly effective environment for the creation of knowledge.

But a crucial reason behind the success of the OSS approach is that its incentive structure is aligned to encourage genuine collaboration. As I stated above, the incentive structure of academia is more or less perfectly optimized to be as destructive and toxic to real collaboration as imaginable: everything favors the lead author or grant PI, and therefore every collaboration conversation tends to focus first on who will get credit and resources, not on what problems are interesting and how to best solve them. Everyone in academia knows of (or has been involved in) battles that leave destroyed relations, friendships and collaborations because of concerns over attribution and/or control.

It's not that the OSS world is a utopia of perfect harmony, far from it. "Flame wars" on public mailing lists are famous, projects split in sometimes acrimonious ways, nasty things get said in public and in private, etc. Humanity is still humanity. But I have seen first hand the difference between a baseline of productive engagement and the constant mistrust of acrid competitiveness that is academia. And I know I'm not the only one: I have heard many friends over the years tell me how much more they enjoy scientific open source conferences like SciPy than their respective, discipline-specific ones.

We want to build a space to bring the strengths of the university together with the best ideas of the OSS world. Despite my somewhat stern words above, I am a staunch believer in the fundamental value to society of our universities, and I am not talking about damaging them, only about helping them adapt to a rapidly changing landscape. The OSS world is distributed, collaborative, mostly online and often volunteer; it has great strengths but it also often produces very uneven outcomes. In contrast, universities have a lasting presence, physical space where intense human, face-to-face collaboration can take place, and deep expertise in many domains. I hope we can leverage those strengths of the university together with the practices of the OSS culture, to create a better future of computational science.

For the next few years, I hope to continue building great open source tools for scientific computing (IPython and much more), but also to bring these questions and ideas into the heart of the debate about what it means to be an academic scientist. Ask me in five years how it went :)

Partnerships at Berkeley

In addition to our partners at UW and NYU, this effort at Berkeley isn't happening in a vacuum, quite the opposite. We hope BIDS will not only produce its own ideas, but that it will also help catalyze a lot of the activity that's already happening. Some of the people in these partner efforts are also co-PIs in the Moore/Sloan project, so we have close connections between all:

Note: as always, this post was written as an IPython Notebook, which can be obtained from my github repository. If you are reading the notebook version, the blog post is available here.