Tuesday, November 12, 2013

An ambitious experiment in Data Science takes off: a biased, Open Source view from Berkeley

Today, during a White House OSTP event combining government, academia and industry, the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation announced a $37.8M funding commitment to build new data science environments. This caps a year's worth of hard work for us at Berkeley, and even more for the Moore and Sloan teams, led by Vicki Chandler, Chris Mentzel and Josh Greenberg: they ran a very thorough selection process to choose three universities to participate in this effort. The Berkeley team was led by Saul Perlmutter, and we are now thrilled to join forces with teams at the University of Washington and NYU, respectively led by Ed Lazowska and Yann LeCun. We have worked very hard on this in private, so it's great to finally be able to publicly discuss what this ambitious effort is all about.

Berkeley BIDS team

Most of the UC Berkeley BIDS team, from left to right: Josh Bloom, Cathryn Carson, Jas Sekhon, Saul Perlmutter, Erik Mitchell, Kimmen Sjölander, Jim Sethian, Mike Franklin, Fernando Perez. Not present: Henry Brady, David Culler, Philip Stark and Ion Stoica (photo credit: Kaja Sehrt, VCRO).

As Joshua Greenberg from the Sloan Foundation says, "What this partnership is trying to do is change the culture of universities to create a data science culture." For us at Berkeley, the whole story has two interlocking efforts:

  1. The Moore and Sloan foundations are supporting a cross-institution initiative, where we will tackle the challenges that the rise of data-intensive science is posing.

  2. Spurred by this, Berkeley is announcing the creation of the new Berkeley Institute for Data Science (BIDS), scheduled to start full operations in Spring 2014 (once the renovations of the Doe 190 space are completed). BIDS will be the hub of our activity in the broader Moore/Sloan initiative, as a partner with the UW eScience Institute and the newly minted NYU Center for Data Science.

Since the two Foundations, Berkeley and our university partners will provide ample detail elsewhere (see link summary at the bottom), I want to give my own perspective. This process has been, as one can imagine, a complex one: we were putting together a campus-wide effort that was very different from any traditional grant proposal, as it involved not only a team of PIs from many departments who normally don't work together, but also serious institutional commitment. But I have seen real excitement in the entire team: there is a sense that we have been given the chance to tackle a big and meaningful problem, and that people are willing to move way out of their comfort zone and take risks. The probability of failure is non-negligible, but these are the kinds of problems worth failing on.

The way I think about it, we're a startup in the Eric Ries sense, a "human institution designed to deliver a new product or service under conditions of extreme uncertainty". Our outcomes are new forms of doing science in academic settings, and we're going against the grain of large institutional priors. We've been given nearly $38M of VC funding and a 5 year runway, now it's our job to make it beyond this.

Berkeley BIDS team

Saul Perlmutter speaks during our internal launch event on Oct 25, at the location of the newly created BIDS in the campus Doe Library. This area, when renovated, will be set up in a similar fashion for seminars and workshops (photo credit: Kaja Sehrt, VCRO).

Why is "Data Science" a different problem?

The original mandate from our funders identified a set of challenges brought about by the rise of data intensive research to the modern university. I would summarize the problem as: the incentive mechanisms of academic research are at sharp odds with the rising need for highly collaborative interdisciplinary research, where computation and data are first-class citizens. The creation of usable, robust computational tools, and the work of data acquisition and analysis must be treated as equal partners to methodological advances or domain-specific results.

Here are, briefly stated, what I see as the main mechanisms driving this problem in today's academic environment:

  • An incentive structure that favors individualism, hyper-specialization and "novelty" to a toxic extreme. The lead-author-paper is the single currency of the realm, and work that doesn't lead to this outcome is ultimately undertaken as pure risk. On the altar of "novelty", intellectual integrity is often sacrificed.

  • Collaboration, tool reuse and cross-domain abstractions are punished.

  • The sharing of tools that enable new science is discouraged/punished: again, it's much better to put out a few more "novel" first-author papers than to work with others to amplify their scientific outcomes.

  • Building a CV spread across disciplines is typically career suicide: no single hiring/tenure committee will understand it, despite the potentially broad and significant impact.

  • Most scientists are taught to treat computation as an afterthought. Similarly, most methodologists are taught to treat applications as an afterthought (thanks to Bill Howe, one of our UW partners, for pointing out this mirror problem).

  • Rigor in algorithmic usage, application and implementation by practitioners isn't taught anywhere: people grab methods like shirts from a rack, to see if they work with the pants they are wearing that day. Critical understanding of algorithmic assumptions and of the range of validity and constraints of new methods is a rarity in many applied domains.

  • Not to fault only domain practitioners, methodologists tend to only offer proof-of-concept, synthetic examples, staying largely shielded from real-world concerns. This means that reading their papers often leaves domain users with precious little guidance on what to actually do with the method (and I'm just as guilty as the rest).

  • Computation and data skills are all of a sudden everybody's problem. Yet, many academic disciplines are not teaching their students these tools in a coherent way, leaving the job to a haphazard collage of student groups, ad-hoc workshops and faculty who create bootcamps and elective seminar courses to patch up the problem.

  • While academia punishes all the behaviors we need, industry rewards them handsomely (and often in the context of genuinely interesting problems). We have a new form of brain drain we had never seen before at this scale. Jake van der Plas' recent blog post articulates the problem with clarity.

Two overarching questions

The above points frame, at least for me, the "why do we have this problem?" part. As practicioners, for the next five years we will tackle it in a variety of ways. But all our work will be inscribed in the context of two large-scale questions, an epistemological one about the nature of science and research, and a very practical one about the structure of the institutions where science gets done.

1: Is the nature of science itself really changing?

"Data Science" is, in some ways, an over-abused buzzword that can mean anything and everything. I'm roughly taking the viewpoint nicely summarized by Drew Conway's well-known Venn Diagram:

It is not clear to me that a new discipline is arising, or at least that we should approach the question by creating yet another academic silo. I see this precisely as one of the questions we need to explore, both through our concrete practice and by taking a step back to reflect on the problem itself. The fact that we have an amazing interdisciplinary team, that includes folks who specialize in the philosophy and anthropology of knowledge creation, makes this particularly exciting. This is a big and complex issue, so I'll defer more writing on it for later.

2: How do the institutions of science need to change in this context?

The deeper epistemological question of whether Data Science "is a thing" may still be open. But at this point, nobody in their sane mind challenges the fact that the praxis of scientific research is under major upheaval because of the flood of data and computation. This is creating the tensions in reward structure, career paths, funding, publication and the construction of knowledge that have been amply discussed in multiple contexts recently. Our second "big question" will then be, can we actually change our institutions in a way that responds to these new conditions?

This is a really, really hard problem: there are few organizations more proud of their traditions and more resistant to change than universities (churches and armies might be worse, but that's about it). That pride and protective instinct have in some cases good reasons to exist. But it's clear that unless we do something about this, and soon, we'll be in real trouble. I have seen many talented colleagues leave academia in frustration over the last decade because of all these problems, and I can't think of a single one who wasn't happier years later. They are better paid, better treated, less stressed, and actually working on interesting problems (not every industry job is a boring, mindless rut, despite what we in the ivory tower would like to delude ourselves with).

One of the unique things about this Moore/Sloan initiative was precisely that the foundations required that our university officials were involved in a serious capacity, acknowledging that success on this project wouldn't be measured just by the number of papers published at the end of five years. I was delighted to see Berkeley put its best foot forward, and come out with unambiguous institutional support at every step of the way. From providing us with resources early on in the competitive process, to engaging with our campus Library so that our newly created institute could be housed in a fantastic, central campus location, our administration has really shown that they take this problem seriously. To put it simply, this wouldn't be happening if it weren't for the amazing work of Graham Fleming (our Vice Chancellor for Research) and Kaja Sehrt, from his office.

We hope that by pushing for disruptive change not only at Berkeley, but also with our partners at UW and NYU, we'll be able to send a clear message to the nation's universities on this topic. Education, hiring, tenure and promotion will all need to be considered as fair game in this discussion, and we hope that by the end of our initial working period, we will have already created some lasting change.

Open Source: a key ingredient in this effort

One of the topics that we will focus on at Berkeley, building off strong traditions we already have on that front, is the contribution of open source software and ideas to the broader Data Science effort. This is a topic I've spoken at length about at conferences and other venues, and I am thrilled to have now an official mandate to work on it.

Much of my academic career has been spent living a "double life", split between the responsibilities of a university scientist and trying to build valuable open source tools for scientific computing. I started writing IPython when I was a graduate student, and I immediately got pulled into the birth of the modern Scientific Python ecosystem. I worked on a number of the core packages, helped with conferences, wrote papers, taught workshops and courses, helped create a Foundation, and in the process met many other like-minded scientists who were committed to the same ideas.

We all felt that we could build a better computational foundation for science that would be, compared to the existing commercial alternatives and traditions:

  • Technically superior, built upon a better language.
  • Openly developed from the start: no more black boxes in science.
  • Based on validated, tested, documented, reusable code whose problems are publicly discussed and tracked.
  • Built collaboratively, with everyone participating as equal partners and worrying first about solving the problems, not about who would be first author or grant PI.
  • Licensed as open source but in a way that would encourage industry participation and collaboration.

In this process, I've learned a lot, built some hopefully interesting tools, made incredible friends, and in particular I realized that in many ways, the open source community practiced many of the ideals of science better than academia.

What do I mean by this? Well, in the open source software (OSS) world, people genuinely build upon each other's work: we all know that citing an opaque paper that hides key details and doesn't include any code or data is not really building off it, it's simply preemptive action to ward off a nasty comment from a reviewer. Furthermore, the open source community's work is reproducible by necessity: they are collaborating across the internet on improving the same software, which means they need to be able to really replicate what each other is doing to continually integrate new work. So this community has developed a combination of technical tools and social practices that, taken as a whole, produces a stunningly effective environment for the creation of knowledge.

But a crucial reason behind the success of the OSS approach is that its incentive structure is aligned to encourage genuine collaboration. As I stated above, the incentive structure of academia is more or less perfectly optimized to be as destructive and toxic to real collaboration as imaginable: everything favors the lead author or grant PI, and therefore every collaboration conversation tends to focus first on who will get credit and resources, not on what problems are interesting and how to best solve them. Everyone in academia knows of (or has been involved in) battles that leave destroyed relations, friendships and collaborations because of concerns over attribution and/or control.

It's not that the OSS world is a utopia of perfect harmony, far from it. "Flame wars" on public mailing lists are famous, projects split in sometimes acrimonious ways, nasty things get said in public and in private, etc. Humanity is still humanity. But I have seen first hand the difference between a baseline of productive engagement and the constant mistrust of acrid competitiveness that is academia. And I know I'm not the only one: I have heard many friends over the years tell me how much more they enjoy scientific open source conferences like SciPy than their respective, discipline-specific ones.

We want to build a space to bring the strengths of the university together with the best ideas of the OSS world. Despite my somewhat stern words above, I am a staunch believer in the fundamental value to society of our universities, and I am not talking about damaging them, only about helping them adapt to a rapidly changing landscape. The OSS world is distributed, collaborative, mostly online and often volunteer; it has great strengths but it also often produces very uneven outcomes. In contrast, universities have a lasting presence, physical space where intense human, face-to-face collaboration can take place, and deep expertise in many domains. I hope we can leverage those strengths of the university together with the practices of the OSS culture, to create a better future of computational science.

For the next few years, I hope to continue building great open source tools for scientific computing (IPython and much more), but also to bring these questions and ideas into the heart of the debate about what it means to be an academic scientist. Ask me in five years how it went :)

Partnerships at Berkeley

In addition to our partners at UW and NYU, this effort at Berkeley isn't happening in a vacuum, quite the opposite. We hope BIDS will not only produce its own ideas, but that it will also help catalyze a lot of the activity that's already happening. Some of the people in these partner efforts are also co-PIs in the Moore/Sloan project, so we have close connections between all:

Note: as always, this post was written as an IPython Notebook, which can be obtained from my github repository. If you are reading the notebook version, the blog post is available here.

Monday, July 1, 2013

In Memoriam, John D. Hunter III: 1968-2012

I just returned from the SciPy 2013 conference, whose organizers kindly invited me to deliver a keynote. For me this was a particularly difficult, yet meaningful edition of SciPy, my favorite conference. It was only a year ago that John Hunter, creator of matplotlib, had delivered his keynote shortly before being diagnosed with terminal colon cancer, from which he passed away on August 28, 2012 (if you haven't seen his talk, I strongly recommend it for its insights into scientific open source work).

On October 1st 2012, a memorial service was held at the University of Chicago's Rockefeller Chapel, the location of his PhD graduation. On that occasion I read a brief eulogy, but for obvious reasons only a few members from the SciPy community were able to attend. At this year's SciPy conference, Michael Droetboom (the new project leader for matplotlib) organized the first edition of the John Hunter Excellence in Plotting Contest, and before the awards ceremony I read a slightly edited version of the text I had delivered in Chicago (you can see the video here). I only made a few changes for brevity and to better suit the audience of the SciPy conference. I am reproducing it below.

I also went through my photo albums and found images I had of John. A memorial fund has been established in his honor to help with the education of his three daughers Clara, Ava and Rahel (Update: the fund was closed in late 2012 and its proceeds given to the family; moving forward, NumFOCUS sponsors the John Hunter Technology Fellowship, that anyone can make contributions to).

Dear friends and colleagues,

I used to tease John by telling him that he was the man I aspired to be when I grew up. I am not sure he knew how much I actually meant that. I first met him over email in 2002, when IPython was in its infancy and had rudimentary plotting support via Gnuplot. He sent me a patch to support a plotting syntax more akin to that of matlab, but I was buried in my effort to finish my PhD and couldn’t deal with his contribution for at least a few months. In the first example of what I later came to know as one of his signatures, he kindly replied and then simply routed around this blockage by single-handedly creating matplotlib. For him, building an entire new visualization library from scratch was the sensible solution: he was never one to be stopped by what many would consider an insurmountable obstacle.

Our first personal encounter was at SciPy 2004 at Caltech. I was immediately taken by his unique combination of generous spirit, sharp wit and technical prowess, and over the years I would grow to love him as a brother. John was a true scholar, equally at ease in a conversation about monetary policy, digital typography or the intricacies of C++ extensions in Python. But never once would you feel from him a hint of arrogance or condescension, something depressingly common in academia. John was driven only by the desire to work on interesting questions and to always engage others in a meaningful way, whether solving their problems, lifting their spirits or simply sharing a glass of wine. Beneath a surface of technical genius, there lied a kind, playful and fearless spirit, who was quietly comfortable in his own skin and let the power of his deeds speak for him.

Beyond the professional context, John had a rich world populated by the wonders of his family, his wife Miriam and his daughters Clara, Ava and Rahel. His love for his daughters knew no bounds, and yet I never once saw him clip their wings out of apprehension. They would be up on trees, dangling from monkeybars or riding their bikes, and he would always be watchful but encouraging of all their adventures. In doing so, he taught them to live like he did: without fear that anything could be too difficult or challenging to accomplish, and guided by the knowledge that small slips and failures were the natural price of being bold and never settling for the easy path.

A year ago in this same venue, John drew lessons from a decade’s worth of his own contributions to our community, from the vantage point of matplotlib. Ten years earlier at U. Chicago, his research on pediatric epilepsy required either expensive and proprietary tools or immature free ones. Along with a few similarly-minded folks, many of whom are in this room today, John believed in a future where science and education would be based on openly available software developed in a collaborative fashion. This could be seen as a fool’s errand, given that the competition consisted of products from companies with enormous budgets and well-entrenched positions in the marketplace. Yet a decade later, this vision is gradually becoming a reality. Today, the Scientific Python ecosystem powers everything from history-making astronomical discoveries to large financial modeling companies. Since all of this is freely available for anyone to use, it was possible for us to end up a few years ago in India, teaching students from distant rural colleges how to work with the same tools that NASA uses to analyze images from the Hubble Space Telescope. In recognition of the breadth and impact of his contributions, the Python Software Foundation awarded him posthumously the first installment of its highest distinction, the PSF Distinguished Service Award.

John’s legacy will be far-reaching. His work in scientific computing happened in a context of turmoil in how science and education are conducted, financed and made available to the public. I am absolutely convinced that in a few decades, historians of science will describe the period we are in right now as one of deep and significant transformations to the very structure of science. And in that process, the rise of free openly available tools plays a central role. John was on the front lines of this effort for a decade, and with his accomplishments he shone brighter than most.

John’s life was cut far, far too short. We will mourn him for time to come, and we will never stop missing him. But he set the bar high, and the best way in which we can honor his incredible legacy is by living up to his standards: uncompromising integrity, never-ending intellectual curiosity, and most importantly, unbounded generosity towards all who crossed his path. I know I will never grow up to be John Hunter, but I know I must never stop trying.

Fernando Pérez

June 27th 2013, SciPy Conference, Austin, Tx.

Thursday, May 30, 2013

Exploring Open Data with Pandas and IPython at the Berkeley I School

"Working with Open Data", a course by Raymond Yee

This will be a guest post, authored by Raymond Yee from the UC Berkeley School of Information (or I School, as it is known around here). This spring, Raymond has been teaching a course titled "Working with Open Data", where students learn how to work with openly available data sets with Python.

Raymond has been using IPython and the notebook since the start of the course, as well as hosting lots of materials directly using github. He kindly invited me to lecture in his course a few weeks ago, and I gave his students an overview of the IPython project as well as our vision of reproducible research and of building narratives that are anchored in code and data that are always available for inspection, discussion and further modification.

Towards the end of the course, his students had to develop a final project, organizing themselves in groups of 2-4 and producing a final working system that would use open datasets to produce an interesting analysis of their choice.

I recently had the chance to see the final projects, and I have to say that I walked out blown away by the results. This is a group of students who don't come from traditionally computationally-intensive backgrounds, as the only requirement was some basic Python experience. And in a matter of a few weeks, they created very compelling tools, that dove into different problem domains from health care to education and even sports, producing interesting results complete with a narrative, plots and even interactive JavaScript controls and SVG output elements. Keep in mind that the tools to do some of this stuff aren't even really documented or explained much in IPython yet, as we haven't really dug into that part in earnest (that is our Fall 2013 plan).

The students obviously did run into some issues, and I took notes on what we can do to improve the situtaion. We had a follow-up meeting on campus where we gave them pointers on how to do certain things more easily. But to me, these results validate the idea that we can construct computational narratives based on code and data that will ultimately lead to a more informed discourse.

I am very much looking forward to future collaborations with Raymond: he has shown that we can create an educational experience around data-driven discovery, using IPython, pandas, matplotlib and the rest of the SciPy stack, that is completely accessible to students without major computational training, and that lets them produce interesting results around socially interesting and relevant datasets.

I will now pass this on to Raymond, so he can describe a little bit more about the course, challenges they encountered, and highlight some of the course projects that are already available on github for further forking and collaboration.

As always, this post is available as an IPython notebook. The rest of the post is Raymond's authorship.

The course

Over a 15-week period, my students and I met twice a week to study open data, using Python to access, process, and interpret that data. Twenty-four students completed the course: 17 Masters students from the I School and 7 undergraduates from electrical engineering/computer science, statistics, and business.

We covered about half of our textbook Python for Data Analysis by Wes McKinney. Accordingly, a fair amount of our energy was directed to studying the pandas library. The prerequisite programming background was the one semester Python minimum requirement for I School Masters students. The students learned a lot (while having a good time overall, so I'm told) about how to program pandas and use the IPython notebook by working through a series of assignments. Students filled in missing code in IPython notebooks I had created to illustrate various techniques for data analysis. (Most of the resources for the course are contained in the course github repository.)

My students and I were particularly grateful for the line-up of guest speakers (which included Fernando Perez), who shared their expertise on topics ranging from scraping the web to archive legal cases, the UC Berkeley Library Data Lab, open data in archaeology, open access web crawling, and scientific reproducibility.

Final projects

The culmination of the course came in the final projects, where groups of two to four students designed and implemented data analysis projects. The final deliverable for the course was an IPython notebook that was expected to contain the following attributes:

  • a clear articulation of the problem being addressed in the project
  • a clear description of what was originally proposed and what the students ended up doing, describing what led the students to go from the proposed work to the final outcome
  • a thorough description of what was behind-the-scenes: the sources of data from which the students drew and the code they wrote to analyze the data
  • a clear description of the results and precise details on how to reproduce the results
  • if the students were to continue their project beyond the course, what would be this future work
  • a paragraph outlining how the work was split among group members.

The students and I welcomed enthusiastic visitors to the Open House -- in which the I School community and, in fact, the larger campus community was invited to attend.

Working with Open Data 2013 Open House

Working with Open Data 2013 Open House

Here are abstracts for the eight projects, where each screenshot is a link to its IPython notebook. Enjoy!

Stock Performance of Product Releases
Edward Lee, Eugene Kim

Drawing connections for open data available pertaining to Apple in order to examine how Apple's stock performance was impacted by a certain product. We examine Wikipedia data for detailed information on Apple's product releases, make use of Yahoo Finance's API for specific stock performance metrics, and openly available Form 10-Q's for internal (financial) changes to Apple. The main purpose is to examine available data to draw new conclusions centered on the time around the product release date.

Education First
Carl Shan, Bharathkumar Gunasekaran, Haroon Rasheed Paul Mohammed, Sumedh Sawant

Most parents nowadays have a general sense of the significant factors for choosing a school for their children. However, with a lack of existing tools and information sources, most of these parents have a hard time measuring, weighing and comparing these factors in relation to geographical areas when they are trying to pick the best place to live in with the best schools.

Thus, our team aims to address this problem by visualizing the statistical data from the NCES with geo-data to help the parents through the process of picking the best area to live in with the best schools. Parents can specify exactly what parameters they consider important in their decision process and we will generate a heat-map of the state they’re interested in living in and dynamically color it according to how closely each county matches their preferences. The heat map will be displayed with a web browser.

The League of Champions
Natarajan Chakrapani, Mark Davidoff, Kuldeep Kapade

In the soccer world, there is a lot of money involved in transfer of players in the premier leagues around the world. Focusing on the English Premier League, our project - “The League Of Champions” aims to analyze the return on investment on soccer transfer done by teams in the English premier league. It aims to measure club return on each dollar spent on their acquired players for a season, on parameters like Goals scored, active time on the field, assists etc. In addition, we also look to analyze how big a factor player age is, in commanding a high transfer fee, and if clubs prefer to pay large amounts for specialist players in specific field positions.

All About TEDx
Chan Kim, JT Huang

TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.

Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.

Environmental Health Gap
Rohan Salantry, Deborah Linton, Alec Hanefeld, Eric Zan

There is growing evidence to support environmental factors trigger chronic diseases such as asthma that result in billions in health care costs. However a gap in knowledge exists concerning the extent of the link and where it is more prevalent. We aim to create a framework for closing this gap by integrating health and environmental condition data sets. Specifically, the project will link emissions data from the EPA and the California Department of Public Health in an attempt to find a correlation between incidences of asthma treatments and emissions seen as triggers for asthma.

The project hopes to be a stepping stone for policy decisions concerning the value tradeoff between health care treatment and environmental regulation as well as where to concentrate resources based on severity of need.

World Bank Data Analysis
Aisha Kigongo, Sydney Friedman, Ignacio Pérez

Our goal was to use a variety of tools to investigate the impact of project funding in developing countries. In order to do so, we looked at open data from the World Bank, which keeps a strong track of every project that gets funded, who funds it, and the goal of the project whether agricultural, economic or related to health. By using Python, we used an index of the World Bank to see where the most funded countries were and how they related to various indicators such as the Human Development Index, the Freedom Index, and for the future, health, educational and other economic indexes. Our secondary goal is to analyze what insight open data can give us as to how effective initiatives and funding actually is as opposed to what it’s meant to be.

Dr. Book
AJ Renold, Shohei Narron, Alice Wang

When we read a book, all the information is contained in that resource. But what if you could learn more about a concept, historical figure, or location presented in a chapter? Dr. Book expands your reading experience by connecting people, places, topics and concepts within a book to render a webpage linking these resources to Wikipedia.

Book Hunters
Fred Chasen, Luis Aguilar, Sonali Sharma

When we search for books on the internet we are often overwhelmed with results coming from various sources. It’s difficult to get direct trusted urls to books. Project Gutenberg, HathiTrust and Open Library all provide an extensive library of books online, each with their own large repository titles. By combining their catalogs, Book Hunters enables querying for a book across those different sources, our project will highlight key statistics about the three datasets. These statistics include: number of books in all the three data sources, formats, language, publishing date. Apart from that we will ask users to search for a particular book of interest and we will return combined results from all the three resources and also provide the direct link to the pdf, text or epub format of the book. This will be an exercise to filter out results for the users and provide them with easy access to the books that they are looking for.

Friday, April 19, 2013

"Literate computing" and computational reproducibility: IPython in the age of data-driven journalism

As "software eats the world" and we become awash in the flood of quantitative information denoted by the "Big Data" buzzword, it's clear that informed debate in society will increasingly depend on our ability to communicate information that is based on data. And for this communication to be a truly effective dialog, it is necessary that the arguments made based on data can be deconstructed, analyzed, rebutted or expanded by others. Since these arguments in practice often rely critically on the execution of code (whether an Excel spreadsheet or a proper program), it means that we really need tools to effectively communicate narratives that combine code, data and the interpretation of the results.

I will point out here two recent examples, taken from events in the news this week, where IPython has helped this kind of discussion, in the hopes that it can motivate a more informed style of debate where all the moving parts of a quantitative argument are available to all participants.

Insight, not numbers: from literate programming to literate computing

The computing community has for decades known about the "literate programming" paradigm introduced by Don Knuth in the 70's and fully formalized in his famous 1992 book. Briefly, Knuth's approach proposes writing computer programs in a format that mixes the code and a textual narrative together, and from this format generating separate files that will contain either an actual code that can be compiled/executed by the computer, or a narrative document that explains the program and is meant for human consumption. The idea is that by allowing the authors to maintain a close connection between code and narrative, a number of benefits will ensue (clearer code, less programming errors, more meaningful descriptions than mere comments embedded in the code, etc).

I don't take any issue with this approach per se, but I don't personally use it because it's not very well suited to the kinds of workflows that I need in practice. These require the frequent execution of small fragments of code, in an iterative cycle where code is run to obtain partial results that inform the next bit of code to be written. Such is the nature of interactive exploratory computing, which is the bread and butter of many practicing scientists. This is the kind of workflow that led me to creating IPython over a decade ago, and it continues to inform basically every decision we make in the project today.

As Hamming famously said in 1962, "The purpose of computing is insight, not numbers.". IPython tries to help precisely in this kind of usage pattern of the computer, in contexts where there is no clear notion in advance of what needs to be done, so the user is the one driving the computation. However, IPython also tries to provide a way to capture this process, and this is where we join back with the discussion above: while LP focuses on providing a narrative description of the structure of an algorithm, our working paradigm is one where the act of computing occupies the center stage.

From this perspective, we therefore refer to the worfklow exposed by these kinds of computational notebooks (not just IPython, but also Sage, Mathematica and others), as "literate computing": it is the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components. For the goals of communicating results in scientific computing and data analysis, I think this model is a better fit than the literate programming one, which is rather aimed at developing software in tight concert with its design and explanatory documentation. I should note that we have some ideas on how to make IPython stronger as a tool for "traditional" literate programming, but it's a bit early for us to focus on that, as we first want to solidify the computational workflows possible with IPython.

As I mentioned in a previous blog post about the history of the IPython notebook, the idea of a computational notebook is not new nor ours. Several IPython developers used extensively other similar systems from a long time and we took lots of inspiration from them. What we have tried to do, however, is to take a fresh look at these ideas, so that we can build a computational notebook that provides the best possible experience for computational work today. That means taking the existence of the Internet as a given in terms of using web technologies, an architecture based on well-specified protocols and reusable low-level formats (JSON), a language-agnostic view of the problem and a concern about the entire cycle of computing from the beginning. We want to build a tool that is just as good for individual experimentation as it is for collaboration, communication, publication and education.

Government debt, economic growth and a buggy Excel spreadsheet: the code behind the politics of fiscal austerity

In the last few years, extraordinarily contentious debates have raged in the circles of political power and fiscal decision making around the world, regarding the relation between government debt and economic growth. One of the center pieces of this debate was a paper form Harvard economists C. Reinhart and K. Rogoff, later turned into a best-selling book, that argued that beyond 90% debt ratios, economic growth would plummet precipitously.

This argument was used (amongst others) by politicians to justify some of the extreme austerity policies that have been foisted upon many countries in the last few years. On April 15, a team of researchers from U. Massachusetts published a re-analysis of the original data where they showed how Rienhart and Rogoff had made both fairly obvious coding errors in their orignal Excel spreadsheets as well as some statistically questionable manipulations of the data. Herndon, Ash and Pollin (the U. Mass authors) published all their scripts in R so that others could inspect their calculations.

Two posts from the Economist and the Roosevelt Institute nicely summarize the story with a more informed policy and economics discussion than I can make. James Kwak has a series of posts that dive into technical detail and question the horrible choice of using Excel, a tool that should for all intents and purposes be banned from serious research as it entangles code and data in ways that more or less guarantee serious errors in anything but trivial scenarios. Victoria Stodden just wrote an excellent new post with specific guidance on practices for better reproducibility; here I want to take a narrow view of these same questions focusing strictly on the tools.

As reported in Mike Konczal's piece at the Roosevelt Institute, Herndon et al. had to reach out to Reinhart and Rogoff for the original code, which hadn't been made available before (apparently causing much frustration in economics circles). It's absolutely unacceptable that major policy decisions that impact millions worldwide had until now hinged effectively on the unverified word of two scientists: no matter how competent or honorable they may be, we know everybody makes mistakes, and in this case there were both egregious errors and debatable assumptions. As Konczal says, "all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel." To that I would add the obvious: this should never have happened in the first place, as we should have been able to inspect that code and data from the start.

Now, moving over to IPython, something interesting happened: when I saw the report about the Herndon et al. paper and realized they had published their R scripts for all to see, I posted this request on Twitter:

It seemed to me that the obvious thing to do would be to create a document that explained together the analysis and a bit of narrative using IPython, hopefully more easily used as a starting point for further discussion. What I didn't really expect is that it would take less than three hours for Vincent Arel-Bundock, a PhD Student in Political Science at U. Michigan, to come through with a solution:

I suggested that he turn this example into a proper repository on github with the code and data, which he quickly did:

So now we have a full IPython notebook, kept in a proper github repository. This repository can enable an informed debate about the statistical methodologies used for the analysis, and now anyone who simply installs the SciPy stack can not only run the code as-is, but explore new directions and contribute to the debate in a properly informed way.

On to the heavens: the New York Times' infographic on NASA's Kepler mission

As I was discussing the above with Vincent on Twitter, I came across this post by Jonathan Corum, an information designer who works as NY Times science graphics editor:

The post links to a gorgeous, animated infographic that summarizes the results that NASA's Kepler spacecraft has obtained so far, and which accompanies a full article at the NYT on Kepler's most recent results: a pair of planets that seem to have just the right features to possibly support life, a quick 1200 light-years hop from us.

Jonathan indicated that he converted his notebook to a Python script later on for version control and automation, though I explained to him that he could have continued using the notebook, since the --script flag would give him a .py file if needed, and it's also possible to execute a notebook just like a script, with a bit of additional support code:

In this case Jonathan's code isn't publicly available, but I am still very happy to see this kind of usage: it's a step in the right direction already and as more of this analysis is done with open-source tools, we move further towards the possibility of an informed discussion around data-driven journalism.

I also hope he'll release perhaps some of the code later on, so that others can build upon it for similar analyses. I'm sure lots of people would be interested and it wouldn't detract in any way from the interest in his own work which is strongly tied to the rest of the NYT editorial resources and strengths.

Looking ahead from IPython's perspective

Our job with IPython is to think deeply about questions regarding the intersection of computing, data and science, but it's clear to me at this point that we can contribute in contexts beyond pure scientific research. I hope we'll be able to provide folks who have a direct intersection with the public, such as journalists, with tools that help a more informed and productive debate.

Coincidentally, UC Berkeley will be hosting on May 4 a symposium on data and journalism, and in recent days I've had very productive interactions with folks in this space on campus. Cathryn Carson currently directs the newly formed D-Lab, whose focus is precisely the use of quantitative and datamethods in the social sciences, and her team has recently been teaching workshops on using Python and R for social scientists. And just last week I lectured in Raymond Yee's course (from the School of Information) where they are using the notebook extensively, following Wes McKinney's excellent Python for Data Analysis as the class textbook. Given all this, I'm fairly optimistic about the future of a productive dialog and collaborations on campus, given that we have a lot of the IPython team working full-time here.

Note: as usual, this post is available as an IPython notebook in my blog repo.