Tuesday, November 12, 2013

An ambitious experiment in Data Science takes off: a biased, Open Source view from Berkeley

Today, during a White House OSTP event combining government, academia and industry, the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation announced a $37.8M funding commitment to build new data science environments. This caps a year's worth of hard work for us at Berkeley, and even more for the Moore and Sloan teams, led by Vicki Chandler, Chris Mentzel and Josh Greenberg: they ran a very thorough selection process to choose three universities to participate in this effort. The Berkeley team was led by Saul Perlmutter, and we are now thrilled to join forces with teams at the University of Washington and NYU, respectively led by Ed Lazowska and Yann LeCun. We have worked very hard on this in private, so it's great to finally be able to publicly discuss what this ambitious effort is all about.

Berkeley BIDS team

Most of the UC Berkeley BIDS team, from left to right: Josh Bloom, Cathryn Carson, Jas Sekhon, Saul Perlmutter, Erik Mitchell, Kimmen Sjölander, Jim Sethian, Mike Franklin, Fernando Perez. Not present: Henry Brady, David Culler, Philip Stark and Ion Stoica (photo credit: Kaja Sehrt, VCRO).

As Joshua Greenberg from the Sloan Foundation says, "What this partnership is trying to do is change the culture of universities to create a data science culture." For us at Berkeley, the whole story has two interlocking efforts:

  1. The Moore and Sloan foundations are supporting a cross-institution initiative, where we will tackle the challenges that the rise of data-intensive science is posing.

  2. Spurred by this, Berkeley is announcing the creation of the new Berkeley Institute for Data Science (BIDS), scheduled to start full operations in Spring 2014 (once the renovations of the Doe 190 space are completed). BIDS will be the hub of our activity in the broader Moore/Sloan initiative, as a partner with the UW eScience Institute and the newly minted NYU Center for Data Science.

Since the two Foundations, Berkeley and our university partners will provide ample detail elsewhere (see link summary at the bottom), I want to give my own perspective. This process has been, as one can imagine, a complex one: we were putting together a campus-wide effort that was very different from any traditional grant proposal, as it involved not only a team of PIs from many departments who normally don't work together, but also serious institutional commitment. But I have seen real excitement in the entire team: there is a sense that we have been given the chance to tackle a big and meaningful problem, and that people are willing to move way out of their comfort zone and take risks. The probability of failure is non-negligible, but these are the kinds of problems worth failing on.

The way I think about it, we're a startup in the Eric Ries sense, a "human institution designed to deliver a new product or service under conditions of extreme uncertainty". Our outcomes are new forms of doing science in academic settings, and we're going against the grain of large institutional priors. We've been given nearly $38M of VC funding and a 5 year runway, now it's our job to make it beyond this.

Berkeley BIDS team

Saul Perlmutter speaks during our internal launch event on Oct 25, at the location of the newly created BIDS in the campus Doe Library. This area, when renovated, will be set up in a similar fashion for seminars and workshops (photo credit: Kaja Sehrt, VCRO).

Why is "Data Science" a different problem?

The original mandate from our funders identified a set of challenges brought about by the rise of data intensive research to the modern university. I would summarize the problem as: the incentive mechanisms of academic research are at sharp odds with the rising need for highly collaborative interdisciplinary research, where computation and data are first-class citizens. The creation of usable, robust computational tools, and the work of data acquisition and analysis must be treated as equal partners to methodological advances or domain-specific results.

Here are, briefly stated, what I see as the main mechanisms driving this problem in today's academic environment:

  • An incentive structure that favors individualism, hyper-specialization and "novelty" to a toxic extreme. The lead-author-paper is the single currency of the realm, and work that doesn't lead to this outcome is ultimately undertaken as pure risk. On the altar of "novelty", intellectual integrity is often sacrificed.

  • Collaboration, tool reuse and cross-domain abstractions are punished.

  • The sharing of tools that enable new science is discouraged/punished: again, it's much better to put out a few more "novel" first-author papers than to work with others to amplify their scientific outcomes.

  • Building a CV spread across disciplines is typically career suicide: no single hiring/tenure committee will understand it, despite the potentially broad and significant impact.

  • Most scientists are taught to treat computation as an afterthought. Similarly, most methodologists are taught to treat applications as an afterthought (thanks to Bill Howe, one of our UW partners, for pointing out this mirror problem).

  • Rigor in algorithmic usage, application and implementation by practitioners isn't taught anywhere: people grab methods like shirts from a rack, to see if they work with the pants they are wearing that day. Critical understanding of algorithmic assumptions and of the range of validity and constraints of new methods is a rarity in many applied domains.

  • Not to fault only domain practitioners, methodologists tend to only offer proof-of-concept, synthetic examples, staying largely shielded from real-world concerns. This means that reading their papers often leaves domain users with precious little guidance on what to actually do with the method (and I'm just as guilty as the rest).

  • Computation and data skills are all of a sudden everybody's problem. Yet, many academic disciplines are not teaching their students these tools in a coherent way, leaving the job to a haphazard collage of student groups, ad-hoc workshops and faculty who create bootcamps and elective seminar courses to patch up the problem.

  • While academia punishes all the behaviors we need, industry rewards them handsomely (and often in the context of genuinely interesting problems). We have a new form of brain drain we had never seen before at this scale. Jake van der Plas' recent blog post articulates the problem with clarity.

Two overarching questions

The above points frame, at least for me, the "why do we have this problem?" part. As practicioners, for the next five years we will tackle it in a variety of ways. But all our work will be inscribed in the context of two large-scale questions, an epistemological one about the nature of science and research, and a very practical one about the structure of the institutions where science gets done.

1: Is the nature of science itself really changing?

"Data Science" is, in some ways, an over-abused buzzword that can mean anything and everything. I'm roughly taking the viewpoint nicely summarized by Drew Conway's well-known Venn Diagram:

It is not clear to me that a new discipline is arising, or at least that we should approach the question by creating yet another academic silo. I see this precisely as one of the questions we need to explore, both through our concrete practice and by taking a step back to reflect on the problem itself. The fact that we have an amazing interdisciplinary team, that includes folks who specialize in the philosophy and anthropology of knowledge creation, makes this particularly exciting. This is a big and complex issue, so I'll defer more writing on it for later.

2: How do the institutions of science need to change in this context?

The deeper epistemological question of whether Data Science "is a thing" may still be open. But at this point, nobody in their sane mind challenges the fact that the praxis of scientific research is under major upheaval because of the flood of data and computation. This is creating the tensions in reward structure, career paths, funding, publication and the construction of knowledge that have been amply discussed in multiple contexts recently. Our second "big question" will then be, can we actually change our institutions in a way that responds to these new conditions?

This is a really, really hard problem: there are few organizations more proud of their traditions and more resistant to change than universities (churches and armies might be worse, but that's about it). That pride and protective instinct have in some cases good reasons to exist. But it's clear that unless we do something about this, and soon, we'll be in real trouble. I have seen many talented colleagues leave academia in frustration over the last decade because of all these problems, and I can't think of a single one who wasn't happier years later. They are better paid, better treated, less stressed, and actually working on interesting problems (not every industry job is a boring, mindless rut, despite what we in the ivory tower would like to delude ourselves with).

One of the unique things about this Moore/Sloan initiative was precisely that the foundations required that our university officials were involved in a serious capacity, acknowledging that success on this project wouldn't be measured just by the number of papers published at the end of five years. I was delighted to see Berkeley put its best foot forward, and come out with unambiguous institutional support at every step of the way. From providing us with resources early on in the competitive process, to engaging with our campus Library so that our newly created institute could be housed in a fantastic, central campus location, our administration has really shown that they take this problem seriously. To put it simply, this wouldn't be happening if it weren't for the amazing work of Graham Fleming (our Vice Chancellor for Research) and Kaja Sehrt, from his office.

We hope that by pushing for disruptive change not only at Berkeley, but also with our partners at UW and NYU, we'll be able to send a clear message to the nation's universities on this topic. Education, hiring, tenure and promotion will all need to be considered as fair game in this discussion, and we hope that by the end of our initial working period, we will have already created some lasting change.

Open Source: a key ingredient in this effort

One of the topics that we will focus on at Berkeley, building off strong traditions we already have on that front, is the contribution of open source software and ideas to the broader Data Science effort. This is a topic I've spoken at length about at conferences and other venues, and I am thrilled to have now an official mandate to work on it.

Much of my academic career has been spent living a "double life", split between the responsibilities of a university scientist and trying to build valuable open source tools for scientific computing. I started writing IPython when I was a graduate student, and I immediately got pulled into the birth of the modern Scientific Python ecosystem. I worked on a number of the core packages, helped with conferences, wrote papers, taught workshops and courses, helped create a Foundation, and in the process met many other like-minded scientists who were committed to the same ideas.

We all felt that we could build a better computational foundation for science that would be, compared to the existing commercial alternatives and traditions:

  • Technically superior, built upon a better language.
  • Openly developed from the start: no more black boxes in science.
  • Based on validated, tested, documented, reusable code whose problems are publicly discussed and tracked.
  • Built collaboratively, with everyone participating as equal partners and worrying first about solving the problems, not about who would be first author or grant PI.
  • Licensed as open source but in a way that would encourage industry participation and collaboration.

In this process, I've learned a lot, built some hopefully interesting tools, made incredible friends, and in particular I realized that in many ways, the open source community practiced many of the ideals of science better than academia.

What do I mean by this? Well, in the open source software (OSS) world, people genuinely build upon each other's work: we all know that citing an opaque paper that hides key details and doesn't include any code or data is not really building off it, it's simply preemptive action to ward off a nasty comment from a reviewer. Furthermore, the open source community's work is reproducible by necessity: they are collaborating across the internet on improving the same software, which means they need to be able to really replicate what each other is doing to continually integrate new work. So this community has developed a combination of technical tools and social practices that, taken as a whole, produces a stunningly effective environment for the creation of knowledge.

But a crucial reason behind the success of the OSS approach is that its incentive structure is aligned to encourage genuine collaboration. As I stated above, the incentive structure of academia is more or less perfectly optimized to be as destructive and toxic to real collaboration as imaginable: everything favors the lead author or grant PI, and therefore every collaboration conversation tends to focus first on who will get credit and resources, not on what problems are interesting and how to best solve them. Everyone in academia knows of (or has been involved in) battles that leave destroyed relations, friendships and collaborations because of concerns over attribution and/or control.

It's not that the OSS world is a utopia of perfect harmony, far from it. "Flame wars" on public mailing lists are famous, projects split in sometimes acrimonious ways, nasty things get said in public and in private, etc. Humanity is still humanity. But I have seen first hand the difference between a baseline of productive engagement and the constant mistrust of acrid competitiveness that is academia. And I know I'm not the only one: I have heard many friends over the years tell me how much more they enjoy scientific open source conferences like SciPy than their respective, discipline-specific ones.

We want to build a space to bring the strengths of the university together with the best ideas of the OSS world. Despite my somewhat stern words above, I am a staunch believer in the fundamental value to society of our universities, and I am not talking about damaging them, only about helping them adapt to a rapidly changing landscape. The OSS world is distributed, collaborative, mostly online and often volunteer; it has great strengths but it also often produces very uneven outcomes. In contrast, universities have a lasting presence, physical space where intense human, face-to-face collaboration can take place, and deep expertise in many domains. I hope we can leverage those strengths of the university together with the practices of the OSS culture, to create a better future of computational science.

For the next few years, I hope to continue building great open source tools for scientific computing (IPython and much more), but also to bring these questions and ideas into the heart of the debate about what it means to be an academic scientist. Ask me in five years how it went :)

Partnerships at Berkeley

In addition to our partners at UW and NYU, this effort at Berkeley isn't happening in a vacuum, quite the opposite. We hope BIDS will not only produce its own ideas, but that it will also help catalyze a lot of the activity that's already happening. Some of the people in these partner efforts are also co-PIs in the Moore/Sloan project, so we have close connections between all:

Note: as always, this post was written as an IPython Notebook, which can be obtained from my github repository. If you are reading the notebook version, the blog post is available here.

Monday, July 1, 2013

In Memoriam, John D. Hunter III: 1968-2012

I just returned from the SciPy 2013 conference, whose organizers kindly invited me to deliver a keynote. For me this was a particularly difficult, yet meaningful edition of SciPy, my favorite conference. It was only a year ago that John Hunter, creator of matplotlib, had delivered his keynote shortly before being diagnosed with terminal colon cancer, from which he passed away on August 28, 2012 (if you haven't seen his talk, I strongly recommend it for its insights into scientific open source work).

On October 1st 2012, a memorial service was held at the University of Chicago's Rockefeller Chapel, the location of his PhD graduation. On that occasion I read a brief eulogy, but for obvious reasons only a few members from the SciPy community were able to attend. At this year's SciPy conference, Michael Droetboom (the new project leader for matplotlib) organized the first edition of the John Hunter Excellence in Plotting Contest, and before the awards ceremony I read a slightly edited version of the text I had delivered in Chicago (you can see the video here). I only made a few changes for brevity and to better suit the audience of the SciPy conference. I am reproducing it below.

I also went through my photo albums and found images I had of John. A memorial fund has been established in his honor to help with the education of his three daughers Clara, Ava and Rahel (Update: the fund was closed in late 2012 and its proceeds given to the family; moving forward, NumFOCUS sponsors the John Hunter Technology Fellowship, that anyone can make contributions to).

Dear friends and colleagues,

I used to tease John by telling him that he was the man I aspired to be when I grew up. I am not sure he knew how much I actually meant that. I first met him over email in 2002, when IPython was in its infancy and had rudimentary plotting support via Gnuplot. He sent me a patch to support a plotting syntax more akin to that of matlab, but I was buried in my effort to finish my PhD and couldn’t deal with his contribution for at least a few months. In the first example of what I later came to know as one of his signatures, he kindly replied and then simply routed around this blockage by single-handedly creating matplotlib. For him, building an entire new visualization library from scratch was the sensible solution: he was never one to be stopped by what many would consider an insurmountable obstacle.

Our first personal encounter was at SciPy 2004 at Caltech. I was immediately taken by his unique combination of generous spirit, sharp wit and technical prowess, and over the years I would grow to love him as a brother. John was a true scholar, equally at ease in a conversation about monetary policy, digital typography or the intricacies of C++ extensions in Python. But never once would you feel from him a hint of arrogance or condescension, something depressingly common in academia. John was driven only by the desire to work on interesting questions and to always engage others in a meaningful way, whether solving their problems, lifting their spirits or simply sharing a glass of wine. Beneath a surface of technical genius, there lied a kind, playful and fearless spirit, who was quietly comfortable in his own skin and let the power of his deeds speak for him.

Beyond the professional context, John had a rich world populated by the wonders of his family, his wife Miriam and his daughters Clara, Ava and Rahel. His love for his daughters knew no bounds, and yet I never once saw him clip their wings out of apprehension. They would be up on trees, dangling from monkeybars or riding their bikes, and he would always be watchful but encouraging of all their adventures. In doing so, he taught them to live like he did: without fear that anything could be too difficult or challenging to accomplish, and guided by the knowledge that small slips and failures were the natural price of being bold and never settling for the easy path.

A year ago in this same venue, John drew lessons from a decade’s worth of his own contributions to our community, from the vantage point of matplotlib. Ten years earlier at U. Chicago, his research on pediatric epilepsy required either expensive and proprietary tools or immature free ones. Along with a few similarly-minded folks, many of whom are in this room today, John believed in a future where science and education would be based on openly available software developed in a collaborative fashion. This could be seen as a fool’s errand, given that the competition consisted of products from companies with enormous budgets and well-entrenched positions in the marketplace. Yet a decade later, this vision is gradually becoming a reality. Today, the Scientific Python ecosystem powers everything from history-making astronomical discoveries to large financial modeling companies. Since all of this is freely available for anyone to use, it was possible for us to end up a few years ago in India, teaching students from distant rural colleges how to work with the same tools that NASA uses to analyze images from the Hubble Space Telescope. In recognition of the breadth and impact of his contributions, the Python Software Foundation awarded him posthumously the first installment of its highest distinction, the PSF Distinguished Service Award.

John’s legacy will be far-reaching. His work in scientific computing happened in a context of turmoil in how science and education are conducted, financed and made available to the public. I am absolutely convinced that in a few decades, historians of science will describe the period we are in right now as one of deep and significant transformations to the very structure of science. And in that process, the rise of free openly available tools plays a central role. John was on the front lines of this effort for a decade, and with his accomplishments he shone brighter than most.

John’s life was cut far, far too short. We will mourn him for time to come, and we will never stop missing him. But he set the bar high, and the best way in which we can honor his incredible legacy is by living up to his standards: uncompromising integrity, never-ending intellectual curiosity, and most importantly, unbounded generosity towards all who crossed his path. I know I will never grow up to be John Hunter, but I know I must never stop trying.

Fernando Pérez

June 27th 2013, SciPy Conference, Austin, Tx.

Thursday, May 30, 2013

Exploring Open Data with Pandas and IPython at the Berkeley I School

"Working with Open Data", a course by Raymond Yee

This will be a guest post, authored by Raymond Yee from the UC Berkeley School of Information (or I School, as it is known around here). This spring, Raymond has been teaching a course titled "Working with Open Data", where students learn how to work with openly available data sets with Python.

Raymond has been using IPython and the notebook since the start of the course, as well as hosting lots of materials directly using github. He kindly invited me to lecture in his course a few weeks ago, and I gave his students an overview of the IPython project as well as our vision of reproducible research and of building narratives that are anchored in code and data that are always available for inspection, discussion and further modification.

Towards the end of the course, his students had to develop a final project, organizing themselves in groups of 2-4 and producing a final working system that would use open datasets to produce an interesting analysis of their choice.

I recently had the chance to see the final projects, and I have to say that I walked out blown away by the results. This is a group of students who don't come from traditionally computationally-intensive backgrounds, as the only requirement was some basic Python experience. And in a matter of a few weeks, they created very compelling tools, that dove into different problem domains from health care to education and even sports, producing interesting results complete with a narrative, plots and even interactive JavaScript controls and SVG output elements. Keep in mind that the tools to do some of this stuff aren't even really documented or explained much in IPython yet, as we haven't really dug into that part in earnest (that is our Fall 2013 plan).

The students obviously did run into some issues, and I took notes on what we can do to improve the situtaion. We had a follow-up meeting on campus where we gave them pointers on how to do certain things more easily. But to me, these results validate the idea that we can construct computational narratives based on code and data that will ultimately lead to a more informed discourse.

I am very much looking forward to future collaborations with Raymond: he has shown that we can create an educational experience around data-driven discovery, using IPython, pandas, matplotlib and the rest of the SciPy stack, that is completely accessible to students without major computational training, and that lets them produce interesting results around socially interesting and relevant datasets.

I will now pass this on to Raymond, so he can describe a little bit more about the course, challenges they encountered, and highlight some of the course projects that are already available on github for further forking and collaboration.

As always, this post is available as an IPython notebook. The rest of the post is Raymond's authorship.

The course

Over a 15-week period, my students and I met twice a week to study open data, using Python to access, process, and interpret that data. Twenty-four students completed the course: 17 Masters students from the I School and 7 undergraduates from electrical engineering/computer science, statistics, and business.

We covered about half of our textbook Python for Data Analysis by Wes McKinney. Accordingly, a fair amount of our energy was directed to studying the pandas library. The prerequisite programming background was the one semester Python minimum requirement for I School Masters students. The students learned a lot (while having a good time overall, so I'm told) about how to program pandas and use the IPython notebook by working through a series of assignments. Students filled in missing code in IPython notebooks I had created to illustrate various techniques for data analysis. (Most of the resources for the course are contained in the course github repository.)

My students and I were particularly grateful for the line-up of guest speakers (which included Fernando Perez), who shared their expertise on topics ranging from scraping the web to archive legal cases, the UC Berkeley Library Data Lab, open data in archaeology, open access web crawling, and scientific reproducibility.

Final projects

The culmination of the course came in the final projects, where groups of two to four students designed and implemented data analysis projects. The final deliverable for the course was an IPython notebook that was expected to contain the following attributes:

  • a clear articulation of the problem being addressed in the project
  • a clear description of what was originally proposed and what the students ended up doing, describing what led the students to go from the proposed work to the final outcome
  • a thorough description of what was behind-the-scenes: the sources of data from which the students drew and the code they wrote to analyze the data
  • a clear description of the results and precise details on how to reproduce the results
  • if the students were to continue their project beyond the course, what would be this future work
  • a paragraph outlining how the work was split among group members.

The students and I welcomed enthusiastic visitors to the Open House -- in which the I School community and, in fact, the larger campus community was invited to attend.

Working with Open Data 2013 Open House

Working with Open Data 2013 Open House

Here are abstracts for the eight projects, where each screenshot is a link to its IPython notebook. Enjoy!

Stock Performance of Product Releases
Edward Lee, Eugene Kim

Drawing connections for open data available pertaining to Apple in order to examine how Apple's stock performance was impacted by a certain product. We examine Wikipedia data for detailed information on Apple's product releases, make use of Yahoo Finance's API for specific stock performance metrics, and openly available Form 10-Q's for internal (financial) changes to Apple. The main purpose is to examine available data to draw new conclusions centered on the time around the product release date.

Education First
Carl Shan, Bharathkumar Gunasekaran, Haroon Rasheed Paul Mohammed, Sumedh Sawant

Most parents nowadays have a general sense of the significant factors for choosing a school for their children. However, with a lack of existing tools and information sources, most of these parents have a hard time measuring, weighing and comparing these factors in relation to geographical areas when they are trying to pick the best place to live in with the best schools.

Thus, our team aims to address this problem by visualizing the statistical data from the NCES with geo-data to help the parents through the process of picking the best area to live in with the best schools. Parents can specify exactly what parameters they consider important in their decision process and we will generate a heat-map of the state they’re interested in living in and dynamically color it according to how closely each county matches their preferences. The heat map will be displayed with a web browser.

The League of Champions
Natarajan Chakrapani, Mark Davidoff, Kuldeep Kapade

In the soccer world, there is a lot of money involved in transfer of players in the premier leagues around the world. Focusing on the English Premier League, our project - “The League Of Champions” aims to analyze the return on investment on soccer transfer done by teams in the English premier league. It aims to measure club return on each dollar spent on their acquired players for a season, on parameters like Goals scored, active time on the field, assists etc. In addition, we also look to analyze how big a factor player age is, in commanding a high transfer fee, and if clubs prefer to pay large amounts for specialist players in specific field positions.

All About TEDx
Chan Kim, JT Huang

TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.

Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.

Environmental Health Gap
Rohan Salantry, Deborah Linton, Alec Hanefeld, Eric Zan

There is growing evidence to support environmental factors trigger chronic diseases such as asthma that result in billions in health care costs. However a gap in knowledge exists concerning the extent of the link and where it is more prevalent. We aim to create a framework for closing this gap by integrating health and environmental condition data sets. Specifically, the project will link emissions data from the EPA and the California Department of Public Health in an attempt to find a correlation between incidences of asthma treatments and emissions seen as triggers for asthma.

The project hopes to be a stepping stone for policy decisions concerning the value tradeoff between health care treatment and environmental regulation as well as where to concentrate resources based on severity of need.

World Bank Data Analysis
Aisha Kigongo, Sydney Friedman, Ignacio Pérez

Our goal was to use a variety of tools to investigate the impact of project funding in developing countries. In order to do so, we looked at open data from the World Bank, which keeps a strong track of every project that gets funded, who funds it, and the goal of the project whether agricultural, economic or related to health. By using Python, we used an index of the World Bank to see where the most funded countries were and how they related to various indicators such as the Human Development Index, the Freedom Index, and for the future, health, educational and other economic indexes. Our secondary goal is to analyze what insight open data can give us as to how effective initiatives and funding actually is as opposed to what it’s meant to be.

Dr. Book
AJ Renold, Shohei Narron, Alice Wang

When we read a book, all the information is contained in that resource. But what if you could learn more about a concept, historical figure, or location presented in a chapter? Dr. Book expands your reading experience by connecting people, places, topics and concepts within a book to render a webpage linking these resources to Wikipedia.

Book Hunters
Fred Chasen, Luis Aguilar, Sonali Sharma

When we search for books on the internet we are often overwhelmed with results coming from various sources. It’s difficult to get direct trusted urls to books. Project Gutenberg, HathiTrust and Open Library all provide an extensive library of books online, each with their own large repository titles. By combining their catalogs, Book Hunters enables querying for a book across those different sources, our project will highlight key statistics about the three datasets. These statistics include: number of books in all the three data sources, formats, language, publishing date. Apart from that we will ask users to search for a particular book of interest and we will return combined results from all the three resources and also provide the direct link to the pdf, text or epub format of the book. This will be an exercise to filter out results for the users and provide them with easy access to the books that they are looking for.

Friday, April 19, 2013

"Literate computing" and computational reproducibility: IPython in the age of data-driven journalism

As "software eats the world" and we become awash in the flood of quantitative information denoted by the "Big Data" buzzword, it's clear that informed debate in society will increasingly depend on our ability to communicate information that is based on data. And for this communication to be a truly effective dialog, it is necessary that the arguments made based on data can be deconstructed, analyzed, rebutted or expanded by others. Since these arguments in practice often rely critically on the execution of code (whether an Excel spreadsheet or a proper program), it means that we really need tools to effectively communicate narratives that combine code, data and the interpretation of the results.

I will point out here two recent examples, taken from events in the news this week, where IPython has helped this kind of discussion, in the hopes that it can motivate a more informed style of debate where all the moving parts of a quantitative argument are available to all participants.

Insight, not numbers: from literate programming to literate computing

The computing community has for decades known about the "literate programming" paradigm introduced by Don Knuth in the 70's and fully formalized in his famous 1992 book. Briefly, Knuth's approach proposes writing computer programs in a format that mixes the code and a textual narrative together, and from this format generating separate files that will contain either an actual code that can be compiled/executed by the computer, or a narrative document that explains the program and is meant for human consumption. The idea is that by allowing the authors to maintain a close connection between code and narrative, a number of benefits will ensue (clearer code, less programming errors, more meaningful descriptions than mere comments embedded in the code, etc).

I don't take any issue with this approach per se, but I don't personally use it because it's not very well suited to the kinds of workflows that I need in practice. These require the frequent execution of small fragments of code, in an iterative cycle where code is run to obtain partial results that inform the next bit of code to be written. Such is the nature of interactive exploratory computing, which is the bread and butter of many practicing scientists. This is the kind of workflow that led me to creating IPython over a decade ago, and it continues to inform basically every decision we make in the project today.

As Hamming famously said in 1962, "The purpose of computing is insight, not numbers.". IPython tries to help precisely in this kind of usage pattern of the computer, in contexts where there is no clear notion in advance of what needs to be done, so the user is the one driving the computation. However, IPython also tries to provide a way to capture this process, and this is where we join back with the discussion above: while LP focuses on providing a narrative description of the structure of an algorithm, our working paradigm is one where the act of computing occupies the center stage.

From this perspective, we therefore refer to the worfklow exposed by these kinds of computational notebooks (not just IPython, but also Sage, Mathematica and others), as "literate computing": it is the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components. For the goals of communicating results in scientific computing and data analysis, I think this model is a better fit than the literate programming one, which is rather aimed at developing software in tight concert with its design and explanatory documentation. I should note that we have some ideas on how to make IPython stronger as a tool for "traditional" literate programming, but it's a bit early for us to focus on that, as we first want to solidify the computational workflows possible with IPython.

As I mentioned in a previous blog post about the history of the IPython notebook, the idea of a computational notebook is not new nor ours. Several IPython developers used extensively other similar systems from a long time and we took lots of inspiration from them. What we have tried to do, however, is to take a fresh look at these ideas, so that we can build a computational notebook that provides the best possible experience for computational work today. That means taking the existence of the Internet as a given in terms of using web technologies, an architecture based on well-specified protocols and reusable low-level formats (JSON), a language-agnostic view of the problem and a concern about the entire cycle of computing from the beginning. We want to build a tool that is just as good for individual experimentation as it is for collaboration, communication, publication and education.

Government debt, economic growth and a buggy Excel spreadsheet: the code behind the politics of fiscal austerity

In the last few years, extraordinarily contentious debates have raged in the circles of political power and fiscal decision making around the world, regarding the relation between government debt and economic growth. One of the center pieces of this debate was a paper form Harvard economists C. Reinhart and K. Rogoff, later turned into a best-selling book, that argued that beyond 90% debt ratios, economic growth would plummet precipitously.

This argument was used (amongst others) by politicians to justify some of the extreme austerity policies that have been foisted upon many countries in the last few years. On April 15, a team of researchers from U. Massachusetts published a re-analysis of the original data where they showed how Rienhart and Rogoff had made both fairly obvious coding errors in their orignal Excel spreadsheets as well as some statistically questionable manipulations of the data. Herndon, Ash and Pollin (the U. Mass authors) published all their scripts in R so that others could inspect their calculations.

Two posts from the Economist and the Roosevelt Institute nicely summarize the story with a more informed policy and economics discussion than I can make. James Kwak has a series of posts that dive into technical detail and question the horrible choice of using Excel, a tool that should for all intents and purposes be banned from serious research as it entangles code and data in ways that more or less guarantee serious errors in anything but trivial scenarios. Victoria Stodden just wrote an excellent new post with specific guidance on practices for better reproducibility; here I want to take a narrow view of these same questions focusing strictly on the tools.

As reported in Mike Konczal's piece at the Roosevelt Institute, Herndon et al. had to reach out to Reinhart and Rogoff for the original code, which hadn't been made available before (apparently causing much frustration in economics circles). It's absolutely unacceptable that major policy decisions that impact millions worldwide had until now hinged effectively on the unverified word of two scientists: no matter how competent or honorable they may be, we know everybody makes mistakes, and in this case there were both egregious errors and debatable assumptions. As Konczal says, "all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel." To that I would add the obvious: this should never have happened in the first place, as we should have been able to inspect that code and data from the start.

Now, moving over to IPython, something interesting happened: when I saw the report about the Herndon et al. paper and realized they had published their R scripts for all to see, I posted this request on Twitter:

It seemed to me that the obvious thing to do would be to create a document that explained together the analysis and a bit of narrative using IPython, hopefully more easily used as a starting point for further discussion. What I didn't really expect is that it would take less than three hours for Vincent Arel-Bundock, a PhD Student in Political Science at U. Michigan, to come through with a solution:

I suggested that he turn this example into a proper repository on github with the code and data, which he quickly did:

So now we have a full IPython notebook, kept in a proper github repository. This repository can enable an informed debate about the statistical methodologies used for the analysis, and now anyone who simply installs the SciPy stack can not only run the code as-is, but explore new directions and contribute to the debate in a properly informed way.

On to the heavens: the New York Times' infographic on NASA's Kepler mission

As I was discussing the above with Vincent on Twitter, I came across this post by Jonathan Corum, an information designer who works as NY Times science graphics editor:

The post links to a gorgeous, animated infographic that summarizes the results that NASA's Kepler spacecraft has obtained so far, and which accompanies a full article at the NYT on Kepler's most recent results: a pair of planets that seem to have just the right features to possibly support life, a quick 1200 light-years hop from us.

Jonathan indicated that he converted his notebook to a Python script later on for version control and automation, though I explained to him that he could have continued using the notebook, since the --script flag would give him a .py file if needed, and it's also possible to execute a notebook just like a script, with a bit of additional support code:

In this case Jonathan's code isn't publicly available, but I am still very happy to see this kind of usage: it's a step in the right direction already and as more of this analysis is done with open-source tools, we move further towards the possibility of an informed discussion around data-driven journalism.

I also hope he'll release perhaps some of the code later on, so that others can build upon it for similar analyses. I'm sure lots of people would be interested and it wouldn't detract in any way from the interest in his own work which is strongly tied to the rest of the NYT editorial resources and strengths.

Looking ahead from IPython's perspective

Our job with IPython is to think deeply about questions regarding the intersection of computing, data and science, but it's clear to me at this point that we can contribute in contexts beyond pure scientific research. I hope we'll be able to provide folks who have a direct intersection with the public, such as journalists, with tools that help a more informed and productive debate.

Coincidentally, UC Berkeley will be hosting on May 4 a symposium on data and journalism, and in recent days I've had very productive interactions with folks in this space on campus. Cathryn Carson currently directs the newly formed D-Lab, whose focus is precisely the use of quantitative and datamethods in the social sciences, and her team has recently been teaching workshops on using Python and R for social scientists. And just last week I lectured in Raymond Yee's course (from the School of Information) where they are using the notebook extensively, following Wes McKinney's excellent Python for Data Analysis as the class textbook. Given all this, I'm fairly optimistic about the future of a productive dialog and collaborations on campus, given that we have a lot of the IPython team working full-time here.

Note: as usual, this post is available as an IPython notebook in my blog repo.

Tuesday, November 20, 2012

Back from PyCon Canada 2012

I just got back a few days ago from the 2012 edition of PyCon Canada, which was a great success. I wanted to thank the team who invited me for a fantastic experience: Diana Clarke who as conference chair did an incredible job, Greg Wilson from Software Carpentry with whom I had a chance to interact a lot (he already has a long list of ideas for the IPython notebook in teaching contexts we're discussing), Mike DiBernardo and the rest of the PyConCa team. They ran a conference with a great vibe and tons of opportunity for engaging discussion.

Thanks to Greg I also had a chance to give a couple of more academically-oriented talks at U. Toronto facilities, both at the Sunnybrook hospital and their SciNet HPC center, where we had some great discussions. I look forward to future collaborations with some of the folks there.

The PyConCa kindly invited me to deliver the closing keynote for the conference, and I tried to provide a presentation on the part of the Python world that I've been involved with, namely scientific computing, but that would be of interest to the broader Python development community in attendance here. I tried to illustrate where Python has been a great success for modern scientific research, and in doing so I took a deliberately biased view where I spent a good amount of time discussing IPython, which is how I entered that world in the first place.

This is the video of the talk:

and here are the accompanying slides.

I'm too far behind to do a proper recap of the conference itself, but I want to mention one of the highlights for me: a fantastic talk by Elizabeth Leddy, a prominent figure in the Plone world, on how to build sustainable communities. She had a ton of useful insight from in-the-trenches experience with the Plone foundation, and I fortunately got to pick her brain for a while after the talk on these topics. As we gradually build up somewhat similar efforts in the scientific Python world with NumFOCUS, I think she'll be a great person for us to bug every now and then for wisdom.

IPython at the sprints

I managed to stay for the two days of sprints after the end of the main conference, and we had a great time: a number of people made contributions to IPython for the first time, so I'd like to quickly recap here what happened.

Nose extension

Taavi Burns and Greg Ward of distutils fame fought hard on a fairly tricky but extremely useful idea on a suggestion from Greg Wilson: easy in-place use of nose to run tests inside a notebook. This was done by taking inspiration (and I think code) from Catherine Devlin's recent work on integrating doctesting inside the notebook.

The new nose extension hasn't been merged yet, but you can already get the code from github, as usual. Briefly (from Taavi's instructions), this little IPython extension gives you the ability to discover and run tests using Nose in an IPython Notebook.

You starty with a cell containing:

%load_ext ipython_nose

Then write tests that conform to Nose conventions, e.g.

  def test_arithmetic():
      assert 1+1 == 2

And where you want to run your tests, you add a cell consisting of


and run it: that will discover your test_* functions, run them, and report how many passed and how many failed, with stack traces for each failure.

WebGL-based 3d protein visualization

RishiRamraj, Christopher Ing and Jonathan Villemaire-Krajden implemented an extremely cool visualization widget that can, using the IPython display protocol, render a protein structure directly in a 3d interactive window. They used Konrad Hinsen's MMTK toolkit, and the resulting code is as simple as:

from MMTK.Proteins import Protein

You can see what the output looks like in this short video shot by Taavi Burns just as they got it working and we were all very excited looking at the result; the code is already available on github.

I very much look forward to much more of this kind of tools being developed, and in fact Cyrille Rossant wasted no time at all building off this to provide fast 2-d visualizations rendered via WebGL with his Galry library:

Software Carpentry

In addition to the Nose extension above, Greg Wilson had a ton of ideas on things that could be added to the notebook that he thinks would help in the context of teaching workshops such as those that Software Carpentry presents. Their audience is typically composed of beginning programmers, scientists who may be experts in their discipline but who have little to no formal computational training and are now tasked with managing often quite complex computational workflows. Since SWC recently announced they would be switching to the notebook as their main teaching platform, they obviously are thinking deeply about how to make the best use of it and where the notebook can improve for this kind of use case.

These are still conversations that I hope will turn soon into concrete issues/code repositories to begin testing them, but that kind of validated testing is very useful for us. Since at this point we have too many feature requests from multiple fronts to be able to satisfy them all, we are trying to focus on ensuring that IPython can support indivdual projects building their own custom tools and extensions. We can't possibly merge every last idea from every front into IPython, but we can work to ensure it's a flexible and coherent enough foundation that others can build their own highly customized experiences on top. Once these get widely tested and validated, it may be that pieces are clearly of generic enough value to percolate into the core, but in the meantime this approach means that other projects (SWC being just one example among many) don't need to wait for us to add every feature they need.

What we will focus on will be on addressing any limitations that our architecture may have for such extensibility to work well, so the life of third party projects isn't a fight against our interfaces.

A first-time contributor to open source

Last, but not least, I had the great experience of working with David Kua, a CS student from U. Toronto who had never made a contribution to open source and wanted to work on IPython. Right during the sprints we were able to merge his first pull request into nbconvert, and he immediately started working on a new one for IPython that by now has also been merged.

That last one required that he learn how to rebase his git repo (he had some extraneous commits originally) and go through a fair amount of feedback before merging: this is precisely the real world cycle of open source contributions. It's always great to see a brand new contributor in the making, and I very much look forward to many more work from David, whether he decides to do it in IPython or in any other open source project that catches his interest.


Since I am now writing all my posts as IPython notebooks (even when there's no code, it's a really nice way to get instant feedback on markdown), you can get the notebook for this post from my repo.

Sunday, October 14, 2012

Help save open space in the Bay Area by protecting Knowland Park from development

Vote NO on new Tax Measure A1

Update: there is now evidence that Zoo officials have actually violated election laws in their zeal to promote measure A1.

I normally only blog about technical topics, but the destruction of a beautiful piece of open space in the Bay Area is imminent, and I want to at least do a little bit to help prevent this disaster.

In short: there's a tax measure on the November ballot, Measure A1, that would impose a parcel tax on all residences and businesses in Alameda County to fund the Oakland Zoo for the next 25 years.  The way the short text on the ballot is worded makes it appear as something geared towards animal care for a cash-strapped Zoo.  The sad reality is that the full text of the measure allows the Zoo to use these funds for a very controversial expansion plan that includes a 34,000 sq. ft. visitor center, gift shop and restaurant serviced by a ski gondola atop one of the last pristine remaining ridges in Knowland Park, an Oakland city park that sits above the Zoo.

Yes, it's as bad as it sounds; the beautiful ridge in the background:

that is today part of an unspoiled open space, would be closed off by a fence and a restaurant would be built atop of it,  serviced by a ski gondola that would reach it from the bottom of the hill.  Here are a few more pics from the same album as well as a great photo essay on the park from the AllThingsOakland blog, and some more history of the park.

Restaurant development disguised as animal care

The Zoo claims to be strapped for cash, yet they are spending over $ 1 million on a media blitz to get this measure passed, and only presenting it as an animal-care issue.  I am a huge animal lover and donate regularly to the San Diego Zoo, but unfortunately the situation with the Oakland Zoo is a different story: they see the 525-acre Knowland Park above the Zoo as their personal back yard, not as a resource that belongs to all of us.  It has been impossible, in years of negotiations, to get the Zoo to sign anything that commits them to respect the boundaries of the park in the future.  They see this tax measure as their strategic "nuclear weapon" to destroy the park, and in order to get it, they are willing to burn through cash they should instead be using for animal care.

I urge you to consider this as you go to the polls in November: all Alameda county voters will end up having a say on whether "nature preservation" in the East Bay is spelled "huge restaurant and a ski gondola on open space". By voting NO on A1 you will help prevent such madness.

More information

 Here are a few relevant links with details and further info

A final note: the citizen's group fighting to save the park can use all the help in the world. You can make donations or join the effort in many other ways; don't hesitate to ask me for more info.  And please share this post as widely as possible!

Friday, September 7, 2012

Blogging with the IPython notebook

Update (May 2014): Please note that these instructions are outdated. while it is still possible (and in fact easier) to blog with the Notebook, the exact process has changed now that IPython has an official conversion framework. However, Blogger isn't the ideal platform for that (though it can be made to work). If you are interested in using the Notebook as a tool for technical blogging, I recommend looking at Jake van der Plas' Pelican support or Damián Avila's support in Nikola.

Update: made full github repo for blog-as-notebooks, and updated instructions on how to more easily configure everything and use the newest nbconvert for a more streamlined workflow.

Since the notebook was introduced with IPython 0.12, it has proved to be very popular, and we are seeing great adoption of the tool and the underlying file format in research and education. One persistent question we've had since the beginning (even prior to its official release) was whether it would be possible to easily write blog posts using the notebook. The combination of easy editing in markdown with the notebook's ability to contain code, figures and results, makes it an ideal platform for quick authoring of technical documents, so being able to post to a blog is a natural request.

Today, in answering a query about this from a colleague, I decided to try again the status of our conversion pipeline, and I'm happy to report that with a bit of elbow-grease, at least on Blogger things work pretty well!

This post was entirely written as a notebook, and in fact I have now created a github repo, which means that you can see it directly rendered in IPyhton's nbviewer app.

The purpose of this post is to quickly provide a set of instructions on how I got it to work, and to test things out. Please note: this requires code that isn't quite ready for prime-time and is still under heavy development, so expect some assembly.

Converting your notebook to html with nbconvert

The first thing you will need is our nbconvert tool that converts notebooks across formats. The README file in the repo contains the requirements for nbconvert (basically python-markdown, pandoc, docutils from SVN and pygments).

Once you have nbconvert installed, you can convert your notebook to Blogger-friendly html with:

nbconvert -f blogger-html your_notebook.ipynb

This will leave two files in your computer, one named your_notebook.html and one named your_noteboook_header.html; it might also create a directory called your_notebook_files if needed for ancillary files. The first file will contain the body of your post and can be pasted wholesale into the Blogger editing area. The second file contains the CSS and Javascript material needed for the notebook to display correctly, you should only need to use this once to configure your blogger setup (see below):

# Only one notebook so far
(master)longs[blog]> ls
120907-Blogging with the IPython Notebook.ipynb  fig/  old/

# Now run the conversion:
(master)longs[blog]> nbconvert.py -f blogger-html 120907-Blogging\ with\ the\ IPython\ Notebook.ipynb

# This creates the header and html body files
(master)longs[blog]> ls
120907-Blogging with the IPython Notebook_header.html  fig/
120907-Blogging with the IPython Notebook.html         old/
120907-Blogging with the IPython Notebook.ipynb

Configuring your Blogger blog to accept notebooks

The notebook uses a lot of custom CSS for formatting input and output, as well as Javascript from MathJax to display mathematical notation. You will need all this CSS and the Javascript calls in your blog's configuration for your notebook-based posts to display correctly:

  1. Once authenticated, go to your blog's overview page by clicking on its title.
  2. Click on templates (left column) and customize using the Advanced options.
  3. Scroll down the middle column until you see an "Add CSS" option.
  4. Copy entire the contents of the _header file into the CSS box.

That's it, and you shouldn't need to do anything else as long as the CSS we use in the notebooks doesn't drastically change. This customization of your blog needs to be done only once.

While you are at it, I recommend you change the width of your blog so that cells have enough space for clean display; in experimenting I found out that the default template was too narrow to properly display code cells, producing a lot of text wrapping that impaired readability. I ended up using a layout with a single column for all blog contents, putting the blog archive at the bottom. Otherwise, if I kept the right sidebar, code cells got too squished in the post area.

I also had problems using some of the fancier templates available from 'Dynamic Views', in that I could never get inline math to render. But sticking to those from the Simple or 'Picture Window' categories worked fine and they still allow for a lot of customization.

Note: if you change blog templates, Blogger does destroy your custom CSS, so you may need to repeat the above steps in that case.

Adding the actual posts

Now, whenever you want to write a new post as a notebook, simply convert the .ipynb file to blogger-html and copy its entire contents to the clipboard. Then go to the 'raw html' view of the post, remove anything Blogger may have put there by default, and paste. You should also click on the 'options' tab (right hand side) and select both Show HTML literally and Use <br> tag, else your paragraph breaks will look all wrong.

That's it!

What can you put in?

I will now add a few bits of code, plots, math, etc, to show which kinds of content can be put in and work out of the box. These are mostly bits copied from our example notebooks so the actual content doesn't matter, I'm just illustrating the kind of content that works.

In [1]:
# Let's initialize pylab so we can plot later
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

With pylab loaded, the usual matplotlib operations work

In [2]:
x = linspace(0, 2*pi)
plot(x, sin(x), label=r'$\sin(x)$')
plot(x, cos(x), 'ro', label=r'$\cos(x)$')
title(r'Two familiar functions')
Out [2]:
<matplotlib.legend.Legend at 0x3128610>

The notebook, thanks to MathJax, has great LaTeX support, so that you can type inline math $(1,\gamma,\ldots, \infty)$ as well as displayed equations:

$$ e^{i \pi}+1=0 $$

but by loading the sympy extension, it's easy showcase math output from Python computations, where we don't type the math expressions in text, and instead the results of code execution are displayed in mathematical format:

In [3]:
%load_ext sympyprinting
import sympy as sym
from sympy import *
x, y, z = sym.symbols("x y z")

From simple algebraic expressions

In [4]:
Rational(3,2)*pi + exp(I*x) / (x**2 + y)
Out [4]:
$$\frac{3}{2} \pi + \frac{e^{\mathbf{\imath} x}}{x^{2} + y}$$
In [5]:
eq = ((x+y)**2 * (x+1))
Out [5]:
$$\left(x + 1\right) \left(x + y\right)^{2}$$
In [6]:
Out [6]:
$$x^{3} + 2 x^{2} y + x^{2} + x y^{2} + 2 x y + y^{2}$$

To calculus

In [7]:
diff(cos(x**2)**2 / (1+x), x)
Out [7]:
$$- 4 \frac{x \operatorname{sin}\left(x^{2}\right) \operatorname{cos}\left(x^{2}\right)}{x + 1} - \frac{\operatorname{cos}^{2}\left(x^{2}\right)}{\left(x + 1\right)^{2}}$$

For more examples of how to use sympy in the notebook, you can see our example sympy notebook or go to the sympy website for much more documentation.

You can easily include formatted text and code with markdown

You can italicize, boldface

  • build
  • lists

and embed code meant for illustration instead of execution in Python:

def f(x):
    """a docstring"""
    return x**2

or other languages:

if (i=0; i<n; i++) {
  printf("hello %d\n", i);
  x += 4;

And since the notebook can store displayed images in the file itself, you can show images which will be embedded in your post:

In [8]:
from IPython.display import Image
Out [8]:

You can embed YouTube videos using the IPython object, this is my recent talk at SciPy'12 about IPython:

In [9]:
from IPython.display import YouTubeVideo
Out [9]:

Including code examples from other languages

Using our various script cell magics, it's easy to include code in a variety of other languages

In [10]:
puts "Hello from Ruby #{RUBY_VERSION}"
Hello from Ruby 1.8.7
In [11]:
echo "hello from $BASH"
hello from /bin/bash

And tools like the Octave and R magics let you interface with entire computational systems directly from the notebook; this is the Octave magic for which our example notebook contains more details:

In [12]:
%load_ext octavemagic
In [13]:
%%octave -s 500,500

# butterworth filter, order 2, cutoff pi/2 radians
b = [0.292893218813452  0.585786437626905  0.292893218813452];
a = [1  0  0.171572875253810];
freqz(b, a, 32);

The rmagic extension does a similar job, letting you call R directly from the notebook, passing variables back and forth between Python and R.

In [14]:
%load_ext rmagic 

Start by creating some data in Python

In [15]:
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])

Which can then be manipulated in R, with results available back in Python (in XYcoef):

In [16]:
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
lm(formula = Y ~ X)

   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:  0.81, Adjusted R-squared: 0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739 

In [17]:
Out [17]:
[ 3.2  0.9]

And finally, in the same spirit, the cython magic extension lets you call Cython code directly from the notebook:

In [18]:
%load_ext cythonmagic
In [19]:
%%cython -lm
from libc.math cimport sin
print 'sin(1)=', sin(1)
sin(1)= 0.841470984808

Keep in mind, this is still experimental code!

Hopefully this post shows that the system is already useful to communicate technical content in blog form with a minimal amount of effort. But please note that we're still in heavy development of many of these features, so things are susceptible to changing in the near future. By all means join the IPython dev mailing list if you'd like to participate and help us make IPython a better tool!