-->

Thursday, May 30, 2013

Exploring Open Data with Pandas and IPython at the Berkeley I School

"Working with Open Data", a course by Raymond Yee

This will be a guest post, authored by Raymond Yee from the UC Berkeley School of Information (or I School, as it is known around here). This spring, Raymond has been teaching a course titled "Working with Open Data", where students learn how to work with openly available data sets with Python.

Raymond has been using IPython and the notebook since the start of the course, as well as hosting lots of materials directly using github. He kindly invited me to lecture in his course a few weeks ago, and I gave his students an overview of the IPython project as well as our vision of reproducible research and of building narratives that are anchored in code and data that are always available for inspection, discussion and further modification.

Towards the end of the course, his students had to develop a final project, organizing themselves in groups of 2-4 and producing a final working system that would use open datasets to produce an interesting analysis of their choice.

I recently had the chance to see the final projects, and I have to say that I walked out blown away by the results. This is a group of students who don't come from traditionally computationally-intensive backgrounds, as the only requirement was some basic Python experience. And in a matter of a few weeks, they created very compelling tools, that dove into different problem domains from health care to education and even sports, producing interesting results complete with a narrative, plots and even interactive JavaScript controls and SVG output elements. Keep in mind that the tools to do some of this stuff aren't even really documented or explained much in IPython yet, as we haven't really dug into that part in earnest (that is our Fall 2013 plan).

The students obviously did run into some issues, and I took notes on what we can do to improve the situtaion. We had a follow-up meeting on campus where we gave them pointers on how to do certain things more easily. But to me, these results validate the idea that we can construct computational narratives based on code and data that will ultimately lead to a more informed discourse.

I am very much looking forward to future collaborations with Raymond: he has shown that we can create an educational experience around data-driven discovery, using IPython, pandas, matplotlib and the rest of the SciPy stack, that is completely accessible to students without major computational training, and that lets them produce interesting results around socially interesting and relevant datasets.

I will now pass this on to Raymond, so he can describe a little bit more about the course, challenges they encountered, and highlight some of the course projects that are already available on github for further forking and collaboration.

As always, this post is available as an IPython notebook. The rest of the post is Raymond's authorship.

The course

Over a 15-week period, my students and I met twice a week to study open data, using Python to access, process, and interpret that data. Twenty-four students completed the course: 17 Masters students from the I School and 7 undergraduates from electrical engineering/computer science, statistics, and business.

We covered about half of our textbook Python for Data Analysis by Wes McKinney. Accordingly, a fair amount of our energy was directed to studying the pandas library. The prerequisite programming background was the one semester Python minimum requirement for I School Masters students. The students learned a lot (while having a good time overall, so I'm told) about how to program pandas and use the IPython notebook by working through a series of assignments. Students filled in missing code in IPython notebooks I had created to illustrate various techniques for data analysis. (Most of the resources for the course are contained in the course github repository.)

My students and I were particularly grateful for the line-up of guest speakers (which included Fernando Perez), who shared their expertise on topics ranging from scraping the web to archive legal cases, the UC Berkeley Library Data Lab, open data in archaeology, open access web crawling, and scientific reproducibility.

Final projects

The culmination of the course came in the final projects, where groups of two to four students designed and implemented data analysis projects. The final deliverable for the course was an IPython notebook that was expected to contain the following attributes:

  • a clear articulation of the problem being addressed in the project
  • a clear description of what was originally proposed and what the students ended up doing, describing what led the students to go from the proposed work to the final outcome
  • a thorough description of what was behind-the-scenes: the sources of data from which the students drew and the code they wrote to analyze the data
  • a clear description of the results and precise details on how to reproduce the results
  • if the students were to continue their project beyond the course, what would be this future work
  • a paragraph outlining how the work was split among group members.

The students and I welcomed enthusiastic visitors to the Open House -- in which the I School community and, in fact, the larger campus community was invited to attend.

Working with Open Data 2013 Open House

Working with Open Data 2013 Open House

Here are abstracts for the eight projects, where each screenshot is a link to its IPython notebook. Enjoy!

Stock Performance of Product Releases
Edward Lee, Eugene Kim

Drawing connections for open data available pertaining to Apple in order to examine how Apple's stock performance was impacted by a certain product. We examine Wikipedia data for detailed information on Apple's product releases, make use of Yahoo Finance's API for specific stock performance metrics, and openly available Form 10-Q's for internal (financial) changes to Apple. The main purpose is to examine available data to draw new conclusions centered on the time around the product release date.

Education First
Carl Shan, Bharathkumar Gunasekaran, Haroon Rasheed Paul Mohammed, Sumedh Sawant

Most parents nowadays have a general sense of the significant factors for choosing a school for their children. However, with a lack of existing tools and information sources, most of these parents have a hard time measuring, weighing and comparing these factors in relation to geographical areas when they are trying to pick the best place to live in with the best schools.

Thus, our team aims to address this problem by visualizing the statistical data from the NCES with geo-data to help the parents through the process of picking the best area to live in with the best schools. Parents can specify exactly what parameters they consider important in their decision process and we will generate a heat-map of the state they’re interested in living in and dynamically color it according to how closely each county matches their preferences. The heat map will be displayed with a web browser.

The League of Champions
Natarajan Chakrapani, Mark Davidoff, Kuldeep Kapade

In the soccer world, there is a lot of money involved in transfer of players in the premier leagues around the world. Focusing on the English Premier League, our project - “The League Of Champions” aims to analyze the return on investment on soccer transfer done by teams in the English premier league. It aims to measure club return on each dollar spent on their acquired players for a season, on parameters like Goals scored, active time on the field, assists etc. In addition, we also look to analyze how big a factor player age is, in commanding a high transfer fee, and if clubs prefer to pay large amounts for specialist players in specific field positions.

All About TEDx
Chan Kim, JT Huang

TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.

Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.

Environmental Health Gap
Rohan Salantry, Deborah Linton, Alec Hanefeld, Eric Zan

There is growing evidence to support environmental factors trigger chronic diseases such as asthma that result in billions in health care costs. However a gap in knowledge exists concerning the extent of the link and where it is more prevalent. We aim to create a framework for closing this gap by integrating health and environmental condition data sets. Specifically, the project will link emissions data from the EPA and the California Department of Public Health in an attempt to find a correlation between incidences of asthma treatments and emissions seen as triggers for asthma.

The project hopes to be a stepping stone for policy decisions concerning the value tradeoff between health care treatment and environmental regulation as well as where to concentrate resources based on severity of need.

World Bank Data Analysis
Aisha Kigongo, Sydney Friedman, Ignacio Pérez

Our goal was to use a variety of tools to investigate the impact of project funding in developing countries. In order to do so, we looked at open data from the World Bank, which keeps a strong track of every project that gets funded, who funds it, and the goal of the project whether agricultural, economic or related to health. By using Python, we used an index of the World Bank to see where the most funded countries were and how they related to various indicators such as the Human Development Index, the Freedom Index, and for the future, health, educational and other economic indexes. Our secondary goal is to analyze what insight open data can give us as to how effective initiatives and funding actually is as opposed to what it’s meant to be.

Dr. Book
AJ Renold, Shohei Narron, Alice Wang

When we read a book, all the information is contained in that resource. But what if you could learn more about a concept, historical figure, or location presented in a chapter? Dr. Book expands your reading experience by connecting people, places, topics and concepts within a book to render a webpage linking these resources to Wikipedia.

Book Hunters
Fred Chasen, Luis Aguilar, Sonali Sharma

When we search for books on the internet we are often overwhelmed with results coming from various sources. It’s difficult to get direct trusted urls to books. Project Gutenberg, HathiTrust and Open Library all provide an extensive library of books online, each with their own large repository titles. By combining their catalogs, Book Hunters enables querying for a book across those different sources, our project will highlight key statistics about the three datasets. These statistics include: number of books in all the three data sources, formats, language, publishing date. Apart from that we will ask users to search for a particular book of interest and we will return combined results from all the three resources and also provide the direct link to the pdf, text or epub format of the book. This will be an exercise to filter out results for the users and provide them with easy access to the books that they are looking for.

Friday, April 19, 2013

"Literate computing" and computational reproducibility: IPython in the age of data-driven journalism

As "software eats the world" and we become awash in the flood of quantitative information denoted by the "Big Data" buzzword, it's clear that informed debate in society will increasingly depend on our ability to communicate information that is based on data. And for this communication to be a truly effective dialog, it is necessary that the arguments made based on data can be deconstructed, analyzed, rebutted or expanded by others. Since these arguments in practice often rely critically on the execution of code (whether an Excel spreadsheet or a proper program), it means that we really need tools to effectively communicate narratives that combine code, data and the interpretation of the results.

I will point out here two recent examples, taken from events in the news this week, where IPython has helped this kind of discussion, in the hopes that it can motivate a more informed style of debate where all the moving parts of a quantitative argument are available to all participants.

Insight, not numbers: from literate programming to literate computing

The computing community has for decades known about the "literate programming" paradigm introduced by Don Knuth in the 70's and fully formalized in his famous 1992 book. Briefly, Knuth's approach proposes writing computer programs in a format that mixes the code and a textual narrative together, and from this format generating separate files that will contain either an actual code that can be compiled/executed by the computer, or a narrative document that explains the program and is meant for human consumption. The idea is that by allowing the authors to maintain a close connection between code and narrative, a number of benefits will ensue (clearer code, less programming errors, more meaningful descriptions than mere comments embedded in the code, etc).

I don't take any issue with this approach per se, but I don't personally use it because it's not very well suited to the kinds of workflows that I need in practice. These require the frequent execution of small fragments of code, in an iterative cycle where code is run to obtain partial results that inform the next bit of code to be written. Such is the nature of interactive exploratory computing, which is the bread and butter of many practicing scientists. This is the kind of workflow that led me to creating IPython over a decade ago, and it continues to inform basically every decision we make in the project today.

As Hamming famously said in 1962, "The purpose of computing is insight, not numbers.". IPython tries to help precisely in this kind of usage pattern of the computer, in contexts where there is no clear notion in advance of what needs to be done, so the user is the one driving the computation. However, IPython also tries to provide a way to capture this process, and this is where we join back with the discussion above: while LP focuses on providing a narrative description of the structure of an algorithm, our working paradigm is one where the act of computing occupies the center stage.

From this perspective, we therefore refer to the worfklow exposed by these kinds of computational notebooks (not just IPython, but also Sage, Mathematica and others), as "literate computing": it is the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components. For the goals of communicating results in scientific computing and data analysis, I think this model is a better fit than the literate programming one, which is rather aimed at developing software in tight concert with its design and explanatory documentation. I should note that we have some ideas on how to make IPython stronger as a tool for "traditional" literate programming, but it's a bit early for us to focus on that, as we first want to solidify the computational workflows possible with IPython.

As I mentioned in a previous blog post about the history of the IPython notebook, the idea of a computational notebook is not new nor ours. Several IPython developers used extensively other similar systems from a long time and we took lots of inspiration from them. What we have tried to do, however, is to take a fresh look at these ideas, so that we can build a computational notebook that provides the best possible experience for computational work today. That means taking the existence of the Internet as a given in terms of using web technologies, an architecture based on well-specified protocols and reusable low-level formats (JSON), a language-agnostic view of the problem and a concern about the entire cycle of computing from the beginning. We want to build a tool that is just as good for individual experimentation as it is for collaboration, communication, publication and education.

Government debt, economic growth and a buggy Excel spreadsheet: the code behind the politics of fiscal austerity

In the last few years, extraordinarily contentious debates have raged in the circles of political power and fiscal decision making around the world, regarding the relation between government debt and economic growth. One of the center pieces of this debate was a paper form Harvard economists C. Reinhart and K. Rogoff, later turned into a best-selling book, that argued that beyond 90% debt ratios, economic growth would plummet precipitously.

This argument was used (amongst others) by politicians to justify some of the extreme austerity policies that have been foisted upon many countries in the last few years. On April 15, a team of researchers from U. Massachusetts published a re-analysis of the original data where they showed how Rienhart and Rogoff had made both fairly obvious coding errors in their orignal Excel spreadsheets as well as some statistically questionable manipulations of the data. Herndon, Ash and Pollin (the U. Mass authors) published all their scripts in R so that others could inspect their calculations.

Two posts from the Economist and the Roosevelt Institute nicely summarize the story with a more informed policy and economics discussion than I can make. James Kwak has a series of posts that dive into technical detail and question the horrible choice of using Excel, a tool that should for all intents and purposes be banned from serious research as it entangles code and data in ways that more or less guarantee serious errors in anything but trivial scenarios. Victoria Stodden just wrote an excellent new post with specific guidance on practices for better reproducibility; here I want to take a narrow view of these same questions focusing strictly on the tools.

As reported in Mike Konczal's piece at the Roosevelt Institute, Herndon et al. had to reach out to Reinhart and Rogoff for the original code, which hadn't been made available before (apparently causing much frustration in economics circles). It's absolutely unacceptable that major policy decisions that impact millions worldwide had until now hinged effectively on the unverified word of two scientists: no matter how competent or honorable they may be, we know everybody makes mistakes, and in this case there were both egregious errors and debatable assumptions. As Konczal says, "all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel." To that I would add the obvious: this should never have happened in the first place, as we should have been able to inspect that code and data from the start.

Now, moving over to IPython, something interesting happened: when I saw the report about the Herndon et al. paper and realized they had published their R scripts for all to see, I posted this request on Twitter:

It seemed to me that the obvious thing to do would be to create a document that explained together the analysis and a bit of narrative using IPython, hopefully more easily used as a starting point for further discussion. What I didn't really expect is that it would take less than three hours for Vincent Arel-Bundock, a PhD Student in Political Science at U. Michigan, to come through with a solution:

I suggested that he turn this example into a proper repository on github with the code and data, which he quickly did:

So now we have a full IPython notebook, kept in a proper github repository. This repository can enable an informed debate about the statistical methodologies used for the analysis, and now anyone who simply installs the SciPy stack can not only run the code as-is, but explore new directions and contribute to the debate in a properly informed way.

On to the heavens: the New York Times' infographic on NASA's Kepler mission

As I was discussing the above with Vincent on Twitter, I came across this post by Jonathan Corum, an information designer who works as NY Times science graphics editor:

The post links to a gorgeous, animated infographic that summarizes the results that NASA's Kepler spacecraft has obtained so far, and which accompanies a full article at the NYT on Kepler's most recent results: a pair of planets that seem to have just the right features to possibly support life, a quick 1200 light-years hop from us.

Jonathan indicated that he converted his notebook to a Python script later on for version control and automation, though I explained to him that he could have continued using the notebook, since the --script flag would give him a .py file if needed, and it's also possible to execute a notebook just like a script, with a bit of additional support code:

In this case Jonathan's code isn't publicly available, but I am still very happy to see this kind of usage: it's a step in the right direction already and as more of this analysis is done with open-source tools, we move further towards the possibility of an informed discussion around data-driven journalism.

I also hope he'll release perhaps some of the code later on, so that others can build upon it for similar analyses. I'm sure lots of people would be interested and it wouldn't detract in any way from the interest in his own work which is strongly tied to the rest of the NYT editorial resources and strengths.

Looking ahead from IPython's perspective

Our job with IPython is to think deeply about questions regarding the intersection of computing, data and science, but it's clear to me at this point that we can contribute in contexts beyond pure scientific research. I hope we'll be able to provide folks who have a direct intersection with the public, such as journalists, with tools that help a more informed and productive debate.

Coincidentally, UC Berkeley will be hosting on May 4 a symposium on data and journalism, and in recent days I've had very productive interactions with folks in this space on campus. Cathryn Carson currently directs the newly formed D-Lab, whose focus is precisely the use of quantitative and datamethods in the social sciences, and her team has recently been teaching workshops on using Python and R for social scientists. And just last week I lectured in Raymond Yee's course (from the School of Information) where they are using the notebook extensively, following Wes McKinney's excellent Python for Data Analysis as the class textbook. Given all this, I'm fairly optimistic about the future of a productive dialog and collaborations on campus, given that we have a lot of the IPython team working full-time here.

Note: as usual, this post is available as an IPython notebook in my blog repo.

Tuesday, November 20, 2012

Back from PyCon Canada 2012

I just got back a few days ago from the 2012 edition of PyCon Canada, which was a great success. I wanted to thank the team who invited me for a fantastic experience: Diana Clarke who as conference chair did an incredible job, Greg Wilson from Software Carpentry with whom I had a chance to interact a lot (he already has a long list of ideas for the IPython notebook in teaching contexts we're discussing), Mike DiBernardo and the rest of the PyConCa team. They ran a conference with a great vibe and tons of opportunity for engaging discussion.

Thanks to Greg I also had a chance to give a couple of more academically-oriented talks at U. Toronto facilities, both at the Sunnybrook hospital and their SciNet HPC center, where we had some great discussions. I look forward to future collaborations with some of the folks there.

The PyConCa kindly invited me to deliver the closing keynote for the conference, and I tried to provide a presentation on the part of the Python world that I've been involved with, namely scientific computing, but that would be of interest to the broader Python development community in attendance here. I tried to illustrate where Python has been a great success for modern scientific research, and in doing so I took a deliberately biased view where I spent a good amount of time discussing IPython, which is how I entered that world in the first place.

This is the video of the talk:


and here are the accompanying slides.

I'm too far behind to do a proper recap of the conference itself, but I want to mention one of the highlights for me: a fantastic talk by Elizabeth Leddy, a prominent figure in the Plone world, on how to build sustainable communities. She had a ton of useful insight from in-the-trenches experience with the Plone foundation, and I fortunately got to pick her brain for a while after the talk on these topics. As we gradually build up somewhat similar efforts in the scientific Python world with NumFOCUS, I think she'll be a great person for us to bug every now and then for wisdom.

IPython at the sprints

I managed to stay for the two days of sprints after the end of the main conference, and we had a great time: a number of people made contributions to IPython for the first time, so I'd like to quickly recap here what happened.

Nose extension

Taavi Burns and Greg Ward of distutils fame fought hard on a fairly tricky but extremely useful idea on a suggestion from Greg Wilson: easy in-place use of nose to run tests inside a notebook. This was done by taking inspiration (and I think code) from Catherine Devlin's recent work on integrating doctesting inside the notebook.

The new nose extension hasn't been merged yet, but you can already get the code from github, as usual. Briefly (from Taavi's instructions), this little IPython extension gives you the ability to discover and run tests using Nose in an IPython Notebook.

You starty with a cell containing:

%load_ext ipython_nose

Then write tests that conform to Nose conventions, e.g.

  def test_arithmetic():
      assert 1+1 == 2

And where you want to run your tests, you add a cell consisting of

  %nose

and run it: that will discover your test_* functions, run them, and report how many passed and how many failed, with stack traces for each failure.

WebGL-based 3d protein visualization

RishiRamraj, Christopher Ing and Jonathan Villemaire-Krajden implemented an extremely cool visualization widget that can, using the IPython display protocol, render a protein structure directly in a 3d interactive window. They used Konrad Hinsen's MMTK toolkit, and the resulting code is as simple as:

from MMTK.Proteins import Protein
Protein('insulin')

You can see what the output looks like in this short video shot by Taavi Burns just as they got it working and we were all very excited looking at the result; the code is already available on github.

I very much look forward to much more of this kind of tools being developed, and in fact Cyrille Rossant wasted no time at all building off this to provide fast 2-d visualizations rendered via WebGL with his Galry library:


Software Carpentry

In addition to the Nose extension above, Greg Wilson had a ton of ideas on things that could be added to the notebook that he thinks would help in the context of teaching workshops such as those that Software Carpentry presents. Their audience is typically composed of beginning programmers, scientists who may be experts in their discipline but who have little to no formal computational training and are now tasked with managing often quite complex computational workflows. Since SWC recently announced they would be switching to the notebook as their main teaching platform, they obviously are thinking deeply about how to make the best use of it and where the notebook can improve for this kind of use case.

These are still conversations that I hope will turn soon into concrete issues/code repositories to begin testing them, but that kind of validated testing is very useful for us. Since at this point we have too many feature requests from multiple fronts to be able to satisfy them all, we are trying to focus on ensuring that IPython can support indivdual projects building their own custom tools and extensions. We can't possibly merge every last idea from every front into IPython, but we can work to ensure it's a flexible and coherent enough foundation that others can build their own highly customized experiences on top. Once these get widely tested and validated, it may be that pieces are clearly of generic enough value to percolate into the core, but in the meantime this approach means that other projects (SWC being just one example among many) don't need to wait for us to add every feature they need.

What we will focus on will be on addressing any limitations that our architecture may have for such extensibility to work well, so the life of third party projects isn't a fight against our interfaces.

A first-time contributor to open source

Last, but not least, I had the great experience of working with David Kua, a CS student from U. Toronto who had never made a contribution to open source and wanted to work on IPython. Right during the sprints we were able to merge his first pull request into nbconvert, and he immediately started working on a new one for IPython that by now has also been merged.

That last one required that he learn how to rebase his git repo (he had some extraneous commits originally) and go through a fair amount of feedback before merging: this is precisely the real world cycle of open source contributions. It's always great to see a brand new contributor in the making, and I very much look forward to many more work from David, whether he decides to do it in IPython or in any other open source project that catches his interest.

Note

Since I am now writing all my posts as IPython notebooks (even when there's no code, it's a really nice way to get instant feedback on markdown), you can get the notebook for this post from my repo.

Sunday, October 14, 2012

Help save open space in the Bay Area by protecting Knowland Park from development

Vote NO on new Tax Measure A1

Update: there is now evidence that Zoo officials have actually violated election laws in their zeal to promote measure A1.

I normally only blog about technical topics, but the destruction of a beautiful piece of open space in the Bay Area is imminent, and I want to at least do a little bit to help prevent this disaster.

In short: there's a tax measure on the November ballot, Measure A1, that would impose a parcel tax on all residences and businesses in Alameda County to fund the Oakland Zoo for the next 25 years.  The way the short text on the ballot is worded makes it appear as something geared towards animal care for a cash-strapped Zoo.  The sad reality is that the full text of the measure allows the Zoo to use these funds for a very controversial expansion plan that includes a 34,000 sq. ft. visitor center, gift shop and restaurant serviced by a ski gondola atop one of the last pristine remaining ridges in Knowland Park, an Oakland city park that sits above the Zoo.

Yes, it's as bad as it sounds; the beautiful ridge in the background:


that is today part of an unspoiled open space, would be closed off by a fence and a restaurant would be built atop of it,  serviced by a ski gondola that would reach it from the bottom of the hill.  Here are a few more pics from the same album as well as a great photo essay on the park from the AllThingsOakland blog, and some more history of the park.

Restaurant development disguised as animal care


The Zoo claims to be strapped for cash, yet they are spending over $ 1 million on a media blitz to get this measure passed, and only presenting it as an animal-care issue.  I am a huge animal lover and donate regularly to the San Diego Zoo, but unfortunately the situation with the Oakland Zoo is a different story: they see the 525-acre Knowland Park above the Zoo as their personal back yard, not as a resource that belongs to all of us.  It has been impossible, in years of negotiations, to get the Zoo to sign anything that commits them to respect the boundaries of the park in the future.  They see this tax measure as their strategic "nuclear weapon" to destroy the park, and in order to get it, they are willing to burn through cash they should instead be using for animal care.

I urge you to consider this as you go to the polls in November: all Alameda county voters will end up having a say on whether "nature preservation" in the East Bay is spelled "huge restaurant and a ski gondola on open space". By voting NO on A1 you will help prevent such madness.



More information

 Here are a few relevant links with details and further info

A final note: the citizen's group fighting to save the park can use all the help in the world. You can make donations or join the effort in many other ways; don't hesitate to ask me for more info.  And please share this post as widely as possible!

Friday, September 7, 2012

Blogging with the IPython notebook

Update: made full github repo for blog-as-notebooks, and updated instructions on how to more easily configure everything and use the newest nbconvert for a more streamlined workflow.

Since the notebook was introduced with IPython 0.12, it has proved to be very popular, and we are seeing great adoption of the tool and the underlying file format in research and education. One persistent question we've had since the beginning (even prior to its official release) was whether it would be possible to easily write blog posts using the notebook. The combination of easy editing in markdown with the notebook's ability to contain code, figures and results, makes it an ideal platform for quick authoring of technical documents, so being able to post to a blog is a natural request.

Today, in answering a query about this from a colleague, I decided to try again the status of our conversion pipeline, and I'm happy to report that with a bit of elbow-grease, at least on Blogger things work pretty well!

This post was entirely written as a notebook, and in fact I have now created a github repo, which means that you can see it directly rendered in IPyhton's nbviewer app.

The purpose of this post is to quickly provide a set of instructions on how I got it to work, and to test things out. Please note: this requires code that isn't quite ready for prime-time and is still under heavy development, so expect some assembly.

Converting your notebook to html with nbconvert

The first thing you will need is our nbconvert tool that converts notebooks across formats. The README file in the repo contains the requirements for nbconvert (basically python-markdown, pandoc, docutils from SVN and pygments).

Once you have nbconvert installed, you can convert your notebook to Blogger-friendly html with:

nbconvert -f blogger-html your_notebook.ipynb

This will leave two files in your computer, one named your_notebook.html and one named your_noteboook_header.html; it might also create a directory called your_notebook_files if needed for ancillary files. The first file will contain the body of your post and can be pasted wholesale into the Blogger editing area. The second file contains the CSS and Javascript material needed for the notebook to display correctly, you should only need to use this once to configure your blogger setup (see below):

# Only one notebook so far
(master)longs[blog]> ls
120907-Blogging with the IPython Notebook.ipynb  fig/  old/

# Now run the conversion:
(master)longs[blog]> nbconvert.py -f blogger-html 120907-Blogging\ with\ the\ IPython\ Notebook.ipynb

# This creates the header and html body files
(master)longs[blog]> ls
120907-Blogging with the IPython Notebook_header.html  fig/
120907-Blogging with the IPython Notebook.html         old/
120907-Blogging with the IPython Notebook.ipynb

Configuring your Blogger blog to accept notebooks

The notebook uses a lot of custom CSS for formatting input and output, as well as Javascript from MathJax to display mathematical notation. You will need all this CSS and the Javascript calls in your blog's configuration for your notebook-based posts to display correctly:

  1. Once authenticated, go to your blog's overview page by clicking on its title.
  2. Click on templates (left column) and customize using the Advanced options.
  3. Scroll down the middle column until you see an "Add CSS" option.
  4. Copy entire the contents of the _header file into the CSS box.

That's it, and you shouldn't need to do anything else as long as the CSS we use in the notebooks doesn't drastically change. This customization of your blog needs to be done only once.

While you are at it, I recommend you change the width of your blog so that cells have enough space for clean display; in experimenting I found out that the default template was too narrow to properly display code cells, producing a lot of text wrapping that impaired readability. I ended up using a layout with a single column for all blog contents, putting the blog archive at the bottom. Otherwise, if I kept the right sidebar, code cells got too squished in the post area.

I also had problems using some of the fancier templates available from 'Dynamic Views', in that I could never get inline math to render. But sticking to those from the Simple or 'Picture Window' categories worked fine and they still allow for a lot of customization.

Note: if you change blog templates, Blogger does destroy your custom CSS, so you may need to repeat the above steps in that case.

Adding the actual posts

Now, whenever you want to write a new post as a notebook, simply convert the .ipynb file to blogger-html and copy its entire contents to the clipboard. Then go to the 'raw html' view of the post, remove anything Blogger may have put there by default, and paste. You should also click on the 'options' tab (right hand side) and select both Show HTML literally and Use <br> tag, else your paragraph breaks will look all wrong.

That's it!

What can you put in?

I will now add a few bits of code, plots, math, etc, to show which kinds of content can be put in and work out of the box. These are mostly bits copied from our example notebooks so the actual content doesn't matter, I'm just illustrating the kind of content that works.

In [1]:
# Let's initialize pylab so we can plot later
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

With pylab loaded, the usual matplotlib operations work

In [2]:
x = linspace(0, 2*pi)
plot(x, sin(x), label=r'$\sin(x)$')
plot(x, cos(x), 'ro', label=r'$\cos(x)$')
title(r'Two familiar functions')
legend()
Out [2]:
<matplotlib.legend.Legend at 0x3128610>

The notebook, thanks to MathJax, has great LaTeX support, so that you can type inline math $(1,\gamma,\ldots, \infty)$ as well as displayed equations:

$$ e^{i \pi}+1=0 $$

but by loading the sympy extension, it's easy showcase math output from Python computations, where we don't type the math expressions in text, and instead the results of code execution are displayed in mathematical format:

In [3]:
%load_ext sympyprinting
import sympy as sym
from sympy import *
x, y, z = sym.symbols("x y z")

From simple algebraic expressions

In [4]:
Rational(3,2)*pi + exp(I*x) / (x**2 + y)
Out [4]:
$$\frac{3}{2} \pi + \frac{e^{\mathbf{\imath} x}}{x^{2} + y}$$
In [5]:
eq = ((x+y)**2 * (x+1))
eq
Out [5]:
$$\left(x + 1\right) \left(x + y\right)^{2}$$
In [6]:
expand(eq)
Out [6]:
$$x^{3} + 2 x^{2} y + x^{2} + x y^{2} + 2 x y + y^{2}$$

To calculus

In [7]:
diff(cos(x**2)**2 / (1+x), x)
Out [7]:
$$- 4 \frac{x \operatorname{sin}\left(x^{2}\right) \operatorname{cos}\left(x^{2}\right)}{x + 1} - \frac{\operatorname{cos}^{2}\left(x^{2}\right)}{\left(x + 1\right)^{2}}$$

For more examples of how to use sympy in the notebook, you can see our example sympy notebook or go to the sympy website for much more documentation.

You can easily include formatted text and code with markdown

You can italicize, boldface

  • build
  • lists

and embed code meant for illustration instead of execution in Python:

def f(x):
    """a docstring"""
    return x**2

or other languages:

if (i=0; i<n; i++) {
  printf("hello %d\n", i);
  x += 4;
}

And since the notebook can store displayed images in the file itself, you can show images which will be embedded in your post:

In [8]:
from IPython.display import Image
Image(filename='fig/img_4926.jpg')
Out [8]:

You can embed YouTube videos using the IPython object, this is my recent talk at SciPy'12 about IPython:

In [9]:
from IPython.display import YouTubeVideo
YouTubeVideo('iwVvqwLDsJo')
Out [9]:

Including code examples from other languages

Using our various script cell magics, it's easy to include code in a variety of other languages

In [10]:
%%ruby
puts "Hello from Ruby #{RUBY_VERSION}"
Hello from Ruby 1.8.7
In [11]:
%%bash
echo "hello from $BASH"
hello from /bin/bash

And tools like the Octave and R magics let you interface with entire computational systems directly from the notebook; this is the Octave magic for which our example notebook contains more details:

In [12]:
%load_ext octavemagic
In [13]:
%%octave -s 500,500

# butterworth filter, order 2, cutoff pi/2 radians
b = [0.292893218813452  0.585786437626905  0.292893218813452];
a = [1  0  0.171572875253810];
freqz(b, a, 32);

The rmagic extension does a similar job, letting you call R directly from the notebook, passing variables back and forth between Python and R.

In [14]:
%load_ext rmagic 

Start by creating some data in Python

In [15]:
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])

Which can then be manipulated in R, with results available back in Python (in XYcoef):

In [16]:
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)
Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:  0.81, Adjusted R-squared: 0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739 

In [17]:
XYcoef
Out [17]:
[ 3.2  0.9]

And finally, in the same spirit, the cython magic extension lets you call Cython code directly from the notebook:

In [18]:
%load_ext cythonmagic
In [19]:
%%cython -lm
from libc.math cimport sin
print 'sin(1)=', sin(1)
sin(1)= 0.841470984808

Keep in mind, this is still experimental code!

Hopefully this post shows that the system is already useful to communicate technical content in blog form with a minimal amount of effort. But please note that we're still in heavy development of many of these features, so things are susceptible to changing in the near future. By all means join the IPython dev mailing list if you'd like to participate and help us make IPython a better tool!