Thursday, May 19, 2011

Austin trip: IPython at TACC and DataArray summit at Enthought

TACC Software Days

I recently had the chance to speak at the UT Austin Texas Advanced Computing Center, during their Fifth Annual Scientific Software Day thanks to a kind invitation by Sergey Fomel and Victor Eijkhout.  Since the audience wasn't specifically composed of Python users, I gave a general introduction to Python's role in scientific computing, but then spent most of my time presenting some of the recent work we've been doing on IPython, extending the model of basic interactive computing in what I think are interesting directions with multiple client models and new parallel computing interfaces. Sergey had asked me to provide a somewhat personal account, so the presentation is fairly biased towards my own path (and therefore IPython) through the scipy effort.  Since the foucs of TACC is high-performance computing, I hope some of our new functionality on the IPython front will be useful to such users.

There were some very interesting presentations about the FLAME linear algebra library. I did my best to convince Robert van de Geijn of the interest there would be in the scientific Python community for exposing FLAME to Python, possibly as an alternative backend for the linear algebra machinery in scipy.  Since FLAME uses a fair amount of code generation, I think the dynamic properties of Python would make it a great fit for the FLAME paradigm.  We had some interesting discussions on this, we'll see where this develops...

Datarray summit at Enthought

In conjunction with this visit, we had been trying to organize a meeting at Enthought to make some progress on the datarray effort that was started after last year's scipy conference.  I'd like to thank Sergey for kindly allowing my UT-funded visit to extend into the datarray summit.  Enthought brought a large contingent of their in-house team (developers and interns like Mark Wiebe of numpy fame), and invited Wes Mc Kinney (pandas) and Matthew Brett (nipy) to participate (as well as others who couldn't make it).  In all we had roughly 15 people, with Travis Oliphant, Eric Jones, Robert Kern, Peter Wang and Corran Webster --all very experienced numpy users/developers-- participating for most of the meeting, which was both a lot of fun and very productive.  The datarray effort has continued to progress but fairly slowly, mostly due to my being timesliced into oblivion.  But I remain convinced it's a really important piece of the puzzle for scientific python to really push hard into high-level data analysis, and I'm thrilled that Enthought is putting serious resources into this.

We spent a lot of time on the first day going over use cases coming from many different fields, and then dove into the API design questions, using the current code in the datarray repository as a starting point.  It has become very clear that one key piece of functionality we can't ignore is allowing for axis labels to be integers (without any assumption of ordering, continuity or monotonicity). This proves to be surprisingly tricky to accomodate, because the assumption that integers are indices for array ranging from 0 to n-1 in any dimension goes very deep in all slicing operations, so allowing for integers with different semantics requires a fair amount of API gymnastics.

Not everything is fully settled (and I missed some of the details because I could not stay until the very end), but it's clear that we will have integer labels, and that the main entry point for all axis slicing will be a .axes attribute. This will likely provide the basic named axis attribute-based access, as well as dictionary-style access to computed axes.  We will also probably have one method ('slice' was the running name when I left) for slicing with more elaborate semantics, so that we can resolve the ambiguities of integer labeling without resorting to obscure magic overloads (numpy already has enough of those with things like ogrid[10:20:3j], where that "complex slice step" is always fun to explain to newcomers).

We just completed an episode of the inSCIght podcast that contains some more discussion about this with Travis and Wes.  I really hope we won't lose the momentum we seem to have picked up and that over the next few months this will start taking shape into production code.  I know we need it, badly.

Tuesday, April 5, 2011

Python goes to Reno: SIAM CSE 2011

In what's becoming a bit of a tradition, Simula's Hans-Petter Langtangen, U. Washington's Randy LeVeque and I co-organized yet another minisymposium on Python for Scientific computing at a SIAM conference.

At the Computational Science and Engineering 2011 meeting, held in Reno February 28-March 4, we had 2 sessions with 4 talks each (part I and II).  I have put together a page with all the slides I got from the various speakers, that also includes slides from python-related talks in other minisymposia.  I have also posted some pictures from our sessions and from the minisymposium on reproducible research that my friend and colleague Jarrod Millman organized during the same conference.

We had great attendance, with a standing-room-only crowd for the first session, something rather unusual during the parallel sessions of a SIAM conference.  But more importantly, this year there were three other sessions entirely devoted to Python in scientific computing at the conference, organized completely independently from ours.  One focused on PDEs and the other on optimization.  Furthermore, there were scattered talks at several other sessions where Python was explicitly discussed in the title or abstract.  For all of these, I have collected the slides I was able to get; if you have slides for one such talk I failed to include, please contact me and I'll be happy to post them there. 

Unfortunately for our audience, we had last-minute logistical complications that prevented Robert Bradshaw and John Hunter from attending, so I had to deliver the Cython and matplotlib talks (in addition to my IPython one).  Having a speaker give three back-to-back talks isn't ideal, but both of them kindly prepared all the materials and "delivered" them to me over skype the day before, so hopefully the audience got a reasonable simile of their original intent. It's a shame, since I know first-hand how good both of them are as speakers, but canceling talks on these two key tools would really have been a disservice to everyone; my thanks go to the SIAM organizers who were flexible enough to allow for this to happen.  Given how packed the room was, I'm sure we made the right choice.

It's now abundantly clear from this level of interest that Python is being very successful in solving real problems in scientific computing.  We've come a long way from the days when some of us (I have painful memories of this) had to justify to our colleagues/advisors why we wanted to 'play' with this newfangled 'toy' instead of just getting on with our job using the existing tools (in my case it was IDL, a hodgepodge of homegrown shell/awk/sed/perl scripting, custom C and some Gnuplot thrown in the mix for good measure).  Things are by no means perfect, and there's plenty of problems to solve, but we have a great foundation, a number of good quality tools that continue to improve as well as our most important asset: a rapidly growing community that is solving new problems, creating new libraries and coming up with innovative approaches to computational and mathematical questions, often facilitated by Python's tremendous flexibility. It's been a fun ride so far, but I suspect the next decade is going to be even more interesting.  If you missed this, try to make it to SciPy 2011 or EuroSciPy 2011!

Link summary

Wednesday, March 30, 2011

IPython and scientific Python go to Sage Days 29

Last week I was in Seattle, attending part of of the Sage Days 29 workshop, which had a strong focus on the more numerical/applied topics and the 'scipy ecosystem', as it's funded by William Stein's Sage: Unifying Mathematical Software for Scientists, Engineers, and Mathematicians grant.

I gave a talk that covered topics that are fairly familiar to many in the scipy community, about using Python for numerical work, but was intended to address many from the Sage group who come from a pure mathematics/number theory background.  I have posted the PDF of my slides; William recorded the talk and posted it online, as well as posting some pictures from the last day (the anecdote about how I unplugged Colombia from the internet when I was a physics undergrad is from 0:11:20 to 0:14:12):

For me, the bulk of the workshop's focus was to make progress on IPython.  Thomas Kluyver, a new core IPython developer, was able to take a week off from his studies and attend from his home in Sheffield; it was great to meet Thomas in person, given the fantastic work he has done on the project in the recent months.  He's been extremely productive and got us out from a number of areas were we had been somewhat stuck, so I'm looking forward to a lot more collaboration with him in the future.  Min Ragan-Kelley was also present, and Brian Granger unfortunately couldn't make it but joined us for a number of long skype-based discussions.  Paul Ivanov, who was officially on matplotlib duty, also helped us a lot with a number of bugs and new pull requests.

In the end, we had a tremendously productive three days.  We closed 66 tickets in total.  Many of these were triage work, but that's still very useful, and we also did major code review to merge most of the outstanding pull requests, as well as having detailed design discussions on the newparallel branch that will provide zeromq-based high-level parallel tools.   We now see light at the end of the tunnel, and I think we'll be able to release the massive amount of work we've been calling IPython 0.11 in a matter of weeks.  I'll write in more detail later on that topic, but at least I think those willing to run code from git master can start playing with it now.  Things are shaping up quickly, and we'd love to get feedback from early adopters to solidify the APIs.

I'm extremely pleased with this workshop: for IPython the progress was massive, and I think the outcome was similar for numpy, matplotlib and the others.  These highly focused development meetings, when held in a good environment (and  the UW facilities we had, in their gorgeous new PACCAR hall, were spectacular), can really be amazingly productive.  A big thank you to William for the funding and all the organization/logistical work!

Now I just have to get caught up with the other million things that piled up on my inbox/todo in the meantime...

Saturday, February 19, 2011

Reproducible Research at the AAAS 2011 meeting in Washington, DC

Update: added links to other related posts, significantly expanded the section on Git and Github for scientific work.

Link summary: Page with abstracts and slide links, Victoria Stodden's blog, Mark Liberman's blogmy slides and extended abstract, audio (my talk is at time 53:25 to 1:10:47).

At this year's AAAS meeting, currently taking place in DC (in unseasonably warm and sunny weather), Victoria Stodden from the statistics department at Columbia, organized a symposium titled The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer that was very well attended.

Lessons from the Open Source software world

I have tagged this post with "Python" because my take on the matter was to contrast the world of classic research/academic publishing with the practices of open source software development, and what little I know about that (as well as some specific tools I mentioned, like Sphinx), I picked up from the world of open source scientific Python projects I'm involved with, from IPython onwards. My argument is that the tools and practices from the open source community in fact come much closer to the scientific ideals of reproducibility than much of what is published in scientific journals today.

The OSS world is basically forced to do this, because people across the world are collaborating on developing a project from different computing environments, operating systems, library versions, compilers, etc. Without very strong systems for provenance tracking (aka version control), automatic testing and good quality documentation, this task would be simply impossible. But many of these tools can be adapted for use in everyday scientific work; for some use cases they work extremely well, for others there's still room for improvement, but overall we can and should take these lessons into everyday scientific practice.

In my talk, I spent a fair amount of time discussing the Git version control system, not in terms of its technical details, but rather trying to point out how it should not be viewed just as a tool for software development, but instead as something that can be an integral part of all aspects of the research process. Git is a powerful and sophisticated system for provenance tracking that automatically validates data integrity by design: Linus Torvalds wanted to ensure that every commit operation is signed with a hash of its contents plus the hash of its dependencies (for details on this, his sometimes abrasive Google Tech Talk about Git is an excellent reference). This simple idea ensures that a single byte change anywhere in the entire repository can be detected automatically.  I keep an informal page of Git resources for those looking tot get started.

I use Git for just about all my activities at the computer that require manually creating content, with repositories not only for research projects that involve writing standalone libraries, but also for papers, grant proposals, data analysis research, and even teaching. Its distributed nature (every copy of the repository has all the project's history) makes it automatically much more resilient to failures than a more limited legacy tool like Subversion and its strong branching and merging capabilities make it great for exploratory work (something that is painful to achieve with SVN). Git is also the perfect way to collaborate on projects: all members have full versioning control, can commit work as they need it, and can make visible to collaborators only what they deem ready for sharing (this is impossible to do with SVN). Writing a multiauthor paper or grant proposal with Git is a far saner, more efficient and less error prone process than the common madness of emailing dozens or hundreds of attachments every which way between multiple people (for those who think Dropbox suffices for collaborative writing: it's like using a wood saw for brain surgery; Dropbox is great for many things and I love it, but it's not the tool for this problem). I have also used Git for teaching, by creating a public repository for all course content and individual repositories for each student that only the student, the teaching assistants and myself can access. This enables students to fetch all new class content with a simple:
git pull
instead of clicking on tens of files in some web-based interface (the typical system used by many universities). A single clone operation can reconstruct the entire class repository on another computer if they need to use it in more than one place or lose their old copy. And when it's time to submit the homework, instead of emailing or uploading anything, all they need to do is:
git push
and the TAs have immediate access to all their work, including its development history. In this manner, not only is the process vastly smoother and simpler for all involved, but the students learn to use version control as a natural tool that is simply part of their daily workflow.

I also tried to highlight the role played by the GitHub service as an enabler of collaboration. While Git can be used (and it is extremely useful in this mode) on a single computer without any server support, the moment several people want to share their repositories for collaborative work, some kind of persistent server is the best solution. GitHub, a service that is free for Open Source projects and that offers paid plans for non-public work, has a number of brilliant features that make the collaboration process amazingly useful. Github makes it trivial for new contributors to begin participating in a project by forking it (i.e. getting their personal copy to work on), and if they want their work to be incorporated into the project, they can make a pull request. The original authors then review the proposed changes, comment on them (including making line-specific comments with a single click), and once all are satisfied with the outcome, integrate them. This is effectively a public peer review system that, while created for software development, can be equally useful for collaborative authorship of a research project.

I should add, however, that I think there's still room for improvement regarding Git as a tool for pervasive use in the scientific workflow. As much as I absolutely love Git, it's a tool tailored for source code tracking and its atomic unit of change is the line of code. As such, it doesn't work as conveniently when tracking for example changes in a paper (even if written in TeX), where a small change can reflow a whole paragraph, showing a diff that is much larger than the real change. In this case, the "track changes" features of word processors actually work better at showing the specific changes made (despite the fact that I think they make for a horrible interface for the overall workflow) [Note: in the comments below, a reader indicates that the --word-diff option solves this problem, though I think it requires a very new version of Git, 1.7.2 at least. But it's fantastic to see this kind of improvement being already available]. And for tracking changes to binary files, there's simply no meaningful diff available. It would be interesting to see new ideas for improving something like git for these kinds of use cases.

I wrapped things up with a short mention of the new Open Research Computation journal, where Victoria and I are members of the editorial board, as well as several well-known contributors to the scientific Python ecosystem, including Titus Brown, Hans-Petter Langtangen, Jarrod Millman, Prabhu Ramachandran and Gaƫl Varoquaux.

Other presentations

I spoke after Keith Baggerly and Victoria. Keith presented an amazing dissection of the (ongoing) scandal with the Duke University cancer clinical trials that has seen extensive media coverage. This case is a bone-chilling example of the worst that can happen when unreproducible research is used as the base for decisions that impact the health and lives of human beings. Yet, despite the rather dark subject, Keith's talk was one of the most lively and entertaining presentations I've seen at a conference in a long time. Victoria discussed the legal framework in which we can begin considering the problem of reproducible computational research; she was instrumental in the NSF's new grant guidelines now having a mandatory data management plan section. She has the unique combination of both a computational and a legal background, which is very necessary to tackle this problem in a meaningful way (since licensing, copyright and other legal issues are central to the discussion).

Afterwards, Michael Reich from the Broad Institute presented the GenePattern project, an impressive genomic analysis platform that includes provenance tracking and workflow execution, as well as a plug-in for Microsoft Word to connect documents with the execution engine. While the Word graphical user interface would likely not be my environment of use, the GenePattern system seems to be very well thought out and useful. The last three talks were by Robert Gentleman of BioConductor fame, David Donoho --Victoria's PhD advisor and a pioneer in posing the problem of reproducibility in computational work together with Jon Claerbout, and finally Mark Liberman of U. Penn (see Mark's blog for his take on the symposium).

I think the symposium went very well; there was lively discussion with the audience and good attendance. A journalist made a good point on how improvements on the reproducibility front are important for them, when they are trying to do their job of reporting to a sometimes skeptical public the results of scientific work. If our work is made available with strong, credible guarantees of reproducibility, it will be that much more easily presented to a society which ultimately decides whether to support the scientific endeavor (or not).

There is a lot of room for improvement, as Keith Baggerly's talk painfully reminded us. But I think that finally the climate is changing, and in this case in the right direction: the tools are improving, people are interested, funding agencies are modifying their policies and so are journals.