-->

Thursday, May 19, 2011

Austin trip: IPython at TACC and DataArray summit at Enthought

TACC Software Days

I recently had the chance to speak at the UT Austin Texas Advanced Computing Center, during their Fifth Annual Scientific Software Day thanks to a kind invitation by Sergey Fomel and Victor Eijkhout.  Since the audience wasn't specifically composed of Python users, I gave a general introduction to Python's role in scientific computing, but then spent most of my time presenting some of the recent work we've been doing on IPython, extending the model of basic interactive computing in what I think are interesting directions with multiple client models and new parallel computing interfaces. Sergey had asked me to provide a somewhat personal account, so the presentation is fairly biased towards my own path (and therefore IPython) through the scipy effort.  Since the foucs of TACC is high-performance computing, I hope some of our new functionality on the IPython front will be useful to such users.

There were some very interesting presentations about the FLAME linear algebra library. I did my best to convince Robert van de Geijn of the interest there would be in the scientific Python community for exposing FLAME to Python, possibly as an alternative backend for the linear algebra machinery in scipy.  Since FLAME uses a fair amount of code generation, I think the dynamic properties of Python would make it a great fit for the FLAME paradigm.  We had some interesting discussions on this, we'll see where this develops...

Datarray summit at Enthought

In conjunction with this visit, we had been trying to organize a meeting at Enthought to make some progress on the datarray effort that was started after last year's scipy conference.  I'd like to thank Sergey for kindly allowing my UT-funded visit to extend into the datarray summit.  Enthought brought a large contingent of their in-house team (developers and interns like Mark Wiebe of numpy fame), and invited Wes Mc Kinney (pandas) and Matthew Brett (nipy) to participate (as well as others who couldn't make it).  In all we had roughly 15 people, with Travis Oliphant, Eric Jones, Robert Kern, Peter Wang and Corran Webster --all very experienced numpy users/developers-- participating for most of the meeting, which was both a lot of fun and very productive.  The datarray effort has continued to progress but fairly slowly, mostly due to my being timesliced into oblivion.  But I remain convinced it's a really important piece of the puzzle for scientific python to really push hard into high-level data analysis, and I'm thrilled that Enthought is putting serious resources into this.

We spent a lot of time on the first day going over use cases coming from many different fields, and then dove into the API design questions, using the current code in the datarray repository as a starting point.  It has become very clear that one key piece of functionality we can't ignore is allowing for axis labels to be integers (without any assumption of ordering, continuity or monotonicity). This proves to be surprisingly tricky to accomodate, because the assumption that integers are indices for array ranging from 0 to n-1 in any dimension goes very deep in all slicing operations, so allowing for integers with different semantics requires a fair amount of API gymnastics.

Not everything is fully settled (and I missed some of the details because I could not stay until the very end), but it's clear that we will have integer labels, and that the main entry point for all axis slicing will be a .axes attribute. This will likely provide the basic named axis attribute-based access, as well as dictionary-style access to computed axes.  We will also probably have one method ('slice' was the running name when I left) for slicing with more elaborate semantics, so that we can resolve the ambiguities of integer labeling without resorting to obscure magic overloads (numpy already has enough of those with things like ogrid[10:20:3j], where that "complex slice step" is always fun to explain to newcomers).

We just completed an episode of the inSCIght podcast that contains some more discussion about this with Travis and Wes.  I really hope we won't lose the momentum we seem to have picked up and that over the next few months this will start taking shape into production code.  I know we need it, badly.


6 comments:

Andy said...

Glad to hear you are on van de Geijn's case, I've been telling him this for a couple of years now. The real problem is licensing (libflame is LGPL) and making sure van de Geijn doesn't have to support the software.

I would say wrapping up the current libflame with cython would be no trouble at all. Ignition has a interface for defining the operations and generating the algorithms but not the low-level implementation.

Fernando said...

Hey @Andy, I would go as far as saying I'm really on his case, but I certainly think it's the right way to go. LGPL licensing isn't that big of a deal, there's a fair amount of LGPL dependencies in the scipy stack (WX, Qt), so I don't think that would worry anyone too much.

Maybe you can take that project up in your copious spare time ;)

Fernando said...

s/would/wouldn't/

Gaël said...

Hi Fernando,

I am really that you are still putting some energy in DataArrays. I think as you do that they are a very important piece of the puzzle.

chuck said...

Besides DataArrays, I think better low level support for masked arrays would be very useful. There was some demo code posted, oh, maybe 2 years ago that added support for them in at the ufunc level. I was sorry that it didn't go any further than that.

Ondřej Čertík said...

@Fernando, nice blog post.

Btw, I don't use Wx nor Qt, just a terminal and Chrome/Firefox. And it seems to have (or will have very soon) everything I need for scientific computing.