Monday, February 23, 2015

linalg support in pypy/numpy

Introduction

PyPy's numpy support has matured enough that it can now support the lapack/blas libraries through the numpy.linalg module. To install the version of numpy this blog post refers to, install PyPy version 2.5.0 or newer, and run this:

pypy -m pip install git+https://bitbucket.org/pypy/numpy.git

This update is a major step forward for PyPy's numpy support. Many of the basic matrix operations depend on linalg, even matplotlib requires it to display legends (a pypy-friendly version of matplotlib 1.3 is available at https://github.com/mattip/matplotlib).

A number of improvements and adaptations, some of which are in the newly-released PyPy 2.5.0, made this possible:
  • Support for an extended frompyfunc(), which in the PyPy version supports much of the ufunc API (signatures, multiple dtypes) allowing creation of pure-python, jit-friendly ufuncs. An additional keyword allows choosing between out = func(in) or func(in, out) ufunc signatures. More explanation follows.
  • Support for GenericUfuncs via PyPy's (slow) capi-compatibility layer. The underlying mechanism actually calls the internal implementation of frompyfunc().
  • A cffi version of _umath_linalg. Since cffi uses dlopen() to call into shared objects, we added support in the numpy build system to create non-python shared libraries from source code in the numpy tree. We also rewrote parts of the c-based _umath_linalg.c.src in python, renamed numpy's umath_linalg capi module to umath_linag_capi, and use it as a shared object through cffi.

Status

We have not completely implemented all the linalg features. dtype resolution via casting is missing, especially for complex ndarrays, which leads to slight numerical errors where numpy uses a more precise type for intermediate calculations. Other missing features in PyPy's numpy support may have implications for complete linalg support.

Some OSX users have noticed they need to update pip to version 6.0.8 to overcome a regression in pip, and it is not clear if we support all combinations of blas/lapack implementations on all platforms.

Over the next few weeks we will be ironing out these issues.

Performance

A simple benchmark is shown below, but let's state the obvious: PyPy's JIT and the iterators built into PyPy's ndarray implementation will in most cases be no faster than CPython's numpy. The JIT can help where there is a mixture of python and numpy-array code. We do have plans to implement lazy evaluation and to further optimize PyPy's support for numeric python, but numpy is quite good at what it does.

HowTo for PyPy's extended frompyfunc

The magic enabling blas support is a rewrite of the _umath_linalg c-based module as a cffi-python module that creates ufuncs via frompyfunc. We extended the numpy frompyfunc to allow it to function as a replacement for the generic ufunc available in numpy only through the c-api.

We start with the basic frompyfunc, which wraps a python function into a ufunc:
 
def times2(in0):
    return in0 * 2
ufunc = frompyfunc(times2, 1, 1)

In cpython's numpy the dtype of the result is always object, which is not implemented (yet) in PyPy, so this example will fail. While the utility of object dtypes can be debated, in the meantime we add a non-numpy-compatible keyword argument dtypes to frompyfunc. If dtype=['match'] the output dtype will match the dtype of the first input ndarray:

ufunc = frompyfunc(times2, 1, 1, dtype=['match'])
ai = arange(24).reshape(3, 4, 2)
ao = ufunc(ai)
assert  (ao == ai * 2).all()

I hear you ask "why is the dtypes keyword argument a list?" This is so we can support the Generalized Universal Function API, which allows specifying a number of specialized functions and the input-output dtypes each specialized function accepts.
Note that the function feeds the values of ai one at a time, the function operates on scalar values. To support more complicated ufunc calls, the generalized ufunc API allows defining a signature, which specifies the layout of the ndarray inputs and outputs. So we extended frompyfunc with a signature keyword as well.
We add one further extension to frompyfunc: we allow a Boolean keyword stack_inputs to specify the argument layout of the function itself. If the function is of the form:
 
out0, out1, ... = func(in0, in1,...)

then stack_inputs is False. If it is True the function is of the form:
 
func(in0, in1, ... out0, out1, ...)

Here is a complete example of using frompyfunc to create a ufunc, based on this link:
 
def times2(in_array, out_array):
    in_flat = in_array.flat
    out_flat = out_array.flat
    for i in range(in_array.size):
        out_flat[i] = in_flat[i] * 2
ufunc = frompyfunc([times2, times2], 1, 1,
                signature='(i)->(i)',
                dtypes=[dtype(int), dtype(int),
                        dtype(float), dtype(float),
                       ],
                stack_inputs=True,
                )
ai = arange(10, dtype=int)
ai2 = ufunc(ai)
assert all(ai2 == ai * 2)

Using this extended syntax, we rewrote the lapack calls into the blas functions in pure python, no c needed. Benchmarking this approach actually was much slower than using the upstream umath_linalg module via cpyext, as can be seen in the following benchmarks. This is due to the need to copy c-aligned data into Fortran-aligned format. Our __getitem__ and __setitem__ iterators are not as fast as pointer arithmetic in C. So we next tried a hybrid approach: compile and use numpy's umath_linalg python module as a shared object, and call the optimized specific wrapper function from it.

Benchmarks

Here are some benchmarks, running a tight loop of the different versions of linalg.inv(a), where a is a 10x10 double ndarray. The benchmark ran on an i7 processor running ubuntu 14.04 64 bit:
Impl. Time after warmup
CPython 2.7 + numpy 1.10.dev + lapack 8.9 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via cpyext 8.6 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via pure python + cffi 19.9 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via python + c + cffi 9.5 msec/1000 loops


While no general conclusions may be drawn from a single micro-benchmark, it does indicate that there is some merit in the approach taken.

Conclusion

PyPy's numpy now includes a working linalg module. There are still some rough corners, but hopefully we have implemented the parts you need. While the speed of the isolated linalg function is no faster than CPython and upstream numpy, it should not be significantly slower either. Your use case may see an improvement if you use a mix of python and lapack, which is the usual case.

Please let us know how it goes. We love to hear success stories too.

We still have challenges at all levels of programming,and are always looking for people willing to contribute, so stop by on IRC at #pypy.

mattip and the PyPy Team

23 comments:

Olivier Grisel said...

Interesting work although benchmarking linear algebra routines on 10x10 arrays feels wrong: typical linear algebra applications use hundreds or thousands of dimensions. Would you mind re-rerunning those benchmarks on 1000x1000 arrays instead? The use of the CPU cache and multiple threads can be very impacting for such workloads.

Also some numpy / scipy developers are working on supporting OpenBLAS as the default BLAS/LAPACK by default for the Windows wheel packages and maybe later for the OSX packages as well.

Under Linux (Debian / Ubuntu) it's pretty easy to have libblas.so / liblapack.so be symlinks to either ATLAS or OpenBLAS using the update-alternative syste,

Maciej Fijalkowski said...

What blog post somehow fails to mention is that we do not reimplement those but reuse whatever underlaying library is there. The measurments of the actual speed is then not that interesting, because we're only interested in the overhead of call.

Olivier Grisel said...

It might still be interesting to run that kind of benchmarks on more realistic workloads (maybe in addition to some micro-workloads) to see the importance of the remaining overhead in a typical usage scenario.

mattip said...

The most interesting benchmark is probably the one only you can run, i.e. how does pypy perform for you on your workload.

As far as lapack vs openblas, we will try to imitate what numpy does. If cpython/numpy supports a variation of lapack and pypy/numpy doesn't, that should be considered a bug.

Please let us know how it works for you.

Olivier Grisel said...

> The most interesting benchmark is probably the one only you can run, i.e. how does pypy perform for you on your workload.

I agree, but inverting a 10x10 matrix is probably not representative of anybody's workload.

While it's important not to introduce too much overhead in the bindings, I think it's also good to keep in mind that an overhead of the order of the micro-second is completely negligible compared to the execution time of a typical linear algebra operation running on realistically sized data. Hence my original remark.

> As far as lapack vs openblas, we will try to imitate what numpy does. If cpython/numpy supports a variation of lapack and pypy/numpy doesn't, that should be considered a bug.

Just to clarify OpenBLAS is an implementation of the standard BLAS API that also includes the official LAPACK implementation from netlib linked against its own optimized BLAS routines. The 2 main open source optimized implementations of BLAS/LAPACK supported by numpy & scipy are ATLAS and OpenBLAS.

Romain Guillebert said...

> While it's important not to introduce too much overhead in the bindings, I think it's also good to keep in mind that an overhead of the order of the micro-second is completely negligible compared to the execution time of a typical linear algebra operation running on realistically sized data. Hence my original remark.

But then you're just benchmarking the underlying library, which is the exact same library as numpy.

Olivier Grisel said...

> But then you're just benchmarking the underlying library, which is the exact same library as numpy.

Yes I agree. I just want to highlight that for most common real life use cases, a small performance overhead in those those LAPACK bindings are almost never a problem.

Otherwise your readers might be mislead into thinking that the "PyPy 2.5.0 + numpy + lapack via pure python + cffi" version is significantly suboptimal (2x slowdown!) while in practice a couple of additional microseconds might be completely undetectable compared to the actual execution time of the "inv" function that typically lasts more than a millisecond on anything that is non-toy data.

Romain Guillebert said...

ok, makes sense :)

Anonymous said...

Additional data point: repeatedly inverting a ~10x10 matrix is exactly what I need performance on - for running an extended Kalman Filter. : )

Olivier Grisel said...

> Additional data point: repeatedly inverting a ~10x10 matrix is exactly what I need performance on - for running an extended Kalman Filter. : )

Fair enough: so there actually exists a use case for that benchmark. Optimizing the bindings overhead might thus be worthy in the end.

Yaacov said...

I love hearing about the progress and wish I could test on my benchmarks. Any chance of windows support?

mattip said...

Yaacov what is missing for you to try it?
Here is the way I verify that the code works on windows 7 64 bit and windows 8.1:

download and install compiler
http://www.microsoft.com/en-us/download/details.aspx?id=44266

download pypy and open the zip
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-win32.zip

install pip into pypy
https://bootstrap.pypa.io/get-pip.py

install numpy into pypy
pip install git+https://bitbucket.org/pypy/numpy.git

Yaacov said...

I get a tracback ending in

\appdata\local\temp\pip-wdyqtr-build\numpy\distutils\mi
sc_util.py", line 872, in _get_configuration_from_setup_py

config = setup_module.configuration(*args)

File "numpy\linalg\setup.py", line 85, in configuration

library_dirs = [sys.real_prefix + '/include',

AttributeError: 'module' object has no attribute 'real_prefix'

in a warning "using unoptimized lapack"

Unknown said...

I still get non-existing conjugate method error when using, e.g., linalg.pinv. Any plan on getting this working?

mattip said...

fixed, try a nightly (from tomorrow)

Derek Z said...

I got an error message:

OSError: Cannot load library /usr/local/Cellar/pypy/2.5.0/libexec/site-packages/numpy/linalg/libumath_linalg_cffi.so: dlopen(/usr/local/Cellar/pypy/2.5.0/libexec/site-packages/numpy/linalg/libumath_linalg_cffi.so, 2): image not found

Is there anything I am not doing right for the installation? I have pypy 2.5, and Mac OS 10.10.

mattip said...

Are you installing via pip, if so we have had reports of older versions of pip failing. You should have pip 6.0.8 or later. See https://bitbucket.org/pypy/numpy/issue/21

melbic said...

Same problem here. (OSX 10.10) I've got the newest pip (6.0.8) and setuptools (14.0.3) version installed.

melbic said...

Same problem here. (OSX 10.10) I've got the newest pip (6.0.8) and setuptools (14.0.3) version installed.

mattip said...

I can't reproduce this as I do not have a MacOS machine. The place to follow this up is on our issue tracker, https://bitbucket.org/pypy/numpy/issue/21

It would be most helpful to attach a full log from "pip install" and 'pypy -c "import numpy"' to that issue

Nimrod said...

One way pypy might be able to outperform numpy is by eliminating temporaries.

Just converting the BLAS functions to chain operations efficiently and sometimes update in-place rather than allocating and de-allocating arrays should help a lot.

Unknown said...

This is great! But I can't use this for almost any of my code before np.einsum is supported :/ IMO, it is a super useful function for almost anything. Any plans for supporting it?

mattip said...

Koos Zevenhoven - we have plans to implement all of numpy. With that, it looks like einsum will take quite a bit of work