Wednesday, February 22, 2012

Compile Numpy and Scipy with Intel Math Kernel Library

Intel Math Kernel Library (Intel MKL) comes with its own LAPACK and BLAS implementation that optimizes for Intel processors. Compiling and link Numpy and Scipy with Intel MKL can be a good way to speed up your computations.

For example, in my case, the DGETRF function, which performs LU decomposition, of Intel MKL is about 40% faster than the same function in Debian's default LAPACK and ATLAS implementation. Specifically, I was able to reduce the running time of the LU decomposition call on a (3444, 2846) matrix from ~2.5s (the default LAPACK) to only ~1.5s (Intel MKL). The computation is on an Intel Core 2 Quad Q9550 CPU with 4 GB of RAM. This can be considered as a good speedup to me since I need to perform the LU decomposition several times (say, hundreds to thousands).

This tutorial will list out necessary steps to compile Numpy and Scipy using Intel C compiler  and link them with Intel Math Kernel Library (Intel MKL) to make use of its optimized LAPACK and BLAS implementation for Intel processors. The Python version is 2.7.

For preparation, let us clone the git repositories of Numpy and Scipy to your local machine and check out the latest stable versions. When this note is written, the stable version for Numpy is v1.6.1 and for Scipy is v0.10.1rc1.
$ git clone https://github.com/numpy/numpy.git
$ git checkout v1.6.1
$ git clone https://github.com/scipy/scipy.git
$ git checkout v0.10.1rc1

1. Compiling Numpy
To compile Numpy, we will need Intel C++ Composer XE, which bundles the C/C++ compiler and also includes the Intel MKL. To compile Scipy, we need to further install Intel Fortran Composer as well. These compiler suites are free for non-commerical use and you can easily download them from Intel's website. Unpack the tar packages, run ./install.sh and install the compilers by following the instructions on the screen.

Let us assume that the compilers are installed at /opt/intel/composer_xe_2011_sp1.9.293/. Note that the version of the Intel C and Fortran Composer is 1.9.293 in this tutorial. You may have a newer version number.

Now let's compile Numpy first. Go to numpy/numpy/disutils folder and edit the intelccompiler.py file. This file contains compiler flags for each processor architecture. On a 32-bit system, it is necessary to modify the IntelCCompiler class, in which the compiler type is intel. Since we are building for a 64-bit system, we will modify the IntelEM64TCCompiler class, which has the compiler type intelem. I set the self.cc_exe to 'icc -m64 -fPIC -O2 -g -openmp' and commented out the cc_exe and cc_args defined just before the __init__ function. These two variables seems not being used. Remember that we will later set the --compiler and --fcompiler (for Fortran for Scipy) both to intelem. If you are compiling on a 32-bit machine, be sure to set them to intel.

Next, copy site.cfg.example in Numpy's source folder and name it site.cfg. Add the following lines to site.cfg.
[DEFAULT]
library_dirs = /opt/intel/composer_xe_2011_sp1.9.293/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.9.293/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.9.293/mkl/include

[mkl]
mkl_libs = mkl_def, mkl_intel_lp64, mkl_intel_thread, mkl_core
lapack_libs = mkl_lapack95_lp64
libraries = iomp5

Save site.cfg. Remember to remove any existing build folders before compiling.
$ rm -rf build
Compile Numpy using
$ python setup.py config --compiler=intelem build_clib --compiler=intelem build_ext --compiler=intelem build

Now we can proceed to install Numpy
$ sudo python setup.py install

You can now proceed to test whether the new Numpy works. First, we need to set the LD_LIBRARY_PATH environment variable to those library folders in the Intel C and Fortran compiler suite. This can be done easily by:
$ export LD_LIBRARY_PATH=
$ source /opt/intel/composer_xe_2011_sp1.9.293/bin/compilervars.sh intel64
$ source /opt/intel/composer_xe_2011_sp1.9.293/mkl/bin/mklvars.sh intel64
LD_LIBRARY_PATH will be automatically set after calling compilervars.sh and mklvars.sh. Now start iPython in the same bash shell and import Numpy. Numpy should be imported without any errors. If you want to run numpy.test(), be sure to have the python-nose package installed on your system. It will perform a bunch of tests to see if Numpy computes some example problems correctly.
>>> import Numpy
>>> numpy.show_config()
>>> numpy.test() 

2. Compiling Scipy
After Numpy is compiled and installed, Scipy can be compiled in almost the same way. First, the site.cfg file we made for Numpy compilation can be used for Scipy, too. Copy site.cfg from the Numpy folder to the Scipy folder, remove any existing build folder, and build Scipy using:
python setup.py config --compiler=intelem --fcompiler=intelem build_clib --compiler=intelem --fcompiler=intelem build_ext --compiler=intelem --fcompiler=intelem build

After Scipy is compiled, we can now install it. Notice that I was *not* able to execute
$ sudo python setup.py install
successfully because setup.py under sudo cannot find the needed libraries under LD_LIBRARY_PATH set previously. I ended up running setup.py under root, which requires us to set the LD_LIBRARY_PATH one more time.
$ su
# export LD_LIBRARY_PATH=
# source /opt/intel/composer_xe_2011_sp1.9.293/bin/compilervars.sh intel64
# source /opt/intel/composer_xe_2011_sp1.9.293/mkl/bin/mklvars.sh intel64
# python setup.py install

Everything should be running well. You can now test Scipy by using
>>> import scipy
>>> scipy.show_config()
>>> scipy.test()


Appendix

Note that in site.cfg, the library iomp5 is necessary. It contains the symbols used in mkl_intel_thread library. If iomp5 is not linked during compilation of Numpy,  you still can compile Numpy successfully, but when you start importing it (say, in iPython):
>>> import numpy
you will get an exception saying undefined reference to `__kmpc_reduce_nowait' in mkl_intel_thread.so. You may want to verify this by doing a quick check on the symbols defined in mkl_intel_thread.so and libiomp5.so. You will see that
$ nm mkl_intel_thread.so | grep nowait
returns none. And
$ nm libiomp5.so | grep nowait 
shows that the __kmpc_reduce_nowait is defined in libiomp5.

Chapter 5 of Intel MKL document also suggests that libiomp5 is necessary besides the layers mkl_intel_lp64 (interface), mkl_intel_thread (threading), mkl_core (computation), as it is a runtime library.

A few sources on the web suggests adding iomp5 to mkl_libs. However, I found that this does not work. I need to add the line libraries = iomp5 to make it link correctly.

Also notice that mkl_def is needed; otherwise runtime error MKL_FATAL_ERROR: Cannot load neither xxxx will appear when calling some functions in Intel MKL from Numpy.

2 comments :

  1. Hi.

    Have you done any benchmarks to see how fast you numpy is ?
    I did the same as you described, but my numpy.dot function is 2 times slower than matlab matrix product...

    ReplyDelete
  2. I only tested the LU decomposition using Intel MKL and default LAPACK, as mentioned in the start of my post.

    I didn't test matrix multiplication, but I believe Intel MKL performance should be comparable to that in MATLAB. What is your matrix size?

    ReplyDelete