intel mkl

This article describes the new features in the Intel® Math Kernel Library Sparse Matrix Vector Multiply Format Prototype Package (Intel® MKL SpMV Format Prototype Package) for use on the Intel® Xeon Phi™ coprocessor. The package includes a new two-stage API for select SpMV operations as well as support for the ELLPACK Sparse Block (ESB) format.

Introduction

Sparse Matrix Vector Multiply (SpMV) is an important operation in many scientific applications, and its performance can be a critical part of overall application performance. Because SpMV is typically a bandwidth-limited operation, its performance is often dependent on the sparsity structure of the matrix. This means that to optimize SpMV fully, we need to choose computational kernels and balancing algorithms that take the structure of the sparse input matrices into account.

The improved memory bandwidth of the Intel Xeon Phi coprocessors helps accelerate SpMV operations. Intel MKL 11.0 and later provides highly-tuned SpMV kernels for the compressed sparse row (CSR) format for the Intel Xeon Phi coprocessor. While experiments show that the performance of Intel MKL CSR SpMV is close to optimal in many cases, there are certain matrices where additional performance improvements are possible. For example, we found that many-core work balancing (further referenced as “workload balancing”) CSR SpMV functionality on Intel Xeon Phi coprocessors benefited performance more than did the tuning of computational kernels.

For sparse matrices, especially those with non-uniform structures, workload balancing should help improve the performance of SpMV on many-core architectures. It is important to note that determining a suitable workload balance is time-consuming and if not used correctly, for example, for a single SpMV call, may cause degradation in performance.

For repeated SpMV calls on matrices with the same structure, it is often advantageous to do the computations in multiple stages. That is, if we first analyze the matrix, the appropriate computational kernel and workload balancing algorithm can be determined. The results of this analysis stage can then be used to boost the performance during the multiple SpMV calls that follow. This approach should benefit the calls to SpMV as long as the total time required for the analysis stage and multiple SpMV calls is less than that for multiple, generic SpMV calls. At the end of the SPMV calls, the data and structures created during the analysis stage should be released.

The current Intel MKL Sparse BLAS has Fortran-style interfaces (based on the NIST* interfaces) and is oriented around several function calls (each function has many parameters for the input matrix and performs computations in a single step) with no assumptions made regarding the sparsity structure or storage details. Recognizing these limitations, there is no obvious way to store matrix analysis information between function calls in the current Intel MKL Sparse BLAS without significantly impacting performance. So in this new package, we extend the Intel MKL Sparse BLAS interfaces to use a staged approach that

analyzes the matrix structure and selects the optimal computational kernels for a given sparse matrix
provides user-controlled options for kernels or workload balancing algorithm selection

for a limited set of functions and also introduce a new sparse matrix format suitable for the Intel Xeon Phi coprocessor.

Description of the Intel MKL SpMV Format Prototype Package

The Intel MKL SpMV Format Prototype Package supports only general, non-transposed SpMV functionality on Intel Xeon Phi coprocessors for native and offload execution. A sparse matrix in this implementation is stored in a structure (handle). This approach allows us to investigate the input matrix only once, in the stage of creation of the internal matrix representation, and to retain the results of the investigation for further calls.

The Intel MKL SpMV Format Prototype Package supports two sparse formats: ELLPACK Sparse Block (ESB) (see http://dl.acm.org/citation.cfm?id=2465013 for details) and Compressed Sparse Row (CSR).

Let us briefly describe the ESB format, where a sparse matrix is stored in slices; each slice consists of 8 rows and is stored in ELLPACK format. This means that the length of each slice is equal to the maximum number of non-zeros in any row; shorter rows are padded with zeros to fill out the dense array. This format was specifically tuned for Intel Xeon Phi: each slice is stored in memory column-wise, so it can be processed column-by-column by SIMD instructions: 8 double precision elements are packed into one register. Also, for each column a bit mask is stored, where bit 0 marks padded elements, allowing the efficient use of masked vector operations.

The Intel MKL SpMV Format Prototype Package operates on internal matrix representation in CSR or ESB format. For both formats an internal matrix representation is created from the external CSR matrix. Additionally, a workload balancing algorithm can be chosen. The following algorithms are supported: with static and dynamic scheduling, and CSR format blocked scheduling. In static and dynamic scheduling, the input matrix is divided into many small chunks (around 2000), then they are scheduled to threads statically at compile time or dynamically at run time. In CSR format blocked scheduling, each thread processes one block of input matrix, and all blocks have more or less equal numbers of non-zeros.

Note: The sparse input matrix is actually duplicated in the internal structure.

Examples

Example of ELLPACK format (for a single block).

Suppose the original sparse matrix is:

11	12	0	0	0	16	0	0	0
21	0	0	0	25	0	0	0	29
0	0	33	0	0	0	0	0	0
0	0	43	44	0	0	0	0	0
0	0	0	0	55	0	57	0	59
61	0	0	0	0	66	0	0	0
0	72	0	0	0	0	0	78	0
0	0	0	84	0	0	87	0	0

Then sparse ELLPACK format for Intel Xeon Phi looks like:

	11	12	16
	21	25	29
	33	0	0
	43	44	0
Val	55	57	59
	61	66	0
	72	78	0
	84	87	0

	1	2	6
	1	5	9
	3	*	*
	3	4	*
Cols	5	7	9
	1	6	*
	2	8	*
	4	7	*

	1	1	1
	1	1	1
	1	0	0
	1	1	0
Bit mask	1	1	1
	1	1	0
	1	1	0
	1	1	0

The spmv_new.c file, located in the __release_lnx/examples folder, demonstrates the implemented functionality on Linux platforms. To build an example, set the proper compiler environment and perform these make commands:

make clean – clean the workspace
make build – create the executable file
make execute – run the executable on the Intel Xeon Phi coprocessor mic0 by default.

The following example demonstrates conversion of a matrix in CSR format to the internal CSR and ESB representations used by the Intel MKL SpMV Format Prototype Package followed by SpMV compute routines.

/*//// Consider the matrix A (see 'Sparse Storage Formats for Sparse BLAS Level 2//and Level 3 in the Intel MKL Reference Manual')////                 |   1       -1      0   -3     0   |
//                 |  -2        5      0    0     0   |
//   A    =        |   0        0      4    6     4   |,
//                 |  -4        0      2    7     0   |
//                 |   0        8      0    0    -5   |
////  The matrix A is represented in a zero-based compressed sparse row storage
//  scheme with three arrays (see 'Sparse Matrix Storage Schemes' in the//  Intel MKL Reference Manual) as follows:
////         values  = ( 1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5 )
//         columns = ( 0 1 3 0 1 2 3 4 0 2 3 1 4 )
//         rowIndex = ( 0  3  5  8  11 13 )
//
//  The test performs the following operations :
////       The code computes A*S = F using sparseDesbmv and sparseDcsrmv
//          where A is a general sparse matrix and S and F are vectors.
//*******************************************************************************/#include<stdio.h>#include<assert.h>#include<math.h>#include"spmv_interface.h"

#define M 5#define N 5#define NNZ 13

intmain() {

//*****************************************************************************//     Declaration and initialization of parameters for sparse
//     representation of the matrix A in the compressed sparse row format:
//*****************************************************************************

    int m = M, n = N, nnz = NNZ;    //*****************************************************************************    //    Sparse representation of the matrix A
    //*****************************************************************************

    double csrVal[NNZ]    = { 1.0,  -1.0,      -3.0,
                            -2.0,  5.0,
                                          4.0,  6.0,  4.0,
                              -4.0,       2.0,  7.0,
                                     8.0,            -5.0 };

    int    csrColInd[NNZ] = { 0, 1,    3,
                              0, 1,
                                    2, 3, 4,
                              0,    2, 3,
                                 1,       4 };

    int    csrRowPtr[M+1] = { 0, 3, 5, 8, 11, 13 };

    // Matrix descriptor, new API variable    sparseMatDescr_t    descrA;
    // Internal CSR matrix representation, new API variable    sparseCSRMatrix_t   csrA;
    // Internal ESB matrix representation, new API variable    sparseESBMatrix_t   esbA;

    //*****************************************************************************    //    Declaration of local variables:
    //*****************************************************************************

    double      x[M]    = { 1.0, 5.0, 1.0, 4.0, 1.0 };
    double      y[M]    = { 0.0, 0.0, 0.0, 0.0, 0.0 };
    double      alpha = 1.0, beta = 0.0;
    int         i;

    // Create matrix descriptor
    sparseCreateMatDescr ( &descrA );   
    // Create CSR matrix with static workload balancing algorithm   
    sparseCreateCSRMatrix ( &csrA, SPARSE_SCHEDULE_STATIC );   
    // Analyze input matrix and create its internal representation in the     
    // csrA structure optimized for static workload balancing     
    sparseDcsr2csr ( m, n, descrA, csrVal, csrRowPtr, csrColInd, csrA );   
    // Compute y = alpha * A * x + beta * y     
    sparseDcsrmv ( SPARSE_OPERATION_NON_TRANSPOSE, &alpha, csrA, x, &beta, y );
    // Release internal representation of CSR matrix     
    sparseDestroyCSRMatrix ( csrA );

    // Create ESB matrix with static workload balancing algorithm   
    sparseCreateESBMatrix ( &esbA, SPARSE_SCHEDULE_STATIC );

    // Analyze input CSR matrix and create its internal ESB representation in   
    // the esbA structure optimized for static workload balancing     
    sparseDcsr2esb ( m, n, descrA, csrVal, csrRowPtr, csrColInd, esbA );

    // Compute y = alpha * A * x + beta * y   
    sparseDesbmv ( SPARSE_OPERATION_NON_TRANSPOSE, &alpha, esbA, x, &beta, y );

    // Release internal representation of ESB matrix   
    sparseDestroyESBMatrix ( esbA );   
    sparseDestroyMatDescr ( descrA );   
    return 0;}

The performance results (see chart below) demonstrated by the implementations of SpMV for the ESB format are on average 30% better than implementation available in Intel MKL 11.1 Update 1, and for some matrices the computation can be several times faster, which agrees with the article referenced previously. It should be noted that:

ESB SpMV performance for some matrices is not as good as CSR SpMV;
The workload scheduling algorithm significantly affects performance and should be chosen experimentally for a sparse matrix to achieve the best performance.

Note: General SpMV routines are used despite of self-adjoint/symmetric matrices in some of the below cases.

We are seeking interested parties to evaluate this prototype implementation and provide us with feedback. If you are interested, please send a request to intel.mkl@intel.com to download the Intel MKL SpMV Format Prototype Package.

Advanced Xeon Phi

intel mkl

Sparse Matrix-Vector multiplication

csr format

Ellpack Sparse Block format

Improving performance

Libraries

Multithread development

The Intel® Optimized Technology Preview for High Performance Conjugate Gradient Benchmark (Intel® Optimized Technology Preview for HPCG) provides an early implementation of the HPCG benchmark https://software.sandia.gov/hpcg optimized for Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Vector Extensions 2 (Intel® AVX2) enabled Intel® processors and Intel® Xeon Phi™ coprocessors. The HPCG Benchmark is intended to complement the High Performance LINPACK benchmark used in the TOP500 http://www.top500.org system ranking by providing a metric that better aligns with a broader set of important cluster applications

For More information on this implementation, Getting Started, Performance measurements and System Requirements, please refer to the Attachment

Intel® MKL now has support for Intel® Xeon Phi™ coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture) on Linux* and Windows* OS. There are three Intel MKL usage models on Intel Xeon Phi coprocessor: automatic offload, compiler assisted offload and native execution.

Here is the list of Automatic Offload enabled functions in Intel MKL on Intel® Xeon Phi™ coprocessors:

BLAS:

BLAS Level-3 subroutines - ?SYMM,?TRMM, ?TRSM, ?GEMM

LAPACK:

LU (?GETRF), Cholesky ((S/D)POTRF), and QR (?GEQRF) factorization functions

As of current release, following numbers gives you the matrix size when the Automatic Offload would be applicable for above mentioned list of functions (These numbers could change in future MKL releases though):

BLAS LEVEL 3

GEMM:
- SGEMM: M, N > 2048, K > 256
- DGEMM NN, NT: M, N > 1280, K > 256
- DGEMM TN, TT: M, N > 2048, K > 256
- C, Z GEMM: M, N > 2048, K > 256
TRxM:
- S, D TRxM: M, N > 512
- C, Z TRxM: M, N > 512, M % 16 == 0, N % 16 == 0
?SYMM: M, N > 512

?GETRF: M, N > 8192

[S/D/C]POTRF:N>=6144

[S/D/C/Z]GEQRF:M=N>=8192

SSYEV,SSYEVD,SSYRDB: N>=9216

DSYEV,DSYEVD,DSYRDB: N>=8000

Apart from the above mentioned list of AO enabled functions, the following list of several functions take benefit from the above mentioned BLAS and LAPACK functions

In the below table, AO enabled functions are listed in Horizontal row and the list of functions which take benefit from AO enabled BLAS and LAPACK functions are listed in veritcal rows.Sign in intersection indicates which function benefits from which

	GETRF	GEQRF	POTRF	TRSM	TRMM	SYMM	GEMM
?gesv	x			x
?gesvx	x			x
?gesvxx	x			x
(ds/zc)gesv	x			x			x
?sysv				x
(c/z)hesv				x
?gegs		x
?gegv		x
(d/s)gejsv		x		x
?gels		x		x
?gelsd		x					x
?gelss		x					x
?geqp3		x
?gesdd		x					x
?gesvd		x					x
?gges		x
?ggesx		x
?ggev		x
?ggevx		x
?ggqrf		x
?ggrqf		x
?ggglm		x		x
?gglse		x		x
?gelsy		x		x
?gelsx				x
?pftrf			x	x
?posv			x	x
?posvx			x	x
?posvxx			x	x
(ds/zc)posv			x	x		x
?(sy/he)gv			x	x	x
?(sy/he)gvd			x	x	x
?(sy/he)gvx			x	x	x
?(sy/he)gst						x
?(sy/he)trs2				x
?tfsm				x			x
?trtrs				x
?potrs				x
?pftri					x
?pftrs				x			x
?tftri					x
?(s/h)frk							x
?geqrt3					x		x
?lalsd							x
?larfb					x		x

Please refer the White Paper for more information on how to use MKL Automatic Offload

Please refer other articles related to Intel MKL on Intel Xeon Phi at Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessor

Improving performance

Title:32 bit Support in Intel® Math Kernel Library (Intel MKL) for Clusters is now deprecated in Intel MKL 11.2 and support will be removed starting next major release

Reason:Intel MKL is aligned with Intel MPI 5.0 which supports only 64 bit. Majority of customer base already selected Intel 64 as main development platform

Impact:Customers of Intel MKL for Clusters will have to go for 64 bit as 32 bit support is now deprecated.

Advantages:

Higher performance with richer registers/instructions set on Intel 64
Removing 4Gb memory limitation on input sizes to up to one terabyte (TB) of platform address space

Microsoft Windows* (XP, Vista, 7)

Theme Zone:

IDZone

Introduction

There is a wide range of domains in which there is a need to use a Schur complement matrix or a partial solver corresponding to it. For example, in mathematical statistics the Schur complement matrix is important in computation of the probability density function, and in computational mechanics the Schur complement matrix correlates to media stiffness. Partial solving also plays an important role in Linear Algebra for efficient preconditioner implementation based on Domain Decomposition algorithms. The application area is huge and has one point in common: sparse matrices. That’s why computation of the Schur complement and partial solving have been implemented as new functionality in the sparse solver of Intel Math Kernel Library (Intel MKL) based on the Intel MKL PARDISO* solver.

Product Overview

Starting from Intel MKL 11.2 update 1, the Intel MKL PARDISO solver supports additional functionality covered by one main topic: the Schur complement. Using this new functionality, you can obtain a Schur complement of selected rows/columns of initial matrix, solve the system with a Schur complement matrix using the Intel MKL PARDISO interface, and solve the lower and upper triangular subsystem that produced by matrix factorization with Schur complement calculation. This functionality can be useful for customers who use the Schur complement in other sparse solver packages.

Let A be a sparse quadratic matrix:

where A_loc,and C are quadratic and sparse and B₁ and B₂ are sparse rectangular matrices. Then, we can make the following decomposition of matrix A, which is formally an LDU decomposition:

where

The matrix S is the Schur complement.

To use Schur complement functionality you need to:

Set iparm(36) to 1 if you want to calculate Schur complement only, and to 2 if you want to use computed factorization of initial matrix (A) during the solver step.
Set the columns/rows that specify Schur complement submatrix (matrix C in the example below)
Provide an array of solution vectors of n² elements to the factorization step of the pardiso routine, where n is the size of the Schur complement matrix. On output the Schur complement is returned via this array.

The charts below compare Intel MKL PARDISO [1] with MUMPS [2] in term of time needed for calculating the Schur complement. All experiments used a computation node with two Intel® Xeon® E5-2697 v3 processors (35M Cache, 2.60 GHz) with 64Gb RAM memory, KMP_AFFINITY set to “compact”, MUMPS version 4.10.0, and Intel MKL 11.2 update 1. For the Schur complement matrix we choose the last 5000 rows/columns of each matrix. The tested matrix was acquired from Matrix Florida collection [3],

[1] Intel® Math Kernel Library

[2] Mumps

[3] T. A. Davis and Y. Hu, The University of Florida Sparse Matrix Collection, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1 - 1:25.

Theme Zone:

IDZone

Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.Intel MKL 11.2 Update 1 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE 2015 . Please visit the Intel® Math Kernel Library Product Page .

Intel® MKL 11.2 Update 1 Bug fixes

What's New in Intel® MKL 11.2 Update 1 :

Introduced support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) on Intel® Xeon® processors for Windows* and Linux* versions of Intel MKL. This is in addition to the current support for Intel® AVX-512 instructions for Intel® Many Integrated Core Architecture (Intel® MIC Architecture)
Introduced support for LAPACK version 3.5
Added support for Schur complement including getting explicit Schur complement matrix and solving the system through Schur complement
Deprecations: Intel® MKL Cluster Support for IA-32 is now Deprecated and support will be removed starting Intel® MKL 11.3

Check out the Release Notes for MKL 11.2 Update 1

Contents

File: w_mkl_11.2.1.148_online.exe
Online Installer for Windows
File: w_mkl_11.2.1.148.exe
A File containing the complete product installation for Windows (32-bit/x86-64bit development)

Theme Zone:

IDZone

Intel® MKL 11.2 Update 1 Bug fixes

What's New in Intel® MKL 11.2 Update 1 :

Introduced support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) on Intel® Xeon® processors for Windows* and Linux* versions of Intel MKL. This is in addition to the current support for Intel® AVX-512 instructions for Intel® Many Integrated Core Architecture (Intel® MIC Architecture)
Introduced support for LAPACK version 3.5
Added support for Schur complement including getting explicit Schur complement matrix and solving the system through Schur complement
Deprecations: Intel® MKL Cluster Support for IA 32 is now Deprecated and support will be removed starting Intel® MKL 11.3

Check out the Release Notes for MKL 11.2 Update 1

Contents

File: l_mkl_online_11.2.1.133.sh
Online Installer for Linux
File: l_mkl_11.2.1.133.tgz
A File containing the complete product installation for Linux (32-bit/x86-64bit development)

Theme Zone:

IDZone

Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.Intel MKL 11.2 Update 2 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE 2015 . Please visit the Intel® Math Kernel Library Product Page .

Intel® MKL 11.2 Update 2 Bug fixes

What's New in Intel® MKL 11.2 Update 2:

Improved symmetric eigensolvers performance by up to 3x, for the cases when eigenvectors are not needed.
Improved ?GESVD performance by 2-3x, for the cases when singular vectors are required.
Improved ?GETRF performance for Intel AVX2 by up to 14x for non-square matrices.
Improved parallel and serial performance of ?HEMM/?SYMM for Intel® Advanced Vector Extensions 2 (Intel® AVX2) for the 64-bit Intel MKL.
Improved parallel and serial performance of ?HERK/?SYRK and and ?HER2K/?SYR2K for Intel AVX2.
Added MKL_DIRECT_CALL support for CBLAS interfaces and ?GEMM3M routines.
Deprecations: Intel® MKL Cluster Support for IA-32 is now Deprecated and support will be removed starting Intel® MKL 11.3

Check out the Release Notes for MKL 11.2 Update 2

Contents

File: w_mkl_11.2.2.179_online.exe
Online Installer for Windows
File: w_mkl_11.2.2.179.exe
A File containing the complete product installation for Windows (32-bit/x86-64bit development)

Theme Zone:

IDZone

Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.Intel MKL 11.2 Update 2 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE 2015 . Please visit the Intel® Math Kernel Library Product Page .

Intel® MKL 11.2 Update 2 Bug fixes

What's New in Intel® MKL 11.2 Update 2:

Improved symmetric eigensolvers performance by up to 3x, for the cases when eigenvectors are not needed.
Improved ?GESVD performance by 2-3x, for the cases when singular vectors are required.
Improved ?GETRF performance for Intel AVX2 by up to 14x for non-square matrices.
Improved parallel and serial performance of ?HEMM/?SYMM for Intel® Advanced Vector Extensions 2 (Intel® AVX2) for the 64-bit Intel MKL.
Improved parallel and serial performance of ?HERK/?SYRK and and ?HER2K/?SYR2K for Intel AVX2.
Added MKL_DIRECT_CALL support for CBLAS interfaces and ?GEMM3M routines.
Deprecations: Intel® MKL Cluster Support for IA-32 is now Deprecated and support will be removed starting Intel® MKL 11.3

Check out the Release Notes for MKL 11.2 Update 2

Contents

File: l_mkl_online_11.2.2.164.sh
Online Installer for Linux
File: l_mkl_11.2.2.164.tgz
A File containing the complete product installation for Linux (32-bit/x86-64bit development)

Theme Zone:

IDZone

1. Introduction.

Intel MKL 11.3 introduced a wrapper code for MPI interface to Intel MKL, which can help users to use Intel® MKL cluster functions with the customized MPI libraries.

While different MPI libraries are compatible on the application programming interface (API) level, they are often incompatible at the application binary interface (ABI) level. So Intel MKL provides some different libraries to support the different MPIs. For example, one should link with libmkl_blacs_lp64.a to use application with MPICH*, libmkl_blacs_openmpi_lp64.a to use Open MPI*. If users link Intel MKL cluster functions with the customized MPI libraries, which is not supported by Intel MKL, and may create some unexpected result.

Note, only the Intel MKL cluster functions (Cluster Sparse Solver, Cluster FFT, Scala PACK, BLACS) depend on the different MPI implementation. For other MKL functions, they do not have such dependency.

Intel MKL 11.3 added a MPI wrapper code, so user can extend Intel MKL cluster functions usage to some customized MPI libraries. Users can build custom BLACS library using wrapper code to extend MKL Cluster functions’ support on their MPI libraries.

2. Using Intel MKL MPI wrapper code

2.1 Build the Intel MKL wrapper code.

The MKLMPI adaptor is provided with the source code, which can be found in MKLROOT/interfaces/mklmpi directory. To build custom BLACS one should run make (nmake) from this directory with corresponding parameters. For example:

>make sointel64 interface=ilp64 MPICC=’mpicc’

This command will build custom dynamic MKL BLACS library with default library name libmkl_blacs_custom_ilp64.so and put it in default directory: ../../lib/intel64. For more information, please use “make help“ (or “nmake help” on Windows).

2.2 Linking with custom MKL BLACS library

To link with custom MKL BLACS library, you only need to replace the default Intel MKL BLACS with the custom one (for example, replace mkl_blacs_intelmpi with mkl_blacs_custom).

Note that, when you link with the dynamic library at Windows* system, you should use mkl_blacs_lp64_dll.lib, but before run the application, it needs to set MKL_BLACS_MPI environment variable. For example, it needs to set to CUSTOM for the default name of the custom MKL BLACS library (mkl_blacs_custom_lp64.dll for LP64 interface and mkl_blacs_custom_ilp64.dll for ILP64 case), or set to any other filename that you use for the custom MKL BLACS library.

please note, that dynamic library should be located at Intel MKL redist directory or in the application directory
Besides the MKL_BLACS_MPI environment variable, you can use mkl_set_mpi function that was introduced in MKL 11.3

3. An example: Using Intel MKL MPI wrapper with Open MPI* at Windows*

Suppose the Open MPI is installed at C:\Program Files (x86)\OpenMPI_v1.6.2-x64. The %MPIROOT% environment variable is set to the installed path

3.1 Build custom MKL BLACS library:

Go to MKL MPI adaptor directory
\>cd %MKLROOT%\interfaces\mklmpi\

Set environment for Open MPI:
\>set PATH=%MPIROOT%\bin;%PATH%
\>set LIB=%MPIROOT%\lib;%LIB%
\>set INCLUDE=%MPIROOT%\include;%INCLUDE%

Build static MKL BLACS library for LP64 interface with default name mkl_blacs_custom_lp64.lib

\>nmake libintel64 MPICC="mpicc -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS" INSTALL_DIR=%MKLROOT%\lib\intel64 (**)

Build dynamic MKL BLACS library for LP64 interface with default name mkl_blacs_custom_lp64.dll
\>nmake dllintel64 MPICC="mpicc -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS" INSTALL_DIR=%MKLROOT%\..\redist\intel64\mkl

3.2 Using the library with the Intel MKL example:

Go to the Parallel Direct Sparse Solver for Clusters example directory:
\>cd %MKLROOT%\examples\cluster_sparse_solverc\source
Set environment for MKL:
\> %MKLROOT%\bin\mkl\mklvars.bat intel64
Build and run the application in static linking mode:
\>mpicc -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS cl_solver_sym_sp_0_based_c.c /link mkl_blacs_custom_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

\>mpiexec –np 4 cl_solver_sym_sp_0_based_c.exe
Build the code, set correct MKL_BLACS_MPI environment variable and run the application in dynamic linking mode:
\>mpicc -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS cl_solver_sym_sp_0_based_c.c /link mkl_blacs_lp64_dll.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

\>set MKL_BLACS_MPI=CUSTOM (*)

\>mpiexec –np 4 cl_solver_sym_sp_0_based_c.exe

*) we can also set MKL_BLACS_MPI=mkl_blacs_custom_lp64.dll

**) Using compiler flags -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS was needed because of peculiarity of our Open MPI* installation, and may be or may be not needed in each specific case.

Theme Zone:

IDZone

Co-authors:

Evarist Fomenko (Intel)

The following components of Intel® MKL 11.0.1 and higher are tuned for the Intel® Xeon Phi Architecture:

Several BLAS (level 1, 2, and 3)
Sparse BLAS
LAPACK routines.
Vector Math Library (VML)
All the Vector Statistical Library (VSL) routines including random number generators (RNG).
Fast Fourier transforms.
1. Please refer the FFT tuning article for Intel Xeon Phi for more details.

Many of the other routines which uses ?gemm also get performance benefits due to ?gemm optimizations.

Please refer other articles related to Intel MKL on Intel Xeon Phi at Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessor

MKL support for Xeon Phi

intel mkl

Intel MIC

Intel® Math Kernel Library

URL

Last Updated:

Monday, May 4, 2015

Intel MKL 11.3 Beta has introduced Intel TBB support.

Intel MKL 11.3 beta Update can increase performance of applications threaded using Intel TBB. Applications using Intel TBB can benefit from the following Intel MKL functions:

BLAS - GEMM, SYMM/HEMM, TRMM, TRSM, SYRK/HERK, SYR2K/HER2K
LAPACK - GETRF, GETRS, GESV, POTRS
Sparse BLAS - CSRMM, BSRMM
Intel MKL Poisson Solver
Intel MKL PARDISO

If such applications call functions not listed above, Intel MKL 11.3 beta executes sequential code. Depending on feedback from customers, future versions of Intel MKL may support Intel TBB in more functions.

Linking applications to Intel TBB and Intel MKL

The simplest way to link applications to Intel TBB and Intel MKL is to use Intel C/C++ Compiler. While Intel MKL supports static and dynamic linking, only dynamic Intel TBB library is available.

Under Linux, use the following commands to compile your application app.c and link it to Intel TBB and Intel MKL.

Dynamic Intel TBB, dynamic Intel MKL icc app.c -mkl -tbb

Dynamic Intel TBB, static Intel MKL icc app.c -static -mkl -tbb

Under Windows, use the following commands to compile your application app.c and link it to dynamic Intel TBB and Intel MKL.

Dynamic Intel TBB, dynamic Intel MKL icl.exe app.c -mkl -tbb

Improving Intel MKL performance with Intel TBB

Performance of Intel MKL can be improved by telling Intel TBB to ensure thread affinity to processor cores. Use the tbb::affinity_partitioner class to this end.

To improve performance of Intel MKL for small input data, you may limit the number of threads allocated by Intel TBB for Intel MKL. Use the tbb::task_scheduler_init class to do so.

For more information on controlling behavior of Intel TBB, see the Intel TBB documentation at https://www.threadingbuildingblocks.org/documentation.

LAPACK (LU factorization ) performance in applications using Intel TBB and Intel MKL 11.3 beta

For more information about Using Intel® MKL with Threaded Applications, please refer to the Knowledge Base Article follow the link :

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-with-threaded-applications

or to the MKL User's Guide

* Each call is single run of single size on range from 1000 to 10000 with step 1000. Performance (GFlops) is computed as cumulative number of floating point operations for all 10 calls divided by wall clock time from starting very first call till finishing very last call.

**into MKL11.3 time frame, we are planning to extend the list of functions threaded with TBB : AXPY, (S/D)DOT, GEMV (in addition to the ones listed above)

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8.x

Financial Services Industry

Improving performance

Libraries

Co-authors:

Alexander Kobotov (Intel)

Evgeny P. (Intel)

Note: This document applies to Intel® MKL 11.0 or later and Intel® IPP 7.1 or later.

Objective

The purpose of this document is to help developers determine which FFT, Intel® MKL or Intel® IPP is best suited for their application.

Overview

Fourier transforms are used in signal processing, image processing, physics, statistics, finance, cryptography, and many other areas. The Discrete Fourier transform (DFT) mathematical operation converts a signal from the time domain to the frequency domain and back.

DFT processing time can dominate a software application. Using a fast algorithm, Fast Fourier transform (FFT), reduces the number of arithmetic operations from O(N²) to O(N log₂ N) operations. Intel® MKL and Intel® IPP are highly optimized for Intel® architecture-based multi-core processors using the latest instruction sets, parallelism, and algorithms.

Read further to decide which FFT is best for your application.

Below is a brief summary of the Intel® MKL and Intel® IPP libraries. For additional details on these products, visit the Intel® MKL web site and the Intel® IPP web site.

Table 1: Comparison of Intel® MKL and Intel® IPP Functionality

	Intel MKL	Intel IPP
Target Applications	Mathematical applications for engineering, scientific and financial applications	Media and communications applications for audio, video, imaging, speech recognition and signal processing
Library Structure	Linear algebra BLAS LAPACK ScaLAPACK Fast Fourier transforms Vector math Vector statistics Random number generators Convolution and correlation Partial differential equations Optimization solvers Sparse Solvers	Signal processing Image processing, compression and color conversion String processing Cryptography Computer vision Data compression Matrix and vector math Audio coding Speech coding and recognition Video coding
Linkage Models	Static, dynamic, custom dynamic	Static, dynamic, custom dynamic
Operating Systems	Windows, Linux, Mac OS X*	Windows, Linux, Mac OS X, QNX*
Processor Support	IA-32 and Intel® 64 architecture-based and compatible platforms(1*)	IA-32 and Intel® 64 architecture-based and compatible platforms (2*)

Both of these libraries contain the generic code optimized for processors with Intel® ® Streaming SIMD Extensions (Intel® ® SSE) and code optimized for processors with Intel® SSE2, SSE3, SSE4.1, SSE4,2, AVX and AVX2 instruction set

1* - IPP provides optimized code for Intel® ® Atom™ processor

2* - started with version11.1, Intel® (R) MKL started to support new Intel® Xeon Phi™ coprocessors

Intel® MKL and Intel® IPP Fourier Transform Feature

The Fourier Transforms provided by MKL and IPP are respectively targeted for the types of applications targeted by MKL (engineering and scientific) and IPP (media and communications). In the table below, we highlight specifics of the MKL and IPP Fourier Transforms that will help you decide which may be best for your application.

Table 2: Comparison of Intel® MKL and Intel® IPP DFT Features

Feature	Intel MKL	Intel IPP
API	DFT Cluster FFT FFTW 2.x and 3.x	FFT DFT
Interfaces	C and Fortran LP64 (64-bit long and pointer) ILP64 (64-bit int, long, and pointer)	C
Dimensions	1-D up to 7-D	1-D (Signal Processing) 2-D (Image Processing)
Transform Sizes	32-bit platforms - maximum size is 2^31-1 64-bit platforms - 2⁶⁴ maximum size	FFT - Powers of 2 only DFT -2³² maximum size (*)
Mixed Radix Support	2,3,5,7,11 and 13 kernels (**)	DFT - 2,3,5,7 kernels (**)
Data Types (See Table 3 for detail)	Real & Complex Single- & Double-Precision	Real & Complex Single- & Double-Precision
Scaling	Transforms can be scaled by an arbitrary floating point number (with precision the same as input data)	Integer ("fixed") scaling Forward 1/N Inverse 1/N Forward + Inverse SQRT (1/N)
Threading	Platform dependent IA-32: All (except 1D when performing a single transform and sizes are not power of two) Intel® 64: All (except in-place power of two)	1D and 2D

Data Types and Formats

The Intel® MKL and Intel® IPP Fourier transform functions support a variety of data types and formats for storing signal values. Mixed types interfaces are also supported. Please see the product documentation for details.

Table 3: Comparison of Intel® MKL and Intel® IPP Data Types and Formats

Feature	Intel MKL	Intel IPP
Real FFTs
Precision	Single, Double	Single, Double
1D Data Types	Real for all dimensions	Signed short, signed int, float, double
2D Data Types	Real for all dimensions	Unsigned char, signed int, float
1D Packed Formats	CCS, Pack, Perm, CCE	CCS, Pack, Perm
2D Packed Formats	CCS, Pack, Perm, CCE	RCPack2D
3D Packed Formats	CCE	n/a
Format Conversion Functions	n/a	n/a
Complex FFTs
Precision	Single, Double	Single, Double
1D Data Types	Complex for all dimensions	Signed short, complex short, signed int, complex integer, complex float, complex double
2D Data Types	Complex for all dimensions	Complex float

Formats Legend
CCE - stores the values of the first half of the output complex conjugate-even signal
CCS - same format as CCE format for 1D, is slightly different for multi-dimensional real transforms
for 2D transforms. CCS, Pack, Perm are not supported for 3D and higher rank
Pack - compact representation of a complex conjugate-symmetric sequence
Perm - same as Pack format for odd lengths, arbitrary permutation of the Pack format for even lengths
RCPack2D - exploits the complex conjugate symmetry of the transformed data to store only half of the resulting Fourier coefficients

Performance

The Intel® MKL and Intel® IPP are optimized for current and future Intel® ® processors, and are specifically tuned for two different usage areas:

Intel® MKL is suitable for large problem sizes typical to FORTRAN and C/C++ high-performance computing software such as engineering, scientific, and financial applications.
Intel® IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications.

Choosing the Best FFT for Your Application

Before making a decision, developers must understand the specific requirements and constraints of the application. Developers should consider these questions:

What are the performance requirements for the application? How performance is measured and what is the measurement criteria? Is a specific benchmark being used? What are the known performance bottlenecks?
What type of application is being developed? What are the main operations being performed and on what kind of data?
What API is currently being used in the application for transforms? What programming language(s) is the application code written in?
Does the FFT output data need to be scaled (normalized)? What type of scaling is required?
What kind of input and output data does the transform process? What are the valid and invalid values? What type of precision is required?

Summary

Intel® MKL and Intel® IPP both provide optimized Fourier Transform functions. For more detailed information on the FFT APIs, parameters and formats, please refer to the following documents:

30-day evaluation packages are available for free download from Intel® Software Trial Download site

To see all “no cost” options, visit this page: https://software.intel.com/en-us/free_tools_and_libraries

Other Resources

Optimization Notice
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Emberson beta program

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 10

Microsoft Windows* 8.x