Quantcast
Channel: intel mkl
Viewing all 33 articles
Browse latest View live

The Intel® Math Kernel Library Sparse Matrix Vector Multiply Format Prototype Package

$
0
0

 

This article describes the new features in the Intel® Math Kernel Library Sparse Matrix Vector Multiply Format Prototype Package (Intel® MKL SpMV Format Prototype Package) for use on the Intel® Xeon Phi™ coprocessor. The package includes a new two-stage API for select SpMV operations as well as support for the ELLPACK Sparse Block (ESB) format.

Introduction

Sparse Matrix Vector Multiply (SpMV) is an important operation in many scientific applications, and its performance can be a critical part of overall application performance. Because SpMV is typically a bandwidth-limited operation, its performance is often dependent on the sparsity structure of the matrix. This means that to optimize SpMV fully, we need to choose computational kernels and balancing algorithms that take the structure of the sparse input matrices into account.

The improved memory bandwidth of the Intel Xeon Phi coprocessors helps accelerate SpMV operations. Intel MKL 11.0 and later provides highly-tuned SpMV kernels for the compressed sparse row (CSR) format for the Intel Xeon Phi coprocessor. While experiments show that the performance of Intel MKL CSR SpMV is close to optimal in many cases, there are certain matrices where additional performance improvements are possible. For example, we found that many-core work balancing (further referenced as “workload balancing”) CSR SpMV functionality on Intel Xeon Phi coprocessors benefited performance more than did the tuning of computational kernels.

For sparse matrices, especially those with non-uniform structures, workload balancing should help improve the performance of SpMV on many-core architectures. It is important to note that determining a suitable workload balance is time-consuming and if not used correctly, for example, for a single SpMV call, may cause degradation in performance.

For repeated SpMV calls on matrices with the same structure, it is often advantageous to do the computations in multiple stages. That is, if we first analyze the matrix, the appropriate computational kernel and workload balancing algorithm can be determined.  The results of this analysis stage can then be used to boost the performance during the multiple SpMV calls that follow.  This approach should benefit the calls to SpMV as long as the total time required for the analysis stage and multiple SpMV calls is less than that for multiple, generic SpMV calls. At the end of the SPMV calls, the data and structures created during the analysis stage should be released.

The current Intel MKL Sparse BLAS has Fortran-style interfaces (based on the NIST* interfaces) and is oriented around several function calls (each function has many parameters for the input matrix and performs computations in a single step) with no assumptions made regarding the sparsity structure or storage details. Recognizing these limitations, there is no obvious way to store matrix analysis information between function calls in the current Intel MKL Sparse BLAS without significantly impacting performance. So in this new package, we extend the Intel MKL Sparse BLAS interfaces to use a staged approach that

  • analyzes the matrix structure and selects the optimal computational kernels for a given sparse matrix
  • provides user-controlled options for kernels or workload balancing algorithm selection 

for a limited set of functions and also introduce a new sparse matrix format suitable for the  Intel Xeon Phi coprocessor.

 Description of the Intel MKL SpMV Format Prototype Package

The Intel MKL SpMV Format Prototype Package supports only general, non-transposed SpMV functionality on Intel Xeon Phi coprocessors for native and offload execution. A sparse matrix in this implementation is stored in a structure (handle). This approach allows us to investigate the input matrix only once, in the stage of creation of the internal matrix representation, and to retain the results of the investigation for further calls.

The Intel MKL SpMV Format Prototype Package supports two sparse formats: ELLPACK Sparse Block (ESB) (see http://dl.acm.org/citation.cfm?id=2465013 for details) and Compressed Sparse Row (CSR).

Let us briefly describe the ESB format, where a sparse matrix is stored in slices; each slice consists of 8 rows and is stored in ELLPACK format. This means that the length of each slice is equal to the maximum number of non-zeros in any row; shorter rows are padded with zeros to fill out the dense array. This format was specifically tuned for Intel Xeon Phi: each slice is stored in memory column-wise, so it can be processed column-by-column by SIMD instructions: 8 double precision elements are packed into one register. Also, for each column a bit mask is stored, where bit 0 marks padded elements, allowing the efficient use of masked vector operations.

The Intel MKL SpMV Format Prototype Package operates on internal matrix representation in CSR or ESB format. For both formats an internal matrix representation is created from the external CSR matrix. Additionally, a workload balancing algorithm can be chosen. The following algorithms are supported: with static and dynamic scheduling, and CSR format blocked scheduling. In static and dynamic scheduling, the input matrix is divided into many small chunks (around 2000), then they are scheduled to threads statically at compile time or dynamically at run time. In CSR format blocked scheduling, each thread processes one block of input matrix, and all blocks have more or less equal numbers of non-zeros.

Note: The sparse input matrix is actually duplicated in the internal structure. 

Examples

Example of ELLPACK format (for a single block).

Suppose the original sparse matrix is:

111200016000
210002500029
0033000000
00434400000
000055057059
61000066000
07200000780
00084008700

 

Then sparse ELLPACK format for Intel Xeon Phi looks like:

 111216
 212529
 3300
 43440
Val555759
 61660
 72780
 84870

 

 126
 159
 3**
 34*
Cols579
 16*
 28*
 47*

 

 111
 111
 100
 110
Bit mask111
 110
 110
 110


The spmv_new.c file, located in the __release_lnx/examples folder, demonstrates the implemented functionality on Linux platforms. To build an example, set the proper compiler environment and perform these make commands: 

  1. make clean – clean the workspace
  2. make build –  create the executable file
  3. make execute – run the executable on the Intel Xeon Phi coprocessor mic0 by default. 

The following example demonstrates conversion of a matrix in CSR format to the internal CSR and ESB representations used by the Intel MKL SpMV Format Prototype Package followed by SpMV compute routines.

/*//// Consider the matrix A (see 'Sparse Storage Formats for Sparse BLAS Level 2//and Level 3 in the Intel MKL Reference Manual')////                 |   1       -1      0   -3     0   |
//                 |  -2        5      0    0     0   |
//   A    =        |   0        0      4    6     4   |,
//                 |  -4        0      2    7     0   |
//                 |   0        8      0    0    -5   |
////  The matrix A is represented in a zero-based compressed sparse row storage
//  scheme with three arrays (see 'Sparse Matrix Storage Schemes' in the//  Intel MKL Reference Manual) as follows:
////         values  = ( 1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5 )
//         columns = ( 0 1 3 0 1 2 3 4 0 2 3 1 4 )
//         rowIndex = ( 0  3  5  8  11 13 )
//
//  The test performs the following operations :
////       The code computes A*S = F using sparseDesbmv and sparseDcsrmv
//          where A is a general sparse matrix and S and F are vectors.
//*******************************************************************************/#include<stdio.h>#include<assert.h>#include<math.h>#include"spmv_interface.h"
#define M 5#define N 5#define NNZ 13
intmain() {
//*****************************************************************************//     Declaration and initialization of parameters for sparse
//     representation of the matrix A in the compressed sparse row format:
//*****************************************************************************
    int m = M, n = N, nnz = NNZ;    //*****************************************************************************    //    Sparse representation of the matrix A
    //*****************************************************************************
    double csrVal[NNZ]    = { 1.0,  -1.0,      -3.0,
                            -2.0,  5.0,
                                          4.0,  6.0,  4.0,
                              -4.0,       2.0,  7.0,
                                     8.0,            -5.0 };
    int    csrColInd[NNZ] = { 0, 1,    3,
                              0, 1,
                                    2, 3, 4,
                              0,    2, 3,
                                 1,       4 };
    int    csrRowPtr[M+1] = { 0, 3, 5, 8, 11, 13 };
    // Matrix descriptor, new API variable    sparseMatDescr_t    descrA;
    // Internal CSR matrix representation, new API variable    sparseCSRMatrix_t   csrA;
    // Internal ESB matrix representation, new API variable    sparseESBMatrix_t   esbA;
    //*****************************************************************************    //    Declaration of local variables:
    //*****************************************************************************
    double      x[M]    = { 1.0, 5.0, 1.0, 4.0, 1.0 };
    double      y[M]    = { 0.0, 0.0, 0.0, 0.0, 0.0 };
    double      alpha = 1.0, beta = 0.0;
    int         i;
    // Create matrix descriptor
    sparseCreateMatDescr ( &descrA );   
    // Create CSR matrix with static workload balancing algorithm   
    sparseCreateCSRMatrix ( &csrA, SPARSE_SCHEDULE_STATIC );   
    // Analyze input matrix and create its internal representation in the     
    // csrA structure optimized for static workload balancing     
    sparseDcsr2csr ( m, n, descrA, csrVal, csrRowPtr, csrColInd, csrA );   
    // Compute y = alpha * A * x + beta * y     
    sparseDcsrmv ( SPARSE_OPERATION_NON_TRANSPOSE, &alpha, csrA, x, &beta, y );
    // Release internal representation of CSR matrix     
    sparseDestroyCSRMatrix ( csrA );
    // Create ESB matrix with static workload balancing algorithm   
    sparseCreateESBMatrix ( &esbA, SPARSE_SCHEDULE_STATIC );
    // Analyze input CSR matrix and create its internal ESB representation in   
    // the esbA structure optimized for static workload balancing     
    sparseDcsr2esb ( m, n, descrA, csrVal, csrRowPtr, csrColInd, esbA );
    // Compute y = alpha * A * x + beta * y   
    sparseDesbmv ( SPARSE_OPERATION_NON_TRANSPOSE, &alpha, esbA, x, &beta, y );
    // Release internal representation of ESB matrix   
    sparseDestroyESBMatrix ( esbA );   
    sparseDestroyMatDescr ( descrA );   
    return 0;}

The performance results (see chart below) demonstrated by the implementations of SpMV for the ESB format are on average 30% better than implementation available in Intel MKL 11.1 Update 1, and for some matrices the computation can be several times faster, which agrees with the article referenced previously. It should be noted that:

  1. ESB SpMV performance for some matrices is not as good as CSR SpMV;
  2. The workload scheduling algorithm significantly affects performance and should be chosen experimentally for a sparse matrix to achieve the best performance. 

Note: General SpMV routines are used despite of self-adjoint/symmetric matrices in some of the below cases.

 We are seeking interested parties to evaluate this prototype implementation and provide us with feedback. If you are interested, please send a request to intel.mkl@intel.com to download the Intel MKL SpMV Format Prototype Package.

 Copyright © 2013, Intel Corporation. All rights reserved.  "Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries."

  • Advanced Xeon Phi
  • intel mkl
  • Sparse Matrix-Vector multiplication
  • csr format
  • Ellpack Sparse Block format
  • Developers
  • Professors
  • Students
  • Linux*
  • C/C++
  • Fortran
  • Advanced
  • Intermediate
  • OpenMP*
  • URL
  • Improving performance
  • Libraries
  • Multithread development

  • Intel® Optimized Technology Preview for High Performance Conjugate Gradient Benchmark

    $
    0
    0

    The Intel® Optimized Technology Preview for High Performance Conjugate Gradient Benchmark (Intel® Optimized Technology Preview for HPCG) provides an early implementation of the HPCG benchmark https://software.sandia.gov/hpcg optimized for Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Vector Extensions 2 (Intel® AVX2) enabled Intel® processors and Intel® Xeon Phi™ coprocessors. The HPCG Benchmark is intended to complement the High Performance LINPACK benchmark used in the TOP500 http://www.top500.org system ranking by providing a metric that better aligns with a broader set of important cluster applications

    For More information on this implementation, Getting Started, Performance measurements and System Requirements, please refer to the Attachment

  • intel mkl
  • benchmark
  • Developers
  • Professors
  • Students
  • Advanced
  • Intermediate
  • URL
  • Intel® MKL Automatic Offload enabled functions for Intel Xeon Phi coprocessors

    $
    0
    0

    Intel® MKL now has support for Intel® Xeon Phi™ coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture) on Linux* and Windows* OS. There are three Intel MKL usage models on Intel Xeon Phi coprocessor: automatic offload, compiler assisted offload and native execution.

    Here is the list of Automatic Offload enabled functions in Intel MKL on Intel® Xeon Phi™ coprocessors:

    BLAS:

    • BLAS Level-3 subroutines - ?SYMM,?TRMM, ?TRSM, ?GEMM

    LAPACK:

    • LU (?GETRF), Cholesky ((S/D)POTRF), and QR (?GEQRF) factorization functions

    As of current release, following numbers gives you the matrix size when the Automatic Offload would be applicable for above mentioned list of functions (These numbers could change in future MKL releases though):

    BLAS LEVEL 3

    • GEMM:

      • SGEMM: M, N > 2048, K > 256
      • DGEMM NN, NT: M, N > 1280, K > 256
      • DGEMM TN, TT: M, N > 2048, K > 256
      • C, Z GEMM: M, N > 2048, K > 256
    • TRxM:

      • S, D TRxM: M, N > 512
      • C, Z TRxM: M, N > 512, M % 16 == 0, N % 16 == 0
    • ?SYMM: M, N > 512

    ?GETRF: M, N > 8192

    [S/D/C]POTRF:N>=6144

    [S/D/C/Z]GEQRF:M=N>=8192

    SSYEV,SSYEVD,SSYRDB: N>=9216

    DSYEV,DSYEVD,DSYRDB: N>=8000

     

    Apart from the above mentioned list of AO enabled functions, the following list of several functions take benefit from the above mentioned BLAS and LAPACK functions

    In the below table, AO enabled functions are listed in Horizontal row and the list of functions which take benefit from AO enabled BLAS and LAPACK functions are listed in veritcal rows.Sign in intersection indicates which function benefits from which

     

     

    GETRF

    GEQRF

    POTRF

    TRSM

    TRMM

    SYMM

    GEMM

    ?gesv

    x

     

     

    x

     

     

     

    ?gesvx

    x

     

     

    x

     

     

     

    ?gesvxx

    x

     

     

    x

     

     

     

    (ds/zc)gesv

    x

     

     

    x

     

     

    x

    ?sysv

     

     

     

    x

     

     

     

    (c/z)hesv

     

     

     

    x

     

     

     

    ?gegs

     

    x

     

     

     

     

     

    ?gegv

     

    x

     

     

     

     

     

    (d/s)gejsv

     

    x

     

    x

     

     

     

    ?gels

     

    x

     

    x

     

     

     

    ?gelsd

     

    x

     

     

     

     

    x

    ?gelss

     

    x

     

     

     

     

    x

    ?geqp3

     

    x

     

     

     

     

     

    ?gesdd

     

    x

     

     

     

     

    x

    ?gesvd

     

    x

     

     

     

     

    x

    ?gges

     

    x

     

     

     

     

     

    ?ggesx

     

    x

     

     

     

     

     

    ?ggev

     

    x

     

     

     

     

     

    ?ggevx

     

    x

     

     

     

     

     

    ?ggqrf

     

    x

     

     

     

     

     

    ?ggrqf

     

    x

     

     

     

     

     

    ?ggglm

     

    x

     

    x

     

     

     

    ?gglse

     

    x

     

    x

     

     

     

    ?gelsy

     

    x

     

    x

     

     

     

    ?gelsx

     

     

     

    x

     

     

     

    ?pftrf

     

     

    x

    x

     

     

     

    ?posv 

     

     

    x

    x

     

     

     

    ?posvx

     

     

    x

    x

     

     

     

    ?posvxx

     

     

    x

    x

     

     

     

    (ds/zc)posv

     

     

    x

    x

     

    x

     

    ?(sy/he)gv 

     

     

    x

    x

    x

     

     

    ?(sy/he)gvd

     

     

    x

    x

    x

     

     

    ?(sy/he)gvx

     

     

    x

    x

    x

     

     

    ?(sy/he)gst

     

     

     

     

     

    x

     

    ?(sy/he)trs2

     

     

     

    x

     

     

     

    ?tfsm

     

     

     

    x

     

     

    x

    ?trtrs

     

     

     

    x

     

     

     

    ?potrs

     

     

     

    x

     

     

     

    ?pftri

     

     

     

     

    x

     

     

    ?pftrs

     

     

     

    x

     

     

    x

    ?tftri

     

     

     

     

    x

     

     

    ?(s/h)frk

     

     

     

     

     

     

    x

    ?geqrt3

     

     

     

     

    x

     

    x

    ?lalsd

     

     

     

     

     

     

    x

    ?larfb

     

     

     

     

    x

     

    x

     Please refer the White Paper for more information on how to use MKL Automatic Offload

    Please refer other articles related to Intel MKL on Intel Xeon Phi at Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessor

  • intel mkl
  • Intel MIC
  • Automatic Offload
  • Advanced
  • URL
  • Improving performance
  • Intel® MKL Cluster Support for IA-32 is now Deprecated

    $
    0
    0

    Title:32 bit Support in Intel® Math Kernel Library (Intel MKL) for Clusters is now deprecated in Intel MKL 11.2 and support will be removed starting next major release

    Reason:Intel MKL is aligned with Intel MPI 5.0 which supports only 64 bit. Majority of customer base already selected Intel 64 as main development platform

    Impact:Customers of Intel MKL for Clusters will have to go for 64 bit as 32 bit support is now deprecated.

    Advantages:

    • Higher performance with richer registers/instructions set on Intel 64
    • Removing 4Gb memory limitation on input sizes to up to one terabyte (TB) of platform address space
  • intel mkl
  • cluster mkl
  • Developers
  • Students
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • C/C++
  • Fortran
  • Advanced
  • URL
  • Theme Zone: 

    IDZone

    Intel® MKL Support to new functionality - Schur Complement

    $
    0
    0

    Introduction

    There is a wide range of domains in which there is a need to use a Schur complement matrix or a partial solver corresponding to it. For example, in mathematical statistics the Schur complement matrix is important in computation of the probability density function, and in computational mechanics the Schur complement matrix correlates to media stiffness. Partial solving also plays an important role in Linear Algebra for efficient preconditioner implementation based on Domain Decomposition algorithms. The application area is huge and has one point in common:  sparse matrices. That’s why computation of the Schur complement and partial solving have been implemented as new functionality in the sparse solver of Intel Math Kernel Library (Intel MKL)  based on the Intel MKL PARDISO* solver.   

    Product Overview

    Starting from Intel MKL 11.2 update 1, the Intel MKL PARDISO solver supports additional functionality covered by one main topic: the Schur complement. Using this new functionality, you can obtain a Schur complement of selected rows/columns of initial matrix, solve the system with a Schur complement matrix using the Intel MKL PARDISO interface, and solve the lower and upper triangular subsystem that produced by matrix factorization with Schur complement calculation. This functionality can be useful for customers who use the Schur complement in other sparse solver packages.

    Let A be a sparse quadratic matrix: 

    where Aloc ,and C are quadratic and sparse and  B1 and B2 are sparse rectangular matrices. Then, we can make the following decomposition of matrix A, which is formally an LDU decomposition:

    where

     

    The matrix S is the Schur complement.

    To use Schur complement functionality you need to:

    1. Set iparm(36) to 1 if you want to calculate Schur complement only, and to 2 if you want to use computed factorization of initial matrix (A) during the solver step.
    2. Set the columns/rows that specify Schur complement submatrix (matrix C in the example below)
    3. Provide an array of solution vectors of n2 elements to the factorization step of the pardiso routine, where n is the size of the Schur complement matrix. On output the Schur complement is returned via this array.

    The charts below compare Intel MKL PARDISO [1] with MUMPS [2] in term of time needed for calculating the Schur complement. All experiments used a computation node with two  Intel® Xeon® E5-2697 v3 processors (35M Cache, 2.60 GHz) with 64Gb RAM memory, KMP_AFFINITY set to “compact”,  MUMPS version 4.10.0, and Intel MKL 11.2 update 1. For the Schur complement matrix we choose the last 5000 rows/columns of each matrix. The tested matrix was acquired from Matrix Florida collection [3],

    [1] Intel® Math Kernel Library

    [2] Mumps

    [3] T. A. Davis and Y. Hu, The University of Florida Sparse Matrix Collection, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1 - 1:25.  

  • intel mkl
  • MKL PARDISO
  • C/C++
  • Advanced
  • Beginner
  • URL
  • Theme Zone: 

    IDZone

    Intel® Math Kernel Library (Intel® MKL) 11.2 Update 1 for Windows*

    $
    0
    0

    Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.Intel MKL 11.2 Update 1 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE 2015 . Please visit the Intel® Math Kernel Library Product Page .

    Intel® MKL 11.2 Update 1 Bug fixes

    What's New in Intel® MKL 11.2 Update 1 :

    Check out the Release Notes for MKL 11.2 Update 1

    Contents

    • File:  w_mkl_11.2.1.148_online.exe

      Online Installer for Windows

    • File: w_mkl_11.2.1.148.exe

      A File containing the complete product installation for Windows (32-bit/x86-64bit development)

  • intel mkl
  • upcoming releases
  • Advanced
  • Intermediate
  • URL
  • Theme Zone: 

    IDZone

    Intel® Math Kernel Library (Intel® MKL) 11.2 Update 1 for Linux*

    $
    0
    0

    Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.Intel MKL 11.2 Update 1 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE 2015 . Please visit the Intel® Math Kernel Library Product Page .

    Intel® MKL 11.2 Update 1 Bug fixes

    What's New in Intel® MKL 11.2 Update 1 :

    Check out the Release Notes for MKL 11.2 Update 1

    Contents

    • File:  l_mkl_online_11.2.1.133.sh

      Online Installer for Linux

    • File: l_mkl_11.2.1.133.tgz

      A File containing the complete product installation for Linux (32-bit/x86-64bit development)

  • intel mkl
  • upcoming releases
  • Linux
  • Developers
  • Professors
  • Students
  • Linux*
  • Advanced
  • Intermediate
  • URL
  • Theme Zone: 

    IDZone

    Intel® Math Kernel Library (Intel® MKL) 11.2 Update 2 for Windows*

    $
    0
    0

    Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.Intel MKL 11.2 Update 2 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE 2015 . Please visit the Intel® Math Kernel Library Product Page .

    Intel® MKL 11.2 Update 2 Bug fixes

    What's New in Intel® MKL 11.2 Update 2:

    • Improved symmetric eigensolvers performance by up to 3x, for the cases when eigenvectors are not needed.
    • Improved ?GESVD performance by 2-3x, for the cases when singular vectors are required.
    • Improved ?GETRF performance for Intel AVX2 by up to 14x for non-square matrices.
    • Improved parallel and serial performance of ?HEMM/?SYMM for Intel® Advanced Vector Extensions 2 (Intel® AVX2) for the 64-bit Intel MKL.
    • Improved parallel and serial performance of ?HERK/?SYRK and and ?HER2K/?SYR2K for Intel AVX2.
    • Added MKL_DIRECT_CALL support for CBLAS interfaces and ?GEMM3M routines.
    • Deprecations: Intel® MKL Cluster Support for IA-32 is now Deprecated and support will be removed starting Intel® MKL 11.3

    Check out the Release Notes for MKL 11.2 Update 2

    Contents

    • File:  w_mkl_11.2.2.179_online.exe

      Online Installer for Windows

    • File: w_mkl_11.2.2.179.exe

      A File containing the complete product installation for Windows (32-bit/x86-64bit development)

  • intel mkl
  • upcoming releases
  • Advanced
  • Intermediate
  • URL
  • Theme Zone: 

    IDZone

    Intel® Math Kernel Library (Intel® MKL) 11.2 Update 2 for Linux*

    $
    0
    0

    Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.Intel MKL 11.2 Update 2 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE 2015 . Please visit the Intel® Math Kernel Library Product Page .

    Intel® MKL 11.2 Update 2 Bug fixes

    What's New in Intel® MKL 11.2 Update 2:

    • Improved symmetric eigensolvers performance by up to 3x, for the cases when eigenvectors are not needed.
    • Improved ?GESVD performance by 2-3x, for the cases when singular vectors are required.
    • Improved ?GETRF performance for Intel AVX2 by up to 14x for non-square matrices.
    • Improved parallel and serial performance of ?HEMM/?SYMM for Intel® Advanced Vector Extensions 2 (Intel® AVX2) for the 64-bit Intel MKL.
    • Improved parallel and serial performance of ?HERK/?SYRK and and ?HER2K/?SYR2K for Intel AVX2.
    • Added MKL_DIRECT_CALL support for CBLAS interfaces and ?GEMM3M routines.
    • Deprecations: Intel® MKL Cluster Support for IA-32 is now Deprecated and support will be removed starting Intel® MKL 11.3

    Check out the Release Notes for MKL 11.2 Update 2

    Contents

    • File:  l_mkl_online_11.2.2.164.sh

      Online Installer for Linux

    • File: l_mkl_11.2.2.164.tgz

      A File containing the complete product installation for Linux (32-bit/x86-64bit development)

  • intel mkl
  • upcoming releases
  • Linux
  • Developers
  • Professors
  • Students
  • Linux*
  • Advanced
  • Intermediate
  • URL
  • Theme Zone: 

    IDZone

    Using Intel® MKL MPI wrapper with the Intel® MKL cluster functions

    $
    0
    0

     

    1. Introduction.

    Intel MKL 11.3 introduced a wrapper code for MPI interface to Intel MKL, which can help users to use Intel® MKL cluster functions with the customized MPI libraries.

    While different MPI libraries are compatible on the application programming interface (API) level, they are often incompatible at the application binary interface (ABI) level.  So Intel MKL provides some different libraries to support the different MPIs. For example, one should link with   libmkl_blacs_lp64.a to use application with MPICH*,  libmkl_blacs_openmpi_lp64.a to use Open MPI*.   If users link Intel MKL cluster functions with the customized MPI libraries, which is not supported by Intel MKL, and may create some unexpected result.

    Note, only the Intel MKL cluster functions (Cluster Sparse Solver, Cluster FFT, Scala PACK, BLACS) depend on the different MPI implementation.  For other MKL functions, they do not have such dependency.

    Intel MKL 11.3 added a MPI wrapper code, so user can extend Intel MKL cluster functions usage to some customized MPI libraries. Users can build custom BLACS library using wrapper code to extend MKL Cluster functions’ support on their MPI libraries.

     

    2. Using Intel MKL MPI wrapper code

    2.1 Build the Intel MKL wrapper code.

    The MKLMPI adaptor is provided with the source code, which can be found in MKLROOT/interfaces/mklmpi directory. To build custom BLACS one should run make (nmake) from this directory with corresponding parameters. For example:

       >make sointel64 interface=ilp64 MPICC=’mpicc’

    This command will build custom dynamic MKL BLACS library with default library name libmkl_blacs_custom_ilp64.so and put it in default directory: ../../lib/intel64. For more information, please use “make help“ (or “nmake help” on Windows).

    2.2 Linking with custom MKL BLACS library

    To link with custom MKL BLACS library, you only need to replace the default Intel MKL BLACS with the custom one (for example, replace mkl_blacs_intelmpi with mkl_blacs_custom).

    Note that, when you link with the dynamic library at Windows* system, you should use mkl_blacs_lp64_dll.lib, but before run the application,   it needs to set MKL_BLACS_MPI environment variable.  For example, it needs to set to CUSTOM for the default name of the custom MKL BLACS library (mkl_blacs_custom_lp64.dll for LP64 interface and mkl_blacs_custom_ilp64.dll for ILP64 case), or set to any other  filename that you use for the  custom MKL BLACS library.

    • please note, that dynamic library should be located at Intel MKL redist directory or in the application directory
    • Besides the MKL_BLACS_MPI environment variable, you can use mkl_set_mpi function that was introduced in MKL 11.3 

    3. An example:  Using Intel MKL MPI wrapper with Open MPI* at Windows*

    Suppose the Open MPI is installed at C:\Program Files (x86)\OpenMPI_v1.6.2-x64. The %MPIROOT% environment variable is set to the installed path

    3.1 Build custom MKL BLACS library:

    1. Go to MKL MPI adaptor directory
      \>cd %MKLROOT%\interfaces\mklmpi\
    1. Set environment for Open MPI:
      \>set PATH=%MPIROOT%\bin;%PATH%
      \>set LIB=%MPIROOT%\lib;%LIB%
      \>set INCLUDE=%MPIROOT%\include;%INCLUDE%
    1. Build static MKL BLACS library for LP64 interface with default name mkl_blacs_custom_lp64.lib

        \>nmake libintel64 MPICC="mpicc -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS" INSTALL_DIR=%MKLROOT%\lib\intel64 (**)

     

    1. Build dynamic MKL BLACS library for LP64 interface with default name mkl_blacs_custom_lp64.dll
      \>nmake dllintel64 MPICC="mpicc -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS" INSTALL_DIR=%MKLROOT%\..\redist\intel64\mkl

    3.2 Using the library with the Intel MKL example:

     

    1. Go to the Parallel Direct Sparse Solver for Clusters example directory:
      \>cd %MKLROOT%\examples\cluster_sparse_solverc\source
       
    2. Set environment for MKL:
      \> %MKLROOT%\bin\mkl\mklvars.bat intel64
       
    3. Build and run the application in static linking mode:
      \>mpicc -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS cl_solver_sym_sp_0_based_c.c /link mkl_blacs_custom_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

      \>mpiexec –np 4 cl_solver_sym_sp_0_based_c.exe
       
    4. Build the code, set correct MKL_BLACS_MPI environment variable and run the application in dynamic linking mode:
      \>mpicc -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS cl_solver_sym_sp_0_based_c.c /link mkl_blacs_lp64_dll.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

      \>set MKL_BLACS_MPI=CUSTOM (*)

      \>mpiexec –np 4 cl_solver_sym_sp_0_based_c.exe

     

            *) we can also set MKL_BLACS_MPI=mkl_blacs_custom_lp64.dll

     

            **) Using compiler flags -DOMPI_IMPORTS -DOPAL_IMPORTS -DORTE_IMPORTS was needed because of peculiarity of our Open MPI* installation, and may be or may be not needed in each specific case.
     

     

     

  • MPI
  • intel mkl
  • Linux*
  • Development Tools
  • URL
  • Getting started
  • Theme Zone: 

    IDZone

    Co-authors: 

    Evarist Fomenko (Intel)

    Intel® Xeon Phi optimizations in Intel MKL

    $
    0
    0

    The following components of Intel® MKL 11.0.1 and higher are tuned for the Intel® Xeon Phi Architecture:

    • Several BLAS (level 1, 2, and 3)
    • Sparse BLAS
    • LAPACK routines.
    • Vector Math Library (VML)
    • All the Vector Statistical Library (VSL) routines including random number generators (RNG).
    • Fast Fourier transforms. 
      1. Please refer the FFT tuning article for Intel Xeon Phi for more details.

    Many of the other routines which uses ?gemm also get performance benefits due to ?gemm optimizations.

    Please refer other articles related to Intel MKL on Intel Xeon Phi at Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessor

  • MKL support for Xeon Phi
  • intel mkl
  • Intel MIC
  • Intel® Math Kernel Library
  • URL
  • Last Updated: 

    Monday, May 4, 2015

    Using Intel MKL and Intel TBB in the same application

    $
    0
    0

    Intel MKL 11.3 Beta has introduced Intel TBB support.

    Intel MKL 11.3 beta Update can increase performance of applications threaded using Intel TBB. Applications using Intel TBB can benefit from the following Intel MKL functions:

    •          BLAS - GEMM, SYMM/HEMM, TRMM, TRSM, SYRK/HERK, SYR2K/HER2K
    •          LAPACK - GETRF, GETRS, GESV, POTRS
    •          Sparse BLAS - CSRMM, BSRMM
    •          Intel MKL Poisson Solver
    •          Intel MKL PARDISO

    If such applications call functions not listed above, Intel MKL 11.3 beta executes sequential code. Depending on feedback from customers, future versions of Intel MKL may support Intel TBB in more functions.

    Linking applications to Intel TBB and Intel MKL

    The simplest way to link applications to Intel TBB and Intel MKL is to use Intel C/C++ Compiler. While Intel MKL supports static and dynamic linking, only dynamic Intel TBB library is available.

    Under Linux, use the following commands to compile your application app.c and link it to Intel TBB and Intel MKL.

    Dynamic Intel TBB, dynamic Intel MKL                    icc app.c -mkl -tbb

    Dynamic Intel TBB, static Intel MKL                         icc app.c -static -mkl -tbb

    Under Windows, use the following commands to compile your application app.c and link it to dynamic Intel TBB and Intel MKL.

    Dynamic Intel TBB, dynamic Intel MKL                    icl.exe app.c -mkl -tbb

    Improving Intel MKL performance with Intel TBB

    Performance of Intel MKL can be improved by telling Intel TBB to ensure thread affinity to processor cores. Use the tbb::affinity_partitioner class to this end.

    To improve performance of Intel MKL for small input data, you may limit the number of threads allocated by Intel TBB for Intel MKL. Use the tbb::task_scheduler_init class to do so.

    For more information on controlling behavior of Intel TBB, see the Intel TBB documentation at https://www.threadingbuildingblocks.org/documentation.

    LAPACK (LU factorization )  performance in applications using Intel TBB and Intel MKL 11.3 beta

    For more information about Using Intel® MKL with Threaded Applications, please refer to the Knowledge Base Article follow the link :

    https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-with-threaded-applications

    or to the MKL User's Guide

    * Each call is single run of single size on range from 1000 to 10000 with step 1000.  Performance (GFlops) is computed as cumulative number of floating point operations for all 10 calls divided by wall clock time from starting very first call till finishing very last call.

    **into MKL11.3  time frame, we are planning to extend the list of functions threaded with TBB :  AXPY, (S/D)DOT, GEMV (in addition to the ones listed above)

     

  • intel mkl
  • tbb
  • lapack
  • pardiso
  • Poisson Solver
  • BLAS
  • Developers
  • Partners
  • Students
  • Apple iOS*
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8.x
  • Unix*
  • UX
  • Windows*
  • C/C++
  • Fortran
  • Advanced
  • Beginner
  • Intermediate
  • OpenMP*
  • Academic
  • Cluster Computing
  • Education
  • Financial Services Industry
  • Optimization
  • Parallel Computing
  • Threading
  • Vectorization
  • URL
  • Improving performance
  • Libraries
  • Co-authors: 

    Alexander Kobotov (Intel)
    Evgeny P. (Intel)

    Intel® MKL and Intel® IPP: Choosing a High Performance FFT

    $
    0
    0

    Note: This document applies to Intel® MKL 11.0 or later and Intel® IPP 7.1 or later.

    Objective

    The purpose of this document is to help developers determine which FFT, Intel® MKL or Intel® IPP is best suited for their application.


    Overview

    Fourier transforms are used in signal processing, image processing, physics, statistics, finance, cryptography, and many other areas. The Discrete Fourier transform (DFT) mathematical operation converts a signal from the time domain to the frequency domain and back.

    DFT processing time can dominate a software application. Using a fast algorithm, Fast Fourier transform (FFT), reduces the number of arithmetic operations from O(N2) to O(N log2 N) operations. Intel® MKL and Intel® IPP are highly optimized for Intel® architecture-based multi-core processors using the latest instruction sets, parallelism, and algorithms.

    Read further to decide which FFT is best for your application.

    Below is a brief summary of the Intel® MKL and Intel® IPP libraries. For additional details on these products, visit the Intel® MKL web site and the Intel® IPP web site.


    Table 1: Comparison of Intel® MKL and Intel® IPP Functionality

     

     

     Intel MKL

    Intel IPP

    Target Applications

    Mathematical applications for engineering, scientific and financial applications

    Media and communications applications for audio, video, imaging, speech recognition and signal processing

    Library Structure

    • Linear algebra
    • BLAS
    • LAPACK
    • ScaLAPACK
    • Fast Fourier transforms
    • Vector math
    • Vector statistics
    • Random number generators
    • Convolution and correlation
    • Partial differential equations
    • Optimization solvers
    • Sparse Solvers
    • Signal processing
    • Image processing, compression and color conversion
    • String processing
    • Cryptography
    • Computer vision
    • Data compression
    • Matrix and vector math
    • Audio coding
    • Speech coding and recognition
    • Video coding

    Linkage Models

    Static, dynamic, custom dynamic

    Static, dynamic, custom dynamic

    Operating Systems

    Windows*, Linux*, Mac OS X*

    Windows, Linux, Mac OS X, QNX*

    Processor Support

    IA-32 and Intel® 64 architecture-based and compatible platforms(1*)

    IA-32 and Intel® 64 architecture-based and compatible platforms (2*)

    Both of these libraries contain the generic code optimized for processors with Intel® ® Streaming SIMD Extensions  (Intel® ® SSE) and code optimized for processors with Intel®  SSE2, SSE3, SSE4.1, SSE4,2, AVX and AVX2  instruction set

    1* - IPP provides optimized code for Intel® ® Atom™ processor

    2* - started with version11.1, Intel® (R) MKL started to support new Intel® Xeon Phi™ coprocessors


    Intel® MKL and Intel® IPP Fourier Transform Feature

    The Fourier Transforms provided by MKL and IPP are respectively targeted for the types of applications targeted by MKL (engineering and scientific) and IPP (media and communications).  In the table below, we highlight specifics of the MKL and IPP Fourier Transforms that will help you decide which may be best for your application.

    Table 2: Comparison of Intel® MKL and Intel® IPP DFT Features

     

    Feature

    Intel MKL

    Intel IPP

    API

    DFT
    Cluster FFT
    FFTW 2.x and 3.x

    FFT
    DFT

    Interfaces

    C and Fortran

    LP64 (64-bit long and pointer)
    ILP64 (64-bit int, long,  and pointer)

    C

    Dimensions

    1-D up to 7-D

    1-D (Signal Processing)
    2-D (Image Processing)

    Transform Sizes

    32-bit platforms - maximum size is 2^31-1
    64-bit platforms - 264 maximum size

    FFT - Powers of 2 only

    DFT -232 maximum size (*)

    Mixed Radix Support

    2,3,5,7,11 and 13 kernels (**)

    DFT - 2,3,5,7 kernels (**)

    Data Types

    (See Table 3 for detail)

    Real & Complex
    Single- & Double-Precision

    Real & Complex
    Single- & Double-Precision

    Scaling

    Transforms can be scaled by an arbitrary floating point number (with precision the same as input data)

    Integer ("fixed") scaling

    • Forward 1/N
    • Inverse 1/N
    • Forward + Inverse  SQRT (1/N)

    Threading

    Platform dependent

    • IA-32: All (except 1D when performing a single transform and sizes are not power of two)
    • Intel® 64: All (except in-place power of two)

    1D and 2D

     

     

      

    Data Types and Formats

    The Intel® MKL and Intel® IPP Fourier transform functions support a variety of data types and formats for storing signal values. Mixed types interfaces are also supported. Please see the product documentation for details.

    Table 3: Comparison of Intel® MKL and Intel® IPP Data Types and Formats

     

    Feature

    Intel MKL

    Intel IPP

    Real FFTs

    Precision

    Single, Double

    Single, Double

    1D Data Types

    Real for all dimensions

    Signed short, signed int, float, double

    2D Data Types

    Real for all dimensions

    Unsigned char, signed int, float

    1D Packed Formats

    CCS, Pack, Perm, CCE

    CCS, Pack, Perm

    2D Packed Formats

    CCS, Pack, Perm, CCE

    RCPack2D

    3D Packed Formats

    CCE

    n/a

    Format Conversion Functions

     n/a

     n/a

    Complex FFTs

    Precision

    Single, Double

    Single, Double

    1D Data Types

    Complex for all dimensions

    Signed short, complex short, signed int, complex integer, complex float, complex double

    2D Data Types

    Complex for all dimensions

    Complex float

    Formats Legend
    CCE - stores the values of the first half of the output complex conjugate-even signal
    CCS - same format as CCE format for 1D, is slightly different for multi-dimensional real transforms
    for 2D transforms. CCS, Pack, Perm are not supported for 3D and higher rank
    Pack - compact representation of a complex conjugate-symmetric sequence
    Perm - same as Pack format for odd lengths, arbitrary permutation of the Pack format for even lengths
    RCPack2D - exploits the complex conjugate symmetry of the transformed data to store only half of the resulting Fourier coefficients


    Performance

    The Intel® MKL and Intel® IPP are optimized for current and future Intel® ® processors, and are specifically tuned for two different usage areas:

    • Intel® MKL is suitable for large problem sizes typical to FORTRAN and C/C++ high-performance computing software such as engineering, scientific, and financial applications.
    • Intel® IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications.


    Choosing the Best FFT for Your Application

    Before making a decision, developers must understand the specific requirements and constraints of the application. Developers should consider these questions:

    • What are the performance requirements for the application? How performance is measured and what is the measurement criteria? Is a specific benchmark being used? What are the known performance bottlenecks?
    • What type of application is being developed? What are the main operations being performed and on what kind of data?
    • What API is currently being used in the application for transforms? What programming language(s) is the application code written in?
    • Does the FFT output data need to be scaled (normalized)? What type of scaling is required?
    • What kind of input and output data does the transform process? What are the valid and invalid values? What type of precision is required?


     

    Summary

    Intel® MKL and Intel® IPP both provide optimized Fourier Transform functions. For more detailed information on the FFT APIs, parameters and formats, please refer to the following documents:

     30-day evaluation packages are available for free download from Intel® Software Trial Download site

    To see all “no cost” options, visit this page: https://software.intel.com/en-us/free_tools_and_libraries

     

    Other Resources

     

    Optimization Notice

    Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

  • Emberson beta program
  • MKL
  • intel mkl
  • math kernel library
  • IPP
  • Intel IPP
  • FTT
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 10
  • Microsoft Windows* 8.x
  • C/C++
  • Fortran
  • Advanced
  • Intermediate
  • Intel® Math Kernel Library
  • Intel® Integrated Performance Primitives
  • Development Tools
  • Optimization
  • Parallel Computing
  • Threading
  • Vectorization
  • Laptop
  • Server
  • Desktop
  • Improving performance
  • Libraries
  • Multithread development
  • Last Updated: 

    Friday, August 28, 2015

    Last Edited by: 

    Devorah H. (Intel)
    Viewing all 33 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>