This article describes the new features in the Intel® Math Kernel Library Sparse Matrix Vector Multiply Format Prototype Package (Intel® MKL SpMV Format Prototype Package) for use on the Intel® Xeon Phi™ coprocessor. The package includes a new two-stage API for select SpMV operations as well as support for the ELLPACK Sparse Block (ESB) format.
Introduction
Sparse Matrix Vector Multiply (SpMV) is an important operation in many scientific applications, and its performance can be a critical part of overall application performance. Because SpMV is typically a bandwidth-limited operation, its performance is often dependent on the sparsity structure of the matrix. This means that to optimize SpMV fully, we need to choose computational kernels and balancing algorithms that take the structure of the sparse input matrices into account.
The improved memory bandwidth of the Intel Xeon Phi coprocessors helps accelerate SpMV operations. Intel MKL 11.0 and later provides highly-tuned SpMV kernels for the compressed sparse row (CSR) format for the Intel Xeon Phi coprocessor. While experiments show that the performance of Intel MKL CSR SpMV is close to optimal in many cases, there are certain matrices where additional performance improvements are possible. For example, we found that many-core work balancing (further referenced as “workload balancing”) CSR SpMV functionality on Intel Xeon Phi coprocessors benefited performance more than did the tuning of computational kernels.
For sparse matrices, especially those with non-uniform structures, workload balancing should help improve the performance of SpMV on many-core architectures. It is important to note that determining a suitable workload balance is time-consuming and if not used correctly, for example, for a single SpMV call, may cause degradation in performance.
For repeated SpMV calls on matrices with the same structure, it is often advantageous to do the computations in multiple stages. That is, if we first analyze the matrix, the appropriate computational kernel and workload balancing algorithm can be determined. The results of this analysis stage can then be used to boost the performance during the multiple SpMV calls that follow. This approach should benefit the calls to SpMV as long as the total time required for the analysis stage and multiple SpMV calls is less than that for multiple, generic SpMV calls. At the end of the SPMV calls, the data and structures created during the analysis stage should be released.
The current Intel MKL Sparse BLAS has Fortran-style interfaces (based on the NIST* interfaces) and is oriented around several function calls (each function has many parameters for the input matrix and performs computations in a single step) with no assumptions made regarding the sparsity structure or storage details. Recognizing these limitations, there is no obvious way to store matrix analysis information between function calls in the current Intel MKL Sparse BLAS without significantly impacting performance. So in this new package, we extend the Intel MKL Sparse BLAS interfaces to use a staged approach that
- analyzes the matrix structure and selects the optimal computational kernels for a given sparse matrix
- provides user-controlled options for kernels or workload balancing algorithm selection
for a limited set of functions and also introduce a new sparse matrix format suitable for the Intel Xeon Phi coprocessor.
Description of the Intel MKL SpMV Format Prototype Package
The Intel MKL SpMV Format Prototype Package supports only general, non-transposed SpMV functionality on Intel Xeon Phi coprocessors for native and offload execution. A sparse matrix in this implementation is stored in a structure (handle). This approach allows us to investigate the input matrix only once, in the stage of creation of the internal matrix representation, and to retain the results of the investigation for further calls.
The Intel MKL SpMV Format Prototype Package supports two sparse formats: ELLPACK Sparse Block (ESB) (see http://dl.acm.org/citation.cfm?id=2465013 for details) and Compressed Sparse Row (CSR).
Let us briefly describe the ESB format, where a sparse matrix is stored in slices; each slice consists of 8 rows and is stored in ELLPACK format. This means that the length of each slice is equal to the maximum number of non-zeros in any row; shorter rows are padded with zeros to fill out the dense array. This format was specifically tuned for Intel Xeon Phi: each slice is stored in memory column-wise, so it can be processed column-by-column by SIMD instructions: 8 double precision elements are packed into one register. Also, for each column a bit mask is stored, where bit 0 marks padded elements, allowing the efficient use of masked vector operations.
The Intel MKL SpMV Format Prototype Package operates on internal matrix representation in CSR or ESB format. For both formats an internal matrix representation is created from the external CSR matrix. Additionally, a workload balancing algorithm can be chosen. The following algorithms are supported: with static and dynamic scheduling, and CSR format blocked scheduling. In static and dynamic scheduling, the input matrix is divided into many small chunks (around 2000), then they are scheduled to threads statically at compile time or dynamically at run time. In CSR format blocked scheduling, each thread processes one block of input matrix, and all blocks have more or less equal numbers of non-zeros.
Note: The sparse input matrix is actually duplicated in the internal structure.
Examples
Example of ELLPACK format (for a single block).
Suppose the original sparse matrix is:
11 | 12 | 0 | 0 | 0 | 16 | 0 | 0 | 0 |
21 | 0 | 0 | 0 | 25 | 0 | 0 | 0 | 29 |
0 | 0 | 33 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 43 | 44 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 55 | 0 | 57 | 0 | 59 |
61 | 0 | 0 | 0 | 0 | 66 | 0 | 0 | 0 |
0 | 72 | 0 | 0 | 0 | 0 | 0 | 78 | 0 |
0 | 0 | 0 | 84 | 0 | 0 | 87 | 0 | 0 |
Then sparse ELLPACK format for Intel Xeon Phi looks like:
11 | 12 | 16 | |
21 | 25 | 29 | |
33 | 0 | 0 | |
43 | 44 | 0 | |
Val | 55 | 57 | 59 |
61 | 66 | 0 | |
72 | 78 | 0 | |
84 | 87 | 0 |
1 | 2 | 6 | |
1 | 5 | 9 | |
3 | * | * | |
3 | 4 | * | |
Cols | 5 | 7 | 9 |
1 | 6 | * | |
2 | 8 | * | |
4 | 7 | * |
1 | 1 | 1 | |
1 | 1 | 1 | |
1 | 0 | 0 | |
1 | 1 | 0 | |
Bit mask | 1 | 1 | 1 |
1 | 1 | 0 | |
1 | 1 | 0 | |
1 | 1 | 0 |
The spmv_new.c file, located in the __release_lnx/examples folder, demonstrates the implemented functionality on Linux platforms. To build an example, set the proper compiler environment and perform these make commands:
- make clean – clean the workspace
- make build – create the executable file
- make execute – run the executable on the Intel Xeon Phi coprocessor mic0 by default.
The following example demonstrates conversion of a matrix in CSR format to the internal CSR and ESB representations used by the Intel MKL SpMV Format Prototype Package followed by SpMV compute routines.
/*//// Consider the matrix A (see 'Sparse Storage Formats for Sparse BLAS Level 2//and Level 3 in the Intel MKL Reference Manual')//// | 1 -1 0 -3 0 | // | -2 5 0 0 0 | // A = | 0 0 4 6 4 |, // | -4 0 2 7 0 | // | 0 8 0 0 -5 | //// The matrix A is represented in a zero-based compressed sparse row storage // scheme with three arrays (see 'Sparse Matrix Storage Schemes' in the// Intel MKL Reference Manual) as follows: //// values = ( 1 -1 -3 -2 5 4 6 4 -4 2 7 8 -5 ) // columns = ( 0 1 3 0 1 2 3 4 0 2 3 1 4 ) // rowIndex = ( 0 3 5 8 11 13 ) // // The test performs the following operations : //// The code computes A*S = F using sparseDesbmv and sparseDcsrmv // where A is a general sparse matrix and S and F are vectors. //*******************************************************************************/#include<stdio.h>#include<assert.h>#include<math.h>#include"spmv_interface.h"
#define M 5#define N 5#define NNZ 13
intmain() {
//*****************************************************************************// Declaration and initialization of parameters for sparse // representation of the matrix A in the compressed sparse row format: //*****************************************************************************
int m = M, n = N, nnz = NNZ; //***************************************************************************** // Sparse representation of the matrix A //*****************************************************************************
double csrVal[NNZ] = { 1.0, -1.0, -3.0, -2.0, 5.0, 4.0, 6.0, 4.0, -4.0, 2.0, 7.0, 8.0, -5.0 };
int csrColInd[NNZ] = { 0, 1, 3, 0, 1, 2, 3, 4, 0, 2, 3, 1, 4 };
int csrRowPtr[M+1] = { 0, 3, 5, 8, 11, 13 };
// Matrix descriptor, new API variable sparseMatDescr_t descrA; // Internal CSR matrix representation, new API variable sparseCSRMatrix_t csrA; // Internal ESB matrix representation, new API variable sparseESBMatrix_t esbA;
//***************************************************************************** // Declaration of local variables: //*****************************************************************************
double x[M] = { 1.0, 5.0, 1.0, 4.0, 1.0 }; double y[M] = { 0.0, 0.0, 0.0, 0.0, 0.0 }; double alpha = 1.0, beta = 0.0; int i;
// Create matrix descriptor sparseCreateMatDescr ( &descrA ); // Create CSR matrix with static workload balancing algorithm sparseCreateCSRMatrix ( &csrA, SPARSE_SCHEDULE_STATIC ); // Analyze input matrix and create its internal representation in the // csrA structure optimized for static workload balancing sparseDcsr2csr ( m, n, descrA, csrVal, csrRowPtr, csrColInd, csrA ); // Compute y = alpha * A * x + beta * y sparseDcsrmv ( SPARSE_OPERATION_NON_TRANSPOSE, &alpha, csrA, x, &beta, y ); // Release internal representation of CSR matrix sparseDestroyCSRMatrix ( csrA );
// Create ESB matrix with static workload balancing algorithm sparseCreateESBMatrix ( &esbA, SPARSE_SCHEDULE_STATIC );
// Analyze input CSR matrix and create its internal ESB representation in // the esbA structure optimized for static workload balancing sparseDcsr2esb ( m, n, descrA, csrVal, csrRowPtr, csrColInd, esbA );
// Compute y = alpha * A * x + beta * y sparseDesbmv ( SPARSE_OPERATION_NON_TRANSPOSE, &alpha, esbA, x, &beta, y );
// Release internal representation of ESB matrix sparseDestroyESBMatrix ( esbA ); sparseDestroyMatDescr ( descrA ); return 0;}
The performance results (see chart below) demonstrated by the implementations of SpMV for the ESB format are on average 30% better than implementation available in Intel MKL 11.1 Update 1, and for some matrices the computation can be several times faster, which agrees with the article referenced previously. It should be noted that:
- ESB SpMV performance for some matrices is not as good as CSR SpMV;
- The workload scheduling algorithm significantly affects performance and should be chosen experimentally for a sparse matrix to achieve the best performance.
Note: General SpMV routines are used despite of self-adjoint/symmetric matrices in some of the below cases.
We are seeking interested parties to evaluate this prototype implementation and provide us with feedback. If you are interested, please send a request to intel.mkl@intel.com to download the Intel MKL SpMV Format Prototype Package.
Copyright © 2013, Intel Corporation. All rights reserved. "Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries."