Introduction

Presentation

STATPACK is a Fortran 95/2003 multi-threaded library for solving the most commonly occurring mathematical and statistical problems in the processing of climate model outputs and datasets, and, more generally, in the analysis of huge datasets found in a large variety of scientific fields. It is a freely-available software, and started from version 2, STATPACK is released under the GNU LGPL license. Details about the license can be found at LGPL License.

All the information related to STATPACK can be found at the following web site STATPACK. The distribution tar file of the software is available for download at this site. Other information is provided there, such as installation instructions and contact information. Instructions for installing the software can also be found below, in the chapter Installation.

The distribution tar file of STATPACK contains the Fortran 95/2003 sources of the library and the associated test and example programs. It also contains an ensemble of portable makefiles, which allows fast and automatic compilation of the STATPACK library on most UNIX/Linux systems, AIX and Mac OSX. This ensemble of portable makefiles also defines a friendly and useful environment for compiling (and executing) Fortran 95/2003 programs on a computer.

The routines available in STATPACK currently include (in version 2.3):

  • Numerical linear algebra subroutines for statistical computations (LU, Cholesky, QR, QL and QLP decompositions, linear solvers, least square solvers, full and partial eigenvalue and singular value decompositions, generalized inverses of full or symmetric matrices, determinant of a square matrix, …)
  • Randomized and deterministic (full or partial) QR decomposition with Column Pivoting (QRCP), Complete Orthogonal Decomposition (COD) or QB decomposition of a matrix and associated linear least square solvers
  • State-of-the-art approximate partial eigenvalue and singular value decompositions based on randomized power subspace and block Krylov iterations, or a preliminary QRCP or QLP factorization
  • A large set of very fast randomized or deterministic routines for solving accurately the fixed-rank and fixed-precision problems
  • Randomized and deterministic routines for computing the column Interpolative (ID), the two-sided Interpolative (tsID) and CUR decompositions of a matrix
  • Probability functions and their inverses
  • Out of core statistical univariate and multivariate functions and subroutines
  • Random number or matrix generation and Monte Carlo procedures
  • Fast Fourier Transforms for both real and complex data of general length
  • Time series analysis functions and subroutines
  • Utilities for printing and sorting matrices and vectors
  • Utilities for manipulating strings, dates and times

The emphasis of the software is on fast and robust methods appropriate for problems in which the associated matrices are large and dense, for example, those arising in the statistical analysis of very high resolution climate model outputs for which standard commercial or free softwares currently used in the climate scientific community (e.g. MATLAB, IDL, Grads, Ferret, Ncl, Python, …) may have difficulties.

Note also that prior to building STATPACK library, none other packages are required or must be installed. In other words, STATPACK is a fully portable software and the efficiency of STATPACK on a specific machine does not depend directly on the implementation of other pre-installed softwares such as the BLAS (e.g., Basic Linear Algebra Subprograms; [blas] ), but only on the quality/performance of the Fortran 95/2003 compiler (in particular of the performance of the built-in intrinsic procedures like the Fortran 90 matmul() and dot_product() functions) and its OpenMP [openmp] parallelism features. Optionally, however, the STATPACK library can also benefit from an optimized/multi-threaded BLAS library [blas] for enhanced performance at the user option. Many open source or vendor versions of the BLAS library are currently available depending on your system and can be used in conjunction with STATPACK to enhance its performance [openblas] [atlas] [blis] [mkl] [aocl] [arm] [accelerate]

With OpenMP and BLAS supports activated, the current version of STATPACK delivers maximum efficiency and can compete with state-of-the-art Dense Linear Algebra libraries such as OpenBLAS [openblas], LIBFLAME [libflame], Intel Math Kernel Library [mkl], AMD Optimizing CPU Libraries [aocl], Arm Performance Libraries [arm] or Apple Accelerate framework [accelerate] for many problems, especially those involving the computations of the full or partial Singular Value Decomposition (SVD) of a real matrix or those related to the full or partial EigenValue Decomposition (EVD) of a real symmetric matrix. See the section Parallelism and BLAS for more details on how activating OpenMP and using an optimized BLAS library.

STATPACK has been built successfully on a variety of UNIX systems (including Mac OSX and AIX) and with many different Fortran 95/2003 compilers. It is believed that STATPACK is a portable software. Currently, STATPACK does not use Graphics Processing Units (GPUs), but this possibility may be included in future releases of STATPACK.

Finally, for most users in the climate community, best flexibility and usability are simply achieved with the use of the NCSTAT software [ncstat] in addition to the STATPACK library. The NCSTAT software is a collection of many UNIX stand-alone operators for statistical processing and analysis of huge climate model outputs and datasets stored in the NetCDF format [netcdf]. These stand-alone operators are also written in Fortran 95/2003 using the NetCDF Fortran 90 interface [netcdf-f90] of the NetCDF library [netcdf] for input/output data transfer and the STATPACK software for numerical and parallel computations. More information about NCSTAT can be found at the following web site NCSTAT.

Language

The STATPACK library consists of numerical algorithms written in pure and portable Fortran 95/2003 language constructs without any obsolescent Fortran77 features [Fortran]. STATPACK uses all the new features of the Fortran 95/2003 standard, including:

  • Fortran 90 array data types
  • Symbolic names for the parameterization of the kind parameters for real, complex, integer and logical data
  • Fortran 90 modules
  • Fortran 90 explicit interfaces
  • Overloading of procedures and functions for ease of use (e.g., generic routines)
  • Allocatable and automatic arrays (dynamical storage)
  • Optional arguments to functions and subroutines
  • Assumed-shape arrays for arguments passing

At the user option, new Fortran 2003 constructs and intrinsic modules, like the IEEE_EXCEPTIONS, IEEE_ARITHMETIC and IEEE_FEATURES modules [Fortran], can also be used in STATPACK. See the section Preprocessor cpp macros for more details.

Thus, to use this product you should be familiar with the Fortran 95/2003 language and you must have access to a Fortran 95/2003 compiler to build the STATPACK library. For more information on Fortran 95/2003, see [Fortran] or consult one of the many tutorials available on the Web, for example Fortran tutorial.

While many current standard Fortran packages, like LAPACK or SCALAPACK (e.g., two famous libraries of Fortran routines for solving problems in numerical linear algebra on shared-memory and distributed-memory architectures, respectively), include different versions of the subroutines and functions for different Fortran data types (e.g., real or complex single- and double-precision arithmetic), the systematic use of Fortran 95/2003 parameterized data types in STATPACK allows the user to choose exactly the precision of the version of the STATPACK library he wants [Buckley:1994a] [Buckley:1994b]. See the Fortran program ex1_svd_cmp.F90 for an illustration of the use of Fortran 95/2003 parameterized data types in STATPACK.

Currently, it is possible to build single, double and also quadruple precision versions of STATPACK, as well as specific versions requesting precise and portable precision specifications for real and complex computations included in the software [Buckley:1994a]. The choice for the precision specification for a particular version of the library is done when building the library. See the chapter Installation for more details.

The interface loading features of the Fortran 95/2003 language are also heavily used in STATPACK, an useful feature, which is also missing in standard Fortran packages. As an illustration, the generic solve_lin() function exported by the module Lin_Procedures, which can be used to solve a linear system with one or multiple right hand sides, accepts the following calls:

x(:n)    = solve_lin( mat(:n,:n) , b(:n)    , tol=tol )
x(:n,:m) = solve_lin( mat(:n,:n) , b(:n,:m) , tol=tol )

The advantages of the overloading are obvious since the same interface is used for one or several right hand sides, or for different precisions.

Parallelism and BLAS

STATPACK is a parallel, multi-threaded software based on the OpenMP standard. Therefore, it will run on multi-core or, more generally, shared-memory multi-processor computers. It is also possible to build sequential versions of STATPACK (e.g., if OpenMP compilation is disabled or if an OpenMP-enabled Fortran compiler is not available), even if it is not at all recommended for efficiency reasons. STATPACK does not run on distributed memory (e.g., clusters) parallel computers.

In both the LAPACK [lapack] and ScaLAPACK [scalapack] libraries, the exploitation of parallelism comes from the availability of a parallel BLAS implementation. In the LAPACK case, a number of BLAS libraries can be used to take advantage of multiple processing units on shared-memory systems; for example, the freely distributed ATLAS [atlas], GotoBLAS [gotoblas], OpenBLAS [openblas], BLIS [blis] libraries or other vendor BLAS like Intel Math Kernel Library [mkl], Arm Performance Libraries [arm], AMD Optimizing CPU Libraries [aocl] or Apple Accelerate framework [accelerate] are popular choices. In the ScaLAPACK case, parallelism is exploited by PBLAS [pblas], which is a parallel BLAS implementation that uses the Message Passing Interface (MPI; [mpi]) for communications on a distributed memory system.

In both cases, parallelism is enclosed inside the BLAS routines for most implementations of LAPACK or ScaLAPACK. In a typical multi-core implementation this means that each BLAS routine contains at least one parallel section and for each call to BLAS, a whole set of threads is started and stopped at least with each BLAS call. This thread management overhead is relatively small for level 3 BLAS routines [blas3], but it could be very significant for level 1 and 2 BLAS operations [blas1] [blas2] due to the low computational intensity in these level 1 and 2 BLAS kernels.

On the other hand, STATPACK does not rely directly on the BLAS to obtain high performance in its numerical Dense Linear Algebra subroutines, since STATPACK uses parameterized data types and allows the user to build quadruple-precision version of the library (remember that standard BLAS, LAPACK and ScaLAPACK libraries exist only in single- and double-precision). Instead, relatively good performance is obtained in STATPACK by reformulating old algorithms or developing new algorithms in a way that their implementations can be easily mapped on recent multi-core systems and take advantage of shared-memory parallelization at a level well-above to the BLAS level. To the best of our knowledge, only the LAPACK version included in the Intel Math Kernel Library [mkl] has improved multi-core and OpenMP implementations of some of its Dense Linear Algebra drivers well above the BLAS level, especially those related to the QR factorization, EigenValue Decomposition (EVD) or Singular Value Decomposition (SVD).

In this spirit, most of the computer intensive methods offered by the STATPACK library are parallelized with the OpenMP Application Program Interface (OpenMP API; [openmp]), which is one of the most common techniques for shared-memory parallelization available with modern Fortran 95/2003/2008 compilers. The OpenMP API is a collection of compiler directives, library routines, and environment variables that can be used to specify shared-memory parallelism in C, C++ and Fortran programs. More information about OpenMP can be found at the following web site OpenMP or in the more friendly tutorial available at OpenMP tutorial. The best place to view OpenMP support by a large range of Fortran compilers is OpenMP compilers. Support for at least OpenMP 3 standard is requested for activation of OpenMP parallelism in version 2.3 of STATPACK and most Fortran compilers are currently supporting the OpenMP 3 standard.

As explained above, the STATPACK library has been designed to take advantages of OpenMP shared-memory parallelism at a high level (e.g., well above the BLAS level) in the algorithms offered in the software. In most cases, this means that each STATPACK routine contains only one or two “global” OpenMP parallel regions and the critical parts of the algorithm are implemented with OpenMP synchronization and barrier primitives inside each parallel region. The programming challenge is thus to minimize the number of these artificial synchronization points in each OpenMP parallel region/routine. In this way, STATPACK users can benefit of good speedup in their programs on all (shared-memory) parallel computers for all choices of the parameterized Fortran real/complex data types (and not only on Intel compatible platforms on which Intel Math Kernel Library is the standard choice for Dense Linear Algebra computations in single- or double-precision).

Note, however, that the parallel paradigm used in STATPACK is still based on the classical fork-and-join scheduling model and, thus, differs from the more advanced methods used in the ongoing PLASMA project. PLASMA lays out matrices in small square tiles, such that each tile occupies a continuous memory region, and is based on new algorithms working on tiles [plasma] [Dongarra_etal:2019].

Currently, PLASMA also relies on the BLAS and OpenMP for dynamic, task-based, scheduling [YarKhan_etal:2016] [Dongarra_etal:2019], but offers only a collection of routines for solving linear systems of equations and linear least square problems [Dongarra_etal:2019], which are not sufficient for the goals of STATPACK. Furthermore, at least OpenMP 4.0 is required for compiling recent versions of PLASMA and not all Fortran compilers are currently supporting the full OpenMP 4.0 standard. Additionally, the systematic use of the PLASMA tile format is challenging in many applications despite that PLASMA provides efficient routines to convert from/to the tile format and the conventional column/row order layout of a matrix. This is why STATPACK is not currently using PLASMA for Dense Linear Algebra computations for maximum portability across current Fortran compilers/platforms.

On the other hand, through the use of the OpenMP 3.0 API, it is expected that the STATPACK software will achieve portability to a wide range of platforms with good (parallel) performance and maximum flexibility and usability as STATPACK still uses the conventional column order layout of a matrix, which is the standard in Fortran.

Moreover, optionally for single- and double-precision versions of the library, STATPACK can also benefit from an optimized/multi-threaded BLAS library (e.g., [openblas], [mkl], [blis], …) and includes generic interfaces for several drivers in LAPACK [lapack], if optimized versions of these libraries are available on your computer. Note, however, that STATPACK does not share any code with the BLAS and LAPACK packages and it is a completely independent software. An optimized BLAS library will provide very significant enhanced speedup with STATPACK if the quality of your Fortran 95/2003 compiler is not enough to obtain the best performance on your computer. This is typical for many compilers, for example with the GNU gfortran compiler, but will also restrict the available precisions of STATPACK since the standard BLAS (and also LAPACK) software is only available in single- and double-precisions. Extended- and quadruple-precision versions of the BLAS and LAPACK libraries are available in QPBLAS and XBLAS packages, but they both use a double-double algorithm in which one quadruple precision number is represented by a combination of two double precision numbers [qpblas] [xblas], which makes their use in conjunction with STATPACK very challenging, if not impossible. On the other hand, IEEE 754- compliant quadruple precision (128 bit) floating point operations have already been implemented in modern Fortran, but many hardwares have not been optimized for those operations yet, excepted for Intel x86 processors in which extended double precision is native and provides improved accuracy at full hardware speed. For other processors, quadruple precision (128 bit) calculation speed is currently extremely slow compared to double precision operations. Despite these caveats, STATPACK can and has been compiled in IEEE 754- compliant quadruple precision (128 bit) without BLAS support and is able to provide OpenMP multi-threaded Dense Linear Algebra routines at this IEEE 754- quadruple precision, which can be useful for many applications in which very accurate results are needed.

Most of the modern Fortran 95/2003 compilers (e.g., gfortran, flang, ifort, ifx, pgfortran, nvfortran, xlf2003, nagfor, …) include a simple command line option to the compiler that activates and allows interpretation of all OpenMP directives included in the STATPACK library. If parallel computing is required, it is the responsibility of the user to include the relevant OpenMP command line options to compile the STATPACK library on his (parallel) computer. See the section OpenMP compilation below, on how to do this.

Finally, the number of processors used when executing an OpenMP conforming program using the STATPACK library and, more generally, the behaviour of such programs are determined by setting some OpenMP environment variables (e.g., OMP_NUM_THREADS, OMP_MAX_ACTIVE_LEVELS, OMP_NESTED, OMP_DYNAMIC, OMP_STACKSIZE, …) just before the execution of the program. See the OpenMP documentation available at OpenMP, OpenMP tutorial or the section Parallel Execution for more details and some examples.

Flag Counter