comp_eof_4d

Authors

Pascal Terray (LOCEAN/IPSL)

Latest revision

08/06/2021

Purpose

Compute an Empirical Orthogonal Function (EOF) analysis , also known as Principal Component Analysis (PCA) from a fourdimensional variable extracted from a NetCDF dataset. The procedure first transforms the input fourdimensional NetCDF variable as a ntime by nv rectangular matrix, X, of observed variables (e.g., the selected cells of the 3-D grid-mesh associated with the fourdimensional NetCDF variable) and then performs an EOF analysis of this rectangular matrix.

The eigenvalues, eigenvectors and principal components time series of the EOF analysis are computed by a full or partial Singular Value Decomposition (SVD) of the rectangular matrix of the observed variables [Bjornsson_Venegas] [Hannachi] [vonStorch_Zwiers]. Both algorithms find square roots of eigenvalues (e.g., singular values of the data matrix X) and associated eigenvectors of the sums of squares and cross-products, covariance or correlation matrix between the observed variables without actually computing this symmetric positive-definite matrix. At the user option, very fast randomized partial SVD algorithms [Halko_etal] [Li_etal] [Martinsson] can be used, which will be highly efficient on huge datasets. See description of the -alg= argument for more details.

Optionally, the eigenvectors and associated principal component time series can be estimated with the help of a metric such that the results are weighted by the surface (or volume) associated with each cell in the grid associated with the input fourdimensional NetCDF variable so that equal areas (or volumes) carry equal weights in the results of the EOF analysis (see the -d= argument description).

An output NetCDF dataset containing singular values, eigenvectors and standardized principal component time series is created. The eigenvectors are repacked as a fourdimensional variable in the output NetCDF dataset.

You should use EOF analysis if you are interested in summarizing data and/or detecting linear relationships between the observed variables. EOF analysis can also be used to reduce the number of variables or the noise in a dataset before a regression, cluster or Maximum Covariance Analysis (MCA). More specifically, the first k principal component time series and eigenvectors give a least-squares solution to the model

X = AB + E

where

  • X is the ntime by nv matrix of observed variables
  • A is the ntime by k matrix of the first k principal component time series
  • B is the k by nv matrix of the first k eigenvectors (stored rowwise)
  • E is an ntime by nv matrix of residuals

and you want to minimize the squared Frobenius norm of E (e.g., the sum of all the squared elements of E).

Refer to comp_invert_eof_4d, if you want to compute such approximation of your dataset.

Refer to comp_svd_3d , comp_reg_3d and comp_reg_4d for more details on MCA and regression procedures available in NCSTAT, respectively.

If the NetCDF variable is tridimensional use comp_eof_3d instead of comp_eof_4d.

This procedure is parallelized if OpenMP is used. Moreover, this procedure may use (randomized) partial SVD algorithms, which are highly efficient on huge datasets if you are interested only in the few leading terms of the SVD of the data matrix X.

Further Details

Usage

$ comp_eof_4d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -m=input_mesh_mask_netcdf_file \
  -g=grid_type                         (optional : n, t, u, v, w, f) \
  -r=resolution                        (optional : r2, r4) \
  -b=nlon_orca, nlat_orca, nlevel_orca (optional) \
  -x=lon1,lon2                         (optional) \
  -y=lat1,lat2                         (optional) \
  -z=level1,level2                     (optional) \
  -t=time1,time2                       (optional) \
  -a=type_of_analysis                  (optional : scp, cov, cor) \
  -c=input_climatology_netcdf_file     (optional) \
  -d=type_of_distance                  (optional : dist2, dist3, ident) \
  -alg=algorithm                       (optional : svd, inviter, deflate, rsvd) \
  -n=number_of_eofs                    (optional) \
  -o=output_eof_netcdf_file            (optional) \
  -mi=missing_value                    (optional) \
  -explvar                             (optional) \
  -double                              (optional) \
  -bigfile                             (optional) \
  -hdf5                                (optional) \
  -tlimited                            (optional)

By default

-g=
the grid_type is set to n which means that the 3-D grid-mesh associated with the input NetCDF variable is assumed to be regular or Gaussian
-r=
if the input netcdf_variable is from the NEMO or ORCA model (e.g., if -n= argument is not set to n) the resolution is assumed to be r2
-b=
if -n= is not set to n, the dimensions of the 3-D grid-mesh, nlon_orca, nlat_orca and nlevel_orca are determined from the -r= argument. However, you may override this choice by default with the -b= argument
-x=
the whole longitude domain associated with the netcdf_variable
-y=
the whole latitude domain associated with the netcdf_variable
-z=
the whole vertical resolution associated with the netcdf_variable
-t=
the whole time period associated with the netcdf_variable
-a=
the type_of_analysis is set to scp. This means that the eigenvectors and eigenvalues are computed from the sums of squares and cross-products matrix between the observed variables
-c=
an input_climatology_netcdf_file is not needed if the type_of_analysis is set to scp
-d=
the type_of_distance is set to dist3. This means that distances and scalar products in the EOF analysis are computed with the diagonal metric associated with the 3-D grid-mesh associated with the input NetCDF variable
-alg=
the algorithm option is set to inviter. This means that the EOF model is computed by a partial SVD analysis of the matrix of the observed variables using an inverse iteration algorithm
-n=
number_of_eofs is set to 10 and a 10-component EOF model is stored in the output NetCDF file output_eof_netcdf_file
-o=
the output_eof_netcdf_file is named eof_netcdf_variable.nc
-mi=
the missing_value attribute in the output NetCDF file is set to 1.e+20
-explvar
the -n= option specifies the number of eofs to be computed and stored in the output NetCDF file. If -explvar is activated, the -n= option specifies the minimum value of explained variance by the EOF model for selecting the order (e.g., the number of components) of this EOF model
-double
the results of the EOF analysis are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
-bigfile
a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
-hdf5
a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
-tlimited
the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks

  1. The -v=netcdf_variable argument specifies the NetCDF variable for which an EOF analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.

  2. The argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to the netcdf_variable for transforming this fourdimensional NetCDF variable as a rectangular matrix before computing the EOF analysis. The scale factors associated with the 3-D grid-mesh of this NetCDF variable (needed if -d=dist2 or dist3 are specified when calling the procedure) are also read from the input_mesh_mask_netcdf_file.

  3. If the -x=lon1,lon2, -y=lat1,lat2 and -z=level1,level2 arguments are missing, the geographical domain used in the EOF analysis is determined from the attributes of the input mesh mask NetCDF variable named grid_typemask (e.g., lon1_Eastern_limit, lon2_Western_limit, lat1_Southern_limit , lat2_Northern_limit, level1_First_level and level2_Last_level ) which is read from the input NetCDF file input_mesh_mask_netcdf_file. If these attributes are missing, the whole geographical domain and vertical resolution associated with the netcdf_variable is used in the EOF analysis.

    The longitude, latitude or level range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to 1. Negative values are allowed for lon1. In this case the longitude domain is from nlon+lon1+1 to lon2 where nlon is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.

    Refer to comp_mask_4d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_eof_4d.

  4. If the -t=time1,time2 argument is missing the whole time period associated with the netcdf_variable is used to estimate eigenvectors and principal component time series.

    The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1. Note that the output NetCDF file will have ntime = time2 - time1 + 1 time observations.

  5. It is assumed that the data has no missing values excepted those associated with a constant land-sea mask.

  6. If -g= is set to t, u, v, w or f it is assumed that the NetCDF variable is from an experiment with the NEMO or ORCA model. In this case, the duplicate points from the ORCA grid are removed before the EOF analysis, as far as possible, and, in particular, if the 3-D grid-mesh of the input NetCDF variable covers the whole globe. On output, the duplicate points are restored when writing the EOFs (e.g., the eigenvectors), if the geographical domain of the input NetCDF variable is the whole globe.

    If -g= is set to n, it is assumed that the 3-D grid-mesh is regular or Gaussian and as such has no duplicate points.

    The -g= argument is also used to determine the name of the NetCDF variables which contain the mesh-mask and the scale factors in the input_mesh_mask_netcdf_file (e.g., these variables are named grid_typemask, e1grid_type, e2grid_type and e3grid_type, respectively). This input_mesh_mask_netcdf_file may be created by comp_clim_4d if the 3-D grid-mesh is regular or gaussian.

  7. If -g= is set to t, u, v, w or f (e.g., if the NetCDF variable is from an experiment with the NEMO or ORCA model), the -r= argument gives the resolution used. If:

    • -r=r2, the NetCDF variable is from an experiment with the ORCA R2 model
    • -r=r4, the NetCDF variable is from an experiment with the ORCA R4 model.
  8. If the NetCDF variable is from an experiment with the NEMO or ORCA model, but the resolution is not r2 or r4, the dimensions of the ORCA grid must be specified explicitly with the -b= argument.

  9. The -a= argument specifies if the observed variables are centered or standardized with an input climatology (specified with the -c= argument) before the EOF analysis. If:

    • -a=scp, the EOF analysis is done on the raw data
    • -a=cov, the EOF analysis is done on the anomalies
    • -a=cor, the EOF analysis is done on the standardized anomalies.
  10. The input_climatology_netcdf_file specified with the -c= argument is needed only if -a=cov or -a=cor.

  11. If -a=cov or -a=cor, the selected time period must agree with the climatology. This means that the first selected time observation (time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.

  12. The geographical shapes of the netcdf_variable (in the input_netcdf_file), the mask (input_mesh_mask_netcdf_file), the scale factors (in the input_mesh_mask_netcdf_file), and the climatology (in the input_climatology_netcdf_file) must agree.

  13. The -n=number_of_eofs argument specifies the number of components of the EOF model which must be stored (and also computed if -alg=inviter or -alg=deflate is specified) in the output NetCDF file specified by the -o= argument. See also the -explvar argument.

  14. If the -explvar argument is activated, the number specified with the -n= argument is not the number of requested EOFs, but the minimum of variance that these EOFs must describe. The number of EOFs is then determined by reference with the minimum of explained variance required by the -n= argument. Express this explained variance in percentage (e.g., with an integer from 0 to 100).

    If -explvar is specified, the -n= argument must be less or equal to 100. Note also that, if a randomized partial SVD algorithm is used (e.g., if -alg=rsvd is specified), the number of computed EOFs is limited to 200 even if these 200 EOFs describe less variance than the minimum requested with the -n= argument.

  15. The -d= argument specifies the metric and scalar product used in the EOF analysis. If:

    • -d=dist2, the EOF analysis is done with the diagonal distance associated with the horizontal 2-D grid-mesh (e.g., each grid point is weighted accordingly to the surface associated with it)
    • -d=dist3, the analysis is done with the diagonal distance associated with the whole 3D grid-mesh (e.g., each grid point is weighted accordingly to the volume or weight associated with it)
    • -d=ident, the EOF analysis is done with the identity metric: the Euclidean distance and the usual scalar product are used in the EOF analysis.

    By default, the -d= argument is set to dist3.

  16. The -alg= argument determines how eigenvectors and principal components are computed. If:

    • -alg=svd, a full SVD of the data matrix is computed, even if you ask only for the leading eigenvectors
    • -alg=inviter, a partial SVD of the data matrix is computed by inverse iteration
    • -alg=deflate, a partial SVD of the data matrix is computed by a deflation technique.
    • -alg=rsvd, a partial SVD of the data matrix is computed by a randomized algorithm.

    All algorithms are parallelized if OpenMP is used. The default is -alg=inviter since computing a partial SVD is generally much faster than computing a full SVD. But, -alg=deflate is generally as fast as -alg=inviter. -alg=rsvd is generally much faster than all other options, but the computed eigenvectors and principal components may be less accurate depending on the distribution of the singular values of the input matrix.

  17. The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g., -D_USE_NETCDF36 or -D_USE_NETCDF4) and linked to the NetCDF 3.6 library or higher. If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.

  18. The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g., -D_USE_NETCDF4) and linked to the NetCDF 4 library or higher. If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.

  19. The -mi=missing_value argument specifies the missing value indicator associated with the netcdf_variables in the output_netcdf_file. If the -mi= argument is not specified missing_value is set to 1.e+20.

  20. The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.

    By default, the results are stored as single-precision floating point numbers in the output NetCDF file.

  21. Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

  22. For more details on EOF analysis in the climate literature and randomized SVD algorithms, see

    • “A manual for EOF and SVD analyses of climate data”, by Bjornsson, H., and Venegas, S.A., McGill University, CCGCR Report No. 97-1, Montréal, Québec, 52pp, 1997. https://www.jsg.utexas.edu/fu/files/EOFSVD.pdf
    • “A primer for EOF analysis of climate data”, by Hannachi, A., Reading University, Reading, UK, 33pp, 2004. http://www.met.reading.ac.uk/~han/Monitor/eofprimer.pdf
    • “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 13, 484 pp., 2002. ISBN: 9780521012300
    • “Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions”, By Halko, N., Martinsson, P.G., and Tropp, J.A., SIAM Rev., 53, 217-288, 2011. https://epubs.siam.org/doi/abs/10.1137/090771806
    • “Algorithm 971: An implementation of a randomized algorithm for principal component analysis”, by Li, H.,Linderman, G.C., Szlam, A., Stanton, K.P., Kluger, Y., and Tygert, M., ACM Trans. Math. Softw. 43, 3, Article 28 (January 2017). https://pubmed.ncbi.nlm.nih.gov/28983138
    • “Randomized methods for matrix computations”, by Martinsson, P.G., arXiv.1607.01649, 2019. https://arxiv.org/abs/1607.01649

Outputs

comp_eof_4d creates an output NetCDF file that contains the principal component time series, the eigenvectors and the eigenvalues of the EOF analysis. The number of principal components, eigenvectors and eigenvalues stored in the output NetCDF dataset is determined by the -n=number_of_eofs argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlev, nlat and nlon are the length of the dimensions of the input NetCDF variable) :

  1. netcdf_variable_eof(number_of_eofs,nlev,nlat,nlon) : the selected eigenvectors of the sums of squares and cross-products (-a=scp), covariance (-a=cov) or correlation (-a=cor) matrix between the observed variables. The eigenvectors are sorted by descending order of the associated eigenvalues. The eigenvectors are scaled such that they give the scalar products (-a=scp), covariances (-a=cov) or correlations (-a=cor) between the original observed variables and the associated principal component time series.

    The eigenvectors are packed in a fourdimensional variable whose first, second and third dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x=, -y= and -z= arguments. However, outside the selected domain, the output NetCDF variable is filled with missing values. If this is a problem, you can use comp_norm_4d for restricting the geographical domain in the input dataset before using comp_eof_4d.

  2. netcdf_variable_pc(ntime,number_of_eofs) : the principal component time series sorted by descending order of the eigenvalues (e.g., the square of the singular values of the input matrix of observed variables). The principal component time series are always standardized to unit variance.

  3. netcdf_variable_sing(number_of_eofs) : the singular values of the input data matrix in decreasing order, up to a constant scaling factor (equals to the square root of 1/ntime). More precisely, these singular values are the square roots of the eigenvalues of the sums of squares and cross-products (-a=scp) or covariance (-a=cov) or correlation (-a=cor) matrix between the observed variables.

    These eigenvalues (e.g., the squares of the singular values) are equal to the raw variance described by the principal component time series.

  4. netcdf_variable_var(number_of_eofs) : the proportion of variance explained by each principal component time series.

Examples

  1. For computing an EOF analysis from the NetCDF file z.1970_2002.apm.nc, which includes a NetCDF variable z, and store the results in a NetCDF file named eof_era40_1m_z850_1979_2001.nc, use the following command (the analysis is done on the centered data and only the level 21, which corresponds to 850 hPa, is considered in the analysis) :

    $ comp_eof_4d \
      -f=z.1970_2002.apm.nc  \
      -v=z \
      -z=21,21 \
      -m=mask_era40_z.nc  \
      -g=n \
      -c=clim_era40_1m_z_1979_2001.nc  \
      -d=dist2 \
      -a=cov \
      -n=10 \
      -o=eof_era40_1m_z850_1979_2001.nc
    
  2. For computing an EOF analysis from the NetCDF file ST7_1m_0101_20012_grid_T_votemper.nc, which includes a NetCDF variable votemper from a numerical simulation with the ORCA R2 model, and store the results in a NetCDF file named eof_votemper.nc, use the following command (the analysis is done on the standardized data and all the depths/levels are included in the analysis) :

    $ comp_eof_4d \
      -f=ST7_1m_0101_20012_grid_T_votemper.nc \
      -v=votemper \
      -c=clim_ST7_1m_0101_20012_grid_T_votemper.nc \
      -a=cor \
      -d=dist3 \
      -m=meshmask.orca2.nc \
      -g=t