comp_eof_4d¶

Authors¶

Pascal Terray (LOCEAN/IPSL)

Latest revision¶

05/05/2026

Purpose¶

Compute an Empirical Orthogonal Function (EOF) analysis, also known as Principal Component Analysis (PCA) from a fourdimensional variable extracted from a NetCDF dataset.

The procedure first transforms the input fourdimensional NetCDF variable as a ntime by nv rectangular matrix, X, of observed variables (e.g., the selected cells of the 3-D grid-mesh associated with the fourdimensional NetCDF variable) and then performs an EOF analysis or PCA of this matrix [Jolliffe] [Jackson] .

The eigenvalues, eigenvectors and standardized Principal Components (PC) time series of the EOF analysis (or PCA) are computed by a full or partial Singular Value Decomposition (SVD) of the rectangular matrix of the observed variables [Jolliffe] [Bjornsson_Venegas] [Hannachi] [vonStorch_Zwiers]. These algorithms find directly square roots of eigenvalues (e.g., singular values of the data matrix X) and associated eigenvectors of the sums of squares and cross-products, covariance or correlation matrix between the observed variables (e.g., right singular vectors of X) without actually computing this symmetric semi-positive-definite matrix. The standardized PC time series, which are the left singular vectors of X are also directly obtained from the (partial) SVD of X.

Note that very fast randomized partial SVD algorithms [Halko_etal] [Li_etal] [Martinsson] can also be used, which will be highly efficient on huge datasets. See description of the -alg= argument for more details on the numerical algorithms available for performing the EOF analysis.

Optionally, the eigenvalues, eigenvectors and associated PC time series can be estimated with the help of a metric such that the results are weighted by the surface (or volume) associated with each cell in the grid-mesh associated with the input fourdimensional NetCDF variable so that equal areas (or volumes) carry equal weights in the results of the EOF analysis (see the -d= argument description for more details).

At the user option, a moving block bootstrap approach [Efron] [Politis] can also be used to obtain confidence intervals for the explained variances by the PC time series and to test the stability of the computed leading EOFs [Jolliffe]. More precisely, statistics (e.g., confidence intervals, means, standard-deviations, skewness, …) of explained variances, cosines of the angles between the pairwise observed and bootstrapped EOFs, as well as spectral and (scaled) Frobenius distances between the pairwise observed and bootstrapped leading EOF subspaces are computed and can be used to select the number of EOFs for further analysis. The -nb=, -bl=, -bp=, -bm=, -bt= and -nobias arguments allow the user to determine the exact form of the blockwise bootstrap algorithm which is used. One of the -nb=, -bl=, -bp=, -bm=and -bt= arguments must be present if you want to test the stability of the computed leading EOFs by this bootstrap approach and store its results in the output NetCDF dataset.

The notions of distance between two (EOF) subspaces of the same dimension used in the moving block bootstrap approach are based on the unique orthogonal projectors onto the ranges of these two (EOF) subspaces and the distances between the two square matrices representing these two orthogonal projectors as measured in some matrix norms [Golub_VanLoan]. Here, we use both the spectral (e.g., the 2-norm) and Frobenius matrix norms to quantify the distances between two (EOF) subspaces of the same dimension [Golub_VanLoan].

The spectral distance, S_dist, of two subspaces of the same dimension is simply equal to the sine of the largest principal angle between these two subspaces [Golub_VanLoan] and is always greater or equal to 0 and less or equal to 1. It is further possible to show that the spectral distance between two subspaces is equal to 0 if and only if the two subspaces are the same. On the other hand, if the spectral distance between two subspaces is equal to 1 than it exists at least one vector belonging to the first subspace, which is orthogonal to all vectors belonging to the second subspace and vice-versa.

On the other hand, it can be shown that the Frobenius distance, F_dist, between two subspaces of dimension n is equal to

F_dist = sqrt( 2. sum( [ sin( PA(:n) ) ]**2 ) )

where the elements of the n-vector PA(:) are the n principal angles between the two subspaces of dimension n. The Frobenius distance is equal to 0 if the two subspaces are the same and is always less than sqrt( 2 . n) and this maximum is attained only when the two subspaces are orthogonal to each other. Since this upper bound is a function of the dimension n of the two subspaces, it is further convenient to define the “scaled” Frobenius distance of two subspaces of dimension n as the ratio between the Frobenius distance and its upper bound. In this way, the “scaled” Frobenius distance is always greater or equal to 0 and less or equal to 1, as the spectral distance, and can also be used to compare “scaled” Frobenius distances between pairs of EOF subspaces of different dimensions in order to determine the most stable EOF subspaces with the help of the moving block bootstrap method.

An output NetCDF dataset containing singular values (e.g., square roots of the eigenvalues, which are equal to the standard-deviations of the corresponding PC time series), percentages of variance explained by each PC time series, eigenvectors scaled by the corresponding singular values and standardized PC time series is created. The scaled eigenvectors are repacked as a fourdimensional variable in the output NetCDF dataset. If the bootstrap approach is requested, the output NetCDF dataset will also contain confidence intervals for the explained variances by the PC time series as well as statistics for explained variances, cosines of the angles between the pairwise observed and bootstrapped EOFs and distances between the pairwise observed and bootstrapped leading EOF subspaces.

You should use EOF analysis if you are interested in summarizing data and/or detecting linear relationships between the observed variables [Jolliffe] [Jackson] [vonStorch_Zwiers]. EOF or PCA analysis can also be used to reduce the number of variables or the noise in a dataset before a regression, cluster or Maximum Covariance Analysis (MCA). More specifically, the first k standardized PC time series and associated eigenvectors (scaled by the corresponding singular values) give a least-squares solution to the model

X = AB + E

where

X is the ntime by nv matrix of observed variables

A is the ntime by k matrix of the first k (standardized) PC time series

B is the k by nv matrix of the first k (scaled) eigenvectors (stored rowwise)

E is an ntime by nv matrix of residuals

and you want to minimize the squared Frobenius norm of E (e.g., the sum of all the squared elements of E).

Refer to comp_invert_eof_4d, if you want to compute such approximation of your dataset and to comp_ortho_rot_eof_4d, comp_filt_rot_eof_4d or comp_loess_rot_eof_4d if you want to perform an orthogonal rotation of all or selected of the computed standardized PC time series in order to simplify their physical interpretations [Jolliffe] [Jackson] [Wills_etal] .

Refer to comp_svd_3d, comp_svd2_3d, comp_reg_3d and comp_reg_4d for more details on MCA and regression procedures available in NCSTAT, respectively.

Finally, if the NetCDF variable is tridimensional use comp_eof_3d instead of comp_eof_4d.

This procedure is parallelized if OpenMP is used. Moreover, as noted above, this procedure may use (randomized) partial SVD algorithms, which are highly efficient on huge datasets if you are interested only in the few leading terms of the SVD of the data matrix X.

Further Details¶

Usage¶

$ comp_eof_4d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -m=input_mesh_mask_netcdf_file \
  -g=grid_type                         (optional : n, t, u, v, w, f) \
  -r=resolution                        (optional : r2, r4) \
  -b=nlon_orca, nlat_orca, nlevel_orca (optional) \
  -x=lon1,lon2                         (optional) \
  -y=lat1,lat2                         (optional) \
  -z=level1,level2                     (optional) \
  -t=time1,time2                       (optional) \
  -a=type_of_analysis                  (optional : scp, cov, cor) \
  -c=input_climatology_netcdf_file     (optional) \
  -d=distance                          (optional : dist2, dist3, ident) \
  -alg=algorithm                       (optional : svd, inviter, deflate, rsvd) \
  -n=number_of_eofs                    (optional) \
  -o=output_eof_netcdf_file            (optional) \
  -bm=bootstrap_method                 (optional : normal, percentile) \
  -nb=number_of_shuffles               (optional) \
  -bp=bootstrap_periodicity            (optional) \
  -bl=bootstrap_block_length           (optional) \
  -bt=bootstrap_trend_length           (optional) \
  -pro=critical_probability            (optional : 0.0 > 1.) \
  -mi=missing_value                    (optional) \
  -ortho                               (optional) \
  -explvar                             (optional) \
  -nobias                              (optional) \
  -double                              (optional) \
  -bigfile                             (optional) \
  -hdf5                                (optional) \
  -cdf5                                (optional) \
  -tlimited                            (optional)

By default¶

-g=

the grid_type is set to n which means that the 3-D grid-mesh associated with the input NetCDF variable is assumed to be regular or Gaussian

-r=

if the input netcdf_variable is from the NEMO ocean model (e.g., if -n= argument is not set to n) the resolution is assumed to be r2

-b=

if -n= is not set to n, the dimensions of the 3-D grid-mesh, nlon_orca, nlat_orca and nlevel_orca are determined from the -r= argument. However, you may override this choice by default with the -b= argument

-x=

the whole longitude domain associated with the netcdf_variable

-y=

the whole latitude domain associated with the netcdf_variable

-z=

the whole vertical resolution associated with the netcdf_variable

-t=

the whole time period associated with the netcdf_variable

-a=

the type_of_analysis is set to scp. This means that the eigenvectors and eigenvalues are computed from the sums of squares and cross-products matrix between the observed variables

-c=

an input_climatology_netcdf_file is not needed if the type_of_analysis is set to scp

-d=

the distance is set to dist3. This means that distances and scalar products in the EOF analysis are computed with the diagonal metric associated with the 3-D grid-mesh associated with the input NetCDF variable

-alg=

the algorithm option is set to inviter. This means that the EOF model is computed by a partial SVD analysis of the matrix of the observed variables using an inverse iteration algorithm

-n=

number_of_eofs is set to 10 and a 10-component EOF model is stored in the output NetCDF file output_eof_netcdf_file

-o=

the output_eof_netcdf_file is named eof_netcdf_variable.nc

-bm=

bootstrap_method is set to normal. This means that bootstrap confidence intervals of explained variances by the EOFs are based on asymptotic normality

-nb=

number_of_shuffles is set to 99. This means that 99 bootstrap samples are generated in the moving block bootstrap algorithm for computing the statistics of explained variance statistics and their sampling distributions as well as statistics related to the stability of the computed EOFs

-bp=

the time series are assumed to be stationary and bootstrap_periodicity is set to 1 in the moving block bootstrap procedure for testing the significance of the singular triplets. This means that the blocks in the bootstrap algorithm are not forced to begin at specific observations. Use this parameter if the time series are cyclostationary, see the remarks below for further details

-bl=

bootstrap_block_length is set to bootstrap_periodicity.2

-bt=

a stationary moving block bootstrap algorithm is used and the bootstrap_trend_length parameter has no effect

-pro=

the critical_probability is set to 0.05, which means that the bootstrap confidence intervals for the explained variance statistics are 95% confidence intervals

-mi=

the missing_value attribute in the output NetCDF file is set to 1.e+20

-ortho

the computed EOFs and associated PC time series are not automatically reorthogonalized if a (partial) SVD is computed by the deflation or inverse iteration methods, e.g., if -alg=inviter or -alg=deflate

-explvar

the -n= option specifies the number of eofs to be computed and stored in the output NetCDF file. If -explvar is activated, the -n= option specifies the minimum value of explained variance by the EOF model for selecting the order (e.g., the number of components) of this EOF model

-nobias

biased bootstrap confidence intervals for explained variances are computed. However, if -nobias is activated, unbiased bootstrap confidence intervals for explained variances are computed

-double

the results of the EOF analysis are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers

-bigfile

a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file

-hdf5

a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file

-cdf5

a NetCDF classical format file is created. If -cdf5 is activated, the output NetCDF file is a CDF5 format file

-tlimited

the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks¶

The -v=netcdf_variable argument specifies the NetCDF variable for which an EOF analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.

The argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to the netcdf_variable for transforming this fourdimensional NetCDF variable as a rectangular matrix before computing the EOF analysis. The scale factors associated with the 3-D grid-mesh of this NetCDF variable (needed if -d=dist2 or dist3 are specified when calling the procedure) are also read from the input_mesh_mask_netcdf_file.

If the -x=lon1,lon2, -y=lat1,lat2 and -z=level1,level2 arguments are missing, the geographical domain used in the EOF analysis is determined from the attributes of the input mesh mask NetCDF variable named grid_typemask (e.g., lon1_Eastern_limit, lon2_Western_limit, lat1_Southern_limit, lat2_Northern_limit, level1_First_level and level2_Last_level ) which is read from the input NetCDF file input_mesh_mask_netcdf_file. If these attributes are missing, the whole geographical domain and vertical resolution associated with the netcdf_variable is used in the EOF analysis.

The longitude, latitude or level range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to 1. Negative values are allowed for lon1. In this case the longitude domain is from nlon+lon1+1 to lon2 where nlon is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.

Refer to comp_mask_4d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_eof_4d.

If the -t=time1,time2 argument is missing the whole time period associated with the netcdf_variable is used to estimate eigenvalues, eigenvectors and PC time series.

The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1. Note that the output NetCDF file will have ntime = time2 - time1 + 1 time observations.

It is assumed that the data has no missing values excepted those associated with a constant land-sea mask.

If -g= is set to t, u, v, w or f it is assumed that the input NetCDF variable is from an experiment with the NEMO model (ORCA configuration and R2, R4 or R05 resolutions).

In this case, the duplicate points from the ORCA grid are removed before the EOF analysis, as far as possible, and, in particular, if the 3-D grid-mesh of the input NetCDF variable covers the whole globe. On output, the duplicate points are restored when writing the EOFs (e.g., the eigenvectors), if the geographical domain of the input NetCDF variable is the whole globe.

If -g= is set to n, it is assumed that the 3-D grid-mesh is regular or Gaussian and as such has no duplicate points.

The -g= argument is also used to determine the name of the NetCDF variables which contain the mesh-mask and the scale factors in the input_mesh_mask_netcdf_file (e.g., these variables are named grid_typemask, e1grid_type, e2grid_type and e3grid_type, respectively). This input_mesh_mask_netcdf_file may be created by comp_clim_4d if the 3-D grid-mesh is regular or gaussian.

If -g= is set to t, u, v, w or f (e.g., if the input NetCDF variable is from an experiment with the NEMO ocean model), the -r= argument gives the resolution used. If:

-r=r2, the NetCDF variable is from an experiment with the ORCA R2 configuration of the NEMO ocean model

-r=r4, the NetCDF variable is from an experiment with the ORCA R4 configuration of the NEMO ocean model.

If the input NetCDF variable is from an experiment with the NEMO model, but the resolution is not r2 or r4, the dimensions of the ORCA grid must be specified explicitly with the -b= argument.

The -a= argument specifies if the observed variables are centered or standardized with an input climatology (specified with the -c= argument) before the EOF analysis. If:

-a=scp, the EOF analysis is done on the raw data

-a=cov, the EOF analysis is done on the anomalies

-a=cor, the EOF analysis is done on the standardized anomalies.

By default, the -a= argument is set to scp.

The input_climatology_netcdf_file specified with the -c= argument is needed only if -a=cov or -a=cor.

If -a=cov or -a=cor, the selected time period must agree with the climatology. This means that the first selected time observation (time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.

The geographical shapes of the netcdf_variable (in the input_netcdf_file), the mask (in the input_mesh_mask_netcdf_file), the scale factors (in the input_mesh_mask_netcdf_file), and the climatology (in the input_climatology_netcdf_file) must agree.

The -d=distance argument specifies the metric and scalar product used in the EOF analysis. If:

-d=dist2, the EOF analysis is done with the diagonal distance associated with the horizontal 2-D grid-mesh (e.g., each grid point is weighted accordingly to the surface associated with it)

-d=dist3, the EOF analysis is done with the diagonal distance associated with the whole 3D grid-mesh (e.g., each grid point is weighted accordingly to the volume or weight associated with it)

-d=ident, the EOF analysis is done with the identity metric, e.g., the Euclidean distance and the usual scalar product are used in the EOF analysis.

By default, the -d= argument is set to dist3.

The -alg=algorithm argument determines how eigenvalues, eigenvectors and PC time series are computed. If:

-alg=svd, a full SVD of the data matrix is computed, even if you ask only for the leading eigenvectors

-alg=inviter, a partial SVD of the data matrix is computed by inverse iteration

-alg=deflate, a partial SVD of the data matrix is computed by a deflation technique

-alg=rsvd, a partial SVD of the data matrix is computed by a randomized algorithm.

All algorithms are parallelized if OpenMP is used. The default is -alg=inviter since computing a partial SVD is generally much faster than computing a full SVD. But, -alg=deflate is generally as fast as -alg=inviter. Note that -alg=rsvd is generally much faster than all other options, but the computed eigenvectors and PC time series may be less accurate depending on the shape of the distribution of the singular values of the input matrix.

The -n=number_of_eofs argument specifies the number of components of the EOF model which must be stored (and also computed if -alg=inviter, -alg=deflate or -alg=rsvd is specified) in the output NetCDF file specified by the -o= argument. See also the description of the -explvar argument below.

If the -explvar argument is activated, the number specified with the -n= argument is not the number of requested EOFs, but the minimum of variance that these EOFs must describe. The number of EOFs is then determined by reference with the minimum of explained variance required by the -n= argument. Express this explained variance in percentage (e.g., with an integer from 0 to 100).

If -explvar is specified, the -n= argument must be less or equal to 100. Note also that, if a randomized partial SVD algorithm is used (e.g., if -alg=rsvd is specified), the number of computed EOFs is limited to 200 even if these 200 EOFs describe less variance than the minimum requested with the -n= argument.

If the -ortho argument is used, the computed EOFs and associated PC time series are always reorthogonalized if a (partial) SVD is computed by the deflation or inverse iteration methods.

By default, they are only partially orthogonalized if the computed singular values are not well separated. Note that this argument has no effect if -alg=svd or -alg=rsvd.

If any of the -nb=, -bl=, -bp= and -bm= arguments is specified, a moving block bootstrap algorithm is used to test the stability of the computed EOFs and help to select the number of significant EOFs for further analysis.

The -bm=bootstrap_method argument specifies how bootstrap confidence intervals for explained variances are computed. If:

-bm=normal, bootstrap confidence intervals of explained variances by the EOFs are based on asymptotic normality [Efron] [Politis] . These standard intervals will be bias-corrected [Politis] if the -nobias argument is also specified.

-bm=percentile, bootstrap confidence intervals of explained variances by the EOFs are based on the percentile or bias-corrected percentile methods [Efron] . The bias-corrected percentile method will be used if the -nobias argument is also specified.

Note that the percentile method requires a larger number of shuffles (specified with the -nb= argument), especially if the critical_probability (specified with the -pro= argument) is small. More precisely, number_of_shuffles.critical_probability.0.5 must be greater or equal to 1. in the case of the percentile method. If it is not the case, the procedure will exit with an error message. As an illustration, with the default critical_probability (e.g., 0.05), number_of_shuffles must be at least 40 for using the percentile method for estimating confidence intervals for explained variances by the EOFs.

By default, the -bm= argument is set to normal.

The -nb=number_of_shuffles argument specifies the number of shuffles used in the bootstrap algorithm (by default 99).

The -bp=bootstrap_periodicity argument specifies that the index, i, of the first observation of each selected block in the moving block bootstrap algorithm verifies the condition i = 1 + bootstrap_periodicity.j where j is a random positive integer. bootstrap_periodicity must be greater than zero and less than the length of the time series. This parameter is useful if the time series are cyclostationary instead of stationary.

By default, bootstrap_periodicity is set to 1 so that the time series is assumed to be stationary (see the description of the -bt= argument if you want to relax this strong assumption in the bootstrap algorithm.

The -bl=bootstrap_block_length argument specifies the size of the blocks in the moving block bootstrap algorithm. bootstrap_block_length must be greater or equal to the bootstrap_periodicity and less than the length of the time series. By default, bootstrap_block_length is set to bootstrap_periodicity.2. If -a=cov or -a=cor is specified, it is highly recommended to set bootstrap_block_length as a multiple of the periodicity in the data as this will take properly into account the cyclostationary of the analyzed time series.

The -bt=bootstrap_trend_length argument specifies that a non-stationary bootstrap algorithm must be used and sets the length of the trend smoother, bootstrap_trend_length, in this non-stationary bootstrap algorithm. The non-stationary bootstrap algorithm consists of two steps:

First, trends are extracted from the time series by using a LOESS smoother [Cleveland] [Cleveland_Devlin]. This LOESS smoother is similar to the one performed in comp_trend_4d and the -bt= argument here has the the same meaning as the -nt= argument in comp_trend_4d.

Secondly, residuals from the estimated trends are computed and are used in a stationary moving block bootstrap algorithm. The final non-stationary bootstrapped time series are finally generated by adding the trend components to the bootstrapped residuals time series.

The value of bootstrap_trend_length should be an odd integer greater than or equal to 3. As bootstrap_trend_length increases the values of the trend component become smoother and the amplitude of the residuals increases, generating more variability in the bootstrapped time series.

The -pro=critical_probability argument specifies the critical probability which is used to determine the lower and upper bounds of the confidence intervals for explained variances by the EOFs if the bootstrap method is used. More precisely, Lower and upper (1 - critical_probability)``100``% bootstrap confidence bounds for explained variances are computed based on asymptotic normality or the percentile method depeding on the value of the *-bm=**bootstrap_method argument. critical_probability must be greater than 0. and less than 1.. The default value is 0.05, meaning that 95% confidence intervals are computed by default.

If -nobias is specified, unbiased bootstrap confidence intervals for explained variances are computed. If the -nobias argument is absent, biased bootstrap confidence intervals are estimated by default.

If -bm=percentile is specified, this means that the confidence intervals will be obtained by the bias-corrected percentile method instead of the percentile method.

The -mi=missing_value argument specifies the missing value indicator associated with the NetCDF variables in the output_netcdf_file.

If the -mi= argument is not specified missing_value is set to 1.e+20.

The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.

By default, the results are stored as single-precision floating point numbers in the output NetCDF file.

The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36, _USE_NETCDF4 or _USE_NETCDF44 CPP macros (e.g., -D_USE_NETCDF36 or -D_USE_NETCDF4 or -D_USE_NETCDF44) and linked to the NetCDF 3.6 library or higher.

If this argument is specified, the output_eof_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36, _USE_NETCDF4 or _USE_NETCDF44 CPP macros.

The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 or _USE_NETCDF44 CPP macros (e.g., -D_USE_NETCDF4 or -D_USE_NETCDF44) and linked to the NetCDF 4 library or higher.

If this argument is specified, the output_eof_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 or _USE_NETCDF44 CPP macros.

The -cdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF44 CPP macro (e.g., -D_USE_NETCDF44) and linked to the NetCDF 4.4 library or higher.

If this argument is specified, the output_eof_netcdf_file will be a CDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF44 CPP macro.

Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

For more details on PCA and EOF analysis, the bootstrap approach, distances between subspaces, the LOESS smoother and randomized SVD algorithms, see

“Principal Component Analysis”, by Jolliffe, I.T., Springer-Verlag, New York, USA, 2nd Ed, 2002. ISBN: 978-0-387-22440-4. https://www.springer.com/gp/book/9780387954424

“A user’s guide to principal components”, by Jackson, J.E., John Wiley and Sons, New York, USA, 592 pp., 2003. ISBN: 978-0-471-47134-9

“Matrix Computations”, by Golub, G.H., and Van Loan, C., The Johns Hopkins University Press, Baltimore, MD., 4rd Edition, 756 pp., 2013. ISBN: `978-1-4214-0794-4`_

“A manual for EOF and SVD analyses of climate data”, by Bjornsson, H., and Venegas, S.A., McGill University, CCGCR Report No. 97-1, Montréal, Québec, 52pp, 1997. https://www.jsg.utexas.edu/fu/files/EOFSVD.pdf

“A primer for EOF analysis of climate data”, by Hannachi, A., Reading University, Reading, UK, 33pp, 2004. http://www.met.reading.ac.uk/~han/Monitor/eofprimer.pdf

“Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 13, 484 pp., 2002. ISBN: 9780521012300

“Nonparametric standard-errors and confidence intervals” By Efron, B., The Canadian Journal of Statistics, 9, 2, 139-172, (1981). https://doi.org/10.2307/3314608

“Computer-Intensive Methods in Statistical Analysis” By Politis, D.N., IEEE Signal Processing Magazine, 15, 1, 39-55, (January 1998). https://ieeexplore.ieee.org/document/647042

“Robust Locally Weighted Regression and Smoothing Scatterplots”, by Cleveland, W.S., Journal of the American Statistical Association, Vol. 74, 829-836, 1979. doi: 10.1080/01621459.1979.10481038

“Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting”, by Cleveland, W.S., and Devlin, S.J., Journal of the American Statistical Association, Vol. 83, 596-610, 1988. doi: 10.1080/01621459.1988.10478639

“Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions”, By Halko, N., Martinsson, P.G., and Tropp, J.A., SIAM Rev., 53, 217-288, 2011. https://epubs.siam.org/doi/abs/10.1137/090771806

“Algorithm 971: An implementation of a randomized algorithm for principal component analysis”, by Li, H.,Linderman, G.C., Szlam, A., Stanton, K.P., Kluger, Y., and Tygert, M., ACM Trans. Math. Softw. 43, 3, Article 28 (January 2017). https://pubmed.ncbi.nlm.nih.gov/28983138

“Randomized methods for matrix computations”, by Martinsson, P.G., arXiv.1607.01649, 2019. https://arxiv.org/abs/1607.01649

“Disentangling Global Warming, Multidecadal Variability, and El Nino in Pacific Temperatures”, by Wills, R.C., Schneider, T., Wallace, J.M., Battisti, D.S., and Hartmann, D.L., Geophysical Research Letters, Vol. 45, 2487-2496, 2018. https://doi.org/10.1002/2017GL076327

Outputs¶

comp_eof_4d creates an output NetCDF file that contains the standardized PC time series, the scaled eigenvectors and the (square roots of) eigenvalues of the EOF analysis. The number of PC time series, eigenvectors and eigenvalues stored in the output NetCDF dataset is determined by the -n=number_of_eofs argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlev, nlat and nlon are the lengths of the geographical dimensions of the input NetCDF variable) :

netcdf_variable_eof(number_of_eofs,nlev,nlat,nlon) : the selected eigenvectors of the sums of squares and cross-products (-a=scp), covariance (-a=cov) or correlation (-a=cor) matrix between the observed variables. The eigenvectors are sorted by descending order of the associated eigenvalues. The eigenvectors are scaled such that they give the scalar products (-a=scp), covariances (-a=cov) or correlations (-a=cor) between the original observed variables and the associated PC time series.

The eigenvectors are packed in a fourdimensional variable whose first, second and third dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x=, -y= and -z= arguments. However, outside the selected domain, the output NetCDF variable is filled with missing values. If this is a problem, you can use comp_norm_4d for restricting the geographical domain in the input dataset before using comp_eof_4d.

netcdf_variable_pc(ntime,number_of_eofs) : the PC time series sorted by descending order of the eigenvalues (e.g., the squares of the singular values of the input matrix of observed variables).

The PC time series are always standardized to unit variance.

netcdf_variable_sing(number_of_eofs) : the singular values of the input data matrix in decreasing order, up to a constant scaling factor (equals to the square root of 1/ntime) and taking into account the effect of the metric used in the EOF analysis. More precisely, these singular values are the square roots of the eigenvalues of the sums of squares and cross-products (-a=scp) or covariance (-a=cov) or correlation (-a=cor) matrix between the observed variables.

These eigenvalues (e.g., the squares of the singular values) are equal to the (raw) variance described by the PC time series.

netcdf_variable_var(number_of_eofs) : the proportions of variance explained by each PC time series (given between 0. and 1. with 1. corresponding to 100%).

Furthermore, if one of the -nb=, -bl=, -bp= or -bm= arguments is specified when calling comp_eof_4d, the following NetCDF variables are computed and also stored in the output NetCDF file:

netcdf_variable_var_boot_lower_bound(number_of_eofs) : lower (1 - critical_probability)``100``% bootstrap confidence bounds for explained variances. By default, *critical_probability is set to 0.05, such that 95% confidence intervals are computed, but the width of the confidence intervals can be changed at the user option with the -pro=critical_probability) argument. Similarly, by default, biased confidence intervals are computed, but unbiased confidence intervals are estimated if the -nobias= option is used.

netcdf_variable_var_boot_upper_bound(number_of_eofs) : upper (1 - critical_probability)``100``% bootstrap confidence bounds for explained variances. By default, *critical_probability is set to 0.05, such that 95`\ `% confidence intervals are computed, but the width of the confidence intervals can be changed at the user option with the **-pro=**\ ``critical_probability) argument. Similarly, by default, biased confidence intervals are computed, but unbiased confidence intervals are estimated if the -nobias= option is used.

netcdf_variable_var_boot_mean(number_of_eofs) : the mean of bootstrapped explained variances.

netcdf_variable_var_boot_var(number_of_eofs) : the variance of bootstrapped explained variances.

netcdf_variable_var_boot_std(number_of_eofs) : the standard-deviation of bootstrapped explained variances.

netcdf_variable_var_boot_skew(number_of_eofs) : the skewness of bootstrapped explained variances.

netcdf_variable_var_boot_kurt(number_of_eofs) : the kurtosis of bootstrapped explained variances.

netcdf_variable_var_boot_min(number_of_eofs) : the minimum of bootstrapped explained variances.

netcdf_variable_var_boot_max(number_of_eofs) : the maximum of bootstrapped explained variances.

netcdf_variable_cos_vec_boot_mean(number_of_eofs) : the mean of the cosines of the angles between pairwise observed and bootstrapped EOFs.

netcdf_variable_cos_vec_boot_var(number_of_eofs) : the variance of the cosines of the angles between pairwise observed and bootstrapped EOFs.

netcdf_variable_cos_vec_boot_std(number_of_eofs) : the standard-deviation of the cosines of the angles between pairwise observed and bootstrapped EOFs.

netcdf_variable_cos_vec_boot_skew(number_of_eofs) : the skewness of the cosines of the angles between pairwise observed and bootstrapped EOFs.

netcdf_variable_cos_vec_boot_kurt(number_of_eofs) : the kurtosis of the cosines of the angles between pairwise observed and bootstrapped EOFs.

netcdf_variable_cos_vec_boot_min(number_of_eofs) : the minimum of the cosines of the angles between pairwise observed and bootstrapped EOFs.

netcdf_variable_cos_vec_boot_max(number_of_eofs) : the maximum of the cosines of the angles between pairwise observed and bootstrapped EOFs.

netcdf_variable_spect_dist_subspace_boot_mean(number_of_eofs) : the mean of the spectral distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_spect_dist_subspace_boot_var(number_of_eofs) : the variance of the spectral distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_spect_dist_subspace_boot_std(number_of_eofs) : the standard-deviation of the spectral distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_spect_dist_subspace_boot_skew(number_of_eofs) : the skewness of the spectral distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_spect_dist_subspace_boot_kurt(number_of_eofs) : the kurtosis of the spectral distances between the pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_spect_dist_subspace_boot_min(number_of_eofs) : the minimum of the spectral distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_spect_dist_subspace_boot_max(number_of_eofs) : the maximum of the spectral distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_frob_dist_subspace_boot_mean(number_of_eofs) : the mean of the (scaled) Frobenius distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_frob_dist_subspace_boot_var(number_of_eofs) : the variance of the (scaled) Frobenius distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_frob_dist_subspace_boot_std(number_of_eofs) : the standard-deviation of the (scaled) Frobenius distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_frob_dist_subspace_boot_skew(number_of_eofs) : the skewness of the (scaled) Frobenius distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_frob_dist_subspace_boot_kurt(number_of_eofs) : the kurtosis of the (scaled) Frobenius distances between the pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_frob_dist_subspace_boot_min(number_of_eofs) : the minimum of the (scaled) Frobenius distances between pairwise observed and bootstrapped EOF subspaces.

netcdf_variable_frob_dist_subspace_boot_max(number_of_eofs) : the maximum of the (scaled) Frobenius distances between pairwise observed and bootstrapped EOF subspaces.

Examples¶

For computing an EOF analysis from the NetCDF file z.1970_2002.apm.nc, which includes a NetCDF variable z, and store the results in a NetCDF file named eof_era40_1m_z850_1979_2001.nc, use the following command (the analysis is done on the centered data and only the level 21, which corresponds to 850 hPa, is considered in the analysis) :
$ comp_eof_4d \
  -f=z.1970_2002.apm.nc  \
  -v=z \
  -z=21,21 \
  -m=mask_era40_z.nc  \
  -g=n \
  -c=clim_era40_1m_z_1979_2001.nc  \
  -d=dist2 \
  -a=cov \
  -n=10 \
  -o=eof_era40_1m_z850_1979_2001.nc
For computing an EOF analysis from the NetCDF file ST7_1m_0101_20012_grid_T_votemper.nc, which includes a NetCDF variable votemper from a numerical simulation with the ORCA R2 configuration of the NEMO ocean model, and store the results in a NetCDF file named eof_votemper.nc, use the following command (the analysis is done on the standardized data and all the depths/levels are included in the analysis) :
$ comp_eof_4d \
  -f=ST7_1m_0101_20012_grid_T_votemper.nc \
  -v=votemper \
  -c=clim_ST7_1m_0101_20012_grid_T_votemper.nc \
  -a=cor \
  -d=dist3 \
  -m=meshmask.orca2.nc \
  -g=t

Table of Contents

Previous topic

Next topic

This Page