comp_eof_3d¶
Authors¶
Pascal Terray (LOCEAN/IPSL)
Latest revision¶
29/05/2024
Purpose¶
Compute an Empirical Orthogonal Function (EOF) analysis, also known as Principal Component Analysis (PCA) from a tridimensional variable extracted from a NetCDF dataset.
The procedure first transforms the input tridimensional NetCDF variable as a ntime by nv rectangular matrix, X, of observed variables (e.g., the selected cells of the 2-D grid-mesh associated with the tridimensional NetCDF variable) and then performs an EOF analysis or PCA of this matrix [Jolliffe] [Jackson] .
The eigenvalues, eigenvectors and standardized Principal Components (PC) time series of the EOF analysis (or PCA) are computed by a full or partial Singular Value Decomposition (SVD) of the rectangular matrix of the observed variables [Jolliffe] [Bjornsson_Venegas] [Hannachi] [vonStorch_Zwiers]. These algorithms find directly square roots of eigenvalues (e.g., singular values of the data matrix X) and associated eigenvectors of the sums of squares and cross-products, covariance or correlation matrix between the observed variables (e.g., right singular vectors of X) without actually computing this symmetric semi-positive-definite matrix. The standardized PC time series, which are the left singular vectors of X are also directly obtained from the (partial) SVD of X.
Note that very fast randomized partial SVD algorithms [Halko_etal] [Li_etal] [Martinsson] can also be used, which will be highly efficient on huge datasets. See description of the -alg= argument for more details on the algorithms available for performing the EOF analysis.
Optionally, the eigenvalues, eigenvectors and associated PC time series can be estimated with the help of a metric such that the results are weighted by the surface (or volume) associated with each cell in the grid-mesh associated with the input tridimensional NetCDF variable so that equal areas (or volumes) carry equal weights in the results of the EOF analysis (see the -d= argument description for more details).
An output NetCDF dataset containing singular values (e.g., squared roots of the eigenvalues, which are equal to the standard-deviations of the corresponding PC time series), percentages of variance explained by each PC time series, eigenvectors scaled by the corresponding singular values and standardized PC time series is created. The scaled eigenvectors are repacked as a tridimensional variable in the output NetCDF dataset.
You should use EOF analysis if you are interested in summarizing data and/or detecting linear relationships between the observed variables. EOF analysis can also be used to reduce the number of variables or the noise in a dataset before a regression, cluster or Maximum Covariance Analysis (MCA). More specifically, the first k principal component time series and eigenvectors give a least-squares solution to the model
X = AB + E
where
- X is the ntime by nv matrix of observed variables
- A is the ntime by k matrix of the first k principal component time series
- B is the k by nv matrix of the first k eigenvectors (stored rowwise)
- E is an ntime by nv matrix of residuals
and you want to minimize the squared Frobenius norm of E (e.g., the sum of all the squared elements of E).
Refer to comp_invert_eof_3d, if you want to compute such approximation of your dataset and to comp_ortho_rot_eof_3d, comp_filt_rot_eof_3d or comp_loess_rot_eof_3d if you want to perform an orthogonal rotation of all or selected of the computed standardized PC time series in order to simplify their interpretations [Jolliffe] [Jackson] [Wills_etal] .
Refer to comp_svd_3d, comp_reg_3d and comp_reg_4d, for more details on MCA and regression procedures available in NCSTAT, respectively.
If your data contains missing values use comp_eof_miss_3d instead of comp_eof_3d to estimate approximate eigenvectors, principal components and, in addition, the missing values in your dataset.
Finally, if the NetCDF variable is fourdimensional use comp_eof_4d instead of comp_eof_3d.
This procedure is parallelized if OpenMP is used. Moreover, as noted above, this procedure may use (randomized) partial SVD algorithms, which are highly efficient on huge datasets if you are interested only in the few leading terms of the SVD of the data matrix X.
Further Details¶
Usage¶
$ comp_eof_3d \
-f=input_netcdf_file \
-v=netcdf_variable \
-m=input_mesh_mask_netcdf_file \
-g=grid_type (optional : n, t, u, v, w, f) \
-r=resolution (optional : r2, r4) \
-b=nlon_orca, nlat_orca (optional) \
-x=lon1,lon2 (optional) \
-y=lat1,lat2 (optional) \
-t=time1,time2 (optional) \
-a=type_of_analysis (optional : scp, cov, cor) \
-c=input_climatology_netcdf_file (optional) \
-d=type_of_distance (optional : dist2, ident) \
-alg=algorithm (optional : svd, inviter, deflate, rsvd) \
-n=number_of_eofs (optional) \
-o=output_eof_netcdf_file (optional) \
-mi=missing_value (optional) \
-ortho (optional) \
-explvar (optional) \
-double (optional) \
-bigfile (optional) \
-hdf5 (optional) \
-tlimited (optional)
By default¶
- -g=
- the grid_type is set to
n
which means that the 2-D grid-mesh associated with the input NetCDF variable is assumed to be regular or Gaussian- -r=
- if the input netcdf_variable is from the NEMO model (e.g., if -g= argument is not set to
n
) the resolution is assumed to ber2
- -b=
- if -g= is not set to
n
, the dimensions of the 2-D grid-mesh, nlon_orca and nlat_orca, are determined from the -r= argument. However, you may override this choice by default with the -b= argument- -x=
- the whole longitude domain associated with the netcdf_variable
- -y=
- the whole latitude domain associated with the netcdf_variable
- -t=
- the whole time period associated with the netcdf_variable
- -a=
- the type_of_analysis is set to
scp
. This means that the eigenvectors and eigenvalues are computed from the sums of squares and cross-products matrix between the observed variables- -c=
- an input_climatology_netcdf_file is not needed if the type_of_analysis is set to
scp
- -d=
- the type_of_distance is set to
dist2
. This means that distances and scalar products in the EOF analysis are computed with the diagonal metric associated with the 2-D grid-mesh associated with the input NetCDF variable- -alg=
- the algorithm option is set to
inviter
. This means that the EOF model is computed by a partial SVD analysis of the matrix of the observed variables using an inverse iteration algorithm- -n=
- number_of_eofs is set to
10
and a 10-component EOF model is stored in the output NetCDF file output_eof_netcdf_file- -o=
- the output_eof_netcdf_file is named
eof_
netcdf_variable.nc
- -mi=
- the missing_value attribute in the output NetCDF file is set to
1.e+20
- -ortho
- the computed EOFs and associated PC time series are not automatically reorthogonalized if a (partial) SVD is computed by the deflation or inverse iteration methods, e.g., if -alg=
inviter
or -alg=deflate
- -explvar
- the -n= option specifies the number of eofs to be computed and stored in the output NetCDF file. If -explvar is activated, the -n= option specifies the minimum value of explained variance by the EOF model for selecting the order (e.g., the number of components) of this EOF model
- -double
- the results of the EOF analysis are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
- -bigfile
- a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
- -hdf5
- a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
- -tlimited
- the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file
Remarks¶
The -v=netcdf_variable argument specifies the NetCDF variable for which an EOF analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.
The argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to the netcdf_variable for transforming this tridimensional NetCDF variable as a rectangular matrix before computing the EOF analysis. The scale factors associated with the 2-D grid-mesh of this NetCDF variable (needed if -d=
dist2
is specified when calling the procedure) are also read from the input_mesh_mask_netcdf_file.If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing, the geographical domain used in the EOF analysis is determined from the attributes of the input mesh mask NetCDF variable named grid_typemask (e.g., lon1_Eastern_limit, lon2_Western_limit, lat1_Southern_limit and lat2_Northern_limit ) which is read from the input NetCDF file input_mesh_mask_netcdf_file. If these attributes are missing, the whole geographical domain associated with the netcdf_variable is used in the EOF analysis.
The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to
1
. Negative values are allowed for lon1. In this case the longitude domain is fromnlon
+lon1+1
to lon2 wherenlon
is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.Refer to comp_mask_3d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_eof_3d.
If the -t=time1,time2 argument is missing the whole time period associated with the netcdf_variable is used to estimate eigenvectors and principal component time series.
The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to
1
. Note that the output NetCDF file will haventime
= time2 - time1 +1
time observations.It is assumed that the data has no missing values excepted those associated with a constant land-sea mask. If your dataset has missing values, use comp_eof_miss_3d instead of comp_eof_3d.
If -g= is set to
t
,u
,v
,w
orf
it is assumed that the NetCDF variable is from an experiment with the NEMO (ORCA configuration and R2, R4 or R05 resolutions). In this case, the duplicate points from the ORCA grid are removed before the EOF analysis, as far as possible, and, in particular, if the 2-D grid-mesh of the input NetCDF variable covers the whole globe. On output, the duplicate points are restored when writing the EOFs (e.g., the eigenvectors), if the geographical domain of the input NetCDF variable is the whole globe.If -g= is set to
n
, it is assumed that the 2-D grid-mesh is regular or Gaussian and as such has no duplicate points.The -g= argument is also used to determine the name of the NetCDF variables which contain the 2-D mesh-mask and the scale factors in the input_mesh_mask_netcdf_file (e.g., these variables are named grid_type
mask
,e1
grid_type ande2
grid_type, respectively). This input_mesh_mask_netcdf_file may be created by comp_clim_3d if the 2-D grid-mesh is regular or gaussian.If -g= is set to
t
,u
,v
,w
orf
(e.g., if the NetCDF variable is from an experiment with the NEMO model), the -r= argument gives the resolution used. If:
- -r=
r2
, the NetCDF variable is from an experiment with the ORCA R2 configuration- -r=
r4
, the NetCDF variable is from an experiment with the ORCA R4 configuration.If the NetCDF variable is from an experiment with the NEMO model, but the resolution is not
r2
orr4
, the dimensions of the ORCA grid must be specified explicitly with the -b= argument.The -a= argument specifies if the observed variables are centered or standardized with an input climatology (specified with the -c= argument) before the EOF analysis. If:
- -a=
scp
, the EOF analysis is done on the raw data- -a=
cov
, the EOF analysis is done on the anomalies- -a=
cor
, the EOF analysis is done on the standardized anomalies.The input_climatology_netcdf_file specified with the -c= argument is needed only if -a=
cov
or -a=cor
.If -a=
cov
or -a=cor
, the selected time period must agree with the climatology. This means that the first selected time observation (time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.The geographical shapes of the netcdf_variable (in the input_netcdf_file), the mask (in the input_mesh_mask_netcdf_file), the scale factors (in the input_mesh_mask_netcdf_file), and the climatology (in the input_climatology_netcdf_file) must agree.
The -d=distance argument specifies the metric and scalar product used in the EOF analysis. If:
- -d=
dist2
, the EOF analysis is done with the diagonal distance associated with the horizontal 2-D grid-mesh (e.g., each grid point is weighted accordingly to the surface associated with it)- -d=
ident
, the EOF analysis is done with the identity metric, e.g., the Euclidean distance and the usual scalar product is used in the EOF analysis.By default, the -d= argument is set to
dist2
.The -alg=algorithm argument determines how eigenvalues, eigenvectors and PC time series are computed. If:
- -alg=
svd
, a full SVD of the data matrix is computed, even if you ask only for the leading eigenvectors- -alg=
inviter
, a partial SVD of the data matrix is computed by inverse iteration- -alg=
deflate
, a partial SVD of the data matrix is computed by a deflation technique- -alg=
rsvd
, a partial SVD of the data matrix is computed by a randomized algorithm.All algorithms are parallelized if OpenMP is used. The default is -alg=
inviter
since computing a partial SVD is generally much faster than computing a full SVD. But, -alg=deflate
is generally as fast as -alg=inviter
. Note that -alg=rsvd
is generally much faster than all other options, but the computed eigenvectors and PC time series may be less accurate depending on the shape of the distribution of the singular values of the input matrix.The -n=number_of_eofs argument specifies the number of components of the EOF model which must be stored (and also computed if -alg=
inviter
, -alg=deflate
or -alg=rsvd
is specified) in the output NetCDF file specified by the -o= argument. See also the description of the -explvar argument below.If the -explvar argument is activated, the number specified with the -n= argument is not the number of requested EOFs, but the minimum of variance that these EOFs must describe. The number of EOFs is then determined by reference with the minimum of explained variance required by the -n= argument. Express this explained variance in percentage (e.g., with an integer from
0
to100
).If -explvar is specified, the -n= argument must be less or equal to
100
. Note also that, if a randomized partial SVD algorithm is used (e.g., if -alg=rsvd
is specified), the number of computed EOFs is limited to200
even if these200
EOFs describe less variance than the minimum requested with the -n= argument.If the -ortho argument is used, the computed EOFs and associated PC time series are always reorthogonalized if a (partial) SVD is computed by the deflation or inverse iteration methods. By default, they are only partially orthogonalized if the computed singular values are not well separated. Note that this argument has no effect if -alg=
svd
or -alg=rsvd
.The -mi=missing_value argument specifies the missing value indicator associated with the NetCDF variables in the output_netcdf_file. If the -mi= argument is not specified missing_value is set to
1.e+20
.The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.
By default, the results are stored as single-precision floating point numbers in the output NetCDF file.
The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g.,
-D_USE_NETCDF36
or-D_USE_NETCDF4
) and linked to the NetCDF 3.6 library or higher. If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g.,
-D_USE_NETCDF4
) and linked to the NetCDF 4 library or higher. If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.
For more details on PCA and EOF analysis in the climate literature and randomized SVD algorithms, see
- “Principal Component Analysis”, by Jolliffe, I.T., Springer-Verlag, New York, USA, 2nd Ed, 2002. ISBN: 978-0-387-22440-4. https://www.springer.com/gp/book/9780387954424
- “A user’s guide to principal components”, by Jackson, J.E., John Wiley and Sons, New York, USA, 592 pp., 2003. ISBN: 978-0-471-47134-9
- “A manual for EOF and SVD analyses of climate data”, by Bjornsson, H., and Venegas, S.A., McGill University, CCGCR Report No. 97-1, Montréal, Québec, 52pp, 1997. https://www.jsg.utexas.edu/fu/files/EOFSVD.pdf
- “A primer for EOF analysis of climate data”, by Hannachi, A., Reading University, Reading, UK, 33pp, 2004. http://www.met.reading.ac.uk/~han/Monitor/eofprimer.pdf
- “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 13, 484 pp., 2002. ISBN: 9780521012300
- “Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions”, By Halko, N., Martinsson, P.G., and Tropp, J.A., SIAM Rev., 53, 217-288, 2011. https://epubs.siam.org/doi/abs/10.1137/090771806
- “Algorithm 971: An implementation of a randomized algorithm for principal component analysis”, by Li, H.,Linderman, G.C., Szlam, A., Stanton, K.P., Kluger, Y., and Tygert, M., ACM Trans. Math. Softw. 43, 3, Article 28 (January 2017). https://pubmed.ncbi.nlm.nih.gov/28983138
- “Randomized methods for matrix computations”, by Martinsson, P.G., arXiv.1607.01649, 2019. https://arxiv.org/abs/1607.01649
- “Disentangling Global Warming, Multidecadal Variability, and El Nino in Pacific Temperatures”, by Wills, R.C., Schneider, T., Wallace, J.M., Battisti, D.S., and Hartmann, D.L., Geophysical Research Letters, Vol. 45, 2487-2496, 2018. https://doi.org/10.1002/2017GL076327
Outputs¶
comp_eof_3d creates an output NetCDF file that contains the principal component time series, the eigenvectors and the (square roots of) eigenvalues of the EOF analysis. The number of principal components, eigenvectors and eigenvalues stored in the output NetCDF dataset is determined by the -n=number_of_eofs argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlat and nlon are the lengths of the spatial dimensions of the input NetCDF variable) :
netcdf_variable_eof
(number_of_eofs,nlat,nlon)
: the selected eigenvectors of the sums of squares and cross-products (-a=scp
), covariance (-a=cov
) or correlation (-a=cor
) matrix between the observed variables. The eigenvectors are sorted by descending order of the associated eigenvalues. The eigenvectors are scaled such that they give the scalar products (-a=scp
), covariances (-a=cov
) or correlations (-a=cor
) between the original observed variables and the associated principal component time series.The eigenvectors are packed in a tridimensional variable whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, the output NetCDF variable is filled with missing values. If this is a problem, you can use comp_norm_3d for restricting the geographical domain in the input dataset before using comp_eof_3d.
netcdf_variable_pc
(ntime,number_of_eofs)
: the principal component time series sorted by descending order of the eigenvalues (e.g., the square of the singular values of the input matrix of observed variables).The principal component time series are always standardized to unit variance.
netcdf_variable_sing
(number_of_eofs)
: the singular values of the input data matrix in decreasing order, up to a constant scaling factor (equals to the square root of1/ntime
). More precisely, these singular values are the square roots of the eigenvalues of the sums of squares and cross-products (-a=scp
) or covariance (-a=cov
) or correlation (-a=cor
) matrix between the observed variables.These eigenvalues (e.g., the squares of the singular values) are equal to the raw variance described by the principal component time series.
netcdf_variable_var
(number_of_eofs)
: the proportion of variance explained by each principal component time series.
Examples¶
For computing an EOF analysis from the NetCDF file
HadISST1_2m_197902_200501_sst.nc
, which includes a NetCDF variablesst
, and store the results in a NetCDF file namedeof_HadISST1_2m_197902_200501_sst_oi.nc
, use the following command (the analysis is done on the raw data without removing the annual cycle) :$ comp_eof_3d \ -f=HadISST1_2m_197902_200501_sst.nc \ -v=sst \ -m=mask_HadISST1_sst.nc \ -g=n \ -a=scp \ -n=20 \ -o=eof_HadISST1_2m_197902_200501_sst_oi.ncFor computing an EOF analysis from the NetCDF file
ST7_1m_0101_20012_grid_T_sosstsst.nc
, which includes a NetCDF variablesosstsst
from a numerical simulation with the ORCA R2 model, and store the results in a NetCDF file namedeof_sosstsst.nc
, use the following command (the analysis is done on the data after removing a climatology) :$ comp_eof_3d \ -f=ST7_1m_0101_20012_grid_T_sosstsst.nc \ -v=sosstsst \ -c=clim_ST7_1m_0101_20012_grid_T_sosstsst.nc \ -a=cov \ -m=meshmask.orca2.nc \ -g=t