comp_eof_3d¶
Authors¶
Pascal Terray (LOCEAN/IPSL)
Latest revision¶
15/03/2021
Purpose¶
Compute an Empirical Orthogonal Function (EOF) analysis, also known as Principal Component Analysis (PCA) from a tridimensional variable extracted from a NetCDF dataset. The procedure first transforms the input tridimensional NetCDF variable as a ntime by nv rectangular matrix, X, of observed variables (e.g. the selected cells of the 2-D grid-mesh associated with the tridimensional NetCDF variable) and then performs an EOF analysis of this rectangular matrix.
The eigenvalues, eigenvectors and principal components time series of the EOF analysis are computed by a full or partial Singular Value Decomposition (SVD) of the rectangular matrix of the observed variables [Bjornsson_Venegas] [Hannachi] [vonStorch_Zwiers]. Both algorithms find square roots of eigenvalues (e.g. singular values of the data matrix X) and associated eigenvectors of the sums of squares and cross-products, covariance or correlation matrix between the observed variables without actually computing this symmetric matrix.
An output NetCDF dataset containing singular values, eigenvectors and standardized principal component time series is created. The eigenvectors are repacked as a tridimensional variable in the output NetCDF dataset.
You should use EOF analysis if you are interested in summarizing data and/or detecting linear relationships between the observed variables. EOF analysis can also be used to reduce the number of variables or the noise in a dataset before a regression, cluster or Maximum Covariance Analysis (MCA). More specifically, the first k principal component time series and eigenvectors give a least-squares solution to the model
X = AB + E
where
- X is the ntime by nv matrix of observed variables
- A is the ntime by k matrix of the first k principal component time series
- B is the k by nv matrix of the first k eigenvectors (stored rowwise)
- E is an ntime by nv matrix of residuals
and you want to minimize the squared Frobenius norm of E (e.g. the sum of all the squared elements of E).
Refer to comp_invert_eof_3d, if you want to compute such approximation of your dataset.
Refer to comp_svd_3d, comp_reg_3d and comp_reg_4d, for more details on MCA and regression procedures available in NCSTAT, respectively.
If your data contains missing values use comp_eof_miss_3d instead of comp_eof_3d to estimate approximate eigenvectors, principal components and, in addition, the missing values in your dataset.
Finally, if the NetCDF variable is fourdimensional use comp_eof_4d instead of comp_eof_3d.
This procedure is parallelized if OpenMP is used. Moreover, this procedure may use partial SVD algorithms which are highly efficient on huge datasets if you are interested only in the few leading terms of the SVD of the data matrix X.
Further Details¶
Usage¶
$ comp_eof_3d \
-f=input_netcdf_file \
-v=netcdf_variable \
-m=input_mesh_mask_netcdf_file \
-g=grid_type (optional : n, t, u, v, w, f) \
-r=resolution (optional : r2, r4) \
-b=nlon_orca, nlat_orca (optional) \
-x=lon1,lon2 (optional) \
-y=lat1,lat2 (optional) \
-t=time1,time2 (optional) \
-a=type_of_analysis (optional : scp, cov, cor) \
-c=input_climatology_netcdf_file (optional) \
-d=type_of_distance (optional : dist2, ident) \
-alg=algorithm (optional : svd, inviter, deflate) \
-n=number_of_eofs (optional) \
-o=output_eof_netcdf_file (optional) \
-mi=missing_value (optional) \
-explvar (optional) \
-double (optional) \
-bigfile (optional) \
-hdf5 (optional) \
-tlimited (optional)
By default¶
- -g=
- the grid_type is set to
n
which means that the 2-D grid-mesh associated with the input NetCDF variable is assumed to be regular or Gaussian- -r=
- if the input netcdf_variable is from the NEMO or ORCA model (e.g. if -g= argument is not set to
n
) the resolution is assumed to ber2
- -b=
- if -g= is not set to
n
, the dimensions of the 2-D grid-mesh, nlon_orca and nlat_orca, are determined from the -r= argument. However, you may override this choice by default with the -b= argument- -x=
- the whole longitude domain associated with the netcdf_variable
- -y=
- the whole latitude domain associated with the netcdf_variable
- -t=
- the whole time period associated with the netcdf_variable
- -a=
- the type_of_analysis is set to
scp
. This means that the eigenvectors and eigenvalues are computed from the sums of squares and cross-products matrix between the observed variables- -c=
- an input_climatology_netcdf_file is not needed if the type_of_analysis is set to
scp
- -d=
- the type_of_distance is set to
dist2
. This means that distances and scalar products in the EOF analysis are computed with the diagonal metric associated with the 2-D grid-mesh associated with the input NetCDF variable- -alg=
- the algorithm option is set to
inviter
. This means that the EOF model is computed by a partial SVD analysis of the matrix of the observed variables using an inverse iteration algorithm- -n=
- number_of_eofs is set to
10
and a 10-component EOF model is stored in the output NetCDF file output_eof_netcdf_file- -o=
- the output_eof_netcdf_file is named
eof_
netcdf_variable.nc
- -mi=
- the missing_value attribute in the output NetCDF file is set to
1.e+20
- -explvar
- the -n= option specifies the number of eofs to be computed and stored in the output NetCDF file. If -explvar is activated, the -n= option specifies the minimum value of explained variance by the EOF model for selecting the order (e.g. the number of components) of this EOF model
- -double
- the results of the EOF analysis are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
- -bigfile
- a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
- -hdf5
- a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
- -tlimited
- the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file
Remarks¶
The -v=netcdf_variable argument specifies the NetCDF variable for which an EOF analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.
The argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to the netcdf_variable for transforming this tridimensional NetCDF variable as a rectangular matrix before computing the EOF analysis. The scale factors associated with the 2-D grid-mesh of this NetCDF variable (needed if -d=
dist2
is specified when calling the procedure) are also read from the input_mesh_mask_netcdf_file.If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing, the geographical domain used in the EOF analysis is determined from the attributes of the input mesh mask NetCDF variable named grid_typemask (e.g. lon1_Eastern_limit, lon2_Western_limit, lat1_Southern_limit and lat2_Northern_limit ) which is read from the input NetCDF file input_mesh_mask_netcdf_file. If these attributes are missing, the whole geographical domain associated with the netcdf_variable is used in the EOF analysis.
The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to
1
. Negative values are allowed for lon1. In this case the longitude domain is fromnlon
+lon1+1
to lon2 wherenlon
is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.Refer to comp_mask_3d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_eof_3d.
If the -t=time1,time2 argument is missing the whole time period associated with the netcdf_variable is used to estimate eigenvectors and principal component time series.
The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to
1
. Note that the output NetCDF file will haventime
= time2 - time1 +1
time observations.It is assumed that the data has no missing values excepted those associated with a constant land-sea mask. If your dataset has missing values, use comp_eof_miss_3d instead of comp_eof_3d.
If -g= is set to
t
,u
,v
,w
orf
it is assumed that the NetCDF variable is from an experiment with the NEMO or ORCA model. In this case, the duplicate points from the ORCA grid are removed before the EOF analysis, as far as possible, and, in particular, if the 2-D grid-mesh of the input NetCDF variable covers the whole globe. On output, the duplicate points are restored when writing the EOFs (e.g. the eigenvectors), if the geographical domain of the input NetCDF variable is the whole globe.If -g= is set to
n
, it is assumed that the 2-D grid-mesh is regular or Gaussian and as such has no duplicate points.The -g= argument is also used to determine the name of the NetCDF variables which contain the 2-D mesh-mask and the scale factors in the input_mesh_mask_netcdf_file (e.g. these variables are named grid_type
mask
,e1
grid_type ande2
grid_type, respectively). This input_mesh_mask_netcdf_file may be created by comp_clim_3d if the 2-D grid-mesh is regular or gaussian.If -g= is set to
t
,u
,v
,w
orf
(e.g. if the NetCDF variable is from an experiment with the NEMO or ORCA model), the -r= argument gives the resolution used. If:
- -r=
r2
, the NetCDF variable is from an experiment with the ORCA R2 model- -r=
r4
, the NetCDF variable is from an experiment with the ORCA R4 model.If the NetCDF variable is from an experiment with the NEMO or ORCA model, but the resolution is not
r2
orr4
, the dimensions of the ORCA grid must be specified explicitly with the -b= argument.The -a= argument specifies if the observed variables are centered or standardized with an input climatology (specified with the -c= argument) before the EOF analysis. If:
- -a=
scp
, the EOF analysis is done on the raw data- -a=
cov
, the EOF analysis is done on the anomalies- -a=
cor
, the EOF analysis is done on the standardized anomalies.The input_climatology_netcdf_file specified with the -c= argument is needed only if -a=
cov
or -a=cor
.If -a=
cov
or -a=cor
, the selected time period must agree with the climatology. This means that the first selected time observation (time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.The geographical shapes of the netcdf_variable (in the input_netcdf_file), the mask (in the input_mesh_mask_netcdf_file), the scale factors (in the input_mesh_mask_netcdf_file), and the climatology (in the input_climatology_netcdf_file) must agree.
The -n=number_of_eofs argument specifies the number of components of the EOF model which must be stored (and also computed if -alg=
inviter
or -alg=deflate
is specified) in the output NetCDF file specified by the -o= argument. See also the -explvar argument.If the -explvar argument is activated, the number specified in -n= option is not the number of required EOFs, but the minimum of explained variance that the EOFs must describe. Number of EOFs is then calculated with reference with the minimum of explained variance required by the -n= argument. Express the explained variance in percentage (0-100). If -explvar is specified, the -n= argument must be less or equal to
100
.The -d= argument specifies the metric and scalar product used in the EOF analysis. If:
- -d=
dist2
, the EOF analysis is done with the diagonal distance associated with the horizontal 2-D grid-mesh (e.g. each grid point is weighted accordingly to the surface associated with it)- -d=
ident
, the EOF analysis is done with the identity metric : the Euclidean distance and the usual scalar product is used in the EOF analysis.The -alg= argument determines how eigenvectors and principal components are computed. If:
- -alg=
svd
, a full SVD of the data matrix is computed, even if you ask only for the leading eigenvectors- -alg=
inviter
, a partial SVD of the data matrix is computed by inverse iteration- -alg=
deflate
, a partial SVD of the data matrix is computed by a deflation technique.All algorithms are parallelized if OpenMP is used. The default is -alg=
inviter
since computing a partial SVD is generally much faster than computing a full SVD. But, -alg=deflate
is generally as fast as -alg=inviter
.The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g.
-D_USE_NETCDF36
or-D_USE_NETCDF4
) and linked to the NetCDF 3.6 library or higher. If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g.
-D_USE_NETCDF4
) and linked to the NetCDF 4 library or higher. If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.The -mi=missing_value argument specifies the missing value indicator associated with the NetCDF variables in the output_netcdf_file. If the -mi= argument is not specified missing_value is set to
1.e+20
.The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.
By default, the results are stored as single-precision floating point numbers in the output NetCDF file.
Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.
For more details on EOF analysis in the climate literature, see
- “A manual for EOF and SVD analyses of climate data”, by Bjornsson, H., and Venegas, S.A., McGill University, CCGCR Report No. 97-1, Montréal, Québec, 52pp, 1997. https://www.jsg.utexas.edu/fu/files/EOFSVD.pdf
- “A primer for EOF analysis of climate data”, by Hannachi, A., Reading University, Reading, UK, 33pp, 2004. http://www.met.reading.ac.uk/~han/Monitor/eofprimer.pdf
- “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 13, 484 pp., 2002. ISBN: 9780521012300
Outputs¶
comp_eof_3d creates an output NetCDF file that contains the principal component time series, the eigenvectors and the eigenvalues of the EOF analysis. The number of principal components, eigenvectors and eigenvalues stored in the output NetCDF dataset is determined by the -n=number_of_eofs argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlat and nlon are the length of the dimensions of the input NetCDF variable) :
netcdf_variable_eof
(number_of_eofs,nlat,nlon)
: the selected eigenvectors of the sums of squares and cross-products (-a=scp
), covariance (-a=cov
) or correlation (-a=cor
) matrix between the observed variables. The eigenvectors are sorted by descending order of the associated eigenvalues. The eigenvectors are scaled such that they give the scalar products (-a=scp
), covariances (-a=cov
) or correlations (-a=cor
) between the original observed variables and the associated principal component time series.The eigenvectors are packed in a tridimensional variable whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, the output NetCDF variable is filled with missing values. If this is a problem, you can use comp_norm_3d for restricting the geographical domain in the input dataset before using comp_eof_3d.
netcdf_variable_pc
(ntime,number_of_eofs)
: the principal component time series sorted by descending order of the eigenvalues (e.g. the square of the singular values of the input matrix of observed variables).The principal component time series are always standardized to unit variance.
netcdf_variable_sing
(number_of_eofs)
: the singular values of the input data matrix in decreasing order. Up to a constant scaling factor (equal to the square root of1/ntime
), these singular values are the square roots of the eigenvalues of the sums of squares and cross-products (-a=scp
) or covariance (-a=cov
) or correlation (-a=cor
) matrix between the observed variables.These eigenvalues are equal to the variance described by the principal component time series.
netcdf_variable_var
(number_of_eofs)
: the proportion of variance explained by each principal component time series.
Examples¶
For computing an EOF analysis from the NetCDF file
HadISST1_2m_197902_200501_sst.nc
, which includes a NetCDF variablesst
, and store the results in a NetCDF file namedeof_HadISST1_2m_197902_200501_sst_oi.nc
, use the following command (the analysis is done on the raw data without removing the annual cycle) :$ comp_eof_3d \ -f=HadISST1_2m_197902_200501_sst.nc \ -v=sst \ -m=mask_HadISST1_sst.nc \ -g=n \ -a=scp \ -n=20 \ -o=eof_HadISST1_2m_197902_200501_sst_oi.ncFor computing an EOF analysis from the NetCDF file
ST7_1m_0101_20012_grid_T_sosstsst.nc
, which includes a NetCDF variablesosstsst
from a numerical simulation with the ORCA R2 model, and store the results in a NetCDF file namedeof_sosstsst.nc
, use the following command (the analysis is done on the data after removing a climatology) :$ comp_eof_3d \ -f=ST7_1m_0101_20012_grid_T_sosstsst.nc \ -v=sosstsst \ -c=clim_ST7_1m_0101_20012_grid_T_sosstsst.nc \ -a=cov \ -m=meshmask.orca2.nc \ -g=t