comp_eof_miss_3d

Authors

Pascal Terray (LOCEAN/IPSL)

Latest revision

06/05/2021

Purpose

Compute an Empirical Orthogonal Function (EOF) analysis , also known as Principal Component Analysis (PCA) from a tridimensional variable with missing values extracted from a NetCDF dataset.

The procedure first transforms the input tridimensional NetCDF variable as a ntime by nv rectangular matrix, X, of observed variables (e.g. the selected cells of the 2-D grid-mesh associated with the tridimensional NetCDF variable) and then performs an EOF analysis of this rectangular matrix. Since missing values are present, there are irregularly distributed gaps in the matrix X and the procedure used here is different than the one used in comp_eof_3d.

A first basic assumption of the estimation procedures used in comp_eof_miss_3d is that each row and column of the matrix, X, of the observed variables have at least one non-missing entry.

Two alternative strategies are available in comp_eof_miss_3d for estimating the first k eigenvectors and principal component time series of an EOF model with gappy data :

  1. If the argument -alg= is set to var, comp_eof_miss_3d first computes an estimate of the matrix product (transpose(X).X)/ntime (as long as every pair of observed variables has at least one nonmissing entry). The elements of this matrix estimate are calculated from all valid observations for every pair of observed variables. If fewer than one valid observation is present for some pair of variables, the procedure prints an error message and stops. In a second step, the first k eigenvectors, B (B is a k by nv matrix with eigenvectors stored rowwise) and eigenvalues of this matrix estimate are computed by inverse iteration. These eigenvectors are preliminary estimates of the first k eigenvectors of the matrix (transpose(X).X)/ntime. Note that because the matrix product estimate (derived from the data matrix X with missing values) is not necessarily positive definite some of these eigenvalues may be negative. In a third step, comp_eof_miss_3d obtains preliminary least square estimates, A (A is a ntime by k matrix), of the first k principal component scores by regressing the observations onto the eigenvectors, B, using only the valid entries in each observation. Note that these preliminary estimates of the first k principal component scores are not necessarily jointly uncorrelated if missing values are present.
  2. If the argument -alg= is set to obs, comp_eof_miss_3d first computes an estimate of the matrix product (X.transpose(X))/ nv (as long as every pair of observations has at least one nonmissing entry). The elements of this matrix estimate are calculated from all valid observed variables for every pair of observations. If fewer than one valid variable is present for some pair of observations, the procedure prints an error message and stops. In a second step, the first k eigenvectors, A (A is a ntime by k matrix with eigenvectors stored columnwise) and eigenvalues of this matrix estimate are computed by inverse iteration. These eigenvectors are preliminary estimates of the first k (standardized) principal component scores. Note that because the matrix product estimate derived from the data matrix X with missing values is not necessarily positive definite some of the eigenvalues may be negative. In a third step, comp_eof_miss_3d obtains preliminary least square estimates, B (B is a k by nv matrix), of the first k eigenvectors of the EOF model by regressing the variables onto the preliminary estimates of the principal component scores, B, using only the valid entries in each variable. Note that these preliminary estimates of the first k eigenvectors are not necessarily orthogonal if missing values are present.

After these alternative preliminary steps, it is an easy task to obtain suitable orthonormalization of the A and B matrices of the estimated k-component model similar to the traditional ones in the restricted k EOF model by computing the Singular Value Decomposition (SVD) of the AB matrix product. Note that there is no need to compute the AB product in this final computation since the SVD of this matrix product can be easily deduced from the two smallest SVD of A and B, respectively. Let

AB = USV

where

  • U is a ntime by k matrix with orthonormal columns (the left singular vectors stored columnwise)
  • S is a square k by k matrix with nonnegative elements on its principal diagonal and zeros elsewhere (the diagonal elements of S are the singular values of AB)
  • V is a k by nv matrix with orthonormal rows (the right singular vectors stored rowwise)

Note that this SVD has no more than k terms with a singular value distinct from zero since AB is of rank inferior or equal to k. With these notations, the first k unstandardized “principal components” of the k-EOF model are just A=US and the first k “eigenvectors” of this model may be defined as B=V (up to a constant scaling factor).

With these final estimates of the EOF model with gappy data, observed that the k eigenvectors stored rowwise in B are orthogonal and that the principal component time series stored columnwise in A are jointly uncorrelated as for the traditional EOF model without missing values in the data.

More sophisticated methods are available in the literature for estimating PCA/EOF models with missing values [Terraya] [Terrayb] [Schneider] and some of them will be implemented in future NCSTAT releases.

An output NetCDF dataset containing singular values, eigenvectors B and standardized principal component time series U is created. The eigenvectors are repacked as a tridimensional variable in the output NetCDF dataset.

Optionally, comp_eof_miss_3d may also compute estimates of the missing values in the data matrix, X, of observed variables with the help of the estimated k-EOF model AB and stored the resulting matrix in an output NetCDF file. This filled matrix is also repacked as a tridimensional variable in a second output NetCDF dataset which may be used for further analyses in NCSTAT procedures.

This procedure is parallelized if OpenMP is used.

Further Details

Usage

$ comp_eof_miss_3d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -m=input_mesh_mask_netcdf_file \
  -g=grid_type                      (optional : n, t, u, v, w, f) \
  -r=resolution                     (optional : r2, r4) \
  -b=nlon_orca, nlat_orca           (optional) \
  -x=lon1,lon2                      (optional) \
  -y=lat1,lat2                      (optional) \
  -t=time1,time2                    (optional) \
  -a=type_of_analysis               (optional : scp, cov, cor) \
  -c=input_climatology_netcdf_file  (optional) \
  -d=type_of_distance               (optional : dist2, ident) \
  -alg=algorithm                    (optional : var, obs) \
  -n=number_of_eofs                 (optional) \
  -o=output_eof_netcdf_file         (optional) \
  -fo=output_netcdf_file            (optional) \
  -mi=missing_value                 (optional) \
  -use_eps=tolerance                (optional) \
  -comp_min_norm                    (optional) \
  -double                           (optional) \
  -bigfile                          (optional) \
  -hdf5                             (optional) \
  -tlimited                         (optional)

By default

-g=
the grid_type is set to n which means that the 2-D grid-mesh is assumed to be regular or Gaussian
-r=
if the input netcdf_variable is from the NEMO or ORCA model (e.g. if -n= argument is not set to n) the resolution is assumed to be r2
-b=
if -n= is not set to n, the dimensions of the grid-mesh, nlon_orca and nlat_orca, are determined from the -r= argument. However, you may override this choice by default with the -b= argument
-x=
the whole longitude domain associated with the netcdf_variable
-y=
the whole latitude domain associated with the netcdf_variable
-t=
the whole time period associated with the netcdf_variable
-a=
the type_of_analysis is set to scp. This means that the eigenvectors and eigenvalues are computed from the sums of squares and cross-products matrix between the observed variables
-c=
an input_climatology_netcdf_file is not needed if the type_of_analysis is set to scp
-d=
the type_of_distance is set to dist2. This means that distances and scalar products in the EOF analysis are computed with the diagonal metric associated with the 2-D grid-mesh
-alg=
the algorithm option is set to var. This means that eigenvectors are estimated in the space domain first in the preliminary steps of the procedure
-n=
number_of_eofs is set to 1 and an one-component EOF model is computed and stored in the output NetCDF file output_eof_netcdf_file
-o=
the output_eof_netcdf_file is named eof_netcdf_variable.nc
-fo=
by default, an output_netcdf_file with a copy of the input netcdf_variable in which missing values replaced by estimates from the selected EOF model is not created
-mi=
the missing_value attribute in the output NetCDF file is set to 1.e+20
-use_eps=
the numerical rank is determined when solving each regression problem
-comp_min_norm
a minimal norm solution is not computed when solving deficient linear least squares problems
-double
the results of the EOF analysis are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
-bigfile
NetCDF classical format file are created. If -bigfile is activated, the output NetCDF files are 64-bit offset format files
-hdf5
NetCDF classical format file are created. If -hdf5 is activated, the output NetCDF files are NetCDF-4/HDF5 format files
-tlimited
the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks

  1. The -v=netcdf_variable argument specifies the NetCDF variable for which an EOF analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.

  2. The argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to the netcdf_variable for transforming this tridimensional NetCDF variable as a rectangular matrix before computing the EOF analysis. The scale factors associated with the 2-D grid-mesh of this NetCDF variable (needed if -d=dist2 is specified when calling the procedure) are also read from the input_mesh_mask_netcdf_file.

  3. If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing, the geographical domain used in the EOF analysis is determined from the attributes of the input mesh mask NetCDF variable named grid_typemask (e.g. lon1_Eastern_limit, lon2_Western_limit, lat1_Southern_limit and lat2_Northern_limit ) which is read from the input NetCDF file input_mesh_mask_netcdf_file. If these attributes are missing, the whole geographical domain associated with the netcdf_variable is used in the EOF analysis.

    The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to 1. Negative values are allowed for lon1. In this case the longitude domain is from nlon+lon1+1 to lon2 where nlon is the number of longitude points in the grid associated with the NetCDF variable.

    Refer to comp_mask_3d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_eof_miss_3d.

  4. If the -t=time1,time2 argument is missing the whole time period associated with the netcdf_variable is used to estimate eigenvectors and principal component time series.

    The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1. Note that the output NetCDF files will have ntime = time2 - time1 + 1 observations.

  5. It is assumed that the specified input netcdf_variable have a scalar missing or _FillValue attribute and that missing values in the data are identified by the values of this missing or _FillValue attribute.

  6. If -g= is set to t, u, v, w or f it is assumed that the NetCDF variable is from an experiment with the NEMO or ORCA model. In this case, the duplicate points from the ORCA grid are removed before the EOF analysis, as far as possible, and, in particular, if the 2-D grid-mesh of the input NetCDF variable covers the whole globe. On output, the duplicate points are restored when writing the EOFs (e.g. the eigenvectors), if the geographical domain of the input NetCDF variable is the whole globe.

    If -g= is set to n, it is assumed that the 2-D grid-mesh is regular or Gaussian and as such has no duplicate points.

    The -g= argument is also used to determine the name of the NetCDF variables which contain the 2-D mesh-mask and the scale factors in the input_mesh_mask_netcdf_file (e.g. these variables are named grid_typemask, e1grid_type and e2grid_type, respectively). This input_mesh_mask_netcdf_file may be created by comp_clim_miss_3d if the 2-D grid-mesh is regular or gaussian.

  7. If -g= is set to t, u, v, w or f (e.g. if the NetCDF variable is from an experiment with the NEMO or ORCA model), the -r= argument gives the resolution used. If:

    • -r=r2, the NetCDF variable is from an experiment with the ORCA R2 model
    • -r=r4, the NetCDF variable is from an experiment with the ORCA R4 model.
  8. If the NetCDF variable is from an experiment with the NEMO or ORCA model, but the resolution is not R2 or R4, the dimensions of the ORCA grid must be specified explicitly with the -b= argument.

  9. The -a= argument specifies if the observed variables are centered or standardized with an input climatology (specified with the -c= argument) before the EOF analysis. If:

    • -a=scp, the EOF analysis is done on the raw data
    • -a=cov, the EOF analysis is done on the anomalies
    • -a=cor, the EOF analysis is done on the standardized anomalies.
  10. The input_climatology_netcdf_file specified with the -c= argument is needed only if -a=cov or -a=cor.

  11. If -a=cov or -a=cor, the selected time period must agree with the climatology. This means that the first selected time observation (time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.

  12. The geographical shapes of the netcdf_variable (in the input_netcdf_file), the mask (in the input_mesh_mask_netcdf_file), the scale factors (in the input_mesh_mask_netcdf_file), and the climatology (in the input_climatology_netcdf_file) must agree.

  13. The -n=number_of_eofs argument specifies the order of the EOF model which must be estimated from the gappy data and stored in the output NetCDF file specified by the -o= argument.

  14. The -d= argument specifies the metric and scalar product used in the EOF analysis. If:

    • -d=dist2, the EOF analysis is done with the diagonal distance associated with the horizontal 2-D grid-mesh (e.g. each grid point is weighted accordingly to the surface associated with it)
    • -d=ident, the EOF analysis is done with the identity metric : the Euclidean distance and the usual scalar product is used in the EOF analysis.

    By default, the -d= argument is set to dist2.

  15. The -alg= argument specifies how the preliminary estimates of the eigenvectors and associated principal components are calculated. If:

    • -alg=var, eigenvectors are estimated in the space domain first and time coefficients are obtained through a regression analysis in a second step
    • -alg=obs, eigenvectors are estimated in the time domain first and space coefficients are obtained through a regression analysis in a second step.

    In both cases, the final results are normalized with a partial SVD analysis as described above. Both algorithms are also parallelized if OpenMP is used.

    The default alorithm is -alg=var.

  16. -use_eps=tolerance is a real value used to determine the effective rank of the coefficient matrix for each regression problem solved in comp_eof_miss_3d. Tolerance must be set to the relative precision of the elements of the input data matrix. If each element is correct to, say, 5 digits then -use_eps=0.00001 should be used. Tolerance must not be greater or equal to 1 or less or equal than 0, otherwise the numerical rank is determined. If tolerance is absent, the numerical rank is determined for each regression problem solved in comp_eof_miss_3d.

  17. The -comp_min_norm argument specifies that a minimal norm solution must be computed for solving deficient linear least squares problems. By default, a minimal norm solution is not computed when solving deficient linear least squares problems in the regression steps of the procedure

  18. The -fo=output_netcdf_file argument specifies that the missing values must be estimated from the computed EOF model (and climatology) and that the full data matrix repacked as a tridimensional variable must be stored in the NetCDF file output_netcdf_file on output of the procedure.

  19. The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g. -D_USE_NETCDF36 or -D_USE_NETCDF4) and linked to the NetCDF 3.6 library or higher. If this argument is specified, the output_netcdf_file and output_netcdf_eof_file will be 64-bit offset format files instead of NetCDF classic format files. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.

  20. The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g. -D_USE_NETCDF4) and linked to the NetCDF 4 library or higher. If this argument is specified, the output_netcdf_file and output_netcdf_eof_file will be NetCDF-4/HDF5 format files instead of NetCDF classic format files. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.

  21. The -mi=missing_value argument specifies the missing value indicator associated with the netcdf_variables in the output NetCDF files. If the -mi= argument is not specified missing_value is set to 1.e+20.

  22. The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.

    By default, the results are stored as single-precision floating point numbers in the output NetCDF file.

  23. Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

  24. For more details on EOF analysis with missing values in the climate literature, see

    • “Space/Time structure of monsoons interannual variability”, by Terray, P., Journal of Climate, Vol. 8, 2595-2619, 1995. doi: 10.1175/1520-0442(1995)008<2595:STSOMI>2.0.CO;2
    • “Detecting Climatic Signals from Ship’s Datasets”, by Terray, P., Proceedings of the International Workshop on Digitization and Preparation of Historical Surface Marine Data and Metadata, 15-17 September. 1997, Toledo, Spain; 83-88 p.; H.F. Diaz and S.D. Woodruff, Eds., WMO/TD-No.957, MMROA Report No. 43, 1999. http://icoads.noaa.gov/mmroa43_toledo.pdf
    • “Application of Weighted Empirical Orthogonal Function Analysis to ship’s datasets”, by Terray, P., Compte-Rendu de la IVème journée Statistique IPSL (Classification et Analyse spatiale). NAI n°23. pp. 11-28. 2002. ISSN 1626-8334. https://www.lmd.polytechnique.fr/nai/nai_23.pdf
    • “Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values”, by Schneider, T., Journal of Climate, Vol. 14, 853-871, 2001. doi: 10.1175/1520-0442(2001)014<0853:AOICDE>2.0.CO;2
    • “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 13, 484 pp., 2002. ISBN: 9780521012300

Outputs

comp_eof_miss_3d creates an output NetCDF file that contains the principal component time series, the eigenvectors and the eigenvalues of the number_of_eofs-EOF model estimated from a dataset with missing values. The number of principal components, eigenvectors and eigenvalues computed and stored in the output NetCDF dataset is determined by the -n=number_of_eofs argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlat and nlon are the length of the dimensions of the input NetCDF variable):

  1. netcdf_variable_eof(number_of_eofs,nlat,nlon) : the approximations of the selected eigenvectors of the sums of squares and cross-products (-a=scp), covariance (-a=cov) or correlation (-a=cor) matrix between the observed variables. The eigenvectors are sorted by descending order of the associated eigenvalues. The eigenvectors are scaled such that they give estimates of the scalar products (-a=scp), covariances (-a=cov) or correlations (-a=cor) between the original observed variables and the associated principal component time series. The eigenvectors are packed in a tridimensional variable whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, the output NetCDF variable is filled with missing values. If this is a problem, you can use comp_norm_miss_3d for restricting the geographical domain in the input dataset before using comp_eof_miss_3d.

  2. netcdf_variable_pc(ntime,number_of_eofs) : the principal component time series sorted by descending order of the eigenvalues .

    The principal component time series are always standardized to unit variance.

  3. netcdf_variable_sing(number_of_eofs) : the singular values of the AB matrix product in decreasing order. Up to a constant scaling factor (equal to the square root of 1/ntime), these singular values are estimates of the square roots of the eigenvalues of the sums of squares and cross-products (-a=scp) or covariance (-a=cov) or correlation (-a=cor) matrix between the observed variables.

    These eigenvalues are estimates to the variance described by the principal component time series.

  4. netcdf_variable_var(number_of_eofs) : estimates of the proportion of variance explained by each principal component time series computed as the ratio between the estimates of variance described by the principal component time series and the sum of the variances of the observed variables.

Optionally, comp_eof_miss_3d can also create an output NetCDF file that contains a copy of the input NetCDF variable with missing values filled with the help of the computed number_of_eofs-EOF model adjusted to the gappy data. You must use the -fo=output_netcdf_file argument in order to create this second output datat set. This optional output NetCDF dataset contains the following NetCDF variable (in the description below, nlat and nlon are the length of the dimensions of the input NetCDF variable):

  1. netcdf_variable(ntime,nlat,nlon) : a copy of the input NetCDF variable with missing values filled with estimates computed with the help of the selected EOF model.

Examples

  1. For estimating a 1-EOF model from the NetCDF file hadcrut2v.nc, which includes a NetCDF variable temanom, and store the results in a NetCDF file named eof1_hadcrut2v.nc, use the following command (the analysis is done on the raw data) :

    $ comp_eof_miss_3d \
      -f=hadcrut2v.nc  \
      -v=temanom \
      -m=mesh_mask_hadcrut2v.nc \
      -g=n \
      -a=scp \
      -n=1 \
      -o=eof1_hadcrut2v.nc
    
  2. For estimating a 3-EOF model from the NetCDF file hadcrut2v.nc, which includes a NetCDF variable temanom, and store the results in a NetCDF file named eof3_hadcrut2v.nc and, in addition, reconstructing a filled dataset with the help of this 3-EOF model approximation of your gappy data, use the following command :

    $ comp_eof_miss_3d \
      -f=hadcrut2v.nc  \
      -v=temanom \
      -m=mesh_mask_hadcrut2v.nc \
      -g=n \
      -a=scp \
      -n=3 \
      -o=eof3_hadcrut2v.nc  \
      -fo=hadcrut2v_eof3.nc