comp_cor_miss_3d

Authors

Pascal Terray (LOCEAN/IPSL)

Latest revision

17/05/2024

Purpose

Compute correlation and regression coefficients between an index time series and a tridimensional variable extracted from a NetCDF dataset and perform statistical tests on these correlation coefficients. Missing values are allowed both in the input index time series and the input NetCDF tridimensional variable. However, if your data does not contain missing values excepted those associated with a constant land-sea mask use comp_cor_3d instead of comp_cor_miss_3d to estimate correlation and regression coefficients from your dataset.

As in comp_cor_3d, the procedure first transforms the input tridimensional NetCDF variable as a ntime by nv rectangular matrix of observed variables stored columnwise (e.g., the selected cells of the 2-D grid-mesh associated with the tridimensional NetCDF variable) and then computes measures of association between each of these variables, say X, and the input index time series, say Y. However since missing values are present, the number of observations used to compute the means and standard-deviations for each variable and the coefficients of correlation between each pair of variables X and Y may vary; This is an important difference with the statistics obtained from comp_cor_3d.

By default, comp_cor_miss_3d computes the sample correlation and regression coefficients, the associated critical probabilities for testing the nullity of the correlation coefficients and the z transforms of the correlation coefficients between the index time series and each point in the time series of the 2-D grid-mesh associated with the input tridimensional NetCDF variable. The intercept coefficients of the regression line between X and Y are also computed if the optional argument -intercept is specified when calling comp_cor_miss_3d. Moreover, all these statistics may be computed by taking into account the periodicity of the input tridimensional NetCDF variable if you suspect that the time series are cyclostationary (by using the -p=periodicity argument when calling the procedure). All the results are finally stored in an output NetCDF dataset, after repacking the statistics on the original 2-D grid of the input tridimensional NetCDF variable.

Refer to comp_cor_3d, for a basic definition of all these statistics, which is not repeated here. Refer to [vonStorch_Zwiers] for a general introduction on the correlation/regression coefficients and the z transform of the correlation coefficients and their use in climate analysis.

Due to the presence of missing values, two different methods are, however, available to estimate the correlation and regression coefficients (and the Fisher z transform) in comp_cor_miss_3d.

In the first method (used when the argument -alg= is set to miss1 ; this is the default), comp_cor_miss_3d estimates first the means and standard-deviations for the index time series and each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable by using all the available observations for each time series. Since missing values are present, the number of observations used to compute the means and standard-deviations may then vary from one point to another in the 2-D grid-mesh associated with the input NetCDF variable (as well as for the index time series). In a second step, comp_cor_miss_3d estimates the correlation and regression coefficients using the previously computed univariate statistics and all valid pairs of observations for each couple of variables. From this definition, it follows that the correlation coefficients computed from this method may be greater than 1 or less than -1 in some cases when the number of missing values is very important. However, in such cases, the procedure adjusts the value of the correlation coefficients accordingly.

In the second method (used when the argument -alg= is set to miss2 ), comp_cor_miss_3d computes both the univariate and bivariate statistics from all valid pairs of observations for each couple of variables separately. From this definition, it follows that the estimated correlation coefficient cannot be less than -1 or greater than 1. However, the univariate statistics may be based on much fewer observations than in the first method (e.g., when -alg= miss1 ).

Finally, note that only one method is available for computing critical probabilities associated with the correlation coefficients and test the significance of the correlations, since the permutation and bootstrap methods cannot easily be implemented if missing values are present in the time series. This is in contrast to the variety of approaches available in comp_cor_3d.

This procedure is parallelized if OpenMP is used and the NCSTAT software has been built with the _PARALLEL_READ CPP macro. Moreover, this procedure computes the correlation and regression coefficients with only one pass through the data and an out-of-core strategy which is highly efficient on huge datasets.

Further Details

Usage

$ comp_cor_miss_3d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -vi=index_netcdf_variable \
  -fi=input_index_netcdf_file              (optional) \
  -m=input_mesh_mask_netcdf_file           (optional) \
  -g=grid_type                             (optional : n, t, u, v, w, f) \
  -x=lon1,lon2                             (optional) \
  -y=lat1,lat2                             (optional) \
  -t=time1,time2                           (optional) \
  -p=periodicity                           (optional) \
  -a=type_of_analysis                      (optional : student) \
  -rg=type_of_regression                   (optional : reg1, reg2) \
  -o=output_netcdf_file                    (optional) \
  -ti=itime1,itime2                        (optional) \
  -pi=iperiodicity,istep                   (optional) \
  -ni=indice_for_2d_index_netcdf_variable  (optional) \
  -mi=missing_value                        (optional) \
  -alg=algorithm                           (optional : miss1, miss2) \
  -regstd                                  (optional) \
  -intercept                               (optional) \
  -double                                  (optional) \
  -bigfile                                 (optional) \
  -hdf5                                    (optional) \
  -tlimited                                (optional)

By default

-fi=
the same as the -f= argument
-m=
an input_mesh_mask_netcdf_file is not used
-g=
the grid_type is set to n which means that the 2-D grid-mesh associated with the input netcdf_variable is assumed to be regular or Gaussian
-x=
the whole longitude domain associated with the netcdf_variable
-y=
the whole latitude domain associated with the netcdf_variable
-t=
the whole time period associated with the netcdf_variable
-p=
the periodicity is set to 1
-ti=
the whole time period associated with the index_netcdf_variable
-pi=
this parameter is not used
-ni=
if the index_netcdf_variable is bidimensional, the first time series is used
-a=
type_of_analysis is set to student
-rg=
type_of_regression is set to reg1
-mi=
the missing_value is set to 1.e+20 in the output_netcdf_file
-alg=
the method used to compute unvariate and bivariate statistics is miss1
-o=
output_netcdf_file name is set to cor_netcdf_variable.index_netcdf_variable.nc
-regstd
the regression coefficients are computed in units of the input NetCDF variables. If -regstd is activated, the regression coefficients are computed in units of the netcdf_variable by standard-deviation of the index_netcdf_variable
-intercept
the intercept coefficients of the regressions are not computed. If -intercept is activated, the intercept coefficients of the regressions are computed and stored in the output_netcdf_file
-double
the results are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
-bigfile
a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
-hdf5
a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
-tlimited
the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks

  1. The -v=netcdf_variable argument specifies the NetCDF variable for which a correlation analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.

  2. The optional argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to netcdf_variable for transforming this tridimensional NetCDF variable as a rectangular matrix of observed variables before computing the correlation analysis. By default, it is assumed that each cell in the 2-D grid-mesh associated with the input tridimensional NetCDF variable is a valid time series (e.g., a time series with some valid data for at least some observations).

    The geographical shapes of the netcdf_variable (in the input_netcdf_file) and the mask (in the input_mesh_mask_netcdf_file) must agree if an input_mesh_mask_netcdf_file is used.

    Refer to comp_clim_miss_3d or comp_mask_3d for creating a valid input_mesh_mask_netcdf_file NetCDF file for regular or gaussian grids before using comp_cor_miss_3d.

  3. If -g= is set to t, u, v, w or f it is assumed that the NetCDF variable is from an experiment with the NEMO model (ORCA configuration and R2, R4 or R05 resolutions). This argument is also used to determined the name of the mesh_mask_variable if an input_mesh_mask_netcdf_file is used.

  4. If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing the whole geographical domain associated with the netcdf_variable is used.

    The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to 1. Negative values are allowed for lon1. In this case the longitude domain is from nlon+lon1+1 to lon2 where nlon is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.

    Refer to comp_mask_3d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_cor_miss_3d.

  5. If the -t=time1,time2 argument is missing, data in the whole time period associated with the netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1.

    The selected time period (e.g., time2 - time1 + 1) must be a whole multiple of the periodicity if the -p= argument is specified.

  6. The -p=periodicity argument gives the periodicity of the input data for the netcdf_variable. For example, with monthly data -p=12 should be specified, with yearly data -p=1 may be used, etc.

    Note that the output NetCDF file will have periodicity time observations.

  7. The -vi=index_netcdf_variable specifies a time series for the correlation analysis. If the -vi=index_netcdf_variable is present, the -fi= argument must also be present and this argument specifies the NetCDF dataset which contains the index_netcdf_variable. However, if the NetCDF dataset which contains the index_netcdf_variable is the same as the NetCDF dataset specified by the -f= argument, it is not necessary to specify the -fi= argument.

  8. The -ni= argument specifies the indice (e.g., an integer) for selecting the time series if the index_netcdf_variable specified in the -vi= argument is a 2D NetCDF variable. By default, the first time series is used, which is equivalent to set indice_for_2d_index_netcdf_variable to 1.

  9. If the -ti=itime1,itime2 argument is missing, data in the whole time period associated with the index_netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1.

  10. The -pi= argument gives the periodicity and select the time step for the index_netcdf_variable. For example, to compute correlations with the January monthly time series extracted from the index_netcdf_variable which is assumed to be sampled every month, -pi=12,1 should be specified, with yearly data -pi=1,1 may be used, etc.

  11. The selected time periods for the netcdf_variable and index_netcdf_variable must agree. This means that the following equality must be verified

    (time2 - time1 + 1)/periodicity = ceiling((itime2 - itime1 - istep + 2)/iperiodicity),

    otherwise, an error message will be issued and the program will stop.

  12. The -alg= argument selects the method for computing the correlation coefficients:

    • If -alg=miss1, the means and standard-deviations of the netcdf_variable and index_netcdf_variable are computed from all valid data. The correlation coefficients are based on these univariate statistics and on all valid pairs of observations.
    • If -alg=miss2, the univariate and bivariate statistics are computed from all valid pairs of observations for each couple of variables separately.
  13. The -a= argument selects the method for computing critical probabilities associated with the correlation coefficients.

    If -a=student, a classical Student-Fisher t test is used.

    No other test options are included in this version of NCSTAT to test correlation coefficients from data with missing values, but this optional parameter is still present for later use.

  14. The -rg= argument selects the method for computing the regression coefficients:

    • If -rg=reg1, the coefficients of the regression equation for predicting the netcdf_variable by the index_netcdf_variable are computed. This is the default.
    • If -rg=reg2, the coefficients of the regression equation for predicting the index_netcdf_variable by the netcdf_variable are computed.
  15. The -intercept argument specifies that the intercept coefficients of the regression equation must be computed and stored in the output NetCDF file. By default, the intercept coefficients are not computed.

  16. The -regstd argument specifies that the regression coefficients of the regression equation must be expressed in terms of units of the input NetCDF variable netcdf_variable by standard-deviation of the index_netcdf_variable. By default, the regression coefficients are expressed in units of the input NetCDF variables.

  17. The -mi=missing_value argument specifies the missing value indicator associated with the netcdf_variables in the output_netcdf_file. If the -mi= argument is not specified missing_value is set to 1.e+20.

  18. The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file. By default, the results are stored as single-precision floating point numbers in the output NetCDF file.

  19. The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g., -D_USE_NETCDF36 or -D_USE_NETCDF4) and linked to the NetCDF 3.6 library or higher. If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.

  20. The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g., -D_USE_NETCDF4) and linked to the NetCDF 4 library or higher. If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.

  21. It is assumed that the specified netcdf_variable and index_netcdf_variable have a scalar missing or _FillValue attributes and that missing values in the data are identified by the values of these missing or _FillValue attributes.

  22. Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

  23. For more details on correlation and regression analysis in the climate literature, see

    • “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 8, 484 pp., 2002. ISBN: 9780521012300

Outputs

comp_cor_miss_3d creates an output NetCDF file that contains the correlation and regression statistics and critical probabilities associated with these coefficients, taking into account eventually the periodicity of the data as determined by the -p=periodicity argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlat and nlon are the length of the spatial dimensions of the input NetCDF variable) and periodicity time observations, if -rg=reg1 :

  1. netcdf_variable_index_netcdf_variable_cor(periodicity,nlat,nlon) : the Pearson correlation coefficients between each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable and the index_netcdf_variable time series.

  2. netcdf_variable_index_netcdf_variable_prob(periodicity,nlat,nlon) : the critical probabilities associated with two-sided tests of the correlation coefficients (e.g., the absolute value of the correlation is tested). These critical probabilities are computed under the null hypothesis that the corresponding correlation coefficients in the parent population are zero.

    The -a=type_of_analysis argument determines how these critical probabilities are computed.

  3. netcdf_variable_index_netcdf_variable_z(periodicity,nlat,nlon) : the Fisher z Transforms of the correlation coefficients for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable and the index_netcdf_variable time series.

  4. netcdf_variable_index_netcdf_variable_reg(periodicity,nlat,nlon) : the regression coefficients for predicting each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable by the index_netcdf_variable time series.

    By default, the regression coefficients are expressed in units of the input NetCDF variable netcdf_variable by unit of the index_netcdf_variable time series. However, if the -regstd argument is specified the regression coefficients are expressed in terms of units of the input NetCDF variable netcdf_variable by standard-deviation of the index_netcdf_variable time series. Finally, if -rg=reg2 is specified the roles of the input NetCDF variables netcdf_variable and index_netcdf_variable are interchanged and the fitted regression models are for predicting the index_netcdf_variable by each time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable .

  5. netcdf_variable_index_netcdf_variable_int(periodicity,nlat,nlon) : the intercept coefficients in the regression models for predicting each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable by the index_netcdf_variable time series.

    This variable is stored only if the -intercept argument has been specified when calling comp_cor_miss_3d. Finally, if -rg=reg2 is specified the roles of the input NetCDF variables netcdf_variable and index_netcdf_variable are interchanged and the fitted regression models are for predicting the index_netcdf_variable by each time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable .

  6. netcdf_variable_index_netcdf_variable_nobs(periodicity,nlat,nlon) : the number of observations used to compute the correlation and regression coefficient for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable.

All these statistics are packed in tridimensional variables whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, the output NetCDF variables are filled with missing values.

If -rg=reg2 , the naming convention for the variables is reversed, the index_netcdf_variable will be listed first and the netcdf_variable will appear after. For example, the name of the NetCDF variable storing the correlation coefficient will be index_netcdf_variable_netcdf_variable_cor instead of netcdf_variable_index_netcdf_variable_cor if -rg=reg2 .

Examples

  1. For computing monthly lead correlations from a tridimensional NetCDF variable sst in the NetCDF file HadISST2_sst.nc and a December-January Nino34 SST index in the NetCDF file HadISST2_sst_nino34_dj.nc and store the results in a NetCDF file named cor_HadISST2_1m_sst_nino34_dj.nc, use the following commands (note that cyclostationarity is assumed for the sst variable since -p=12 is specified ) :

    $ comp_cor_miss_3d \
      -f=HadISST2_sst.nc \
      -v=sst \
      -m=mesh_mask_HadISST2.nc \
      -p=12 \
      -fi=HadISST2_sst_nino34_dj.nc \
      -vi=sst \
      -o=cor_HadISST2_1m_sst_nino34_dj.nc
    
Flag Counter