comp_reg_3d

Authors

Pascal Terray (LOCEAN/IPSL)

Latest revision

28/05/2024

Purpose

Estimate polynomial trends and regression models from time series extracted from a tridimensional variable in a NetCDF dataset and, optionally, remove the linear terms from the data by using linear least squares estimation and store the residuals and/or the predictions in a NetCDF dataset.

Optionally, regression diagnostics and statistical tests to diagnose the quality of the fitted regression model may also be computed and stored [Rusta] [Rustb] .

Finally, if the NetCDF variable is fourdimensional use comp_reg_4d instead of comp_reg_3d and if the NetCDF variable is unidimensional use comp_reg_1d instead of comp_reg_3d.

This procedure is parallelized if OpenMP is used.

Further Details

Usage

$ comp_reg_3d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -m=input_mesh_mask_netcdf_file           (optional) \
  -g=grid_type                             (optional : n, t, u, v, w, f) \
  -x=lon1,lon2                             (optional) \
  -y=lat1,lat2                             (optional) \
  -t=time1,time2                           (optional) \
  -p=periodicity                           (optional) \
  -a=type_of_analysis                      (optional : reg, residual, predict, all) \
  -o=output_netcdf_file                    (optional) \
  -to=time_origin                          (optional) \
  -dg=polynomial_degree                    (optional) \
  -fi=input_index_netcdf_file              (optional) \
  -vi=index_netcdf_variable                (optional) \
  -ti=itime1,itime2                        (optional) \
  -pi=iperiodicity,istep                   (optional) \
  -ni=index_for_2d_index_netcdf_variable   (optional) \
  -sm=smoothing_factor                     (optional) \
  -dv=dv_time1,dv_time2                    (optional) \
  -mi=missing_value                        (optional) \
  -use_eps=tol                             (optional) \
  -comp_min_norm                           (optional) \
  -add_mean                                (optional) \
  -rsquare                                 (optional) \
  -adjrsquare                              (optional) \
  -stderr                                  (optional) \
  -ftest                                   (optional) \
  -ttest                                   (optional) \
  -double                                  (optional) \
  -bigfile                                 (optional) \
  -hdf5                                    (optional) \
  -tlimited                                (optional)

By default

-m=
an input_mesh_mask_netcdf_file is not used
-g=
the grid_type is set to n which means that the 2-D grid-mesh associated with the input netcdf_variable is assumed to be regular or Gaussian
-x=
the whole longitude domain associated with the netcdf_variable
-y=
the whole latitude domain associated with the netcdf_variable
-t=
the whole time period associated with the netcdf_variable
-p=
the periodicity is set to 1
-a=
type_of_analysis is set to reg
-o=
output_netcdf_file is set to reg_netcdf_variable.index_netcdf_variable.nc if the -vi= argument is used and to reg_netcdf_variable.poly_trend.nc otherwise
-to=
If a polynomial trend is estimated, the origin for the time index is set to 0
-dg=
the polynomial_degree is set 1. This means that a linear trend is estimated if the -vi= argument is not used
-fi=
a polynomial trend is estimated if the -vi= argument is not used, otherwise the -fi= argument may be used to specified the NetCDF dataset for extracting the index_netcdf_variable
-vi=
a polynomial trend is estimated if the -vi= argument is not used
-ti=
the whole time period associated with the index_netcdf_variable if the -vi=index_netcdf_variable is specified
-pi=
this parameter is not used
-ni=
if the index_netcdf_variable is bidimensional, the first time series is used
-sm=
no smoothing is applied to the index_netcdf_variable if the -vi= argument is used
-dv=
a dummy variable is not used
-mi=
the missing_value is set to 1.e+20 for the NetCDF variables in the output_netcdf_file
-use_eps=
no tolerance is used for solving the linear least square problem associated with the specified regression model. If -use_eps= is used, the specified tolerance tol is used for solving the linear least square problem
-comp_min_norm
the minimal norm solution is not computed. If -comp_min_norm is activated, the minimal norm solution of the linear least square problem is computed
-add_mean
the means are not added to the residuals. If -add_mean is activated, the means are added to the residuals
-rsquare
the coefficient of determination is not computed. If -rsquare is activated, the coefficient of determination is computed
-adjrsquare
the adjusted coefficient of determination is not computed. If -adjrsquare is activated, the adjusted coefficient of determination is computed
-stderr
the standard errors of the estimates are not computed. If -stderr is activated, the standard errors of the estimates are computed
-ftest
a F-test for the regression model is not computed. If -ftest is activated, a a F-test is computed
-ttest
Student-tests for the regression coefficients are not computed. If -ttest is activated, the Student-tests are computed
-double
the results are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
-bigfile
a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
-hdf5
a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
-tlimited
the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks

  1. The -v=netcdf_variable argument specifies the NetCDF variable for which a regression analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.

  2. The optional argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to netcdf_variable for transforming this tridimensional NetCDF variable as a rectangular matrix of observed variables before estimating the regression model. By default, it is assumed that each cell in the 2-D grid-mesh associated with the input tridimensional NetCDF variable is a valid time series (e.g., missing values are not present).

    The geographical shapes of the netcdf_variable (in the input_netcdf_file) and the mask (in the input_mesh_mask_netcdf_file) must agree if an input_mesh_mask_netcdf_file is used.

    Refer to comp_clim_3d or comp_mask_3d for creating a valid input_mesh_mask_netcdf_file NetCDF file for regular or gaussian grids before using comp_reg_3d.

  3. If -g= is set to t, u, v, w or f, it is assumed that the input NetCDF variable is from an experiment with the NEMO model (ORCA configuration and R2, R4 or R05 resolutions). This argument is also used to determine the name of the mesh_mask variable if an input_mesh_mask_netcdf_file is used.

  4. If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing, the whole geographical domain associated with the netcdf_variable is used.

    The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to 1. Negative values are allowed for lon1. In this case the longitude domain is from nlon + lon1 + 1 to lon2 where nlon is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.

    Refer to comp_mask_3d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_reg_3d.

  5. If the -t=time1,time2 argument is missing, data in the whole time period associated with the netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1.

    The selected time period (e.g., time2 - time1 + 1) must be a whole multiple of the periodicity if the -p= argument is specified.

  6. The -p=periodicity argument gives the periodicity of the input data for the netcdf_variable. For example, with monthly data -p=12 should be specified, with yearly data -p=1 may be used, etc.

    If the -p=periodicity argument is specified, the regression models are computed by taking into account the periodicity of the data. This means that periodicity regression models are estimated for each each cell in the 2-D grid-mesh associated with the input tridimensional NetCDF variable.

  7. The -a=argument specifies the statistics which must be computed and if the residuals from the trend or regression models are stored in the output NetCDF file. If:

    • -a=reg, the default, the residuals or predictions are not computed and only the regression and intercept coefficients are computed and stored
    • -a=residual, the residuals are computed and stored, in addition of the regression and intercept coefficients
    • -a=predict, the predictions are computed and stored, in addition of the regression and intercept coefficients
    • -a=all, the residuals and predictions are computed and stored, in addition of the regression and intercept coefficients.
  8. The -to=time_origin argument specifies the origin for the time index if a polynomial regression model is used (e.g., when the -vi= argument is not used).

    By default, the origin for the time index is set to 0. This shift of the zero point for the time index makes the estimate for the intercept(s) in the polynomial model equal to the model prediction(s) for the first observation at the beginning of the record.

  9. The -dg=polynomial_degree argument specifies the degree of the polynomial if a polynomial trend regression model is used (e.g., when the -vi= argument is not used).

    If -dg=1 a linear trend is used, if -dg=2 a quadratic trend is used, etc……

  10. The -vi=index_netcdf_variable specifies a predictor time series for the regression model (e.g., an independent variable).

    If the -vi=index_netcdf_variable is present, the -fi= argument must also be present and this argument specifies the NetCDF dataset which contains the index_netcdf_variable. However, if the NetCDF dataset, which contains the index_netcdf_variable, is the same as the NetCDF dataset specified by the -f= argument, it is not necessary to specify the -fi= argument.

    If the -vi=index_netcdf_variable is not specified a regression model with a polynomial trend is assumed and the -dg= argument specifies the degree of the polynomial : if -dg=1 a linear trend is used, if -dg=2 a quadratic trend is used, etc……

  11. If the -ti=itime1,itime2 argument is missing, data in the whole time period associated with the index_netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1. If the -vi= argument is not present, this argument is not used.

  12. The selected time periods for the netcdf_variable and index_netcdf_variable must agree. This means that the following equality must be verified

    (time2 - time1 + 1)/periodicity = ceiling((itime2 - itime1 - istep + 2)/iperiodicity),

    otherwise, an error message will be issued and the program will stop.

  13. The -pi= argument gives the periodicity and selects the time step for the index_netcdf_variable. For example, to compute regression models with the January monthly time series extracted from the index_netcdf_variable, which is assumed to be sampled every month, -pi=12,1 should be specified, with yearly data -pi=1,1 may be used, etc.

    If the -vi= argument is not present, this argument is not used.

  14. The -ni= argument specifies the index (e.g., an integer) for selecting the time series if the index_netcdf_variable specified in the -vi= argument is a 2D NetCDF variable.

  15. The -sm=smoothing_factor means that the time series associated with the index_netcdf_variable (e.g., the -vi= argument) must be smoothed with a moving average of approximately 2.smoothing_factor + 1 terms before estimating the regression parameters for predicting the netcdf_variable (e.g., the -v= argument ) from the index_netcdf_variable. smoothing_factor must be a strictly positive integer.

    If the -vi= argument is not present, this argument is not used.

  16. If the -dv=dv_time1,dv_time2 argument is specified, a dummy variable is also included in the regression model. The dummy variable is an absence/presence variable (e.g., with values 0 or 1 ) and the time observations where the dummy variable is equal to 1 is specified by the dv_time1 and dv_time2 integers. The dv_time1 and dv_time2 integers specify the first and last time observations of the selected time period in which the dummy variable is set to 1.

    These time indices are counted from the start of the (selected) time period (e.g., time1 in the -t=time1,time2 argument or 1 if this argument is missing) and must take into account the periodicity of the data if the -p= argument is specified.

  17. The -mi=missing_value argument specifies the missing value indicator associated with the NETCDF variables in the output_netcdf_file. If the -mi= argument is not specified, missing_value is set to 1.e+20.

  18. The -use_eps=tol argument is used to determine the effective rank of the linear least squares problem. tol must be set to the relative precision of the elements in the NETCDF variables specify by the -v= and -vi= arguments. If each element is correct to, say, 5 digits then tol =0.00001 should be used. tol must not be greater or equal to 1 or less than 0 , otherwise an error message is printed and the program stops. If the -use_eps= argument is not used, the numerical rank is determined.

  19. If -comp_min_norm is specified, a complete orthogonal factorization of the coefficient matrix and the minimun 2-norm solutions are computed.

  20. If -add_mean is specified, the means are added to the residuals of the regression model. This option has an effect only if -a=residual or -a=all (e.g., if the residuals are computed and stored in the output_netcdf_file).

    By default, the means of the residuals in the output NetCDF file are zero.

  21. If -rsquare is specified, the coefficient of determination of the specified model is computed.

  22. If -adjrsquare is specified, the adjusted coefficient of determination of the specified model is computed.

  23. If -stderr is specified, the standard errors of the estimated regression coefficients are computed (unless the specified model is not of full rank).

  24. If -ftest is specified, a F-test is performed to test the null hypothesis that all the regression coefficients are zero (excepted the intercept).

  25. If -ttest is specified, Student-tests are performed to test the null hypothesis that the regression coefficients are zero (independently of the other regression coefficients).

  26. The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.

    By default, the results are stored as single-precision floating point numbers in the output NetCDF file.

  27. The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g., -D_USE_NETCDF36 or -D_USE_NETCDF4) and linked to the NetCDF 3.6 library or higher.

    If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.

  28. The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g., -D_USE_NETCDF4) and linked to the NetCDF 4 library or higher.

    If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.

  29. The -tlimited argument specifies that the time dimension must be defined as limited in the output NetCDF file.

    By default, this time dimension is defined as unlimited in the output NetCDF file.

  30. Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

  31. It is assumed that the data has no missing values excepted those associated with a constant land-sea mask if the -m= argument is used.

  32. For more details on regression analysis in the climate literature see

    • “Fitting nature s basic functions Part I: polynomials and linear least squares”, by Rust, B.W., Computing in Science and Engineering, Vol. 3, no 5, 84-89, 2001. doi: 10.1109/MCISE.2001.947111
    • “Fitting nature s basic functions Part II: estimating uncertainties and testing hypotheses”, by Rust, B.W., Computing in Science and Engineering, Vol. 3, no 6, 60-64, 2001. doi: 10.1109/5992.963429
    • “Fitting nature s basic functions Part III: exponentials, sinusoids, and nonlinear least squares”, by Rust, B.W., Computing in Science and Engineering, Vol. 4, no 4, 72-77, 2002. doi: 10.1109/MCISE.2002.1014982
    • “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 8, 484 pp., 2002. ISBN: 9780521012300

Outputs

comp_reg_3d creates an output NetCDF file that contains the regression statistics and statistical tests associated with these coefficients, taking into account eventually the periodicity of the data as determined by the -p=periodicity argument.

If the -vi=index_netcdf_variable is specified, the output NetCDF data set will have periodicity time observations and may contain the following NetCDF variables depending on the arguments used in calling the procedure (in the description below, nlat and nlon are the spatial dimensions of the input NetCDF variable) :

  1. netcdf_variable_index_netcdf_variable_reg0(periodicity,nlat,nlon) : the intercept coefficients in the regression models for predicting each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  2. netcdf_variable_index_netcdf_variable_reg1(periodicity,nlat,nlon) : the regression coefficients between each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable and the index_netcdf_variable time series.

    By default, the regression coefficients are expressed in units of the input NetCDF variable netcdf_variable by unit of the index_netcdf_variable time series.

  3. netcdf_variable_index_netcdf_variable_stderr0(periodicity,nlat,nlon) : the standard-errors of the intercept coefficients in the regression models for predicting each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  4. netcdf_variable_index_netcdf_variable_stderr1(periodicity,nlat,nlon) : the standard-errors of the regression coefficients between each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable and the index_netcdf_variable time series.

  5. netcdf_variable_index_netcdf_variable_tprob0(periodicity,nlat,nlon) : the Student t-test probabilities associated with the intercept coefficients in the regression models for predicting each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  6. netcdf_variable_index_netcdf_variable_tprob1(periodicity,nlat,nlon) : the Student t-test probabilities associated with the regression coefficients between each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable and the index_netcdf_variable time series.

  7. netcdf_variable_index_netcdf_variable_dv(periodicity,nlat,nlon) : the regression coefficients between each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable and the dummy variable time series if a dummy variable is included in the regression models.

    This variable is stored only if the -dv= argument has been specified when calling comp_reg_3d.

  8. netcdf_variable_index_netcdf_variable_dv_stderr(periodicity,nlat,nlon) : the standard-errors of the regression coefficients between each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable and the dummy variable time series if a dummy variable is included in the regression models.

    This variable is stored only if the -dv= argument has been specified when calling comp_reg_3d.

  9. netcdf_variable_index_netcdf_variable_dv_tprob(periodicity,nlat,nlon) : the Student t-test probabilities associated with the regression coefficients between each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable and the dummy variable time series if a dummy variable is included in the regression models.

    This variable is stored only if the -dv= argument has been specified when calling comp_reg_3d.

  10. netcdf_variable_index_netcdf_variable_r2(periodicity,nlat,nlon) : the r-square statistics associated with the regression models for predicting each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  11. netcdf_variable_index_netcdf_variable_adjr2(periodicity,nlat,nlon) : the adjusted r-square statistics associated with the regression models for predicting each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  12. netcdf_variable_index_netcdf_variable_fprob(periodicity,nlat,nlon) : the F-test probabilities associated with the regression models for predicting each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  13. netcdf_variable_index_netcdf_variable_predict(ntime,nlat,nlon) : the predictions associated with the regression models for predicting each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  14. netcdf_variable_index_netcdf_variable_resid(ntime,nlat,nlon) : the residuals associated with the regression models for predicting each grid-point in the time series of the 2-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

All these statistics are packed in tridimensional variables whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable, even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, the output NetCDF variables are filled with missing values.

If the -vi=index_netcdf_variable is not specified when calling the procedure, similar statistics will be produced and each component of the selected polynomial trend regression model will have regression coefficients, standard-errors and Student t-test probabilities associated with it. As an illustration, the intercept and the regression coefficients (e.g., the slope) in a linear trend regression model will be stored in NetCDF variables netcdf_variable_poly_trend_reg0 and netcdf_variable_poly_trend_reg1, respectively. For a quadratic trend regression model, in addition to these variables, the regression coefficients associated with the quadratic component will be stored in a NetCDF variable netcdf_variable_poly_trend_reg2. The same naming conventions are used for the standard-errors and Student t-test probabilities associated with each component of the selected polynomial trend regression model.

Examples

  1. For linear detrending bimonthly data from a tridimensional NetCDF variable mslp in the NetCDF file mslp.seas.mean.nc and store the results in a NetCDF file named reg_mslp_seas_ncep2.nc, use the following command (note that cyclostationarity is assumed for the mslp variable since -p=6 is specified ) :

    $ comp_reg_3d \
      -f=mslp.seas.mean.nc \
      -v=mslp \
      -m=mesh_mask_mslp_ncep2.nc \
      -p=6 \
      -a=residual \
      -o=reg_mslp_seas_ncep2.nc
    
  2. For computing bimonthly lag regressions and residuals from a tridimensional NetCDF variable mslp in the NetCDF file mslp.seas.mean.nc and a February-March Nino34 SST index in the NetCDF file sst_nino34_7901_seas.nc and store the results in a NetCDF file named reg_mslp_seas_ncep2_nino34_23.nc, use the following command (note that cyclostationarity is assumed for both the mslp variable since -p=6 is specified, and the index variable since -pi=6,1 ) :

    $ comp_reg_3d \
      -f=mslp.seas.mean.nc \
      -v=mslp \
      -m=mesh_mask_mslp_ncep2.nc \
      -p=6 \
      -fi=sst_nino34_7901_seas.nc \
      -vi=sst \
      -pi=6,1 \
      -a=residual \
      -o=reg_mslp_seas_ncep2_nino34_23.nc
    
Flag Counter