comp_reg_1d

Authors

Pascal Terray (LOCEAN/IPSL)

Latest revision

28/05/2024

Purpose

Estimate polynomial trends and regression models from a time series extracted from a unidimensional variable in a NetCDF dataset and, optionally, remove the linear terms from the data by using linear least squares estimation and store the residuals and/or the predictions in a NetCDF dataset.

Optionally, regression diagnostics and statistical tests to diagnose the quality of the fitted regression model may also be computed and stored [Rusta] [Rustb] .

Finally, if the NetCDF variable is tridimensional use comp_reg_3d instead of comp_reg_1d and if the NetCDF variable is fourdimensional use comp_reg_4d instead of comp_reg_1d.

Further Details

Usage

$ comp_reg_1d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -t=time1,time2                           (optional) \
  -p=periodicity                           (optional) \
  -a=type_of_analysis                      (optional : reg, residual, predict, all) \
  -o=output_netcdf_file                    (optional) \
  -to=time_origin                          (optional) \
  -dg=polynomial_degree                    (optional) \
  -fi=input_index_netcdf_file              (optional) \
  -vi=index_netcdf_variable                (optional) \
  -ti=itime1,itime2                        (optional) \
  -pi=iperiodicity,istep                   (optional) \
  -ni=index_for_2d_index_netcdf_variable   (optional) \
  -sm=smoothing_factor                     (optional) \
  -dv=dv_time1,dv_time2                    (optional) \
  -mi=missing_value                        (optional) \
  -use_eps=tol                             (optional) \
  -comp_min_norm                           (optional) \
  -add_mean                                (optional) \
  -rsquare                                 (optional) \
  -adjrsquare                              (optional) \
  -stderr                                  (optional) \
  -ftest                                   (optional) \
  -ttest                                   (optional) \
  -double                                  (optional) \
  -hdf5                                    (optional) \
  -tlimited                                (optional)

By default

-t=
the whole time period associated with the netcdf_variable
-p=
the periodicity is set to 1
-a=
type_of_analysis is set to reg
-o=
output_netcdf_file is set to reg_netcdf_variable.index_netcdf_variable.nc if the -vi= argument is used and to reg_netcdf_variable.poly_trend.nc otherwise
-to=
If a polynomial trend is estimated, the origin for the time index is set to 0
-dg=
the polynomial_degree is set 1. This means that a linear trend is estimated if the -vi= argument is not used
-fi=
a polynomial trend is estimated if the -vi= argument is not used, otherwise the -fi= argument may be used to specified the NetCDF dataset for extracting the index_netcdf_variable
-vi=
a polynomial trend is estimated if the -vi= argument is not used
-ti=
the whole time period associated with the index_netcdf_variable if the -vi=index_netcdf_variable is specified
-pi=
this parameter is not used
-ni=
if the index_netcdf_variable is bidimensional, the first time series is used
-sm=
no smoothing is applied to the index_netcdf_variable if the -vi= argument is used
-dv=
a dummy variable is not used
-mi=
the missing_value is set to 1.e+20 for the NetCDF variables in the output_netcdf_file
-use_eps=
no tolerance is used for solving the linear least square problem associated with the specified regression model. If -use_eps= is used, the specified tolerance tol is used for solving the linear least square problem
-comp_min_norm
the minimal norm solution is not computed. If -comp_min_norm is activated, the minimal norm solution of the linear least square problem is computed
-add_mean
the means are not added to the residuals. If -add_mean is activated, the means are added to the residuals
-rsquare
the coefficient of determination is not computed. If -rsquare is activated, the coefficient of determination is computed
-adjrsquare
the adjusted coefficient of determination is not computed. If -adjrsquare is activated, the adjusted coefficient of determination is computed
-stderr
the standard errors of the estimates are not computed. If -stderr is activated, the standard errors of the estimates are computed
-ftest
a F-test for the regression model is not computed. If -ftest is activated, a a F-test is computed
-ttest
Student-tests for the regression coefficients are not computed. If -ttest is activated, the Student-tests are computed
-double
the results are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
-hdf5
a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
-tlimited
the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks

  1. The -v=netcdf_variable argument specifies the NetCDF variable for which a regression analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.

  2. If the -t=time1,time2 argument is missing, data in the whole time period associated with the netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1.

    The selected time period (e.g., time2 - time1 + 1) must be a whole multiple of the periodicity if the -p= argument is specified.

  3. The -p=periodicity argument gives the periodicity of the input data for the netcdf_variable. For example, with monthly data -p=12 should be specified, with yearly data -p=1 may be used, etc. If the -p=periodicity argument is specified, the regression models are computed by taking into account the periodicity of the data. This means that periodicity regression models are estimated for the input tridimensional NetCDF variable.

  4. The -a=argument specifies the statistics which must be computed and if the residuals from the trend or regression models are stored in the output NetCDF file. If:

    • -a=reg, the default, the residuals or predictions are not computed and only the regression and intercept coefficients are computed and stored
    • -a=residual, the residuals are computed and stored, in addition of the regression and intercept coefficients
    • -a=predict, the predictions are computed and stored, in addition of the regression and intercept coefficients
    • -a=all, the residuals and predictions are computed and stored, in addition of the regression and intercept coefficients.
  5. The -to=time_origin argument specifies the origin for the time index if a polynomial regression model is used (e.g., when the -vi= argument is not used).

    By default, the origin for the time index is set to 0. This shift of the zero point for the time index makes the estimate for the intercept(s) in the polynomial model equal to the model prediction(s) for the first observation at the beginning of the record.

  6. The -dg=polynomial_degree argument specifies the degree of the polynomial if a polynomial trend regression model is used (e.g., when the -vi= argument is not used).

    if -dg=1 a linear trend is used, if -dg=2 a quadratic trend is used, etc……

  7. The -vi=index_netcdf_variable specifies a predictor time series for the regression model (e.g., an independent variable).

    If the -vi=index_netcdf_variable is present, the -fi= argument must also be present and this argument specifies the NetCDF dataset which contains the index_netcdf_variable. However, if the NetCDF dataset, which contains the index_netcdf_variable, is the same as the NetCDF dataset specified by the -f= argument, it is not necessary to specify the -fi= argument.

    If the -vi=index_netcdf_variable is not specified, a regression model with a polynomial trend is assumed and the -dg= argument specifies the degree of the polynomial : if -dg=1 a linear trend is used, if -dg=2 a quadratic trend is used, etc……

  8. If the -ti=itime1,itime2 argument is missing, data in the whole time period associated with the index_netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1. If the -vi= argument is not present, this argument is not used.

  9. The selected time periods for the netcdf_variable and index_netcdf_variable must agree. This means that the following equality must be verified

    (time2 - time1 + 1)/periodicity = ceiling((itime2 - itime1 - istep + 2)/iperiodicity),

    otherwise, an error message will be issued and the program will stop.

  10. The -pi= argument gives the periodicity and selects the time step for the index_netcdf_variable. For example, to compute regression models with the January monthly time series extracted from the index_netcdf_variable, which is assumed to be sampled every month, -pi=12,1 should be specified, with yearly data -pi=1,1 may be used, etc.

    If the -vi= argument is not present, this argument is not used.

  11. The -ni= argument specifies the index (e.g., an integer) for selecting the time series if the index_netcdf_variable specified in the -vi= argument is a 2D NetCDF variable.

  12. The -sm=smoothing_factor means that the time series associated with the index_netcdf_variable (e.g., the -vi= argument) must be smoothed with a moving average of approximately 2.smoothing_factor + 1 terms before estimating the regression parameters for predicting the netcdf_variable (e.g., the -v= argument ) from the index_netcdf_variable. smoothing_factor must be a strictly positive integer.

    If the -vi= argument is not present, this argument is not used and has no effect.

  13. If the -dv=dv_time1,dv_time2 argument is specified, a dummy variable is also included in the regression model. The dummy variable is an absence/presence variable (e.g., with values 0 or 1 ) and the time observations where the dummy variable is equal to 1 is specified by the dv_time1 and dv_time2 integers. The dv_time1 and dv_time2 integers specify the first and last time observations of the selected time period in which the dummy variable is set to 1.

    These time indices are counted from the start of the (selected) time period (e.g., time1 in the -t=time1,time2 argument or 1 if this argument is missing) and must take into account the periodicity of the data if the -p= argument is specified.

  14. The -mi=missing_value argument specifies the missing value indicator associated with the NETCDF variables in the output_netcdf_file.

    If the -mi= argument is not specified, missing_value is set to 1.e+20.

  15. The -use_eps=tol argument is used to determine the effective rank of the linear least squares problem. tol must be set to the relative precision of the elements in the NETCDF variables specify by the -v= and -vi= arguments.

    If each element is correct to, say, 5 digits then tol =0.00001 should be used. tol must not be greater or equal to 1 or less than 0 , otherwise an error message is printed and the program stops. If the -use_eps= argument is not used, the numerical rank is determined.

  16. If -comp_min_norm is specified, a complete orthogonal factorization of the coefficient matrix and the minimun 2-norm solutions are computed.

  17. If -add_mean is specified, the means are added to the residuals of the regression model. This option has an effect only if -a=residual or -a=all (e.g., if the residuals are computed and stored in the output_netcdf_file).

    By default, the means of the residuals in the output NetCDF file are zero.

  18. If -rsquare is specified, the coefficient of determination of the specified model is computed.

  19. If -adjrsquare is specified, the adjusted coefficient of determination of the specified model is computed.

  20. If -stderr is specified, the standard errors of the estimated regression coefficients are computed (unless the specified model is not of full rank).

  21. If -ftest is specified, a F-test is performed to test the null hypothesis that all the regression coefficients are zero (excepted the intercept).

  22. If -ttest is specified, Student-tests are performed to test the null hypothesis that the regression coefficients are zero (independently of the other regression coefficients).

  23. The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.

    By default, the results are stored as single-precision floating point numbers in the output NetCDF file.

  24. The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g., -D_USE_NETCDF4) and linked to the NetCDF 4 library or higher.

    If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.

  25. The -tlimited argument specifies that the time dimension must be defined as limited in the output NetCDF file.

    By default, this time dimension is defined as unlimited in the output NetCDF file.

  26. Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

  27. It is assumed that the data has no missing values.

  28. For more details on regression analysis in the climate literature, see

    • “Fitting nature s basic functions Part I: polynomials and linear least squares”, by Rust, B.W., Computing in Science and Engineering, Vol. 3, no 5, 84-89, 2001. doi: 10.1109/MCISE.2001.947111
    • “Fitting nature s basic functions Part II: estimating uncertainties and testing hypotheses”, by Rust, B.W., Computing in Science and Engineering, Vol. 3, no 6, 60-64, 2001. doi: 10.1109/5992.963429
    • “Fitting nature s basic functions Part III: exponentials, sinusoids, and nonlinear least squares”, by Rust, B.W., Computing in Science and Engineering, Vol. 4, no 4, 72-77, 2002. doi: 10.1109/MCISE.2002.1014982
    • “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 8, 484 pp., 2002. ISBN: 9780521012300

Outputs

comp_reg_1d creates an output NetCDF file that contains the regression statistics and statistical tests associated with these coefficients, taking into account eventually the periodicity of the data as determined by the -p=periodicity argument.

If the -vi=index_netcdf_variable is specified, the output NetCDF data set will have periodicity time observations and may contain the following NetCDF variables depending on the arguments used in calling the procedure :

  1. netcdf_variable_index_netcdf_variable_reg0(periodicity) : the intercept coefficient in the regression model for predicting the time series of the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  2. netcdf_variable_index_netcdf_variable_reg1(periodicity) : the regression coefficient in the regression model for predicting the time series of the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

    By default, the regression coefficient is expressed in units of the input NetCDF variable netcdf_variable by unit of the index_netcdf_variable time series.

  3. netcdf_variable_index_netcdf_variable_stderr0(periodicity) : the standard-error of the intercept coefficient in the regression model for predicting the time series of the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  4. netcdf_variable_index_netcdf_variable_stderr1(periodicity) : the standard-error of the regression coefficient in the regression model for predicting the time series of the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  5. netcdf_variable_index_netcdf_variable_tprob0(periodicity) : the Student t-test probability associated with the intercept coefficient in the regression model for predicting the time series of the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  6. netcdf_variable_index_netcdf_variable_tprob1(periodicity) : the Student t-test probability associated with the regression coefficient in the regression model for predicting the time series of the the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  7. netcdf_variable_index_netcdf_variable_dv(periodicity) : the regression coefficient associated with the dummy variable time series if a dummy variable is included in the regression model.

    This variable is stored only if the -dv= argument has been specified when calling comp_reg_1d.

  8. netcdf_variable_index_netcdf_variable_dv_stderr(periodicity) : the standard-error of the regression coefficient associated with the dummy variable time series if a dummy variable is included in the regression models.

    This variable is stored only if the -dv= argument has been specified when calling comp_reg_1d.

  9. netcdf_variable_index_netcdf_variable_dv_tprob(periodicity) : the Student t-test probability associated with the dummy variable time series if a dummy variable is included in the regression model.

    This variable is stored only if the -dv= argument has been specified when calling comp_reg_1d.

  10. netcdf_variable_index_netcdf_variable_r2(periodicity) : the r-square statistic associated with the regression model for predicting the time series associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  11. netcdf_variable_index_netcdf_variable_adjr2(periodicity) : the adjusted r-square statistic associated with the regression model for predicting the time series associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  12. netcdf_variable_index_netcdf_variable_fprob(periodicity) : the F-test probability associated with the regression model for predicting the time series associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  13. netcdf_variable_index_netcdf_variable_predict(ntime) : the predictions associated with the regression model for predicting the time series associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

  14. netcdf_variable_index_netcdf_variable_resid(ntime) : the residuals associated with the regression model for predicting the time series associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.

If the -vi=index_netcdf_variable is not specified when calling the procedure, similar statistics will be produced and each component of the selected polynomial trend regression model will have regression coefficients, standard-errors and Student t-test probabilities associated with it. As an illustration, the intercept and the regression coefficients (e.g., the slope) in a linear trend regression model will be stored in NetCDF variables netcdf_variable_poly_trend_reg0 and netcdf_variable_poly_trend_reg1, respectively. For a quadratic trend regression model, in addition to these variables, the regression coefficients associated with the quadratic component will be stored in a NetCDF variable netcdf_variable_poly_trend_reg2. The same naming conventions are used for the standard-errors and Student t-test probabilities associated with each component of the selected polynomial trend regression model.

Examples

  1. For polynomial detrending of a unidimensional NetCDF variable sosstsst in the NetCDF file ST7_1m_sst_nino34.nc and store the results in a NetCDF file named ST7_1m_sst_nino34_detrended.nc, use the following command (note that quadratic detrending is used since -dg=2 is specified and cyclostationarity is assumed for the sosstsst variable since -p=12 is also specified ) :

    $ comp_reg_1d \
      -f=ST7_1m_sst_nino34.nc \
      -v=sosstsst \
      -dg=2 \
      -p=12 \
      -a=residual \
      -o=ST7_1m_sst_nino34_detrended.nc
    
Flag Counter