comp_reg_4d¶
Authors¶
Pascal Terray (LOCEAN/IPSL)
Latest revision¶
28/05/2024
Purpose¶
Estimate polynomial trends and regression models from time series extracted from a fourdimensional variable in a NetCDF dataset and, optionally, remove the linear terms from the data by using linear least squares estimation and store the residuals and/or the predictions in a NetCDF dataset.
Optionally, regression diagnostics and statistical tests to diagnose the quality of the fitted regression model may also be computed and stored [Rusta] [Rustb] .
Finally, if the NetCDF variable is tridimensional use comp_reg_3d instead of comp_reg_4d and if the NetCDF variable is unidimensional use comp_reg_1d instead of comp_reg_4d.
This procedure is parallelized if OpenMP is used.
Further Details¶
Usage¶
$ comp_reg_4d \
-f=input_netcdf_file \
-v=netcdf_variable \
-m=input_mesh_mask_netcdf_file (optional) \
-g=grid_type (optional : n, t, u, v, w, f) \
-x=lon1,lon2 (optional) \
-y=lat1,lat2 (optional) \
-z=level1,level2 (optional)
-t=time1,time2 (optional) \
-p=periodicity (optional) \
-a=type_of_analysis (optional : reg, residual, predict, all) \
-o=output_netcdf_file (optional) \
-to=time_origin (optional) \
-dg=polynomial_degree (optional) \
-fi=input_index_netcdf_file (optional) \
-vi=index_netcdf_variable (optional) \
-ti=itime1,itime2 (optional) \
-pi=iperiodicity,istep (optional) \
-ni=index_for_2d_index_netcdf_variable (optional) \
-sm=smoothing_factor (optional) \
-dv=dv_time1,dv_time2 (optional) \
-mi=missing_value (optional) \
-use_eps=tol (optional) \
-comp_min_norm (optional) \
-add_mean (optional) \
-rsquare (optional) \
-adjrsquare (optional) \
-stderr (optional) \
-ftest (optional) \
-ttest (optional) \
-double (optional) \
-bigfile (optional) \
-hdf5 (optional) \
-tlimited (optional)
By default¶
- -m=
- an input_mesh_mask_netcdf_file is not used
- -g=
- the grid_type is set to
n
which means that the 2-D grid-mesh associated with the input netcdf_variable is assumed to be regular or Gaussian- -x=
- the whole longitude domain associated with the netcdf_variable
- -y=
- the whole latitude domain associated with the netcdf_variable
- -z=
- the whole vertical resolution associated with the netcdf_variable
- -t=
- the whole time period associated with the netcdf_variable
- -p=
- the periodicity is set to
1
- -a=
- type_of_analysis is set to
reg
- -o=
- output_netcdf_file is set to
reg_
netcdf_variable.
index_netcdf_variable.nc
if the -vi= argument is used and toreg_
netcdf_variable.poly_trend.nc
otherwise- -to=
- If a polynomial trend is estimated, the origin for the time scale is set to
0
- -dg=
- the polynomial_degree is set
1
. this means that a linear trend is estimated if the -vi= argument is not used- -fi=
- a polynomial trend is estimated if the -vi= argument is not used, otherwise the -fi= argument may be used to specified the NetCDF dataset for extracting the index_netcdf_variable
- -vi=
- a polynomial trend is estimated if the -vi= argument is not used
- -ti=
- the whole time period associated with the index_netcdf_variable if the -vi=index_netcdf_variable is specified.
- -pi=
- this parameter is not used
- -ni=
- if the index_netcdf_variable is bidimensional, the first time series is used
- -sm=
- no smoothing is applied to the index_netcdf_variable if the -vi= argument is used
- -dv=
- a dummy variable is not used
- -mi=
- the missing_value is set to
1.e+20
for the NetCDF variables in the output_netcdf_file- -use_eps=
- no tolerance is used for solving the linear least square problem associated with the specified regression model. If -use_eps= is used, the specified tolerance tol is used for solving the linear least square problem
- -comp_min_norm
- the minimal norm solution is not computed. If -comp_min_norm is activated, the minimal norm solution of the linear least square problem is computed
- -add_mean
- the means are not added to the residuals. If -add_mean is activated, the means are added to the residuals
- -rsquare
- the coefficient of determination is not computed. If -rsquare is activated, the coefficient of determination is computed
- -adjrsquare
- the adjusted coefficient of determination is not computed. If -adjrsquare is activated, the adjusted coefficient of determination is computed
- -stderr
- the standard errors of the estimates are not computed. If -stderr is activated, the standard errors of the estimates are computed
- -ftest
- a F-test is not computed. If -ftest is activated, a a F-test is computed
- -ttest
- Student-tests are not computed. If -ttest is activated, the Student-tests are computed
- -double
- the results are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
- -bigfile
- a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
- -hdf5
- a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
- -tlimited
- the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file
Remarks¶
The -v=netcdf_variable argument specifies the NetCDF variable for which a regression analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.
The optional argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to netcdf_variable for transforming this fourdimensional NetCDF variable as a rectangular matrix of observed variables before computing the regression analysis. By default, it is assumed that each cell in the 3-D grid-mesh associated with the input fourdimensional NetCDF variable is a valid time series (e.g., missing values are not present).
The geographical shapes of the netcdf_variable (in the input_netcdf_file) and the mask (in the input_mesh_mask_netcdf_file) must agree if an input_mesh_mask_netcdf_file is used.
Refer to comp_clim_4d or comp_mask_4d for creating a valid input_mesh_mask_netcdf_file NetCDF file for regular or gaussian grids before using comp_reg_4d.
If -g= is set to
t
,u
,v
,w
orf
, it is assumed that the input NetCDF variable is from an experiment with the NEMO model (ORCA configuration and R2, R4 or R05 resolutions). This argument is also used to determine the name of the mesh_mask variable if an input_mesh_mask_netcdf_file is used.If the -x=lon1,lon2 , -y=lat1,lat2 -z=level1,level2 arguments are missing the whole geographical domain and vertical resolution associated with the netcdf_variable is used.
The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to
1
. Negative values are allowed for lon1. In this case the longitude domain is fromnlon
+ lon1 +1
to lon2 wherenlon
is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.Refer to comp_mask_4d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_reg_4d.
If the -t=time1,time2 argument is missing, data in the whole time period associated with the netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to
1
.The selected time period (e.g., time2 - time1 +
1
) must be a whole multiple of the periodicity if the -p= argument is specified.The -p=periodicity argument gives the periodicity of the input data for the netcdf_variable. For example, with monthly data -p=
12
should be specified, with yearly data -p=1
may be used, etc.If the -p=periodicity argument is specified, the regression models are computed by taking into account the periodicity of the data. This means that periodicity regression models are estimated for each each cell in the 3-D grid-mesh associated with the input fourdimensional NetCDF variable.
The -a=argument specifies the statistics which must be computed and if the residuals from the trend or regression models are stored in the output NetCDF file. If:
- -a=
reg
, the default, the residuals or predictions are not computed and only the regression and intercept coefficients are computed and stored- -a=
residual
, the residuals are computed and stored, in addition of the regression and intercept coefficients- -a=
predict
, the predictions are computed and stored, in addition of the regression and intercept coefficients- -a=
all
, the residuals and predictions are computed and stored, in addition of the regression and intercept coefficients.The -to=time_origin argument specifies the origin for the time scale if a polynomial regression model is used (e.g., when the -vi= argument is not used).
By default, the origin for the time scale is set to
0
. This shift of the zero point for the time scale makes the estimate for the intercept(s) in the polynomial model equal to the model prediction(s) for the first observation at the beginning of the record.The -dg=polynomial_degree argument specifies the degree of the polynomial if a polynomial trend regression model is used.
if -dg=
1
a linear trend is used, if -dg=2
a quadratic trend is used, etc……The -vi=index_netcdf_variable specifies a time series for the regression model (e.g., an independent variable).
If the -vi=index_netcdf_variable is present, the -fi= argument must also be present and this argument specifies the NetCDF dataset which contains the index_netcdf_variable. However, if the NetCDF dataset, which contains the index_netcdf_variable, is the same as the NetCDF dataset specified by the -f= argument, it is not necessary to specify the -fi= argument.
If the -vi=index_netcdf_variable is not specified a regression model with a polynomial trend is assumed and the -dg= argument specifies the degree of the polynomial : if -dg=
1
a linear trend is used, if -dg=2
a quadratic trend is used, etc……If the -ti=itime1,itime2 argument is missing, data in the whole time period associated with the index_netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to
1
. If the -vi= argument is not present, this argument is not used.The selected time periods for the netcdf_variable and index_netcdf_variable must agree. This means that the following equality must be verified
(time2 - time1 + 1)/periodicity = ceiling((itime2 - itime1 - istep + 2)/iperiodicity)
,otherwise, an error message will be issued and the program will stop.
The -pi= argument gives the periodicity and selects the time step for the index_netcdf_variable. For example, to compute regression models with the January monthly time series extracted from the index_netcdf_variable, which is assumed to be sampled every month, -pi=
12,1
should be specified, with yearly data -pi=1,1
may be used, etc.If the -vi= argument is not present, this argument is not used.
The -ni= argument specifies the index (e.g., an integer) for selecting the time series if the index_netcdf_variable specified in the -vi= argument is a 2D NetCDF variable.
The -sm=smoothing_factor means that the time series associated with the index_netcdf_variable (e.g., the -vi= argument) must be smoothed with a moving average of approximately
2
.smoothing_factor +1
terms before estimating the regression parameters for predicting the netcdf_variable (e.g., the -v= argument ) from the index_netcdf_variable. smoothing_factor must be a strictly positive integer.If the -vi= argument is not present, this argument is not used.
If the -dv=dv_time1,dv_time2 argument is specified, a dummy variable is also included in the regression model. The dummy variable is an absence/presence variable (e.g., with values
0
or1
) and the time observations where the dummy variable is equal to1
is specified by the dv_time1 and dv_time2 integers. The dv_time1 and dv_time2 integers specify the first and last time observations of the selected time period in which the dummy variable is set to1
.These time indices are counted from the start of the (selected) time period (e.g., time1 in the -t=time1,time2 argument or
1
if this argument is missing) and must take into account the periodicity of the data if the -p= argument is specified.The -mi=missing_value argument specifies the missing value indicator associated with the NETCDF variables in the output_netcdf_file. If the -mi= argument is not specified missing_value is set to
1.e+20
.The -use_eps=tol argument is used to determine the effective rank of the linear least squares problem. tol must be set to the relative precision of the elements in the NETCDF variables specify by the -v= and -vi= arguments. If each element is correct to, say,
5
digits then tol =0.00001
should be used. tol must not be greater or equal to1
or less than0
, otherwise an error message is printed and the program stops. If the -use_eps= argument is not used, the numerical rank is determined.If -comp_min_norm is specified, a complete orthogonal factorization of the coefficient matrix and the minimun 2-norm solutions are computed.
If -add_mean is specified, the means are added to the residuals of the regression model. This option has an effect only if -a=
residual
or -a=all
(e.g., if the residuals are computed and stored in the output_netcdf_file).By default, the means of the residuals in the output NetCDF file are zero.
If -rsquare is specified, the coefficient of determination of the specified model is computed.
If -adjrsquare is specified, the adjusted coefficient of determination of the specified model is computed.
If -stderr is specified, the standard errors of the estimated regression coefficients are computed (unless the specified model is not of full rank).
If -ftest is specified, a F-test is performed to test the null hypothesis that all the regression coefficients are zero (excepted the intercept).
If -ttest is specified, Student-tests are performed to test the null hypothesis that the regression coefficients are zero (independently of the other regression coefficients).
The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.
By default, the results are stored as single-precision floating point numbers in the output NetCDF file.
The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g.,
-D_USE_NETCDF36
or-D_USE_NETCDF4
) and linked to the NetCDF 3.6 library or higher.If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.
The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g.,
-D_USE_NETCDF4
) and linked to the NetCDF 4 library or higher.If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.
The -tlimited argument specifies that the time dimension must be defined as limited in the output NetCDF file.
By default, this time dimension is defined as unlimited in the output NetCDF file.
Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.
It is assumed that the data has no missing values excepted those associated with a constant land-sea mask if the -m= argument is used.
For more details on regression analysis in the climate literature, see
- “Fitting nature s basic functions Part I: polynomials and linear least squares”, by Rust, B.W., Computing in Science and Engineering, Vol. 3, no 5, 84-89, 2001. doi: 10.1109/MCISE.2001.947111
- “Fitting nature s basic functions Part II: estimating uncertainties and testing hypotheses”, by Rust, B.W., Computing in Science and Engineering, Vol. 3, no 6, 60-64, 2001. doi: 10.1109/5992.963429
- “Fitting nature s basic functions Part III: exponentials, sinusoids, and nonlinear least squares”, by Rust, B.W., Computing in Science and Engineering, Vol. 4, no 4, 72-77, 2002. doi: 10.1109/MCISE.2002.1014982
- “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 8, 484 pp., 2002. ISBN: 9780521012300
Outputs¶
comp_reg_4d creates an output NetCDF file that contains the regression statistics and statistical tests associated with these coefficients, taking into account eventually the periodicity of the data as determined by the -p=periodicity argument.
If the -vi=index_netcdf_variable is specified, the output NetCDF data set will have periodicity time observations and may contain the following NetCDF variables depending on the arguments used in calling the procedure (in the description below, nlev, nlat and nlon are the lengths of the geographical dimensions of the input NetCDF variable) :
netcdf_variable_index_netcdf_variable_reg0
(periodicity,nlev,nlat,nlon)
: the intercept coefficients in the regression models for predicting each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_reg1
(periodicity,nlev,nlat,nlon)
: the regression coefficients between each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable and the index_netcdf_variable time series.By default, the regression coefficients are expressed in units of the input NetCDF variable netcdf_variable by unit of the index_netcdf_variable time series.
netcdf_variable_index_netcdf_variable_stderr0
(periodicity,nlev,nlat,nlon)
: the standard-errors of the intercept coefficients in the regression models for predicting each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_stderr1
(periodicity,nlev,nlat,nlon)
: the standard-errors of the regression coefficients between each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable and the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_tprob0
(periodicity,nlev,nlat,nlon)
: the Student t-test probabilities associated with the intercept coefficients in the regression models for predicting each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_tprob1
(periodicity,nlev,nlat,nlon)
: the Student t-test probabilities associated with the regression coefficients between each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable and the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_dv
(periodicity,nlev,nlat,nlon)
: the regression coefficients between each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable and the dummy variable time series if a dummy variable is included in the regression models.This variable is stored only if the -dv= argument has been specified when calling comp_reg_4d.
netcdf_variable_index_netcdf_variable_dv_stderr
(periodicity,nlev,nlat,nlon)
: the standard-errors of the regression coefficients between each grid- point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable and the dummy variable time series if a dummy variable is included in the regression models.This variable is stored only if the -dv= argument has been specified when calling comp_reg_4d.
netcdf_variable_index_netcdf_variable_dv_tprob
(periodicity,nlev,nlat,nlon)
: the Student t-test probabilities associated with the regression coefficients between each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable and the dummy variable time series if a dummy variable is included in the regression models.This variable is stored only if the -dv= argument has been specified when calling comp_reg_4d.
netcdf_variable_index_netcdf_variable_r2
(periodicity,nlev,nlat,nlon)
: the r-square statistics associated with the regression models for predicting each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_adjr2
(periodicity,nlev,nlat,nlon)
: the adjusted r-square statistics associated with the regression models for predicting each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_fprob
(periodicity,nlev,nlat,nlon)
: the F-test probabilities associated with the regression models for predicting each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_predict
(ntime,nlev,nlat,nlon)
: the predictions associated with the regression models for predicting each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.netcdf_variable_index_netcdf_variable_resid
(ntime,nlev,nlat,nlon)
: the residuals associated with the regression models for predicting each grid-point in the time series of the 3-D grid-mesh associated with the input NetCDF variable netcdf_variable by the index_netcdf_variable time series.All these statistics are packed in fourdimensional variables whose first, second and third dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable, even if you restrict the geographical domain with the -x=, -y= and -z= arguments. However, outside the selected domain, the output NetCDF variables are filled with missing values.
If the -vi=index_netcdf_variable is not specified when calling the procedure, similar statistics will be produced and each component of the selected polynomial trend regression model will have regression coefficients, standard-errors and Student t-test probabilities associated with it. As an illustration, the intercept and the regression coefficients (e.g., the slope) in a linear trend regression model will be stored in NetCDF variables netcdf_variable_poly_trend_reg0 and netcdf_variable_poly_trend_reg1, respectively. For a quadratic trend regression model, in addition to these variables, the regression coefficients associated with the quadratic component will be stored in a NetCDF variable netcdf_variable_poly_trend_reg2. The same naming conventions are used for the standard-errors and Student t-test probabilities associated with each component of the selected polynomial trend regression model.
Examples¶
For quadratic detrending bimonthly data from a fourdimensional NetCDF variable
uwind
in the NetCDF fileuwind.seas.mean.nc
and store the results in a NetCDF file namedreg_uwind_seas_ncep2.nc
, use the following command (note that cyclostationarity is assumed for theuwind
variable since -p=6
is specified and that computations are done only for levels1
and3
) :$ comp_reg_4d \ -f=uwind.seas.mean.nc \ -v=uwind \ -z=1,3 \ -p=6 \ -dg=2 \ -m=mesh_mask_wind_ncep2.nc \ -a=residual \ -o=reg_uwind_seas_ncep2.nc