comp_composite_3d¶
Authors¶
Pascal Terray (LOCEAN/IPSL)
Latest revision¶
12/09/2018
Purpose¶
Compute a composite analysis from a tridimensional variable extracted from a NetCDF dataset and perform statistical tests on the differences between the composite mean and the overall mean (e.g. the mean of the parent finite population) for each cell of the 2-D grid-mesh associated with the tridimensional NetCDF variable. The composite statistics and tests may be computed by taking into account the periodicity of the data. These statistics are stored in an output NetCDF dataset.
The use of Student t values and test in composite analyses, as is traditionally used in the climate literature is not valid [Brown_Hall]. Thus, to obtain more meaningful results, comp_composite_3d uses a resampling scheme or probabilities based on the finite population of the observed time observations and a normal approximation to estimate the statistical significance of the composite means [Terray_etal] .
For more details on the statistical test used in comp_composite_3d to assess the significance of the composite means, consult the references cited below [Terray_etal] [Noreen].
If your data contains missing values, use comp_composite_miss_3d instead of comp_composite_3d to estimate the statistics of your composite analysis from your gappy dataset.
Finally, if the NetCDF variable is fourdimensional use comp_composite_4d instead of comp_composite_3d.
This procedure is parallelized if OpenMP is used and the NCSTAT software has been built with the _PARALLEL_READ CPP macro.
Further Details¶
Usage¶
$ comp_composite_3d \
-f=input_netcdf_file \
-v=netcdf_variable \
-c=input_climatology_netcdf_file \
-a=year1,year2,year3, ... , yearn \
-m=input_mesh_mask_netcdf_file (optional) \
-g=grid_type (optional : n, t, u, v, w, f) \
-x=lon1,lon2 (optional) \
-y=lat1,lat2 (optional) \
-t=time1,time2 (optional) \
-s=type_of_statistics (optional : comp, diff) \
-alg=algorithm (optional : normal, simul) \
-nb=number_of_shuffles (optional) \
-o=output_composite_netcdf_file (optional) \
-mi=missing_value (optional) \
-nostd (optional) \
-double (optional) \
-bigfile (optional) \
-hdf5 (optional) \
-tlimited (optional)
By default¶
- -m=
- an output_mesh_mask_netcdf_file is not used
- -g=
- the grid_type is set to
n
which means that the 2-D grid-mesh associated with the input netcdf_variable is assumed to be regular or Gaussian- -x=
- the whole longitude domain associated with the netcdf_variable
- -y=
- the whole latitude domain associated with the netcdf_variable
- -t=
- the whole time period associated with the netcdf_variable
- -s=
- the type_of_statistics is set to
comp
- -alg=
- the algorithm is set to
normal
- -nb=
- number_of_shuffles is set to
99
- -o=
- the output_composite_netcdf_file is named
composite_
netcdf_variable.nc
- -mi=
- the missing_value attribute for the NetCDF variables in the output NetCDF file is set to
1.e+20
- -nostd
- the composite fields of the input NetCDF variable are standardized. If -nostd is activated, the composite fields are not standardized
- -double
- the composite statistics are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
- -bigfile
- a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
- -hdf5
- a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
- -tlimited
- the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file
Remarks¶
The -v=netcdf_variable argument specifies the NetCDF variable for which a composite analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file input_netcdf_file.
The input_climatology_netcdf_file specified with the -c= argument must contain the means and standard deviations for the parent population. The periodicity for the composite analysis is also deduced from the number of observations in the input_climatology_netcdf_file. This input_climatology_netcdf_file must have been created by comp_clim_3d applied to the same netcdf_variable with an identical -t=time1,time2 argument as used in comp_composite_3d in order to obtain correct statistics in the output NetCDF file.
The geographical shapes of the netcdf_variable (in the input_netcdf_file) and the climatology (in the input_climatology_netcdf_file) must agree.
The -a= argument lists the indices of the years (or seasons, months or days depending on the sampling of the observations in the input_netcdf_file), which must be included in the composite analysis (e.g. for computing the composite means). The indices of the years are counted from the start of the (selected) time period (e.g. time1 in the -t=time1,time2 argument or
1
if this argument is missing).The list may be specified in different formats:
- -a=n1,n2,…nn allows to select for years n1, n2, … and nn in the input_netcdf_file
- -a=n1:n2 allows to select for years n1 to n2 in the input_netcdf_file
The two forms of the -a= argument may be combined and repeated any number of times, but duplicate years are not allowed.
Remember also that the length of a year in the input_netcdf_file is determined by the periodicity of the observations as deduced from the number of time observations in the input_climatology_netcdf_file, when specifying the indices of the years in the -a= argument. This periodicity will also determine how many composite means and statistics will be computed and stored in the output_composite_netcdf_file by comp_composite_3d.
The optional argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to netcdf_variable for transforming this tridimensional NetCDF variable as a rectangular matrix of observed variables before computing the composite analysis. By default, it is assumed that each cell in the 2-D grid-mesh associated with the input tridimensional NetCDF variable is a valid time series (e.g. missing values are not present).
The geographical shapes of the netcdf_variable (in the input_netcdf_file), the climatology (in the input_climatology_netcdf_file) and the mask (if an input_mesh_mask_netcdf_file is used) must agree.
Refer to comp_clim_3d or comp_mask_3d for creating a valid input_mesh_mask_netcdf_file NetCDF file for regular or gaussian grids before using comp_composite_3d.
If -g= is set to
t
,u
,v
,w
orf
, it is assumed that the input NetCDF variable is from an experiment with the ORCA model (R2, R4 or R05 resolutions). This argument is also used to determine the name of the mesh_mask variable if an input_mesh_mask_netcdf_file is used.If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing, the whole geographical domain associated with the netcdf_variable is used in the composite analysis.
The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to
1
. Negative values are allowed for lon1. In this case the longitude domain is fromnlon
+lon1+1
to lon2 wherenlon
is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.Refer to comp_mask_3d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_composite_3d.
If the -t=time1,time2 argument is missing, data in the whole time period associated with the netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to
1
.The selected time period (e.g. time2 - time1 +
1
) must be a whole multiple of the periodicity as deduced from the number of time observations in the input_climatology_netcdf_file if -alg=simul
(see remark below).It is also assumed that the selected time period matches exactly the time period used to compute the climatology in the input_climatology_netcdf_file. Moreover, the first selected time observation ( time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.
The -s=type_of_statistics argument specified how the composite variable netcdf_variable_composite stored in the output_composite_netcdf_file is computed:
- -s=
comp
means that the composites are computed as the standardized differences between the mean of the composite years and the overall mean (e.g. the mean of the parent finite population) for each grid-point.- -s=
diff
means that the composites are computed as the standardized differences between the mean of the composite years and the mean of the other years in the parent finite population for each grid-point.The standard deviations from the parent population are used for the standardization in both cases and are read from the input_climatology_netcdf_file. Finally, note that this argument does not affect the other variables stored in the output_composite_netcdf_file.
The -nostd argument specifies that the composite variable netcdf_variable_composite in the output_composite_netcdf_file must not be standardized.
By default, the composites are standardized in the output_composite_netcdf_file.
The -alg= argument selects the method for computing critical probabilities associated with the composite means and U statistics (see the second reference cited below for the definition of this U statistic):
- If -alg=
normal
, a normal approximation is used to compute the critical probabilities associated with the composite means and U statistics.- If -alg=
simul
, the critical probabilities are estimated by a resampling method. If -alg=simul
the selected time period (e.g. time2 - time1 +1
) must be a whole multiple of the periodicity.The -nb=number_of_shuffles argument specifies the number of shuffles for the resampling procedure if -alg=
simul
.The -mi=missing_value argument specifies the missing value indicator associated with the NetCDF variables in the output_netcdf_file. If the -mi= argument is not specified the missing_value attribute is set to
1.e+20
for the NetCDF variables in the output_netcdf_file.The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file. By default, the results are stored as single-precision floating point numbers in the output NetCDF file.
The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g.
-D_USE_NETCDF36
or-D_USE_NETCDF4
) and linked to the NetCDF 3.6 library or higher. If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g.
-D_USE_NETCDF4
) and linked to the NetCDF 4 library or higher. If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.
It is assumed that the data has no missing values. If it is the case, use comp_composite_miss_3d instead of comp_composite_3d.
For more details on composite analysis and statistical testing in composite analysis, see
- “The Use of t values in Composite Analyses” by Brown, T.J., and Hall, B.L., Journal of climate, vol. 12. 2941-2944, 1999. doi: 10.1175/1520-0442(1999)012<2941:TUOTVI>2.0.CO;2
- “Sea Surface Temperature associations with the Late Indian Summer Monsoon”, by Terray, P., Delecluse P., Labattu S., Terray L., Climate Dynamics, vol. 21, 593-618, 2003. doi: 10.1007/s00382-003-0354-0
- “Computer-intensive methods for testing hypotheses: an introduction”, by Noreen, E.W., Wiley and Sons, New York, USA, 1989. ISBN: 978-0-471-61136-3
Outputs¶
comp_composite_3d creates an output NetCDF file that contains the composite statistics and critical probabilities associated with the composite means, taking into account eventually the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlat and nlon are the length of the spatial dimensions of the input NetCDF variable) and periodicity time observations (periodicity is the number of time observations in the input_climatology_netcdf_file ) :
netcdf_variable_compmean
(periodicity,nlat,nlon)
: the mean fields of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).netcdf_variable_compstd
(periodicity,nlat,nlon)
: the standard-deviation fields of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument.netcdf_variable_qstd
(periodicity,nlat,nlon)
: the ratio between the standard deviations fields of the selected time observations to the standard deviations fields of all the time observations, taking into account the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. The standard deviations fields for all the observations are extracted from the input_climatology_netcdf_file specified in the -c= argument.netcdf_variable_composite
(periodicity,nlat,nlon)
: the composite fields of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).How these composite fields are calculated is determined by the -s= and -nostd arguments.
netcdf_variable_u
(periodicity,nlat,nlon)
: the U statistics for the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).See the second publication cited above for more details on the definition of the U statistic.
netcdf_variable_prob
(periodicity,nlat,nlon)
: the critical probabilities associated with the composite means and U statistics.These critical probabilities are computed under the null hypothesis that the selected time observations (with the help of the -a= argument) come from the same population as the other observations in the input_netcdf_file. Small probabilities indicate a large departure from the null hypothesis.
The -alg=algorithm argument determines how these critical probabilities are computed.
netcdf_variable_compnobs
(periodicity)
: the number of observations used to compute the composite means and standard-deviations stored in the NetCDF variables netcdf_variable_compmeanand netcdf_variable_compstd defined above.All these statistics are packed in tridimensional variables whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, the output NetCDF variables are filled with missing values.
Example¶
For computing a composite analysis from a tridimensional NetCDF variable
sosstsst
in the NetCDF fileST7_1m_00101_20012_grid_T_sosstsst.nc
, assessing the significance of the results with a monte carlo simulation with 999 shuffles and, finally, storing the results in a NetCDF file namedcomposite_ST7_1m_sosstsst_grid_T.nc
, use the following command (note that the critical probabilities associated with the U statistics are estimated with the help of a resampling method using999
surrogate time series since -alg=simul
and -nb=999
are specified) :$ comp_composite_3d \ -f=ST7_1m_00101_20012_sosstsst_grid_T.nc \ -v=sosstsst \ -c=clim_sosstsst_grid_T.nc \ -a=10,11,20,55,143 \ -g=t \ -m=mesh_mask_ST7_grid_T.nc \ -alg=simul \ -nb=999 \ -o=composite_ST7_1m_sosstsst_grid_T.nc