comp_composite_3d¶
Authors¶
Pascal Terray (LOCEAN/IPSL)
Latest revision¶
29/05/2024
Purpose¶
Compute a composite analysis from a tridimensional variable extracted from a NetCDF dataset and perform statistical tests on the differences between the composite mean and the overall mean (e.g., the mean of the parent finite population) for each cell of the 2-D grid-mesh associated with the tridimensional NetCDF variable. The composite statistics and their associated critical probabilities may be computed by taking into account the periodicity of the data. These statistics are stored in an output NetCDF dataset.
Composite analysis is a classical statistical tool in climatology. Its purpose is to highlight the space-time evolution of a time series or a gridded dataset (for instance surface temperature) according to the variations of a given index time series (for instance an ENSO index).
For the sake of simplicity, in the following, we take “years” as our time samples but the approach would also be valid with days ,months,…
The first step of the method consists in defining groups of years according to the values of the index time series. The second step involves the description of each group of years with the help of the gridded dataset (or other time series). Usually, this description is obtained by computing composite means for each group of years. As the years used to estimate the composite means are restricted to those years belonging to each group, the resulting maps may be useful to describe the spatial-time variability associated with each group of years identified by the index time series.
While it is easy to compute composite means, assessing the statistical significance of these composite maps is a more difficult task. This is often done with the help of a classical Student’s two sample t-test, where one sample consists of the years belonging to one group and the second sample of the other years in the observed time series. In the usual context of statistical inference, this procedure is used to test the hypothesis of equal population means on the basis of two random samples independently drawn from two normal populations with a common variance, but possibly different means [vonStorch_Zwiers]. The assumptions of random selection and Gaussian distribution are essential for the validity of the test. As noted by [Brown_Hall], the use of the Student’s two sample t-test for statistical inference in composite analyses, as is traditionally used in the climate literature, is not valid since composite means are computed from groups of years which are not randomly selected, but rather by the value of the index time series. The assumption of Gaussian distribution is also difficult to verify and inappropriate here since the data distribution in each composite group is unknown even if the original data distribution is assumed to be Gaussian.
Therefore, we need an alternative procedure for statistical significance testing of the composite maps in order to overcome the drawbacks of the Student’s t-test.
To obtain more meaningful results, comp_composite_3d uses a resampling scheme [Noreen] or, alternatively, critical probabilities based on the theory of sampling in a finite population (e.g., the observed time observations) and a normal approximation, to estimate the statistical significance of the composite means [Terray_etal] .
The following approach to assess chance of occurrence of the composite maps is suggested.
Let x(1
), x(2
), … , x(n
) be the raw time series of one grid point X in the dataset
observed during the n
years included in the whole finite population, which is used in the composite analysis and
- MEAN = sum( X(:) )/
n
- VAR = sum( [X(:)-MEAN]**2 )/
n
- STD = sqrt(VAR)
the sample mean, variance and standard-deviation estimated from the n
data values. Suppose now that
the n
years are classified by means of an index time series into k
groups and let C(1
), C(2
), … , C(k
)
be the k
groups of years. In common applications k
is less than 3
. Finally, let n(1
), n(2
), … , n(k
)
be the number of years in each group and M(1
), M(2
), … , M(k
) be the overall mean computed from the years
belonging to each group.
Suppose now that we want to assess if the n(j
) values for the grid point X observed during the years in class C(j
)
are significantly different from the values observed on all the years.
This problem can be treated as the statistical testing of a null hypothesis. The null hypothesis H0, as usually stated, is that these
n(j
) values were allocated in class C(j
) at random and without replacement among the finite population of the n
years
included in the composite analysis. In order to test this hypothesis, we may compare the actual mean M(j
) with the expected value of the mean,
assuming the null hypothesis H0 is true. More precisely, it may be shown with the help of the theory of random sampling without replacement in a finite
population, that MEAN is the expected value of the mean for the years in class C(j
) and that
STD-ERROR(j
) = sqrt( [ n
- n(j
) ] / [ n(j
) . (n
- 1
] ) . STD
is the standard-error of the mean for the years in class C(j
) if the null hypothesis H0 is true. In order to assess the validity of this null hypothesis,
we may then compute the following sample criterion for each grid-point X in the gridded dataset :
U(j
) = [ M(j
) - MEAN ] / STD-ERROR(j
)
In statistical terms, large absolute values of this U statistic indicate a strong departure from the null hypothesis.
For example, U(j
) = 8
, means that the overall mean M(j
) in class C(j
) deviates from
the finite population mean MEAN by over 8
standard-deviations under the null hypothesis of a random selection of years in class C(j
) from
the finite population of the n
years.
Finally, the U statistic may be used to compute a critical probability, that is to say, the probability that, under the null hypothesis
of a random selection of the years in group C(j
) from the finite population, the U statistic takes values more discordant than
the observed sampled criterion U(j
). More precisely, if we assume a two-sided test, this critical probability is defined by
PROB(j
) = P( abs( U ) > abs( U(j
) ) ) if the null hypothesis H0 is true
For this purpose, we need to find the null distribution of the U statistic. This null distribution may be estimated by means of simulation, as in bootstrapping techniques and (approximate) randomization tests [Noreen], or approximated since it is possible to demonstrate that U is approximately distributed as a Gaussian distribution with mean zero and standard-deviation unity if H0 is true by the use of an extension of the central limit theorem.
In other words, the U statistic and its associated critical probability PROB estimated from the theory of random sampling in a finite population may be used to determine whether the null hypothesis H0 is to be retained or rejected as in a classical statistical test.
The procedure outlined above may be applied to each grid-point in the gridded dataset and to each group of years defined in the composite analysis in order to assess the statistical significance of the composite maps. Finally, it should be noted that if there are only two groups in the composite analysis, this test procedure may also be used to test the difference between the means in the two groups. In other words, testing the significance of the mean in one group is the same as testing the mean in the other group, or the difference between the two group means in this case.
For more details and examples of use of the statistical test described above, and used in comp_composite_3d to assess the statistical significance of the composite means, consult the references cited below [Terray_etal] [Noreen].
If your data contains missing values, use comp_composite_miss_3d instead of comp_composite_3d to estimate the statistics of your composite analysis from your gappy dataset.
Finally, if the NetCDF variable is fourdimensional use comp_composite_4d instead of comp_composite_3d.
This procedure is parallelized if OpenMP is used and the NCSTAT software has been built with the _PARALLEL_READ CPP macro.
Further Details¶
Usage¶
$ comp_composite_3d \
-f=input_netcdf_file \
-v=netcdf_variable \
-c=input_climatology_netcdf_file \
-a=year1,year2,year3, ... , yearn \
-m=input_mesh_mask_netcdf_file (optional) \
-g=grid_type (optional : n, t, u, v, w, f) \
-x=lon1,lon2 (optional) \
-y=lat1,lat2 (optional) \
-t=time1,time2 (optional) \
-s=type_of_statistics (optional : comp, diff) \
-alg=algorithm (optional : normal, simul) \
-nb=number_of_shuffles (optional) \
-o=output_composite_netcdf_file (optional) \
-mi=missing_value (optional) \
-nostd (optional) \
-double (optional) \
-bigfile (optional) \
-hdf5 (optional) \
-tlimited (optional)
By default¶
- -m=
- an output_mesh_mask_netcdf_file is not used
- -g=
- the grid_type is set to
n
which means that the 2-D grid-mesh associated with the input netcdf_variable is assumed to be regular or Gaussian- -x=
- the whole longitude domain associated with the netcdf_variable
- -y=
- the whole latitude domain associated with the netcdf_variable
- -t=
- the whole time period associated with the netcdf_variable
- -s=
- the type_of_statistics is set to
comp
- -alg=
- the algorithm is set to
normal
- -nb=
- number_of_shuffles is set to
99
- -o=
- the output_composite_netcdf_file is named
composite_
netcdf_variable.nc
- -mi=
- the missing_value attribute for the NetCDF variables in the output NetCDF file is set to
1.e+20
- -nostd
- the composite fields of the input NetCDF variable are standardized. If -nostd is activated, the composite fields are not standardized
- -double
- the composite statistics are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
- -bigfile
- a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
- -hdf5
- a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
- -tlimited
- the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file
Remarks¶
The -v=netcdf_variable argument specifies the NetCDF variable for which a composite analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file input_netcdf_file.
The input_climatology_netcdf_file specified with the -c= argument must contain the means and standard deviations for the finite parent population. The periodicity for the composite analysis is also deduced from the number of observations and periodicity in the input_climatology_netcdf_file. This input_climatology_netcdf_file must have been created by comp_clim_3d applied to the same netcdf_variable with an identical -t=time1,time2 argument as used in comp_composite_3d in order to obtain correct statistics in the output NetCDF file output_composite_netcdf_file.
The geographical shapes of the netcdf_variable (in the input_netcdf_file) and the climatology (in the input_climatology_netcdf_file) must agree.
The -a= argument lists the indices of the years (or seasons, months or days depending on the sampling of the observations in the input_netcdf_file), which must be included in the composite analysis (e.g., for computing the composite means). The indices of the years are counted from the start of the (selected) time period (e.g., time1 in the -t=time1,time2 argument or
1
if this argument is missing).The list may be specified in different formats:
- -a=n1,n2,…nn allows to select for years n1, n2, … and nn in the input_netcdf_file
- -a=n1:n2 allows to select for years n1 to n2 in the input_netcdf_file
The two forms of the -a= argument may be combined and repeated any number of times, but duplicate years are not allowed.
Remember also, when specifying the indices of the years in the -a= argument, that the “length” of a year in the input_netcdf_file is determined by the periodicity of the observations as deduced from the number of time observations and periodicity in the input_climatology_netcdf_file.
This periodicity will also determine how many composite means M(
j
) and statistics U(j
) will be computed and stored in the output_composite_netcdf_file by comp_composite_3d.The optional argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to netcdf_variable for transforming this tridimensional NetCDF variable as a rectangular matrix of observed variables before computing the composite analysis. By default, it is assumed that each cell in the 2-D grid-mesh associated with the input tridimensional NetCDF variable is a valid time series (e.g., missing values are not present).
The geographical shapes of the netcdf_variable (in the input_netcdf_file), the climatology (in the input_climatology_netcdf_file) and the mask (if an input_mesh_mask_netcdf_file is used) must agree.
Refer to comp_clim_3d or comp_mask_3d for creating a valid input_mesh_mask_netcdf_file NetCDF file for regular or gaussian grids before using comp_composite_3d.
If -g= is set to
t
,u
,v
,w
orf
, it is assumed that the input NetCDF variable is from an experiment with the NEMO model (ORCA configuration and R2, R4 or R05 resolutions). This argument is also used to determine the name of the mesh_mask variable if an input_mesh_mask_netcdf_file is used.If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing, the whole geographical domain associated with the netcdf_variable is used in the composite analysis.
The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to
1
. Negative values are allowed for lon1. In this case the longitude domain is fromnlon
+lon1+1
to lon2 wherenlon
is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.Refer to comp_mask_3d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_composite_3d.
If the -t=time1,time2 argument is missing, data in the whole time period associated with the netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to
1
.The selected time period (e.g., time2 - time1 +
1
) must be a whole multiple of the periodicity as deduced from the number of time observations in the input_climatology_netcdf_file if -alg=simul
(see remark below).It is also assumed that the selected time period matches exactly the time period used to compute the climatology in the input_climatology_netcdf_file. Moreover, the first selected time observation ( time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.
The -s=type_of_statistics argument specified how the composite variable netcdf_variable_composite stored in the output_composite_netcdf_file is computed:
- -s=
comp
means that the composites are computed as the standardized differences between the mean of the composite years and the overall mean (e.g., the mean of the parent finite population) for each grid-point.- -s=
diff
means that the composites are computed as the standardized differences between the mean of the composite years and the mean of the other years in the parent finite population for each grid-point (option useful when we consider only two groups in the composite analysis).The standard deviations from the finite parent population are used for the standardization in both cases and are read from the input_climatology_netcdf_file.
Finally, note that this argument does not affect the other variables stored in the output_composite_netcdf_file.
The -nostd argument specifies that the composite variable netcdf_variable_composite in the output_composite_netcdf_file must not be standardized.
By default, the composites are standardized in the output_composite_netcdf_file.
The -alg= argument selects the method for computing critical probabilities associated with the composite mean M(
j
) and the U(j
) statistic:
- If -alg=
normal
, a normal approximation is used to compute the critical probabilities associated with the composite mean M(j
) and the U(j
) statistic.- If -alg=
simul
, the critical probabilities are estimated by a resampling method. If -alg=simul
the selected time period (e.g., time2 - time1 +1
) must be a whole multiple of the periodicity.The -nb=number_of_shuffles argument specifies the number of shuffles for the resampling procedure used to estimate critical probabilities if -alg=
simul
.The -mi=missing_value argument specifies the missing value indicator associated with the NetCDF variables in the output_composite_netcdf_file.
If the -mi= argument is not specified the missing_value attribute is set to
1.e+20
for the NetCDF variables in the output_composite_netcdf_file.The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.
By default, the results are stored as single-precision floating point numbers in the output NetCDF file.
The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g.,
-D_USE_NETCDF36
or-D_USE_NETCDF4
) and linked to the NetCDF 3.6 library or higher.If this argument is specified, the output_composite_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.
The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g.,
-D_USE_NETCDF4
) and linked to the NetCDF 4 library or higher.If this argument is specified, the output_composite_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.
Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.
It is assumed that the data has no missing values. If it is the case, use comp_composite_miss_3d instead of comp_composite_3d.
For more details on composite analysis and statistical testing in composite analysis, see
- “The Use of t values in Composite Analyses” by Brown, T.J., and Hall, B.L., Journal of climate, vol. 12. 2941-2944, 1999. doi: 10.1175/1520-0442(1999)012<2941:TUOTVI>2.0.CO;2
- “Sea Surface Temperature associations with the Late Indian Summer Monsoon”, by Terray, P., Delecluse P., Labattu S., Terray L., Climate Dynamics, vol. 21, 593-618, 2003. doi: 10.1007/s00382-003-0354-0
- “Computer-intensive methods for testing hypotheses: an introduction”, by Noreen, E.W., Wiley and Sons, New York, USA, 1989. ISBN: 978-0-471-61136-3
- “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 2, 484 pp., 2002. ISBN: 9780521012300
Outputs¶
comp_composite_3d creates an output NetCDF file that contains the composite statistics and critical probabilities associated with the composite means, M(
j
), taking into account eventually the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlat and nlon are the length of the spatial dimensions of the input NetCDF variable) and periodicity time observations (periodicity is the number of time observations in the input_climatology_netcdf_file ) :
netcdf_variable_compmean
(periodicity,nlat,nlon)
: the mean fields M(j
) of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).netcdf_variable_compstd
(periodicity,nlat,nlon)
: the standard-deviation fields of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. This corresponds to the standard-deviation of the years belonging to class C(j
).netcdf_variable_qstd
(periodicity,nlat,nlon)
: the ratio between the standard deviations fields of the selected time observations to the standard deviations fields of all the time observations, taking into account the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. The standard deviations fields for all the observations are extracted from the input_climatology_netcdf_file specified in the -c= argument.netcdf_variable_composite
(periodicity,nlat,nlon)
: the composite fields of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).How these composite fields are calculated is determined by the -s= and -nostd arguments.
netcdf_variable_u
(periodicity,nlat,nlon)
: the U(j
) statistics for the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).See above and the second publication cited above for more details on the definition of the U(
j
) statistic.netcdf_variable_prob
(periodicity,nlat,nlon)
: the critical probabilities associated with the composite means M(j
) and the U(j
) statistics.These critical probabilities are computed under the null hypothesis that the selected time observations (with the help of the -a= argument) come from the same finite population as the other observations in the input_netcdf_file. Small probabilities indicate a large departure from the null hypothesis H0.
The -alg=algorithm argument determines how these critical probabilities are computed.
netcdf_variable_compnobs
(periodicity)
: the number of observations used to compute the composite means and standard-deviations stored in the NetCDF variables netcdf_variable_compmeanand netcdf_variable_compstd defined above.All these statistics are packed in tridimensional variables whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, the output NetCDF variables are filled with missing values.
Example¶
For computing a composite analysis from a tridimensional NetCDF variable
sosstsst
in the NetCDF fileST7_1m_00101_20012_grid_T_sosstsst.nc
, assessing the significance of the results with a monte carlo simulation with 999 shuffles and, finally, storing the results in a NetCDF file namedcomposite_ST7_1m_sosstsst_grid_T.nc
, use the following command (note that the critical probabilities associated with the U statistics are estimated with the help of a resampling method using999
surrogate time series since -alg=simul
and -nb=999
are specified) :$ comp_composite_3d \ -f=ST7_1m_00101_20012_sosstsst_grid_T.nc \ -v=sosstsst \ -c=clim_sosstsst_grid_T.nc \ -a=10,11,20,55,143 \ -g=t \ -m=mesh_mask_ST7_grid_T.nc \ -alg=simul \ -nb=999 \ -o=composite_ST7_1m_sosstsst_grid_T.nc