comp_composite_miss_3d

Authors

Pascal Terray (LOCEAN/IPSL)

Latest revision

27/05/2024

Purpose

Compute a composite analysis from a tridimensional variable with missing values extracted from a NetCDF dataset and perform statistical tests on the differences between the composite mean and the overall mean (e.g., the mean of the parent finite population) for each cell of the 2-D grid-mesh associated with the tridimensional NetCDF variable. The composite statistics and their associated critical probabilities may be computed by taking into account the periodicity of the data. These statistics are stored in an output NetCDF dataset.

The use of the classical Student’s two sample t-test [vonStorch_Zwiers] for statistical inference in composite analyses, as is traditionally used in the climate literature, is not valid [Brown_Hall]. Thus, to obtain more meaningful results, comp_composite_miss_3d uses the theory of random sampling without replacement in the finite population of the observed time observations and a normal approximation to assess the statistical significance of the composite means [Terray_etal] . Note, however, that resampling techniques can not be used in comp_composite_miss_3d due to the presence of missing values in the data. This is in contrast with comp_composite_3d in which the user can choose between a resampling scheme and the normal approximation method to estimate the statistical significance of the composite means.

For a much more comprehensive description of the composite method and the statistical test used in comp_composite_miss_3d to assess the statistical significance of the composite means, consult the documentation of comp_composite_3d, which is more extensive, and the references cited below [Brown_Hall] [Terray_etal].

If your data does not contain missing values, use comp_composite_3d instead of comp_composite_miss_3d to estimate the statistics of the composite analysis from your dataset.

Further Details

Usage

$ comp_composite_miss_3d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -c=input_climatology_netcdf_file \
  -a=year1,year2,year3, ... , yearn \
  -m=input_mesh_mask_netcdf_file    (optional) \
  -g=grid_type                      (optional : n, t, u, v, w, f) \
  -x=lon1,lon2                      (optional) \
  -y=lat1,lat2                      (optional) \
  -t=time1,time2                    (optional) \
  -s=type_of_statistics             (optional : comp, diff) \
  -o=output_composite_netcdf_file   (optional) \
  -mi=missing_value                 (optional) \
  -nostd                            (optional) \
  -double                           (optional) \
  -bigfile                          (optional) \
  -hdf5                             (optional) \
  -tlimited                         (optional)

By default

-m=
an output_mesh_mask_netcdf_file is not used
-g=
the grid_type is set to n which means that the 2-D grid-mesh associated with the input netcdf_variable is assumed to be regular or Gaussian
-x=
the whole longitude domain associated with the netcdf_variable
-y=
the whole latitude domain associated with the netcdf_variable
-t=
the whole time period associated with the netcdf_variable
-s=
the type_of_statistics is set to comp
-o=
the output_composite_netcdf_file is named composite_netcdf_variable.nc
-mi=
the missing_value attribute in the output NetCDF file is set to 1.e+20
-nostd
the composite fields of the input NetCDF variable are standardized. If -nostd is activated, the composite fields are not standardized
-double
the composite statistics are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
-bigfile
a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
-hdf5
a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
-tlimited
the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks

  1. The -v=netcdf_variable argument specifies the NetCDF variable for which a composite analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.

  2. The input_climatology_netcdf_file specified with the -c= argument must contain the means and standard deviations for the parent population. The periodicity for the composite analysis is also deduced from the number of observations in the input_climatology_netcdf_file. This input_climatology_netcdf_file must have been created by comp_clim_miss_3d applied to the same netcdf_variable with an identical -t=time1,time2 argument.

    The geographical shapes of the netcdf_variable (in the input_netcdf_file) and the climatology (in the input_climatology_netcdf_file) must agree.

  3. The -a= argument lists the indices of the years (or seasons, months or days depending on the sampling of the observations in the input_netcdf_file), which must be included in the composite analysis (e.g., for computing the composite means). The indices of the years are counted from the start of the (selected) time period (e.g., time1 in the -t=time1,time2 argument or 1 if this argument is missing).

    The list may be specified in different formats:

    • -a=n1,n2,…nn allows to select for years n1, n2, … and nn in the input_netcdf_file
    • -a=n1:n2 allows to select for years n1 to n2 in the input_netcdf_file

    The two forms of the -a= argument may be combined and repeated any number of times, but duplicate years are not allowed.

    Note, however, that the number of years really used to compute the composite means may vary from one cell to another in the 2-D grid-mesh associated with the input tridimensional NetCDF variable because of the presence of missing values.

    Remember also, when specifying the indices of the years in the -a= argument, that the “length” of a year in the input_netcdf_file is determined by the periodicity of the observations as deduced from the number of time observations and periodicity in the input_climatology_netcdf_file.

    This periodicity will also determine how many composite means M(j) and statistics U(j) will be computed and stored in the output_composite_netcdf_file by comp_composite_miss_3d.

    See the documentation of comp_composite_3d for more details on the definition of the U(j) statistic.

  4. The optional argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to netcdf_variable for transforming this tridimensional NetCDF variable as a rectangular matrix of observed variables before computing the composite analysis. By default, it is assumed that each cell in the 2-D grid-mesh associated with the input tridimensional NetCDF variable is a valid time series (e.g., some non-missing values are not present in each time series).

    The geographical shapes of the netcdf_variable (in the input_netcdf_file), the climatology (in the input_climatology_netcdf_file) and the mask (if an input_mesh_mask_netcdf_file is used) must agree.

    Refer to comp_clim_miss_3d or comp_mask_3d for creating a valid input_mesh_mask_netcdf_file NetCDF file for regular or gaussian grids before using comp_composite_miss_3d.

  5. If -g= is set to t, u, v, w or f, it is assumed that the input NetCDF variable is from an experiment with the NEMO model (ORCA configuration and R2, R4 or R05 resolutions). This argument is also used to determined the name of the mesh_mask variable if an input_mesh_mask_netcdf_file is used.

  6. If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing, the whole geographical domain associated with the netcdf_variable is used in the composite analysis.

    The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to 1. Negative values are allowed for lon1. In this case the longitude domain is from nlon+lon1+1 to lon2 where nlon is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.

    Refer to comp_mask_3d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_composite_miss_3d.

  7. If the -t=time1,time2 argument is missing, data in the whole time period associated with the netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1.

    It is also assumed that the selected time period matches exactly the time period used to compute the climatology in the input_climatology_netcdf_file. Moreover, the first selected time observation ( time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.

  8. The -s=type_of_statistics argument specified how the composite variable netcdf_variable_composite stored in the output_composite_netcdf_file is computed:

    • -s=comp means that the composites are computed as the standardized differences between the mean of the composite years and the overall mean (e.g., the mean of the parent finite population) for each grid-point.
    • -s=diff means that the composites are computed as the standardized differences between the mean of the composite years and the mean of the other years in the parent finite population for each grid-point.

    The standard deviations from the parent population are used for the standardization in both cases and are read from the input_climatology_netcdf_file. Finally, note that this argument does not affect the other variables stored in the output_composite_netcdf_file.

  9. The -nostd argument specifies that the composite variable netcdf_variable_composite in the output_composite_netcdf_file must not be standardized.

    By default, the composites are standardized in the output_composite_netcdf_file.

  10. It is assumed that the specified netcdf_variable has a scalar missing_value or _FillValue attribute and that missing values in the data are identified by the value of this missing_value or _FillValue attribute.

  11. The -mi=missing_value argument specifies the missing value indicator associated with the NetCDF variables in the output_composite_netcdf_file.

    If the -mi= argument is not specified the missing_value attribute is set to 1.e+20for the NetCDF variables in the output_composite_netcdf_file.

  12. The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file.

    By default, the results are stored as single-precision floating point numbers in the output NetCDF file.

  13. The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g., -D_USE_NETCDF36 or -D_USE_NETCDF4) and linked to the NetCDF 3.6 library or higher.

    If this argument is specified, the output_composite_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.

  14. The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g., -D_USE_NETCDF4) and linked to the NetCDF 4 library or higher.

    If this argument is specified, the output_composite_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.

  15. Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

  16. For more details on composite analysis and statistical testing in composite analysis, see

    • “The Use of t values in Composite Analyses” by Brown, T.J., and Hall, B.L., Journal of climate, vol. 12. 2941-2944, 1999. doi: 10.1175/1520-0442(1999)012<2941:TUOTVI>2.0.CO;2
    • “Sea Surface Temperature associations with the Late Indian Summer Monsoon”, by Terray, P., Delecluse P., Labattu S., Terray L., Climate Dynamics, vol. 21, 593-618, 2003. doi: 10.1007/s00382-003-0354-0
    • “Computer-intensive methods for testing hypotheses: an introduction”, by Noreen, E.W., Wiley and Sons, New York, USA, 1989. ISBN: 978-0-471-61136-3
    • “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 2, 484 pp., 2002. ISBN: 9780521012300

Outputs

comp_composite_miss_3d creates an output NetCDF file that contains the composite statistics and critical probabilities associated with the composite means, taking into account the missing values and, eventually, the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlat and nlon are the length of the spatial dimensions of the input NetCDF variable) and periodicity time observations (periodicity is the number of time observations in the input_climatology_netcdf_file ) :

  1. netcdf_variable_compmean(periodicity,nlat,nlon) : the mean fields M(j) of the time observations selected with the help of the -a= argument, taking into account the missing values and the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).

  2. netcdf_variable_compstd(periodicity,nlat,nlon) : the standard-deviation fields of the time observations selected with the help of the -a= argument, taking into account the missing values and the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. This corresponds to the standard-deviation of the years belonging to class C(j).

  3. netcdf_variable_qstd(periodicity,nlat,nlon) : the ratio between the standard deviations fields of the selected time observations to the standard deviations fields of all the time observations, taking into account the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. The standard deviations fields for all the observations are extracted from the input_climatology_netcdf_file specified in the -c= argument.

  4. netcdf_variable_composite(periodicity,nlat,nlon) : the composite fields of the time observations selected with the help of the -a= argument, taking into account the missing values and the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).

    How these composite fields are calculated is determined by the -s= and -nostd arguments.

  5. netcdf_variable_u(periodicity,nlat,nlon) : the U(j) statistics for the time observations selected with the help of the -a= argument, taking into account the missing values and the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).

    See the documentation of comp_composite_3d for more details on the definition of the U(j) statistic.

  6. netcdf_variable_prob(periodicity,nlat,nlon) : the critical probabilities associated with the U statistics.

    These critical probabilities are computed under the null hypothesis that the selected time observations (with the help of the -a= argument) come from the same population as the other observations in the input_netcdf_file. Small probabilities indicate a large departure from the null hypothesis H0.

    These critical probabilities are computed with a Gaussian approximation, see the documentation of comp_composite_3d for more details.

  7. netcdf_variable_compnobs(periodicity,nlat,nlon) : the number of observations used to compute the composite means and standard-deviations for each grid cell in the 2-D grid-mesh associated with the input tridimensional NetCDF variable.

Note that some of these statistics and the associated critical probabilities may be missing for some grid-cell in the 2-D grid-mesh associated with the input tridimensional NetCDF variable depending on the pattern of missing values in this input NetCDF variable.

All these statistics are packed in tridimensional variables whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, the output NetCDF variables are filled with missing values.

Example

  1. For computing a composite analysis from a tridimensional NetCDF variable sst in the NetCDF file Hadisst_1m_195001_201512_sst.nc, assessing the significance of the results with a Gaussian approximation and, finally, storing the results in a NetCDF file named composite_Hadisst_1m_195001_201512_sst.nc, use the following command :

    $ comp_composite_miss_3d \
      -f=Hadisst_1m_195001_201512_sst.nc \
      -v=sst \
      -c=clim_Hadisst_1m_195001_201512_sst.nc \
      -a=10,11,20,55,143 \
      -m=mesh_mask_Hadisst.nc \
      -o=composite_Hadisst_1m_195001_201512_sst.nc
    
Flag Counter