comp_composite_4d

Authors

Pascal Terray (LOCEAN/IPSL)

Latest revision

12/09/2018

Purpose

Compute a composite analysis from a fourdimensional variable extracted from a NetCDF dataset and perform statistical tests on the differences between the composite mean and the overall mean (e.g. the mean of the parent finite population) for each cell of the 3-D grid-mesh associated with the tridimensional NetCDF variable. The composite statistics and tests may be computed by taking into account the periodicity of the data. These statistics are stored in an output NetCDF dataset.

The use of Student t values and test in composite analyses, as is traditionally used in the climate literature is not valid [Brown_Hall]. Thus, to obtain more meaningful results, comp_composite_4d uses a resampling scheme or probabilities based on the finite population of the observed time observations and a normal approximation to estimate the statistical significance of the composite means [Terray_etal] .

For more details on the statistical test used in comp_composite_4d to assess the significance of the composite means, consult the references cited below [Terray_etal] [Noreen].

If the NetCDF variable is tridimensional, use comp_composite_3d instead of comp_composite_4d.

This procedure is parallelized if OpenMP is used and the NCSTAT software has been built with the _PARALLEL_READ CPP macro.

Further Details

Usage

$ comp_composite_4d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -c=input_climatology_netcdf_file \
  -a=year1,year2,year3,..., yearn \
  -m=input_mesh_mask_netcdf_file    (optional) \
  -g=grid_type                      (optional : n, t, u, v, w, f) \
  -x=lon1,lon2                      (optional) \
  -y=lat1,lat2                      (optional) \
  -z=level1,level2                  (optional) \
  -t=time1,time2                    (optional) \
  -s=type_of_statistics             (optional : comp, diff) \
  -alg=algorithm                    (optional : normal, simul) \
  -o=output_composite_netcdf_file   (optional) \
  -nb=number_of_shuffles            (optional) \
  -mi=missing_value                 (optional) \
  -nostd                            (optional) \
  -double                           (optional) \
  -bigfile                          (optional) \
  -hdf5                             (optional) \
  -tlimited                         (optional)

By default

-m=
an output_mesh_mask_netcdf_file is not used
-g=
the grid_type is set to n which means that the 3-D grid-mesh associated with the input netcdf_variable is assumed to be regular or Gaussian
-x=
the whole longitude domain associated with the netcdf_variable
-y=
the whole latitude domain associated with the netcdf_variable
-z=
the whole vertical resolution associated with the netcdf_variable
-t=
the whole time period associated with the netcdf_variable
-s=
the type_of_statistics is set to comp
-alg=
the algorithm is set to normal
-nb=
number_of_shuffles is set to 99
-o=
the output_composite_netcdf_file is named composite_netcdf_variable.nc
-mi=
the missing_value attribute in the output NetCDF file is set to 1.e+20
-nostd
the composite fields of the input NetCDF variable are standardized. If -nostd is activated, the composite fields are not standardized
-double
the composite statistics are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the results are stored as double-precision floating point numbers
-bigfile
a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
-hdf5
a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
-tlimited
the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks

  1. The -v=netcdf_variable argument specifies the NetCDF variable for which a composite analysis must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file input_netcdf_file.

  2. The input_climatology_netcdf_file specified with the -c= argument must contain the means and standard deviations for the parent population. The periodicity for the composite analysis is also deduced from the number of observations in the input_climatology_netcdf_file. This input_climatology_netcdf_file must have been created by comp_clim_4d applied to the same netcdf_variable with an identical -t=time1,time2 argument as used in comp_composite_4d in order to obtain correct statistics in the output NetCDF file.

    The geographical shapes of the netcdf_variable (in the input_netcdf_file) and the climatology (in the input_climatology_netcdf_file) must agree.

  3. The -a= argument lists the indices of the years (or seasons, months or days depending on the sampling of the observations in the input_netcdf_file), which must be included in the composite analysis (e.g. for computing the composite means). The indices of the years are counted from the start of the (selected) time period (e.g. time1 in the -t=time1,time2 argument or 1 if this argument is missing).

    The list may be specified in different formats:

    • -a=n1,n2,…nn allows to select for years n1, n2, … and nn in the input_netcdf_file
    • -a=n1:n2 allows to select for years n1 to n2 in the input_netcdf_file

    The two forms of the -a= argument may be combined and repeated any number of times, but duplicate years are not allowed.

    Remember also that the length of a year in the input_netcdf_file is determined by the periodicity of the observations as deduced from the number of time observations in the input_climatology_netcdf_file, when specifying the indices of the years in the -a= argument. This periodicity will also determine how many composite means and statistics will be computed and stored in the output_composite_netcdf_file by comp_composite_4d.

  4. The optional argument -m=input_mesh_mask_netcdf_file specifies the land-sea mask to apply to netcdf_variable for transforming this fourdimensional NetCDF variable as a rectangular matrix of observed variables before computing the composite analysis. By default, it is assumed that each cell in the 3-D grid-mesh associated with the input fourdimensional NetCDF variable is a valid time series (e.g. missing values are not present).

    The geographical shapes of the netcdf_variable (in the input_netcdf_file), the climatology (in the input_climatology_netcdf_file) and the mask (if an input_mesh_mask_netcdf_file is used) must agree.

    Refer to comp_clim_4d or comp_mask_4d for creating a valid input_mesh_mask_netcdf_file NetCDF file for regular or gaussian grids before using comp_composite_4d.

  5. If -g= is set to t, u, v, w or f, it is assumed that the input NetCDF variable is from an experiment with the ORCA model (R2, R4 or R05 resolutions). This argument is also used to determined the name of the mesh_mask variable if an input_mesh_mask_netcdf_file is used.

  6. If the -x=lon1,lon2, -y=lat1,lat2 and -z=level1,level2 arguments are missing, the whole geographical domain and vertical resolution associated with the netcdf_variable are used in the composite analysis.

    The longitude, latitude or level range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to 1. Negative values are allowed for lon1. In this case the longitude domain is from nlon+lon1+1 to lon2 where nlon is the number of longitude points in the grid associated with the NetCDF variable and it is assumed that the grid is periodic.

    Refer to comp_mask_4d for transforming geographical coordinates as indices or generating an appropriate mesh-mask before using comp_composite_4d.

  7. If the -t=time1,time2 argument is missing, data in the whole time period associated with the netcdf_variable is taken into account. The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1.

    The selected time period (e.g. time2 - time1 + 1) must be a whole multiple of the periodicity as deduced from the number of time observations in the input_climatology_netcdf_file if -alg=simul (see remark below).

    It is also assumed that the selected time period matches exactly the time period used to compute the climatology in the input_climatology_netcdf_file. Moreover, the first selected time observation ( time1 if the -t= argument is present) must correspond to the first day, month, season of the climatology specified with the -c= argument.

  8. The -s=type_of_statistics argument specified how the composite variable netcdf_variable_composite stored in the output_composite_netcdf_file is computed:

    • -s=comp means that the composites are computed as the standardized differences between the mean of the composite years and the overall mean (e.g. the mean of the parent finite population) for each grid-point.
    • -s=diff means that the composites are computed as the standardized differences between the mean of the composite years and the mean of the other years in the parent finite population for each grid-point.

    The standard deviations from the parent population are used for the standardization in both cases and are read from the input_climatology_netcdf_file. Finally, note that this argument does not affect the other variables stored in the output_composite_netcdf_file.

  9. The -nostd argument specifies that the composite variable netcdf_variable_composite in the output_composite_netcdf_file must not be standardized.

    By default, the composites are standardized in the output_composite_netcdf_file.

  10. The -alg= argument selects the method for computing critical probabilities associated with the composite means and U statistics (see the second reference cited below for the definition of this statistic):

    • If -alg=normal, a normal approximation is used to compute the critical probabilities associated with the composite means and U statistics.
    • If -alg=simul, the critical probabilities are estimated by a resampling method. If -alg=simul the selected time period (e.g. time2 - time1 + 1) must be a whole multiple of the periodicity.
  11. The -nb=number_of_shuffles argument specifies the number of shuffles for the resampling procedure if -alg=simul.

  12. The -mi=missing_value argument specifies the missing value indicator associated with the NetCDF variables in the output_netcdf_file. If the -mi= argument is not specified the missing_value attribute is set to 1.e+20for the NetCDF variables in the output_netcdf_file.

  13. The -double argument specifies that the results are stored as double-precision floating point numbers in the output NetCDF file. By default, the results are stored as single-precision floating point numbers in the output NetCDF file.

  14. The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g. -D_USE_NETCDF36 or -D_USE_NETCDF4) and linked to the NetCDF 3.6 library or higher. If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 macros.

  15. The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g. -D_USE_NETCDF4) and linked to the NetCDF 4 library or higher. If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.

  16. Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

  17. It is assumed that the data has no missing values.

  18. For more details on composite analysis and statistical testing in composite analysis, see

    • “The Use of t values in Composite Analyses” by Brown, T.J., and Hall, B.L., Journal of climate, vol. 12. 2941-2944, 1999. doi: 10.1175/1520-0442(1999)012<2941:TUOTVI>2.0.CO;2
    • “Sea Surface Temperature associations with the Late Indian Summer Monsoon”, by Terray, P., Delecluse P., Labattu S., Terray L., Climate Dynamics, vol. 21, 593-618, 2003. doi: 10.1007/s00382-003-0354-0
    • “Computer-intensive methods for testing hypotheses: an introduction”, by Noreen, E.W., Wiley and Sons, New York, USA, 1989. ISBN: 978-0-471-61136-3

Outputs

comp_composite_4d creates an output NetCDF file that contains the composite statistics and critical probabilities associated with the composite means, taking into account eventually the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlev, nlat and nlon are the length of the vertical and spatial dimensions of the input NetCDF variable) and periodicity time observations (periodicity is the number of time observations in the input_climatology_netcdf_file ) :

  1. netcdf_variable_compmean(periodicity,nlev,nlat,nlon) : the mean fields of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).

  2. netcdf_variable_compstd(periodicity,nlev,nlat,nlon) : the standard-deviation fields of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument.

  3. netcdf_variable_qstd(periodicity,nlev,nlat,nlon) : the ratio between the standard deviations fields of the selected time observations to the standard deviations fields of all the time observations, taking into account the periodicity of the data as determined by the -c=input_climatology_netcdf_file argument. The standard deviations fields for all the observations are extracted from the input_climatology_netcdf_file specified in the -c= argument.

  4. netcdf_variable_composite(periodicity,nlev,nlat,nlon) : the composite fields of the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).

    How these composite fields are calculated is determined by the -s= and -nostd arguments.

  5. netcdf_variable_u(periodicity,nlev,nlat,nlon) : the U statistics for the time observations selected with the help of the -a= argument, taking into account the periodicity of the data as determined by the number of observations in the input_climatology_netcdf_file (specified with the help of the -c= argument).

    See the second publication cited above for more details on the definition of the U statistic.

  6. netcdf_variable_prob(periodicity,nlev,nlat,nlon) : the critical probabilities associated with the U statistics.

    These critical probabilities are computed under the null hypothesis that the selected time observations (with the help of the -a= argument) come from the same population as the other observations in the input_netcdf_file. Small probabilities indicate a large departure from the null hypothesis.

    The -alg=algorithm argument determines how these critical probabilities are computed.

  7. netcdf_variable_compnobs(periodicity) : the number of observations used to compute the composite means.

All these statistics are packed in fourdimensional variables whose first, second and third dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x=, -y= and -z= arguments. However, outside the selected domain, the output NetCDF variables are filled with missing values.

Example

  1. For computing a composite analysis from a fourdimensional NetCDF variable votemper in the NetCDF file ST7_1m_00101_20012_votemper_grid_T.nc, assessing the significance of the results with a monte carlo simulation with 999 shuffles and, finally, storing the results in a NetCDF file named composite_ST7_1m_votemper_grid_T.nc, use the following command (note that the critical probabilities associated with the U statistics are estimated with the help of a resampling method using 999 surrogate time series since -alg=simul and -nb=999 are specified) :

    $ comp_composite_4d \
      -f=ST7_1m_00101_20012_T_votemper_grid_T.nc \
      -v=votemper \
      -c=clim_votemper_grid_T.nc \
      -a=10,11,20,55,100 \
      -g=t \
      -m=mesh_mask_ST7_votemper_grid_T.nc \
      -alg=simul \
      -nb=999 \
      -o=composite_ST7_1m_votemper_grid_T.nc