comp_stat_3d

Authors

Pascal Terray (LOCEAN/IPSL)

Latest revision

29/05/2024

Purpose

Compute univariate statistics from a tridimensional variable extracted from a NetCDF dataset and, optionally, the associated mesh-mask and scale factors of the 2-D grid-mesh associated with the input NetCDF variable.

Mean, variance, standard-deviation, skewness, kurtosis, minimum and maximum are computed for each point in the multi-channel time series of the 2-D grid-mesh associated with the input NetCDF variable. These univariate statistics may be computed by taking into account the periodicity of the data. These statistics are stored in an output NetCDF dataset.

The mean is a simple, but informative, measure, of the central tendency of a variable [vonStorch_Zwiers]. The standard-deviation and variance are conventional measures of variation of a variable [vonStorch_Zwiers]. If X(:) is a vector of ntime observations for one grid-point in the multi-channel time series of the 2-D grid-mesh, the MEAN, VAR (e.g., variance) and STD (e.g., standard-deviation) statistics for this grid-point are estimated by:

  • MEAN = sum( X(:) )/ntime
  • VAR = sum( [X(:)-MEAN]**2 )/(ntime-1)
  • STD = sqrt(VAR)

Note that the divisor used in calculating variance and standard-deviation is the number of degrees of freedom (e.g., the number of observations minus 1), this is in contrast with the formulae used in comp_clim_3d.

Skewness measures the deviation of the distribution of a variable from symmetry [vonStorch_Zwiers] [Burgers_Stephenson] [White] [Masson_etal]. For a symmetrical distribution, the skewness coefficient is always equal to 0, but the converse is not true. Skewness is 0 for a normal distribution. For unimodal distributions shifted to the right (left), the skewness coefficient is positive (negative).

If the argument -nobias is specified, the skewness coefficient is estimated as

SKEWNESS = (ntime.M3)/[(ntime-1).(ntime-2).STD3]

, otherwise the following (biased) classical formulae is used

SKEWNESS = M3/(ntime.STD3)

where

  • M3 is equal to the sum of the deviations of the observations from the mean raised to the third power (e.g., sum( [X(:)-MEAN]**3) where X(:) is the vector of observations)
  • STD3 is the standard deviation raised to the third power

In order to interpret correctly the skewness of a variable, note that the Standard Error (SE) of the skewness coefficient calculated from a sample drawn from a Gaussian distribution is given by

SE[SKEWNESS] = sqrt( [6.ntime.(ntime-1)]/[(ntime-2).(ntime+1).(ntime+3)] )

SE[SKEWNESS] is not very different from the quantity sqrt( [6/ntime] ) when the number of observations is sufficiently high.

Moreover, the quantity SKEWNESS/SE[SKEWNESS] follows asymptotically a normal (e.g., Gaussian) distribution with mean 0 and variance equal to 1 when the sample X(:) were independent Gaussian observations. With a sample of independent Gaussian observations, a value twice the standard error is thus associated with a 5% significance level. However, in climate analysis the observations are in general autocorrelated.

Kurtosis measures the flatness or peakedness of the distribution of a variable [vonStorch_Zwiers] [Burgers_Stephenson] [White] . As computed by comp_stat_3d, the kurtosis coefficient is always greater or equal to -2 and is equal to 0 for a normal distribution. In most cases, if the kurtosis is greater (lower) than 0 then the distribution is more peaked (flatter) than the normal distribution with the same mean and standard-deviation.

If the argument -nobias is specified, the kurtosis coefficient is estimated as

KURTOSIS = A - 3.[(ntime-1).(ntime-1)]/[(ntime-2).(ntime-3)]

, otherwise the following (biased) classical formulae is used

KURTOSIS = M4/(ntime.STD4) - 3

where

  • M4 is equal to the sum of the deviations of the observations from the mean raised to the fourth power (e.g., sum( [X(:)-MEAN]**4) where X(:) is the vector of observations)
  • STD4 is the standard deviation raised to the fourth power
  • A is equal to [ntime.(ntime+1).M4]/[(ntime-1).(ntime-2).(ntime-3).STD4]

In order to interpret correctly the kurtosis of a variable, note that the SE of the kurtosis coefficient calculated from a sample drawn from a Gaussian distribution is given by

SE[KURTOSIS] = sqrt( B/C )

where

  • B is equal to 24.ntime.(ntime-1).(ntime-1)
  • C is equal to (ntime-3).(ntime-2).(ntime+3).(ntime+5)

and the quantity KURTOSIS/SE[KURTOSIS] follows also asymptotically a normal distribution with mean 0 and variance equal to 1 when the sample X(:) were independent Gaussian observations.

Extreme departures from the mean will cause very high values of kurtosis. Consequently, the kurtosis coefficient can be used to detect outliers. High values of kurtosis can also be a result of one or two extreme observations in a sample of observations.

The standard errors of the skewness and kurtosis coefficients are calculated and stored in the output NetCDF file if the argument -stderror is specified.

Moreover, the two-tailed significance levels of the statistics SKEWNESS/SE[SKEWNESS] and KURTOSIS/SE[KURTOSIS] (under the hypothesis that the sample X(:) were independent Gaussian observations) are calculated and stored in the output NetCDF file if the argument -prob is specified.

Thus, if you are interested in how well the distribution of the time series in the 2-D grid-mesh associated with the input NetCDF variable can be approximated by the normal distribution, the skewness and kurtosis coefficients, their standard errors and significance levels, can give you some useful information on this issue.

If your data contains missing values, use comp_stat_miss_3d instead of comp_stat_3d to estimate univariate statistics from your gappy dataset.

If the NetCDF variable is fourdimensional use comp_stat_4d instead of comp_stat_3d.

This procedure is parallelized if OpenMP is used and the NCSTAT software has been built with the _PARALLEL_READ CPP macro. However, in this version of comp_stat_3d, parellelism is on the number of periods as determined by the -p=periodicity argument. This means that the number of processors or threads must not be greater than the periodicity parameter.

Moreover, this procedure computes all the univariate statistics with only one pass through the data and an out-of-core strategy which is highly efficient on huge datasets.

Optionally, a mesh-mask NetCDF dataset may also be created. This dataset will contain a presence-absence 2-D mask and scale factor variables, which may be used to compute the surface of each cell in the 2-D grid-mesh associated with the input NetCDF variable. This mesh-mask NetCDF dataset will be used by other NCSTAT procedures such as comp_serie_3d, comp_eof_3d, etc. When computing the scale factors, it is assumed that the 2-D grid-mesh associated with the input NetCDF variable is regular or that this grid is a regular gaussian grid (the grid of the input NetCDF variable is assumed to be gaussian if this NetCDF variable has a grid_type character attribute with the string “gaussian” as a value).

Further Details

Usage

$ comp_stat_3d \
  -f=input_netcdf_file \
  -v=netcdf_variable \
  -p=periodicity                    (optional) \
  -x=lon1,lon2                      (optional) \
  -y=lat1,lat2                      (optional) \
  -t=time1,time2                    (optional) \
  -o=output_statistics_netcdf_file  (optional) \
  -m=output_mesh_mask_netcdf_file   (optional) \
  -yl=latl1,latl2                   (optional) \
  -mi=missing_value                 (optional) \
  -nobias                           (optional) \
  -stderror                         (optional) \
  -prob                             (optional) \
  -double                           (optional) \
  -bigfile                          (optional) \
  -hdf5                             (optional) \
  -tlimited                         (optional)

By default

-p=
the periodicity is equal to 1
-x=
the whole longitude domain associated with the netcdf_variable
-y=
the whole latitude domain associated with the netcdf_variable
-t=
the whole time period associated with the netcdf_variable
-o=
the output_statistics_netcdf_file is named stat_netcdf_variable.nc
-m=
the output_mesh_mask_netcdf_file is not created
-yl=
it is assumed that the grid is gaussian and the domain is the whole globe or that the grid is regular when computing the scale factors
-mi=
the missing_value for the statistics variables in the output NetCDF file is set to 1.e+20
-nobias
biased estimates of kurtosis and skewness coefficients are computed. However, if -nobias is activated, unbiased estimates of kurtosis and skewness are computed
-stderror
the standard errors of the kurtosis and skewness coefficients are not computed. However, if -stderror is activated, these standard errors are computed
-prob
the significance levels of the kurtosis and skewness coefficients are not computed. However, if -prob is activated, these significance levels are computed
-double
the statistic variables are stored as single-precision floating point numbers in the output NetCDF file. If -double is activated, the statistic variables are stored as double-precision floating point numbers
-bigfile
a NetCDF classical format file is created. If -bigfile is activated, the output NetCDF file is a 64-bit offset format file
-hdf5
a NetCDF classical format file is created. If -hdf5 is activated, the output NetCDF file is a NetCDF-4/HDF5 format file
-tlimited
the time dimension is defined as unlimited in the output NetCDF file. However, if -tlimited is activated, the time dimension is defined as limited in the output NetCDF file

Remarks

  1. The -v=netcdf_variable argument specifies the NetCDF variable for which univariate statistics must be computed and the -f=input_netcdf_file argument specifies that this NetCDF variable must be extracted from the NetCDF file, input_netcdf_file.

  2. The -p=periodicity argument gives the periodicity of the input data. For example, with monthly data -p=12 should be specified, with yearly data -p=1 may be used, etc. By default, the periodicity is set to 1. Note that the output NetCDF file will have periodicity time observations.

  3. If the -x=lon1,lon2 and -y=lat1,lat2 arguments are missing, statistics are computed for all the points of the 2-D grid-mesh associated with the netcdf_variable.

    The longitude or latitude range must be a vector of two integers specifying the first and last selected indices along each dimension. The indices are relative to 1. Negative values for lon1 are not allowed.

    Refer to comp_mask_3d for transforming geographical coordinates as indices before using comp_stat_3d.

  4. If the -t=time1,time2 argument is missing, the whole time period associated with the netcdf_variable is used to compute the statistics.

    The selected time period is a vector of two integers specifying the first and last time observations. The indices are relative to 1.

  5. It is assumed that the data has no missing values (excepted missing values associated with a constant land-sea mask as indicated by a missing_value or _FillValue attribute).

    If it is the case, use comp_stat_miss_3d instead of comp_stat_3d.

  6. If the -m=output_mesh_mask_netcdf_file argument is present and the -yl= argument is missing, it is assumed when computing the scale factors that the 2-D grid-mesh associated with the input NetCDF variable is gaussian and covers the whole globe or that this 2-D grid-mesh is regular.

    If the grid is a regular gaussian grid, but the domain does not cover the whole globe, the -yl= argument must be used to specify the latitude boundaries of the domain, otherwise the first and last columns (elements) of the first two scale factors are wrong.

    The -yl= argument specifies the latitude limits of the domain in degrees (latl1 and latl2 must be real numbers) and the ordering of these two values must be coherent with the ordering of the latitudinal coordinate variable in the input_netcdf_file file.

  7. If the -m=output_mesh_mask_netcdf_file argument is present and if some scale factors can not be computed, these scale factors are set to 1.

  8. The -mi=missing_value argument specifies the missing value indicator for the variance (VAR), standard-deviation (STD), skewness (SKEW) and kurtosis (KURT) variables in the output_statistics_netcdf_file.

    If the -mi= argument is not specified and the NetCDF variable has a missing_value or _FillValue attribute, the missing_value is set to 1.e+20. This argument is not used if the NetCDF variable specified in the -v= argument has no missing_value or _FillValue attribute (excepted for the skewness and kurtosis statistics).

  9. If -nobias is specified, unbiased estimates of skewness and kurtosis are computed.

    If the -nobias argument is absent, the biased standard estimates are computed.

  10. If -stderror is specified, the standard errors of skewness and kurtosis are computed.

    If the -stderror argument is absent, the standard errors are not computed.

  11. If -prob is specified, the significance levels of skewness and kurtosis are computed. Moreover, the -prob argument implies also the -stderror argument, even if this argument is not activated.

    If the -prob argument is absent, the significance levels are not computed.

  12. At least 4 observations by period, as determined from the -p=periodicity argument, are required, otherwise the program will stop.

  13. The -double argument specifies that the VAR, STD, SKEW and KURT variables must be stored as double-precision floating point numbers instead of single-precision floating point numbers in the output_statistics_netcdf_file.

  14. The -bigfile argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros (e.g., -D_USE_NETCDF36 or -D_USE_NETCDF4) and linked to the NetCDF 3.6 library or higher.

    If this argument is specified, the output_netcdf_file will be a 64-bit offset format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF36 or _USE_NETCDF4 CPP macros.

  15. The -hdf5 argument is allowed only if the NCSTAT software has been compiled with the _USE_NETCDF4 CPP macro (e.g., -D_USE_NETCDF4) and linked to the NetCDF 4 library or higher.

    If this argument is specified, the output_netcdf_file will be a NetCDF-4/HDF5 format file instead of a NetCDF classic format file. However, this argument is recognized in the procedure only if the NCSTAT software has been built with the _USE_NETCDF4 CPP macro.

  16. Duplicate parameters are allowed, but this is always the last occurrence of a parameter which will be used for the computations. Moreover, the number of specified parameters must not be greater than the total number of allowed parameters.

  17. For more details and examples on the use of skewness and kurtosis coefficients in the climate literature see

    • “Skewness, Kurtosis and Extreme Values of Northern Hemisphere Geopotential Heights”, by White, G., Monthly Weather Review, Vol. 108, 1446-1455, 1980. doi: 10.1175/1520-0493(1980)108<1446:SKAEVO>2.0.CO;2
    • “The Normality of El Nino”, by Burgers, G., and Stephenson, D.B , Geophysical Research Letters, 26, 1027-1030, 1999. doi: 10.1029/1999GL900161
    • “Impact of intra-daily SST variability on ENSO characteristics in a coupled model” by Masson, S., et al., Climate Dynamics, Vol. 39, 681-707, 2012. doi: 10.1007/s00382-011-1247-2
    • “Statistical Analysis in Climate Research”, by von Storch, H., and Zwiers, F.W., Cambridge University press, Cambridge, UK, Chapter 2, 484 pp., 2002. ISBN: 9780521012300

Outputs

comp_stat_3d creates an output NetCDF file that contains the univariate statistics and number of observations of the input NetCDF variable, taking into account eventually the periodicity of the data as determined by the -p=periodicity argument. The output NetCDF dataset contains the following NetCDF variables (in the description below, nlat and nlon are the lengths of the spatial dimensions of the input NetCDF variable) :

  1. netcdf_variable_mean(periodicity,nlat,nlon) : the means for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable.

  2. netcdf_variable_var(periodicity,nlat,nlon) : the variances for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable.

  3. netcdf_variable_std(periodicity,nlat,nlon) : the standard-deviations for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable.

  4. netcdf_variable_skew(periodicity,nlat,nlon) : the skewness for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable.

  5. netcdf_variable_kurt(periodicity,nlat,nlon) : the kurtosis for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable.

  6. netcdf_variable_min(periodicity,nlat,nlon) : the minima for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable.

  7. netcdf_variable_max(periodicity,nlat,nlon) : the maxima for each point in the time series of the 2-D grid-mesh associated with the input NetCDF variable.

  8. netcdf_variable_skew_prob(periodicity,nlat,nlon) : the significance levels of the skewness coefficients.

    This variable is stored only if the -prob argument has been specified when calling comp_stat_3d.

  9. netcdf_variable_kurt_prob(periodicity,nlat,nlon) : the significance levels of the kurtosis coefficients.

    This variable is stored only if the -prob argument has been specified when calling comp_stat_3d.

  10. netcdf_variable_nobs(periodicity) : the number of observations used to compute the statistics.

  11. netcdf_variable_skew_se(periodicity) : the standard-errors of the skewness coefficients.

    This variable is stored only if the -stderror or -prob arguments have been specified when calling comp_stat_3d.

  12. netcdf_variable_kurt_se(periodicity) : the standard-errors of the kurtosis coefficients.

    This variable is stored only if the -stderror or -prob arguments have been specified when calling comp_stat_3d.

All the statistics and associated probabilities, excepted the number of observations and the standard-errors of the skewness and kurtosis coefficients are packed in tridimensional variables whose first and second dimensions are exactly the same as those associated with the input NetCDF variable netcdf_variable even if you restrict the geographical domain with the -x= and -y= arguments. However, outside the selected domain, these output NetCDF variables are filled with missing values.

Optionally, comp_stat_3d can also create an output mesh-mask NetCDF file that contains the following NetCDF variables :

  1. netcdf_variable_nmask(nlat,nlon) : a presence-absence or land-sea 2-D mask associated with the input NetCDF variable.
  2. netcdf_variable_e1n(nlat,nlon) : the first scale factor associated with the 2-D grid-mesh of the input NetCDF variable.
  3. netcdf_variable_e2n(nlat,nlon) : the second scale factor associated with the 2-D grid-mesh of the input NetCDF variable.

Multiplying the two scale factors together gives the surface of each cell in the 2-D grid-mesh associated with the input NetCDF variable.

Examples

  1. For computing monthly statistics from the NetCDF file ST7_1m_00101_20012_grid_T_sosstsst.nc, which includes a NetCDF variable sosstsst, and store the results in a NetCDF file named stat_ST7_1m_00101_20012_grid_T_sosstsst.nc, use the following command :

    $ comp_stat_3d \
      -f=ST7_1m_00101_20012_grid_T_sosstsst.nc \
      -v=sosstsst \
      -p=12 \
      -o=stat_ST7_1m_00101_20012_grid_T_sosstsst.nc
    
  2. For computing monthly unbiased univariate statistics from the NetCDF file sst.mnmean.nc, which includes a NetCDF variable sst`,` and store the results in a NetCDF file named ``stat_sst.nc and, in addition, generate an associated mesh_mask_netcdf_file named mesh_mask_sst.nc, use the following command :

    $ comp_stat_3d \
      -f=sst.mnmean.nc \
      -v=sst \
      -p=12 \
      -nobias \
      -m=mesh_mask_sst.nc
    
Flag Counter