Synopsis
Compute statistics for images and columns in tables.
Syntax
dmstat infile [centroid] [median] [sigma] [clip] [nsigma] [maxiter] [out_columns] [out_min] [out_min_loc] [out_max] [out_max_loc] [out_mean] [out_median] [out_sigma] [out_sum] [out_good] [out_null] [out_cnvrgd] [out_cntrd_log] [out_cntrd_phys] [out_sigma_cntrd] [verbose]
Description
The dmstat tool calculates statistics of both columns in tables and two-dimensional images. The calculated values are printed on the screen and also stored in the parameter file, which allows easy access to the results from scripts.
Tables
When the input file is a table, the default option is to compute the mean, standard deviation, total, minimum, maximum, and the number of good and null rows for each of the columns in the table. Additional statistics include the median and the ability to perform an iterative, sigma-clipping algorithm to filter the data.
unix% dmstat "evt2.fits[cols x]" x[pixel] min: 2479.8139648 @: 102340 max: 6018.9042969 @: 126933 mean: 4059.8933467 sigma: 869.49332738 sum: 765618747.21 good: 188581 null: 0
Here we use dmstat to calculate a number of statistics on the "x" column in the table "evt2.fits". If we had not used a data model (DM) filter to select the x column (see "ahelp dm") then the statistics for each column in the table would have been calculated and printed to the screen. The values can also be recovered from the parameter file:
unix% pget dmstat out_mean 4059.8933467 unix% set min = `pget dmstat out_min`
The screen output lists the minimum, maximum, mean, and standard deviation (here called 'sigma') values of the column. The "sum" field refers to the sum of the values in the column, and the "good" and "null" values refer to the number of valid and non-valid rows respectively (the handling of Null/NaN values is discussed below). The sum of the good and null columns gives the number of rows in the table (which equates to the number of events in an event file). Also listed, on the min and max lines after the "@" symbol, are the row numbers which correspond to the location of that value (or the first occurrence if there is more than one match).
The mean, sigma, and sum values are calculated using:
sum = sumof( x_i )
mean = sum / good
sigma = sqrt( sumof( (x_i - mean)^2 ) / good )
where x_i refers to the value of a single row. Note the use of good rather than (good-1) in the calculation of sigma.
Images
Two calculation modes are available for images: the centroid of the distribution (the default) or the mean level. Minimum and maximum values, the total signal, and the number of good and null values are also reported.
The output from the two modes is shown below for a 2D image:
unix% dmstat img.fits centroid=yes EVENTS_IMAGE(x, y) min: 0 @: ( 3722 3038 ) max: 19 @: ( 4205 3755 ) cntrd[log] : ( 591.44825223 648.85355266 ) cntrd[phys]: ( 4312.4482522 3685.8535527 ) sigma_cntrd: ( 49.11625588 52.963861712 ) good: 1523984 null: 0 unix% pget dmstat out_cntrd_phys 4312.4482522,3685.8535527
and
unix% dmstat img.fits centroid=no EVENTS_IMAGE min: 0 @: ( 3722 3038 ) max: 19 @: ( 4205 3755 ) mean: 0.0057441547943 sigma: 0.084794915133 sum: 8754 good: 1523984 null: 0 unix% pget dmstat out_sum 8754
In both modes, the value and location - in physical coordinates - of the minimum and maximum pixel values is reported. Since there may be multiple pixels with either the minimum or maximum value, only the location of the first such pixel is returned. The good and null items refer to the number of pixels that are valid and invalid (i.e. pixels whose value equals the Null/NaN value) respectively.
For the centroid=yes case, the location of the centroid of the pixel distribution is reported in both logical and physical coordinate systems. For a set of N valid pixels with logical coordinates (x_i,y_j) and values f(x_i,y_j), the centroid (c_x,c_y) is calculated in the logical coordinate system via
psum = sumof( f(x_i,y_j) )
c_x = sumof( x_i * f(x_i,y_j) ) / psum
c_y = sumof( y_j * f(x_i,y_j) ) / psum
where the sums are over both i and j, and ignore those pixels equal to the Null/Nan value. The position in the physical coordinate system is then found by transforming (c_x,c_y). The "sigma centroid" values (sc_x,sc_y) are calculated using
sc_x = sqrt( (sumof ( abs(x_i) * (f(x_i,y_j)-mean)^2 )) / ngood)
sc_y = sqrt( (sumof ( abs(y_i) * (f(x_i,y_j)-mean)^2 )) / ngood)
As the centroid is calculated using all the valid pixels in the input image, the reported value will be affected by the presence of sources in the image other than the object of interest. It is therefore suggested that the image be filtered (see "ahelp dmfiltering") to restrict the area to just the interesting source. Note that the centroid calculation is not well-defined if the image contains negative pixels.
For the centroid=no case, the mean, sigma, and sum values are calculated as:
sum = sumof( f(x_i,y_j) )
mean = sum / good
sigma = sqrt( sumof( (f(x_i,y_j) - mean)^2 ) / good )
Note the use of good rather than (good-1) in the calculation of sigma.
Sigma-clipping algorithm
An iterative, sigma-clipping algorithm can be used to reduce the effect of outliers on the computed statistics. Options exist to choose which statistics to compute if the execution time is a factor. If clip=yes with a table, then the algorithm works on each selected columns individually; for images, the setting is currently ignored if centroid=yes.
The sigma-clipping algorithm calculates the mean and standard deviation of the data, then removes all points which are more than nsigma times the standard deviation away from the mean. The procedure is repeated until either no more points are rejected (cnvrgd is set to Y) or the maximum number of iterations (as specified by the maxiter parameter) is reached (cnvrgd is set to N). The null value reports the total number of rejected elements, i.e. both those rejected by the clipping algorithm and those whose values are Null (or NaN).
For example:
unix% dmstat "img.fits[sky=circle(4237,3694,40)]" centroid=no EVENTS_IMAGE min: 0 @: ( 4237 3654 ) max: 4 @: ( 4242 3690 ) mean: 0.13990049751 sigma: 0.38752947575 sum: 703 good: 5025 null: 1536
and
unix% dmstat "img.fits[sky=circle(4237,3694,40)]" centroid=no clip=yes EVENTS_IMAGE min: 0 @: ( 4237 3654 ) max: 1 @: ( 4245 3655 ) mean: 0.11377849506 sigma: 0.31754204307 sum: 564 good: 4957 null: 1604 cnvrgd: Y unix% pget dmstat out_cnvrgd Y
Since a Y is reported for cnvrgd, the number of rejected points (1604 - 1536 = 68) converged before the maximum number of iterations was reached. Note that the maximum pixel - and hence its location - has also changed.
Examples
Example 1
dmstat "intable.fits[cols energy]" set mean = `pget dmstat out_mean`
Computes the standard set of statistics for the ENERGY column of intable.fits. The second line sets the variable "mean" to the mean value calculated for the column (using csh/tcsh syntax).
Example 2
dmstat "intable.fits[cols time,pha,pi]" median+
Computes the statistics for the TIME, PHA, and PI columns in the file intable.fits. The median value of each column will also be calculated and reported.
Example 3
dmstat lc.txt"[cols counts]"
Run dmstat on the "counts" column of a DTF text format file; see "ahelp dmascii" for more information on working with ASCII files.
Example 4
dmstat "intable.fits[sky=circle(4096,4096,20),pha=:100][cols time]"
You can use the on-the-fly filtering capabilities of the Data Model (see "ahelp dm") to determine exactly what section of the data you want dmstat to consider. This example applies two filters to table:
- a spatial filter on the sky column
- restricts rows to those with a pha value less than 100
and then calculates the statistics of the time column of the remaining rows.
Example 5
dmstat intable.fits
Computes the statistics for all columns in the first interesting block of intable.fits.
Example 6
dmstat intable.fits median=no sigma=no
Computes the statistics, except for the median and standard deviation, for all columns in the first interesting block of intable.fits.
Example 7
dmstat inimage.fits
Computes the statistics for an image, reporting the position of the centroid of the distribution.
Example 8
dmstat inimage.fits centroid=no
Computes the mean level in the image with associated statistics.
Example 9
dmstat "inimage.fits[exclude sky=circle(500,500,20)]" centroid=no
Computes the statistics for an image after applying a spatial filter which excludes all pixels that lie within a circle of radius 20 with center (500,500) in sky coordinates. Those pixels excluded by the spatial filter are ignored in the calculation, only contributing to the number of excluded pixels listed in the "null" column.
See "ahelp dmfiltering" for more information on filtering images.
Example 10
dmstat inimage.fits centroid=no clip=yes nsigma=2.5 maxiter=10
Computes the mean level in an image, with associated statistics, after removing all points which are more than 2.5 standard deviations away from the mean. This is repeated until either no more values are rejected or 10 iterations have been performed. If centroid=yes, then no clipping of the data would have occurred.
Example 11
dmstat "inimage.fits[sky=region(src.reg)]" centroid=no clip=yes nsigma=2.5 maxiter=10
As with the previous example, but this time the calculation is restricted to only those pixels that lie within the region defined by the file "src.reg".
Example 12
dmstat intable.fits sigma- > /dev/null set cols = `pget dmstat out_columns` set means = `pget dmstat out_mean` @ n = 1 @ nmax = `stk_count $cols echo+` while ( $n <= $nmax ) set col = `stk_read_num $cols $n echo+` set mean = `stk_read_num $means $n echo+` printf "%-10s mean= %g\n" $col $mean @ n++ end
Here we calculate the statistics for all the columns in intable.fits and then use the stack tools ("ahelp stack" and "ahelp tools /stk_/") to find the name and mean value for all columns in the file. This example uses csh/tcsh syntax.
Example 13
dmstat "*.fits" centroid=no pget dmstat out_mean
Here we run dmstat on all files in the current directory that end in ".fits". Although the statistics for each entry in the stack will be written out to the screen, only those for the last file processed will be written to the parameter file.
Parameters
name | type | ftype | def | min | reqd | stacks |
---|---|---|---|---|---|---|
infile | file | input | yes | yes | ||
centroid | boolean | yes | ||||
median | boolean | no | ||||
sigma | boolean | yes | ||||
clip | boolean | no | ||||
nsigma | real | 3 | ||||
maxiter | integer | 20 | 1 | |||
out_columns | string | |||||
out_min | string | |||||
out_min_loc | string | |||||
out_max | string | |||||
out_max_loc | string | |||||
out_mean | string | |||||
out_median | string | |||||
out_sigma | string | |||||
out_sum | string | |||||
out_good | string | |||||
out_null | string | |||||
out_cnvrgd | string | |||||
out_cntrd_log | string | |||||
out_cntrd_phys | string | |||||
out_sigma_cntrd | string | |||||
verbose | integer | 1 | 0 |
Detailed Parameter Descriptions
Parameter=infile (file required filetype=input stacks=yes)
Input stack of files
The event file(s) or two-dimenstional image(s) on which statistics will be computed.
Parameter=centroid (boolean default=yes)
Calculate centroid of image?
This option is only used if the input is a two-dimensional image. If "yes", calculate the centroid of the distribution; otherwise, calculate the mean value.
Parameter=median (boolean default=no)
Calculate median value?
If set, calculate the median value of the distribution. Setting to "no" will speed up the execution time of the program. The median is not calculated if the file is an image and centroid=yes.
Parameter=sigma (boolean default=yes)
Calculate the population standard deviation?
If set, calculate the standard deviation of the distribution. Setting to "no" will speed up the execution time of the program.
Parameter=clip (boolean default=no)
Calculate stats using sigma clipping?
Set to "yes" to use an iterative sigma-clipping algorithm; see also the parameters nsigma and maxiter. This parameter is ignored if the input file is an image and the centroid parameter is set to "yes."
Parameter=nsigma (real default=3)
Number of sigma to clip
If the clip parameter is set to "yes", values are removed if they lie more than nsigma standard deviations away from the mean.
Parameter=maxiter (integer default=20 min=1)
Maximum number of iterations
If the clip parameter is set to "yes", repeat the clipping algorithm until either no more points are removed - when the cnvrgd value will be reported as Y - or maxiter iterations has occurred, in which case the cnvrgd value will be reported as N.
Parameter=out_columns (string)
The names of the columns or image block.
A comma-separated list of column names when infile is a table or an image when centroid=yes, otherwise the name of the block containing the image (centroid=no). This parameter can be used to identify which element of the other out_* parameters belongs to which column.
Vector columns are listed as the individual components, and array columns are written out as a list of "name[index]" values, where index starts from 0. So, if infile="evt2.fits[cols time,sky,phas]" for a Chandra event file, then out_columns will be set to "time,x,y,phas[0],phas[1],phas[2],phas[3],phas[4],phas[5],phas[6],phas[7],phas[8]".
This is an output parameter.
Parameter=out_min (string)
Minimum values.
The minimum values, stored as a comma-separated string.
This is an output parameter.
Parameter=out_min_loc (string)
Location of the minimum values.
The location of the minimum values, stored as a comma-separated string. For columns, the row number is used (starting from 1). For images, the physical coordinate system is used, when available; otherwise, the logical coordinate system is used.
This is an output parameter.
Parameter=out_max (string)
Maximum values.
The maximum values, stored as a comma-separated string.
This is an output parameter.
Parameter=out_max_loc (string)
The location of the maximum values.
The location of the minimum values, stored as a comma-separated string. For columns, the row number is used (starting from 1). For images, the physical coordinate system is used, when available; otherwise, the logical coordinate system is used.
This is an output parameter.
Parameter=out_mean (string)
The mean values.
The mean values stored as a comma-separated string. This parameter is set to "" when infile is an image and centroid=yes.
This is an output parameter.
Parameter=out_median (string)
The median values.
The median values stored as a comma-separated string. This parameter is set to "" when the median value has not been calculated.
This is an output parameter.
Parameter=out_sigma (string)
The standard deviation.
The standard deviation values stored as a comma-separated string. This parameter is set to "" when the standard deviation is not calculated.
This is an output parameter.
Parameter=out_sum (string)
The sum of the values.
The sum of all the values in a column, or an image, stored in a comma-separated list. This parameter is set to "" when infile is an image and centroid=yes.
This is an output parameter.
Parameter=out_good (string)
The number of good values.
The number of "good" values in the column or image, stored as a comma-separated list.
This is an output parameter.
Parameter=out_null (string)
The number of ignored values.
The number of ignored rows or pixels, stored as a comma-separated list.
This is an output parameter.
Parameter=out_cnvrgd (string)
Did the sigma clipping converge?
A Y value (yes) indicates that the sigma clipping converged, otherwise N (no) is used for each column or image. This parameter is set to "" when the sigma-clipping algorithm is not used.
This is an output parameter.
Parameter=out_cntrd_log (string)
The centroid of the image, in logical coordinates.
The centroid of the image in logical coordinates, stored as a comma-separated list. If the input is not an image or centroid=no, then this parameter is set to "".
This is an output parameter.
Parameter=out_cntrd_phys (string)
The centroid of the image, in physical coordinates.
The centroid of the image in physical coordinates, stored as a comma-separated list. If the input is not an image or centroid=no, then this parameter is set to "".
This is an output parameter.
Parameter=out_sigma_cntrd (string)
The standard deviation of the centroid, in logical coordinates.
The "standard deviation" of the image centroid in logical coordinates, stored as a comma-separated list. If the input is not an image or centroid=no, then this parameter is set to "".
This is an output parameter.
Parameter=verbose (integer default=1 min=0)
Level of verbosity
If the verbose value is set to one (1), the default, the tool will use the standard output to the screen. If the verbose value is set to zero (0) the output is not shown, but the calculation values are still stored in the par file.
Using Virtual Columns
If the table contains a column with a World Coordinate System attached to it - for instance, the SKY column of a Chandra event list - then dmstat can calculate statistics in both the original and transformed coordinate frames. All you have to do is include the names of the columns you require - e.g. "evt2.fits[cols ra]" or "evt2.fits[cols eqpos]" - as part of the infile parameter. For example:
unix% dmstat "evt2.fits[cols eqpos]" EQPOS(RA, DEC)[deg] min: ( 258.57348385 67.020579406 ) @: ( 31812 29626 ) max: ( 259.82637267 67.421050984 ) @: ( 206 7825 ) mean: ( 259.26622974 67.217694774 ) sigma: ( 0.30743608783 0.094642855434 ) sum: ( 48892684.871 12675980.098 ) good: ( 188581 188581 ) null: ( 0 0 )
Handling Invalid Data
Tables and images can include elements which do not contain valid data; for instance, the result of a division by 0 or an event in a grating observation which does not have a grating order. Such items are generally flagged by a "special" number (see the "null" option in "ahelp dmopt") and dmstat will ignore them during the calculation. As part of its output, dmstat reports the number of valid (good) and invalid (null) rows/pixels it processed.
Pixels in images can also be ignored if they lie outside the region of interest - i.e. a spatial filter which has been applied to the image. dmstat uses the subspace information of the image (see "ahelp subspace") to find out which pixels to exclude from the calculation.
Accessing Statistics from the Parameter File
dmstat writes the calculated values to the parameter file (using the "out_*" parameters). This means that you do not need to parse the screen output to access these values. The following csh/tcsh expression will set the variable m to the mean value calculated by dmstat.
unix% set m = `pget dmstat out_mean`
Not all the fields are written to the parameter file; the output depends on what options are selected and the type of the input file. For example, out_median is only filled in if the median option is set to true and either the input file is not an image or the centroid option is set to false. The contents of some of the fields also depends on the options chosen: e.g. when analyzing an image, the out_columns parameter is set to the block name when the centroid option is set to no, but it is set to a comma-separated list of the names of the components of the physical coordinate system when centroid is set to "yes".
Using multiple columns
When given a scalar column, such as "evt2.fits[cols x]", the output values will also be scalar values. For more complicated situations, such as "evt2.fits[cols sky]" and "evt2.fits[cols time,energy]", the output values will be written out as a comma-separated list. As these values can be treaded as stacks (see "ahelp stack"), the stack tools and module ("ahelp /stk_/") can be used to access individual elements of the list. The contents of the out_columns parameter can be used to map between position and column. So, after
unix% dmstat "evt2.fits[cols time,sky]" sigma- time[s] min: 68010810.424 @: 1 max: 68010862.403 @: 1973 mean: 68010837.09 sum: 1.3602167418e+11 good: 2000 null: 0 sky(x, y)[pixel] min: ( 2495.1342773 2870.4226074 ) @: ( 206 940 ) max: ( 5968.3662109 5709.3413086 ) @: ( 1610 1384 ) mean: ( 3969.3936443 4225.1935254 ) sum: ( 7938787.2886 8450387.0508 ) good: ( 2000 2000 ) null: ( 0 0 )
we can say
unix% set cols = `pget dmstat out_columns` unix% echo $cols time,x,y unix% set means = `pget dmstat out_mean` unix% echo $means 68010837.09,3969.3936443,4225.1935254 unix% stk_where $cols x echo+ 2 unix% stk_read_num $means 2 echo+ 3969.3936443
Note that although only two columns - namely "time" and "sky" - were given to dmstat, the output contains three values, since the "sky" column has been broken up into its constituent columns ("x" and "y"). Columns that are unsupported by dmstat - i.e. bit arrays and character strings - are not written out to the parameter file.
For images, the number of values written out to each parameter depends on the dimensionality of the image (although dmstat currently only accepts two-dimensional images).
Setting the infile parameter to a stack
When the infile parameter is a stack - such as "img1.fits,img2.fits" or "@file.lis" - only the statistics for the last item in the stack are written to the parameter file. The filename can be found from the parameter file using something like the following:
unix% set files = `pget dmstat infile` unix% set n = `stk_count "$files" echo+` unix% set file = `stk_read_num "$files" $n echo+`
Bugs
Caveats
- dmstat reads the null value defined for a column as an integer whether or not the image actually contains integer values.
-
For images (and probably for tables, as well), dmstat reads the null value defined for a column as an integer whether or not the image actually contains integer values. So as dmstat calculates its output, null=0 behaves exactly like null=0.9 and null=1.1 behaves like null=1.0.
This means that if there are pixels inside of the region whose values actually are 0.0, they are added up with the null pixels outside of the region, in the case that if null=0.0 or null=0.9; similarly for null=1.0. This inflates the null count returned by dmstat.
-
Workaround:
Using null=NaN should give good results. Technically, NaN is the "only" value that should be used for floating point images (or columns). Using any other null value with a floating point image is advised against since that value might already be represented in the image.
See Also
- concept
- subspace
- dm
- dm, dmascii, dmfiltering, dmopt
- tools::composite
- srcextent, srcflux
- tools::coordinates
- dmcoords
- tools::core
- dmappend, dmcopy, dmextract, dmlist
- tools::image
- centroid_map, dmfilth, dmimg2jpg, dmimgadapt, dmimgblob, dmimgcalc, dmimgdist, dmimgfilt, dmimghist, dmimgpick, dmimgpm, dmimgproject, dmimgreproject, dmimgthresh, dmmaskbin, dmmaskfill, dmnautilus, dmradar, dmregrid, dmregrid2, energy_hue_map, evalpos, hexgrid, map2reg, merge_too_small, mkregmap, pathfinder, vtbin
- tools::region
- dmcontour, dmellipse, dmimghull, dmimglasso
- tools::response
- mean_energy_map, pileup_map
- tools::statistics
- aplimits, aprates, imgmoment, lim_sens, statmap