Concept of summary functions is to take one or more pdqr-function(s) and return a summary value (which shouldn’t necessarily be a number). Argument method is used to choose function-specific algorithm of computation.

Note that some summary functions can accumulate pdqr approximation error (like summ_moment() for example). For better precision increase number intervals for piecewise-linear density using either n argument for density() in new_*() or n_grid argument in as_*().

We will use the following distributions throughout this vignette:

my_beta <- as_d(dbeta, shape1 = 2, shape2 = 5)
my_norm <- as_d(dnorm, mean = 0.5)
my_beta_mix <- form_mix(list(my_beta, my_beta + 1))

Although they both are continuous, discrete distributions are also fully supported.

Basic numerical summary

Spread

Moments

summ_moment() has extra arguments for controlling the nature of moment (which can be combined):

summ_moment(my_beta, order = 3)
#> [1] 0.0476182
summ_moment(my_beta, order = 3, central = TRUE)
#> [1] 0.002429287
summ_moment(my_beta, order = 3, standard = TRUE)
#> [1] 11.68727
summ_moment(my_beta, order = 3, absolute = TRUE)
#> [1] 0.0476182

There are wrappers for most common moments: skewness and kurtosis:

Quantiles

summ_quantile(f, probs) is essentially a more strict version of as_q(f)(probs):

Entropy

summ_entropy() computes differential entropy (which can be negative) for “continuous” type pdqr-functions, and information entropy for “discrete”:

summ_entropy2() computes entropy based summary of relation between a pair of distributions. There are two methods: default “relative” (for relative entropy which is Kullback-Leibler divergence) and “cross” (for cross-entropy). It handles different supports by using clip (default exp(-20)) value instead of 0 during log() computation. Order of input does matter: summ_entropy2() uses support of the first pdqr-function as integration/summation reference.

summ_entropy2(my_beta, my_norm)
#> [1] 1.439193
summ_entropy2(my_norm, my_beta)
#> [1] 11.61849
summ_entropy2(my_norm, my_beta, clip = exp(-10))
#> [1] 5.289639
summ_entropy2(my_beta, my_norm, method = "cross")
#> [1] 0.9546508

Regions

Distributions can be summarized with regions: union of closed intervals. Region is represented as data frame with rows representing intervals and two columns “left” and “right” with left and right interval edges respectively.

Single interval

summ_interval() summarizes input pdqr-function with single interval based on the desired coverage level supplied in argument level. It has three methods:

  • Default “minwidth”: interval with total probability of level that has minimum width.
  • “percentile”: 0.5*(1-level) and 1 - 0.5*(1-level) quantiles.
  • “sigma”: interval centered at the mean of distribution. Left and right edges are distant from center by the amount of standard deviation multiplied by level’s critical value (computed from normal distribution). Corresponds to classical confidence interval of sample based on assumption of normality.
summ_interval(my_beta, level = 0.9, method = "minwidth")
#>         left     right
#> 1 0.03015543 0.5252921
summ_interval(my_beta, level = 0.9, method = "percentile")
#>         left     right
#> 1 0.06284986 0.5818016
summ_interval(my_beta, level = 0.9, method = "sigma")
#>         left    right
#> 1 0.02300124 0.548426

Highest density region

summ_hdr() computes highest density region (HDR) of a distribution: set of intervals with the lowest total width among all sets with total probability not less than an input level. With unimodal distribution it is essentially the same as summ_interval() with “minwidth” method.

Separation and classification

Separation

Function summ_separation() computes a threshold that optimally separates distributions represented by pair of input pdqr-functions. In other words, summ_separation() solves a binary classification problem with one-dimensional linear classifier: values not more than some threshold are classified as one class, and more than threshold - as another. Order of input functions doesn’t matter.

summ_separation(my_beta, my_norm, method = "KS")
#> [1] 0.6175981
summ_separation(my_beta, my_norm, method = "F1")
#> [1] 0.007545242

Classification metrics

Functions summ_classmetric() and summ_classmetric_df() compute metric(s) of classification setup, similar to one used in summ_separation(). Here classifier threshold should be supplied and order of input matters. Classification is assumed to be done as follows: any x value not more than threshold value is classified as “negative”; if more - “positive”. Classification metrics are computed based on two pdqr-functions: f, which represents the distribution of values which should be classified as “negative” (“true negative”), and g - the same for “positive” (“true positive”).

With summ_roc() and summ_rocauc() one can compute data frame of ROC curve points and ROC AUC value respectively. There is also a roc_plot() function for predefined plotting of ROC curve.

Ordering

‘pdqr’ has functions that can order set of distributions. They are summ_order(), summ_sort(), and summ_rank(), which are analogues of order(), sort(), and rank() respectively. They take a list of pdqr-functions as input, establish their ordering based on specified method, and return the desired output.

There are two sets of methods:

  • Method “compare” uses the following ordering relation: pdqr-function f is greater than g if and only if P(f >= g) > 0.5, or in ‘pdqr’ code summ_prob_true(f >= g) > 0.5. This method orders input based on this relation and order() function. Notes:
    • This relation doesn’t strictly define ordering because it is not transitive. It is solved by first preordering input list based on method “mean” and then calling order().
    • Because comparing two pdqr-functions can be time consuming, this method becomes rather slow as number of distributions grows. To increase computation speed (sacrificing a little bit of approximation precision), use less intervals in piecewise-linear approximation of density for “continuous” types of pdqr-functions.
  • Methods “mean”, “median”, and “mode” are based on summ_center(): ordering of distributions is defined as ordering of corresponding measures of distribution’s center.

Other

Functions summ_prob_true() and summ_prob_false() should be used to extract probabilities from boolean pdqr-functions: outputs of comparing basic operators (like >=, ==, etc.):

summ_prob_true(my_beta >= my_norm)
#> [1] 0.416062
summ_prob_false(my_beta >= 2*my_norm)
#> [1] 0.6391

summ_pval() computes p-value(s) of observed statistic(s) based on the distribution. You can compute left, right, or two-sided p-values with methods “left”, “right”, and “both” respectively. By default multiple input values are adjusted for multiple comparisons (using stats::p.adjust()):