This function computes distance between two distributions represented by pdqr-functions. Here "distance" is used in a broad sense: a single non-negative number representing how much two distributions differ from one another. Bigger values indicate bigger difference. Zero value means that input distributions are equivalent based on the method used (except method "avgdist" which is almost always returns positive value). The notion of "distance" is useful for doing statistical inference about similarity of two groups of numbers.
summ_distance(f, g, method = "KS")
f | |
---|---|
g | A pdqr-function of any type and class. |
method | Method for computing distance. Should be one of "KS", "totvar", "compare", "wass", "cramer", "align", "avgdist", "entropy". |
A single non-negative number representing distance between pair of distributions. For methods "KS", "totvar", and "compare" it is not bigger than 1. For method "avgdist" it is almost always bigger than 0.
Methods can be separated into three categories: probability based, metric based, and entropy based.
Probability based methods return a number between 0 and 1 which is computed in the way that mostly based on probability:
Method "KS" (short for Kolmogorov-Smirnov) computes the supremum of
absolute difference between p-functions corresponding to f
and g
(|F - G|
). Here "supremum" is meant to describe the fact that if input functions
have different types, there can be no point at which "KS"
distance is achieved. Instead, there might be a sequence of points from left
to right with |F - G|
values tending to the result (see Examples).
Method "totvar" (short for "total variation") computes a biggest absolute
difference of probabilities for any subset of real line. In other words,
there is a set of points for "discrete" type and intervals for "continuous",
total probability of which under f
and g
differs the most. Note that
if f
and g
have different types, output is always 1. The set of interest
consists from all "x" values of "discrete" pdqr-function: probability under
"discrete" distribution is 1 and under "continuous" is 0.
Method "compare" represents a value computed based on probabilities of
one distribution being bigger than the other (see pdqr methods for "Ops" group generic family for more details on comparing
pdqr-functions). It is computed as
2*max(P(F > G), P(F < G)) + 0.5*P(F = G) - 1
(here P(F > G)
is basically
summ_prob_true(f > g)
). This is maximum of two values (P(F > G) + 0.5*P(F = G)
and P(F < G) + 0.5*P(F = G)
), normalized to return values from 0
to 1. Other way to look at this measure is that it computes (before
normalization) two ROC AUC values with method "expected"
for two possible ordering (f, g
, and g, f
) and takes their maximum.
Metric based methods compute "how far" two distributions are apart on the real line:
Method "wass" (short for "Wasserstein") computes a 1-Wasserstein
distance: "minimum cost of 'moving' one density into another", or "average
path density point should go while transforming from one into another". It is
computed as integral of |F - G|
(absolute difference between p-functions).
If any of f
and g
has "continuous" type, stats::integrate()
is used, so
relatively small numerical errors can happen.
Method "cramer" computes Cramer distance: integral of (F - G)^2
. This
somewhat relates to "wass" method as variance relates to first central absolute moment. Relatively small numerical errors
can happen.
Method "align" computes an absolute value of shift d
(possibly
negative) that should be added to f
to achieve both P(f+d >= g) >= 0.5
and P(f+d <= g) >= 0.5
(in other words, align f+d
and g
) as close as
reasonably possible. Solution is found numerically with stats::uniroot()
,
so relatively small numerical errors can happen. Also note that this
method is somewhat slow (compared to all others). To increase speed, use less
elements in "x_tbl" metadata. For example, with
form_retype()
or smaller n_grid
argument in as_*() functions.
Method "avgdist" computes average distance between sample values from
inputs. Basically, it is a deterministically computed approximation of
expected value of absolute difference between random variables, or in 'pdqr'
code: summ_mean(abs(f - g))
(but computed without randomness). Computation
is done by approximating possibly present continuous pdqr-functions with
discrete ones (see description of "pdqr.approx_discrete_n_grid" option for more information) and then computing output value
directly based on two discrete pdqr-functions. Note that this method
almost never returns zero, even for identical inputs (except the case of
discrete pdqr-functions with identical one value).
Entropy based methods compute output based on entropy characteristics:
Method "entropy" computes sum of two Kullback-Leibler divergences:
KL(f, g) + KL(g, f)
, which are outputs of summ_entropy2()
with method
"relative". Notes:
If f
and g
don't have the same support, distance can be very high.
Error is thrown if f
and g
have different types (the same as in
summ_entropy2()
).
summ_separation()
for computation of optimal threshold separating
pair of distributions.
Other summary functions:
summ_center()
,
summ_classmetric()
,
summ_entropy()
,
summ_hdr()
,
summ_interval()
,
summ_moment()
,
summ_order()
,
summ_prob_true()
,
summ_pval()
,
summ_quantile()
,
summ_roc()
,
summ_separation()
,
summ_spread()
d_unif <- as_d(dunif, max = 2)
d_norm <- as_d(dnorm, mean = 1)
vapply(
c(
"KS", "totvar", "compare",
"wass", "cramer", "align", "avgdist",
"entropy"
),
function(meth) {
summ_distance(d_unif, d_norm, method = meth)
},
numeric(1)
)
#> KS totvar compare wass cramer align avgdist entropy
#> 0.1586546 0.3173092 0.0000000 0.2978750 0.0271365 0.0000000 0.9246546 5.7929318
# "Supremum" quality of "KS" distance
d_dis <- new_d(2, "discrete")
## Distance is 1, which is a limit of |F - G| at points which tend to 2 from
## left
summ_distance(d_dis, d_unif, method = "KS")
#> [1] 1