Convert to pdqr-function

Convert some function to be a proper pdqr-function of specific class, i.e. a function describing distribution with finite support and finite values of probability/density.

as_p(f, ...)

# S3 method for default
as_p(f, support = NULL, ..., n_grid = 10001)

# S3 method for pdqr
as_p(f, ...)

as_d(f, ...)

# S3 method for default
as_d(f, support = NULL, ..., n_grid = 10001)

# S3 method for pdqr
as_d(f, ...)

as_q(f, ...)

# S3 method for default
as_q(f, support = NULL, ..., n_grid = 10001)

# S3 method for pdqr
as_q(f, ...)

as_r(f, ...)

# S3 method for default
as_r(f, support = NULL, ..., n_grid = 10001,
  n_sample = 10000, args_new = list())

# S3 method for pdqr
as_r(f, ...)

Arguments

f	Appropriate function to be converted (see Details).
...	Extra arguments to `f`.
support	Numeric vector with two increasing elements describing desired support of output. If `NULL` or any its value is `NA`, detection is done using specific algorithms (see Details).
n_grid	Number of grid points at which `f` will be evaluated (see Details). Bigger values lead to better approximation precision, but worse memory usage and evaluation speed (direct and in `summ_*()` functions).
n_sample	Number of points to sample from `f` inside `as_r()`.
args_new	List of extra arguments for `new_d()` to control `density()` inside `as_r()`.

Value

A pdqr-function of corresponding class.

Details

General purpose of as_*() functions is to create a proper pdqr-function of desired class from input which doesn't satisfy these conditions. Here is described sequence of steps which are taken to achieve that goal.

If f is already a pdqr-function, as_*() functions properly update it to have specific class. They take input's "x_tbl" metadata and type to use with corresponding new_*() function. For example, as_p(f) in case of pdqr-function f is essentially the same as new_p(x = meta_x_tbl(f), type = meta_type(f)).

If f is a function describing "honored" distribution, it is detected and output is created in predefined way taking into account extra arguments in .... For more details see "Honored distributions" section.

If f is some other unknown function, as_*() functions use heuristics for approximating input distribution with a "proper" pdqr-function. Outputs of as_*() can be only pdqr-functions of type "continuous" (because of issues with support detection). It is assumed that f returns values appropriate for desired output class of as_*() function and output type "continuous". For example, input for as_p() should return values of some continuous cumulative distribution function (monotonically non-increasing values from 0 to 1). To manually create function of type "discrete", supply data frame input describing it to appropriate new_*() function.

General algorithm of how as_*() functions work for unknown function is as follows:

Detect support. See "Support detection" section for more details.
Create data frame input for new_*(). The exact process differs:
- In as_p() equidistant grid of n_grid points is created inside detected support. After that, input's values at the grid is taken as reference points of cumulative distribution function used to approximate density at that same grid. This method showed to work more reliably in case density goes to infinity. That grid and density values are used as "x" and "y" columns of data frame input for new_p().
- In as_d() "x" column of data frame is the same equidistant grid is taken as in as_p(). "y" column is taken as input's values at this grid after possibly imputing infinity values. This imputation is done by taking maximum from left and right linear extrapolations on mentioned grid.
- In as_q(), at first inverse of input f function is computed on [0; 1] interval. It is done by approximating it with piecewise-linear function on [0; 1] equidistant grid with n_grid points, imputing infinity values (which ensures finite support), and computing inverse of approximation. This inverse of f is used to create data frame input with as_p().
- In as_r() at first d-function with new_d() is created based on the same sample used for support detection and extra arguments supplied as list in args_new argument. In other words, density estimation is done based on sample, generated from input f. After that, its values are used to create data frame with as_d().
Use appropriate new_*() function with data frame from previous step and type = "continuous". This step implies that all tails outside detected support are trimmed and data frame is normalized to represent proper piecewise-linear density.

Honored distributions

For efficient workflow, some commonly used distributions are recognized as special ("honored"). Those receive different treatment in as_*() functions.

Basically, there is a manually selected list of "honored" distributions with all their information enough to detect them. Currently that list has all common univariate distributions from 'stats' package, i.e. all except multinomial and "less common distributions of test statistics".

"Honored" distribution is recognized only if f is one of p*(), d*(), q*(), or r*() function describing honored distribution and is supplied as variable with original name. For example, as_d(dunif) will be treated as "honored" distribution but as_d(function(x) {dunif(x)}) will not.

After it is recognized that input f represents "honored" distribution, its support is computed based on predefined rules. Those take into account special features of distribution (like infinite support or infinite density values) and supplied extra arguments in .... Usually output support "loses" only around 1e-6 probability on each infinite tail.

After that, for "discrete" type output new_d() is used for appropriate data frame input and for "continuous" - as_d() with appropriate d*() function and support. D-function is then converted to desired class with as_*().

Support detection

In case input is a function without any extra information, as_*() functions must know which finite support its output should have. User can supply desired support directly with support argument, which can also be NULL (mean automatic detection of both edges) or have NA to detect only those edges.

Support is detected in order to preserve as much information as practically reasonable. Exact methods differ:

In as_p() support is detected as values at which input function is equal to 1e-6 (left edge detection) and 1 - 1e-6 (right edge), which means "losing" 1e-6 probability on each tail. Note that those values are searched inside [-10^100; 10^100] interval.
In as_d(), at first an attempt at finding one point of non-zero density is made by probing 10000 points spread across wide range of real line (approximately from -1e7 to 1e7). If input's value at all of them is zero, error is thrown. After finding such point, cumulative distribution function is made by integrating input with integrate() using found point as reference (without this there will be poor accuracy of integrate()). Created CDF function is used to find 1e-6 and 1 - 1e-6 quantiles as in as_p(), which serve as detected support.
In as_q() quantiles for 0 and 1 are probed for being infinite. If they are, 1e-6 and 1 - 1e-6 quantiles are used respectively instead of infinite values to form detected support.
In as_r() sample of size n_sample is generated and detected support is its range stretched by mean difference of sorted points (to account for possible tails at which points were not generated). Note that this means that original input f "demonstrates its randomness" only once inside as_r(), with output then used for approximation of "original randomness".

Examples

# Convert existing "proper" pdqr-function
set.seed(101)
x <- rnorm(10)
my_d <- new_d(x, "continuous")

my_p <- as_p(my_d)

# Convert "honored" function to be a proper pdqr-function. To use this
# option, supply originally named function.
p_unif <- as_p(punif)
r_beta <- as_r(rbeta, shape1 = 2, shape2 = 2)
d_pois <- as_d(dpois, lambda = 5)

## `pdqr_approx_error()` computes pdqr approximation error
summary(pdqr_approx_error(as_d(dnorm), dnorm))
#>       grid            error               abserror        
#>  Min.   :-4.753   Min.   :-7.979e-07   Min.   :9.900e-12  
#>  1st Qu.:-2.377   1st Qu.:-4.000e-07   1st Qu.:1.975e-09  
#>  Median : 0.000   Median :-5.552e-08   Median :5.552e-08  
#>  Mean   : 0.000   Mean   :-2.104e-07   Mean   :2.104e-07  
#>  3rd Qu.: 2.377   3rd Qu.:-1.975e-09   3rd Qu.:4.000e-07  
#>  Max.   : 4.753   Max.   :-9.900e-12   Max.   :7.979e-07  

## This will work as if input is unkonw function because of unsupported
## variable name
my_runif <- function(n) {
  runif(n)
}
r_unif_2 <- as_r(my_runif)
plot(as_d(r_unif_2))

# Convert some other function to be a "proper" pdqr-function
my_d_quadr <- as_d(function(x) {
  0.75 * (1 - x^2)
}, support = c(-1, 1))

# Support detection
unknown <- function(x) {
  dnorm(x, mean = 1)
}
## Completely automatic support detection
as_d(unknown)
#> Density function of continuous type
#> Support: ~[-37.36926, 39.36951] (10000 intervals)
## Semi-automatic support detection
as_d(unknown, support = c(-4, NA))
#> Density function of continuous type
#> Support: ~[-4, 39.36951] (10000 intervals)
as_d(unknown, support = c(NA, 5))
#> Density function of continuous type
#> Support: ~[-37.36926, 5] (10000 intervals)

## If support is very small and very distant from zero, it probably won't
## get detected in `as_d()` (throwing a relevant error)
if (FALSE) {
as_d(function(x) {
  dnorm(x, mean = 10000, sd = 0.1)
})
}

# Using different level of granularity
as_d(unknown, n_grid = 1001)
#> Density function of continuous type
#> Support: ~[-37.36926, 39.36951] (1000 intervals)