vignettes/design-and-format.Rmd
design-and-format.Rmd
The main idea of the ruler
package is to create a format
of validation results (along with functional API) that will work
naturally with tidyverse tools.
This vignette will:
ruler
’s validation result format. This should help to
understand the foundations of ruler
validation
workflow.The preferred local data structure in tidyverse
is tibble: “A modern re-imagining
of the data frame”, on which its implementation is based. That is why
ruler
uses data frames as preferred format for data to be
validated. However the initial goal is to use tibbles in creation of
validation result format as much as possible.
Basically data frame is a list of variables with the same length. It is easier to think about it as two-dimensional structure where columns can be of different types.
In abstract form validation of data frame can be put as asking whether certain subset of data frame (data unit) obeys certain rule. The result of validation is logical value representing an answer.
With influence of dplyr’s grammar of data manipulation a data frame can be represented in terms of the following data units:
summarise()
without grouping.summarise()
with grouping.summarise()
without grouping:
summarise_all()
, summarise_if()
and
summarise_at()
.transmute()
.transmute()
: transmute_all()
,
transmute_if()
and transmute_at()
.In ruler
data, group, column, row and cell are five
basic data units. They all can be described by the combination of two
variables:
Validation of data units can be done with the dplyr
functions described above. Their application to some data unit can give
answers to multiple questions. That is why by design
rules (functions that answer one certain question about
one type of data unit) are combined in rule packs
(functions that answer multiple questions about one type of data
unit).
Application of rule pack to data is connected with several points:
ruler
has option of
removing obeyers from results during the
validation.In ruler
exposing data to rules means
applying rule packs to data, collecting results in common format and
attaching them to the data as an exposure
attribute. In
this way actual exposure can be done in multiple steps and also be a
part of a general data preparation pipeline.
Exposure is a format designed to contain uniform information about validation of different data units. For reproducibility it also saves information about applied packs. Basically exposure is a list with two elements:
tibble
with the
following structure:
tibble
with the following structure: