vignettes/manipulation.Rmd
manipulation.Rmd
This vignette will describe comperes
functionality for manipulating (summarising and transforming) competition results (hereafter - results):
We will need the following packages:
library(comperes)
library(dplyr)
#>
#> Присоединяю пакет: 'dplyr'
#> Следующие объекты скрыты от 'package:stats':
#>
#> filter, lag
#> Следующие объекты скрыты от 'package:base':
#>
#> intersect, setdiff, setequal, union
library(rlang)
Example results in long format:
cr_long <- tibble(
game = c("a1", "a1", "a1", "a2", "a2", "b1", "b1", "b2"),
player = c(1, NA, NA, 1, 2, 2, 1, 2),
score = 1:8,
season = c(rep("A", 5), rep("B", 3))
) %>%
as_longcr()
Functions discussed in these topics leverage dplyr
’s grammar of data manipulation. Only basic knowledge is enough to use them. Also a knowledge of rlang
’s quotation mechanism is preferred.
Item summary is understand as some summary measurements (of arbitrary nature) of item (one or more columns) present in data. To compute them, comperes
offers summarise_*()
family of functions in which summary functions should be provided as in dplyr::summarise()
. Basically, they are wrappers for grouped summarise with forced ungrouping, conversion to tibble
and possible adding prefix to summaries. Note that if one of columns in item is a factor with implicit NA
s (present in vector but not in levels), there will be a warning suggesting to add NA
to levels. This is due to group_by()
functionality in dplyr
after 0.8.0 version.
Couple of examples:
cr_long %>% summarise_player(mean_score = mean(score))
#> # A tibble: 3 x 2
#> player mean_score
#> <dbl> <dbl>
#> 1 1 4
#> 2 2 6.33
#> 3 NA 2.5
cr_long %>% summarise_game(min_score = min(score), max_score = max(score))
#> # A tibble: 4 x 3
#> game min_score max_score
#> <chr> <int> <int>
#> 1 a1 1 3
#> 2 a2 4 5
#> 3 b1 6 7
#> 4 b2 8 8
cr_long %>% summarise_item("season", sd_score = sd(score))
#> # A tibble: 2 x 2
#> season sd_score
#> <chr> <dbl>
#> 1 A 1.58
#> 2 B 1
For convenient transformation of results there are join_*_summary()
family of functions, which compute respective summaries and join them to original data:
cr_long %>%
join_item_summary("season", season_mean_score = mean(score)) %>%
mutate(score = score - season_mean_score)
#> # A longcr object:
#> # A tibble: 8 x 5
#> game player score season season_mean_score
#> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 a1 1 -2 A 3
#> 2 a1 NA -1 A 3
#> 3 a1 NA 0 A 3
#> 4 a2 1 1 A 3
#> 5 a2 2 2 A 3
#> 6 b1 2 -1 B 7
#> 7 b1 1 0 B 7
#> 8 b2 2 1 B 7
For common summary functions comperes
has a list summary_funs
with 8 quoted expressions to be used with rlang
’s unquoting mechanism:
# Use .prefix to add prefix to summary columns
cr_long %>%
join_player_summary(!!!summary_funs[1:2], .prefix = "player_") %>%
join_item_summary("season", !!!summary_funs[1:2], .prefix = "season_")
#> # A longcr object:
#> # A tibble: 8 x 8
#> game player score season player_min_score player_max_score season_min_score
#> <chr> <dbl> <int> <chr> <int> <int> <int>
#> 1 a1 1 1 A 1 7 1
#> 2 a1 NA 2 A 2 3 1
#> 3 a1 NA 3 A 2 3 1
#> 4 a2 1 4 A 1 7 1
#> 5 a2 2 5 A 5 8 1
#> 6 b1 2 6 B 5 8 6
#> 7 b1 1 7 B 1 7 6
#> 8 b2 2 8 B 5 8 6
#> # … with 1 more variable: season_max_score <int>
Head-to-Head value is a summary statistic of direct confrontation between two players. It is assumed that this value can be computed based only on the players’ matchups, data of actual participation for ordered pair of players in one game.
To compute matchups, comperes
has get_matchups()
, which returns a widecr
object with all matchups actually present in results (including matchups of players with themselves). Note that missing values in player
column are treated as separate players. It allows operating with games where multiple players’ identifiers are not known. However, when computing Head-to-Head values they treated as single player. Example:
get_matchups(cr_long)
#> # A widecr object:
#> # A tibble: 18 x 5
#> game player1 score1 player2 score2
#> <chr> <dbl> <int> <dbl> <int>
#> 1 a1 1 1 1 1
#> 2 a1 1 1 NA 2
#> 3 a1 1 1 NA 3
#> 4 a1 NA 2 1 1
#> 5 a1 NA 2 NA 2
#> 6 a1 NA 2 NA 3
#> 7 a1 NA 3 1 1
#> 8 a1 NA 3 NA 2
#> 9 a1 NA 3 NA 3
#> 10 a2 1 4 1 4
#> 11 a2 1 4 2 5
#> 12 a2 2 5 1 4
#> 13 a2 2 5 2 5
#> 14 b1 2 6 2 6
#> 15 b1 2 6 1 7
#> 16 b1 1 7 2 6
#> 17 b1 1 7 1 7
#> 18 b2 2 8 2 8
Head-to-Head values can be stored in two ways:
tibble
with columns player1
and player2
which identify ordered pair of players, and columns corresponding to Head-to-Head values. Computation is done with h2h_long()
which returns an object of class h2h_long
. Head-to-Head functions are specified as in dplyr
’s grammar for results matchups:
cr_long %>%
h2h_long(
abs_diff = mean(abs(score1 - score2)),
num_wins = sum(score1 > score2)
)
#> # A long format of Head-to-Head values:
#> # A tibble: 9 x 4
#> player1 player2 abs_diff num_wins
#> <dbl> <dbl> <dbl> <int>
#> 1 1 1 0 0
#> 2 1 2 1 1
#> 3 1 NA 1.5 0
#> 4 2 1 1 1
#> 5 2 2 0 0
#> 6 2 NA NA NA
#> 7 NA 1 1.5 2
#> 8 NA 2 NA NA
#> 9 NA NA 0.5 1
h2h_mat()
which returns an object of class h2h_mat
. Head-to-Head functions are specified as in h2h_long()
:
cr_long %>% h2h_mat(sum_score = sum(score1 + score2))
#> # A matrix format of Head-to-Head values:
#> 1 2 <NA>
#> 1 24 22 7
#> 2 22 38 NA
#> <NA> 7 NA 20
comperes
also offers a list h2h_funs
of 9 common Head-to-Head functions as quoted expressions to be used with rlang
’s unquoting mechanism:
cr_long %>% h2h_long(!!!h2h_funs)
#> # A long format of Head-to-Head values:
#> # A tibble: 9 x 11
#> player1 player2 mean_score_diff mean_score_diff… mean_score sum_score_diff
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1 0 0 4 0
#> 2 1 2 0 0 5.5 0
#> 3 1 NA -1.5 0 1 -3
#> 4 2 1 0 0 5.5 0
#> 5 2 2 0 0 6.33 0
#> 6 2 NA NA NA NA NA
#> 7 NA 1 1.5 1.5 2.5 3
#> 8 NA 2 NA NA NA NA
#> 9 NA NA 0 0 2.5 0
#> # … with 5 more variables: sum_score_diff_pos <dbl>, sum_score <int>,
#> # num_wins <dbl>, num_wins2 <dbl>, num <int>
To compute Head-to-Head for only subset of players or include values for players that are not in the results, use factor player
column. Notes:
fill
argument to replace NA
s in certain columns after computing Head-to-Head values.summarise_item()
, there will be a warning in case of implicit NA
s in factor columns.
cr_long_fac <- cr_long %>%
mutate(player = factor(player, levels = c(1, 2, 3)))
cr_long_fac %>%
h2h_long(abs_diff = mean(abs(score1 - score2)),
fill = list(abs_diff = -100))
#> # A long format of Head-to-Head values:
#> # A tibble: 9 x 3
#> player1 player2 abs_diff
#> <fct> <fct> <dbl>
#> 1 1 1 0
#> 2 1 2 1
#> 3 1 3 -100
#> 4 2 1 1
#> 5 2 2 0
#> 6 2 3 -100
#> 7 3 1 -100
#> 8 3 2 -100
#> 9 3 3 -100
cr_long_fac %>%
h2h_mat(mean(abs(score1 - score2)),
fill = -100)
#> # A matrix format of Head-to-Head values:
#> 1 2 3
#> 1 0 1 -100
#> 2 1 0 -100
#> 3 -100 -100 -100
To convert between long and matrix formats of Head-to-Head values, comperes
has to_h2h_long()
and to_h2h_mat()
which convert from matrix to long and from long to matrix respectively. Note that output of to_h2h_long()
has player1
and player2
columns as characters. Examples:
cr_long %>% h2h_mat(mean(score1)) %>% to_h2h_long()
#> # A long format of Head-to-Head values:
#> # A tibble: 9 x 3
#> player1 player2 h2h_value
#> <chr> <chr> <dbl>
#> 1 1 1 4
#> 2 1 2 5.5
#> 3 1 <NA> 1
#> 4 2 1 5.5
#> 5 2 2 6.33
#> 6 2 <NA> NA
#> 7 <NA> 1 2.5
#> 8 <NA> 2 NA
#> 9 <NA> <NA> 2.5
cr_long %>%
h2h_long(mean_score1 = mean(score1), mean_score2 = mean(score2)) %>%
to_h2h_mat()
#> Using mean_score1 as value.
#> # A matrix format of Head-to-Head values:
#> 1 2 <NA>
#> 1 4.0 5.500000 1.0
#> 2 5.5 6.333333 NA
#> <NA> 2.5 NA 2.5
All this functionality is powered by useful outside of comperes
functions long_to_mat()
and mat_to_long()
. They convert general pair-value data between long and matrix format:
pair_value_long <- tibble(
key_1 = c(1, 1, 2),
key_2 = c(2, 3, 3),
val = 1:3
)
pair_value_mat <- pair_value_long %>%
long_to_mat(row_key = "key_1", col_key = "key_2", value = "val")
pair_value_mat
#> 2 3
#> 1 1 2
#> 2 NA 3
pair_value_mat %>%
mat_to_long(
row_key = "key_1", col_key = "key_2", value = "val",
drop = TRUE
)
#> # A tibble: 3 x 3
#> key_1 key_2 val
#> <chr> <chr> <int>
#> 1 1 2 1
#> 2 1 3 2
#> 3 2 3 3
For some ranking algorithms it crucial that games should only be between two players. comperes
has function to_pairgames()
for this. It removes games with one player. Games with three and more players to_pairgames()
splits into separate games between unordered pairs of different players without specific order. Note that game identifiers are changed to integers but order of initial games is preserved. Example:
to_pairgames(cr_long)
#> # A widecr object:
#> # A tibble: 5 x 5
#> game player1 score1 player2 score2
#> <int> <dbl> <int> <dbl> <int>
#> 1 1 1 1 NA 2
#> 2 2 1 1 NA 3
#> 3 3 NA 2 NA 3
#> 4 4 1 4 2 5
#> 5 5 2 6 1 7