Why use sumvar?

Simple one-line commands to explore variables for R users
Pipe-friendly tidyverse integration
Tabular summaries which can be stored as tibbles and used for downstream analysis.

When I first moved from Stata to R about 5 years ago, the main thing I missed was the simplicity of the “sum” and “tab” functions to efficiently explore data. Most template code to perform these commands, in introductory R books or tutorials eg. https://r4ds.hadley.nz/data-tidy.html, takes typically 3-5 lines to replicate these functions in R. I couldn’t find a package that could quite as simply and efficiently explore data.

Sumvar is fast and easy to use, and brings these variable summary functions to R.

Continuous Data

We call dist_sum() to explore a continous variable.

The tibble output shows: the number of rows in the data, and number missing, the median, interquartile range (25th and 75th centiles), mean, the standard deviation, and 95% confidence intervals using the Wald method (normal approximation), and the minimum and maximum values.

Dist_sum() will show a density plot and histogram for a single variable, or a grouped density plot when there is a grouping varialbe.

You can save the output from dist_sum as a tibble and use the estimates for downstream analysis, eg. sum_df <- df %>% dist_sum(age, sex)

# Example data
set.seed(123)
df <- tibble::tibble(
  age = rnorm(100, mean = 50, sd = 20),
  sex = sample(c("male", "female"), 100, replace = TRUE)) %>%
  dplyr::mutate(age = dplyr::if_else(sex == "male", age + 10, age))

# Call dist_sum
df %>% dist_sum(age)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#> # A tibble: 1 × 11
#>       n n_miss median   p25   p75  mean    sd ci_lower ci_upper   min   max
#>   <int>  <int>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl>
#> 1   100      0   55.6  44.0  68.1  56.9  18.2     53.3     60.5  13.8  101.
df %>% dist_sum(age, sex)

#> # A tibble: 2 × 14
#>   sex        n n_miss median   p25   p75  mean    sd   min   max ci_lower
#>   <chr>  <int>  <int>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 female    49      0   52.5  41.1  65.6  54.7  17.6  16.3  93.7     49.8
#> 2 male      51      0   57.2  46.8  71.3  59.0  18.6  13.8 101.      53.9
#> # ℹ 3 more variables: ci_upper <dbl>, p_ttest <dbl>, p_wilcox <dbl>

Dates

To explore the distribution of dates, call dist_date() - it is similar to dist_sum. This can also be grouped by a second grouping variable. With a single date, a histogram is shown; when a grouping variable is also called, a density plot is shown.

df3 <- tibble::tibble(
  dates = as.Date("2022-01-01") + rnorm(n=100, sd=50, mean=0),
  group = sample(c("A", "B"), 100, TRUE)) %>%
  dplyr::mutate(dt = dplyr::case_when(group == "A" ~ dates + 10, TRUE ~ dates))

df3 %>% dist_date(dates)

#> # A tibble: 1 × 7
#>       n n_miss min        p25        median     p75        max       
#>   <int>  <int> <date>     <date>     <date>     <date>     <date>    
#> 1   100      0 2021-10-25 2021-11-26 2021-12-22 2022-01-28 2022-06-12
df3 %>% dist_date(dates, group)

#> # A tibble: 2 × 8
#>   group     n n_miss min        p25        median     p75        max       
#>   <chr> <int>  <int> <date>     <date>     <date>     <date>     <date>    
#> 1 A        43      0 2021-10-25 2021-11-25 2021-12-17 2022-01-16 2022-06-12
#> 2 B        57      0 2021-10-27 2021-12-01 2022-01-03 2022-02-07 2022-04-20

Categorical Data

tab1() produces a tibble showing the distribution of a categorical variable and illustrates using a horizontal bar chart.

df2 <- tibble::tibble(
  group = sample(LETTERS[1:3], 200, TRUE)
)

df2 %>% tab1(group)

#> # A tibble: 4 × 3
#>   Category Frequency Percent
#>   <chr>        <int> <chr>  
#> 1 C               71 35.5   
#> 2 A               66 33.0   
#> 3 B               63 31.5   
#> 4 Total          200 100.0

Check for duplicate and missing data

To explore the proportion of duplicate values and missing values in a variable, pass it to dup().

example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))
example_data$age[sample(1:200, size = 15)] <- NA  # Replace 20 values with missing.

example_data %>% dup(age)

#> # A tibble: 1 × 7
#>   Variable     n n_unique n_duplicate percent_duplicate n_missing
#>   <chr>    <int>    <int>       <int>             <dbl>     <int>
#> 1 age        200      119          66              35.7        15
#> # ℹ 1 more variable: percent_missing <dbl>

If you send the whole database to dup(), it will produce a summary of duplicates and missingness in the whole database. Dup() illustrates with a stacked bar chart.

example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0),
                              sex = sample(c("Male", "Female"), 200, TRUE),
                              favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))
example_data$age[sample(1:200, size = 15)] <- NA  # Replace 15 values with missing.
example_data$sex[sample(1:200, size = 32)] <- NA  # Replace 32 values with missing.

dup(example_data)

#> # A tibble: 3 × 7
#>   Variable             n n_unique n_duplicate percent_duplicate n_missing
#>   <chr>            <int>    <int>       <int>             <dbl>     <int>
#> 1 age                200      117          68              36.8        15
#> 2 sex                200        2         166              98.8        32
#> 3 favourite_colour   200        3         197              98.5         0
#> # ℹ 1 more variable: percent_missing <dbl>

Exploring data with sumvar

Introduction

Why use sumvar?

Continuous Data

Dates

Categorical Data

Check for duplicate and missing data