19 Wrangle

Author

Jarad Niemi

In this chapter, we will discuss the basics of wrangling an individual data set.

library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

19.1 Pipe

Pipe operators allow for a data pipeline to be constructed within R code that is relatively easy to understand due to steps being conducted sequentially from the top to the bottom.

For example, can you guess what the following code does

# Example data pipeline
ToothGrowth |>
  group_by(supp, dose) |>
  summarize(
    n    = n(),
    mean = mean(len),
    sd   = sd(len),
    
    .groups = "drop"
  ) |>
  mutate(
    mean = round(mean, 2),
    sd   = round(sd, 2)
  ) |>
  arrange(mean)

This code is made much easier to read (than equivalent un-piped R code) due to 1) functions being written so that the first argument is always a data.frame and 2) the pipe operator being used to send the results of the previous operation to the next operation.

The pipe operator is a relatively new feature of R. The base R version |> was introduced in May 2021 (R version 4.1.0), the magittr verions %>% was introduced in Dec 2013, and the very first version was introduced in a stackoverflow post in Jan 2012.

The idea behind the pipe operator is pretty simple, it simply passes the contents on its left hand side as the first argument to the function on its right hand side.

# Calculate mean
c(1, 2, 3, 4) |> mean()

[1] 2.5

As the pipe only replaces the first argument, we can also use additional arguments of the following function.

# Calculate mean, ignore missing values
c(1, 2, NA, 4) |> mean(na.rm = TRUE)

[1] 2.333333

Piping is especially useful when combining a series of operations where the pipe is used after each operation.

# Calculate mean, ignore missing values
c(1, 2, NA, 4) |> 
  na.omit() |>
  mean()

[1] 2.333333

We will the pipe operator used extensively in data pipelines.

19.2 Rename

We can use the rename function to rename an existing variable in the data set. If we want to use names that are not valid R object names, we need to enclose the name using backticks.

# Initial names
ToothGrowth |> 
  names()

[1] "len"  "supp" "dose"

# Renamed
ToothGrowth |>
  rename(
    `Dose (mg/day)` = dose,
    Supplement      = supp,
    `Length (mm)`   = len
  ) |>
  names()

[1] "Length (mm)"   "Supplement"    "Dose (mg/day)"

# We didn't actually change the object so the names in the ToothGrowth data set has not changed.
ToothGrowth |>
  names()

[1] "len"  "supp" "dose"

# Save the renamed data set
d <- ToothGrowth |>
  rename(
    `Dose (mg/day)` = dose,
    Supplement      = supp,
    `Length (mm)`   = len
  ) 

# Saved data set has the new names
d |> 
  names()

[1] "Length (mm)"   "Supplement"    "Dose (mg/day)"

19.3 Mutate

The mutate() function allows you to create new variables in a data.frame. Typically this involves changing units, performing calculations with other variables, and refactoring factors.

19.3.1 Numeric

# Mutate numeric
ToothGrowth |>
  mutate(
     `Length (cm)` = len / 10 # `len` is in the data.frame
  ) |>
  summary()                   # `len` is in mm while `Length (cm)` is in cm

      len        supp         dose        Length (cm)   
 Min.   : 4.20   OJ:30   Min.   :0.500   Min.   :0.420  
 1st Qu.:13.07   VC:30   1st Qu.:0.500   1st Qu.:1.308  
 Median :19.25           Median :1.000   Median :1.925  
 Mean   :18.81           Mean   :1.167   Mean   :1.881  
 3rd Qu.:25.27           3rd Qu.:2.000   3rd Qu.:2.527  
 Max.   :33.90           Max.   :2.000   Max.   :3.390

# Overwrite existing variable
ToothGrowth |>
  mutate(
     len = len / 10 # `len` is in the data.frame
  ) |>
  summary()                   # `len` is in cm

      len        supp         dose      
 Min.   :0.420   OJ:30   Min.   :0.500  
 1st Qu.:1.308   VC:30   1st Qu.:0.500  
 Median :1.925           Median :1.000  
 Mean   :1.881           Mean   :1.167  
 3rd Qu.:2.527           3rd Qu.:2.000  
 Max.   :3.390           Max.   :2.000

# Use object outside data.frame
mm_per_cm <- 10
ToothGrowth |>
  mutate(
     len = len / mm_per_cm 
  ) |>
  summary()

      len        supp         dose      
 Min.   :0.420   OJ:30   Min.   :0.500  
 1st Qu.:1.308   VC:30   1st Qu.:0.500  
 Median :1.925           Median :1.000  
 Mean   :1.881           Mean   :1.167  
 3rd Qu.:2.527           3rd Qu.:2.000  
 Max.   :3.390           Max.   :2.000

# Use updated variable
ToothGrowth |>
  mutate(
    # Scale len between 0 and 1
    len = len - min(len),
    len = len / max(len)  # reused `len` here
  ) |>
  summary() # len now ranges between 0 and 1

      len         supp         dose      
 Min.   :0.0000   OJ:30   Min.   :0.500  
 1st Qu.:0.2988   VC:30   1st Qu.:0.500  
 Median :0.5067           Median :1.000  
 Mean   :0.4920           Mean   :1.167  
 3rd Qu.:0.7096           3rd Qu.:2.000  
 Max.   :1.0000           Max.   :2.000

19.3.2 Categorical

We commonly need to convert between factor and character representations of categorical variables.

# Convert between character and factor
d <- ToothGrowth |>
  select(supp) |>                 # only keep `supp` column
  mutate(
    supp_ch = as.character(supp), # convert to character
    supp_fa = as.factor(supp_ch)  # convert to factor
  ) 

d |> summary() # summary() only informative for factor

 supp      supp_ch          supp_fa
 OJ:30   Length:60          OJ:30  
 VC:30   Class :character   VC:30  
         Mode  :character

# table() is always informative
d$supp_ch |> table()


OJ VC 
30 30

d$supp_fa |> table()


OJ VC 
30 30

At this point, the main distinction between character variables and factor variables is that you can change the order of factor variables while character variables will always be in alphabetical order.

# Change order of factor variable
# Cannot change order of character variables
d |>
  mutate(
    supp_fa = factor(supp_fa, 
                     levels = c(
                       "VC",     # put VC first
                       "OJ"))    # then OJ
  ) |>
  summary()

 supp      supp_ch          supp_fa
 OJ:30   Length:60          VC:30  
 VC:30   Class :character   OJ:30  
         Mode  :character

# Recode character or factor levels
d |> 
  mutate(
    supp_ch = fct_recode(supp_ch, 
      `Ascorbic Acid` = "VC",
      `Orange Juice`  = "OJ"
    ),
    supp_fa = fct_recode(supp_fa,
      `Ascorbic Acid` = "VC",
      `Orange Juice`  = "OJ"
    )
  ) |>
  summary()

 supp             supp_ch            supp_fa  
 OJ:30   Orange Juice :30   Orange Juice :30  
 VC:30   Ascorbic Acid:30   Ascorbic Acid:30

19.3.3 Both

Here we will show you how to utilize the mutate() function to perform a large number of calculations.

# Diamonds
d <- diamonds |>
  # Rather than precalculating depth 
  # we will calculate depth in a script
  select(-depth) |>
  mutate(
    # Calculate depth
    depth = 2 * z / (x+y), # see ?diamonds for formula
    depth = 100 * depth,   # make depth a percent
    
    # Calculate $/weight
    price_per_carat = price / carat,
    
    # Reorder cut
    cut = factor(cut, 
                 levels = c(
                   "Ideal",
                   "Premium",
                   "Very Good",
                   "Good",
                   "Fair"
                 ))
  )

# View calculated variables
d |>
  select(price_per_carat, depth, cut) |>
  summary()

 price_per_carat     depth               cut       
 Min.   : 1051   Min.   :  0.00   Ideal    :21551  
 1st Qu.: 2478   1st Qu.: 61.04   Premium  :13791  
 Median : 3495   Median : 61.84   Very Good:12082  
 Mean   : 4008   Mean   : 61.74   Good     : 4906  
 3rd Qu.: 4950   3rd Qu.: 62.53   Fair     : 1610  
 Max.   :17829   Max.   :619.28                    
                 NA's   :7

# View observations with NA depth
d |>
  filter(is.na(depth)) |> # see filter() below
  select(depth, x, y, z)  # NaN stands for `Not a number`

# A tibble: 7 × 4
  depth     x     y     z
  <dbl> <dbl> <dbl> <dbl>
1   NaN     0     0     0
2   NaN     0     0     0
3   NaN     0     0     0
4   NaN     0     0     0
5   NaN     0     0     0
6   NaN     0     0     0
7   NaN     0     0     0

19.4 Arrange

You can arrange a data set by a collection of variables. For factor variables, the order is according to the factor level which is alphabetically by default.

# No rearrangment
ToothGrowth |>
  head()

   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5

# Arrange by supp
levels(ToothGrowth$supp) # OJ is first

[1] "OJ" "VC"

ToothGrowth |>
  arrange(supp) |>
  head()

   len supp dose
1 15.2   OJ  0.5
2 21.5   OJ  0.5
3 17.6   OJ  0.5
4  9.7   OJ  0.5
5 14.5   OJ  0.5
6 10.0   OJ  0.5

By default, the arrangement is ascending, but you can use desc() to arrange descending.

# Descending dose
ToothGrowth |> 
  arrange(desc(dose)) |>
  head()

   len supp dose
1 23.6   VC    2
2 18.5   VC    2
3 33.9   VC    2
4 25.5   VC    2
5 26.4   VC    2
6 32.5   VC    2

# Dose, then descending supp
ToothGrowth |>
  arrange(dose, desc(supp)) |>
  tail()

    len supp dose
55 24.8   OJ    2
56 30.9   OJ    2
57 26.4   OJ    2
58 27.3   OJ    2
59 29.4   OJ    2
60 23.0   OJ    2

19.5 Select

The select() function allows you to keep or remove certain variables.

# Keep variables by name
ToothGrowth |>
  select(len, supp) |>
  names()

[1] "len"  "supp"

# Remove variables by name
ToothGrowth |>
  select(-len, -supp) |> 
  names()

[1] "dose"

# Keep variables in a range
diamonds |>
  select(price:cut) |>
  names()

[1] "price"   "table"   "depth"   "clarity" "color"   "cut"

There are a variety of helper functions to use with the select() function. Take a look at ?select for more details.

# Select helper functions
diamonds |>
  select(starts_with("c")) |>
  names()

[1] "carat"   "cut"     "color"   "clarity"

diamonds |>
  select(x:last_col()) |>
  names()

[1] "x" "y" "z"

diamonds |>
  select(contains("y")) |>
  names()

[1] "clarity" "y"

You can also use select() to reorder columns.

# Reorder rows
ToothGrowth |> 
  select(len, dose, supp) |>
  head()

   len dose supp
1  4.2  0.5   VC
2 11.5  0.5   VC
3  7.3  0.5   VC
4  5.8  0.5   VC
5  6.4  0.5   VC
6 10.0  0.5   VC

names(diamonds)

 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"

diamonds |>
  select(carat, price,    # carat first then price
         everything()) |> # all other columns (in order)
  names()

 [1] "carat"   "price"   "cut"     "color"   "clarity" "depth"   "table"  
 [8] "x"       "y"       "z"

19.5.1 Pull

Data pipelines work the best when functions return a data.frame as the other functions in this chapter do. If you want to investigate a single variable, you can use the pull() function. This is equivalent to the $ access of a column, but can be included in a dplyr pipeline.

# Pull a variable
ToothGrowth |>
  pull(len) |>
  summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4.20   13.07   19.25   18.81   25.27   33.90

diamonds |>
  pull(cut) |>
  summary()

     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551

19.6 Filter

The filter() function allows you to keep observations (rows) by some criteria.

# Filter by numeric variable
ToothGrowth |> filter(len <  6)

  len supp dose
1 4.2   VC  0.5
2 5.8   VC  0.5
3 5.2   VC  0.5

ToothGrowth |> filter(len >= 30)

   len supp dose
1 33.9   VC    2
2 32.5   VC    2
3 30.9   OJ    2

ToothGrowth |> 
  filter(dose == 2) |>
  summary()

      len        supp         dose  
 Min.   :18.50   OJ:10   Min.   :2  
 1st Qu.:23.52   VC:10   1st Qu.:2  
 Median :25.95           Median :2  
 Mean   :26.10           Mean   :2  
 3rd Qu.:27.82           3rd Qu.:2  
 Max.   :33.90           Max.   :2

ToothGrowth |> 
  filter(dose != 2) |> 
  summary()

      len         supp         dose     
 Min.   : 4.200   OJ:20   Min.   :0.50  
 1st Qu.: 9.925   VC:20   1st Qu.:0.50  
 Median :15.200           Median :0.75  
 Mean   :15.170           Mean   :0.75  
 3rd Qu.:19.775           3rd Qu.:1.00  
 Max.   :27.300           Max.   :1.00

You can also filter by character (or factor) variables.

# Filter character
ToothGrowth |>
  filter(supp == "OJ") |>
  summary()

      len        supp         dose      
 Min.   : 8.20   OJ:30   Min.   :0.500  
 1st Qu.:15.53   VC: 0   1st Qu.:0.500  
 Median :22.70           Median :1.000  
 Mean   :20.66           Mean   :1.167  
 3rd Qu.:25.73           3rd Qu.:2.000  
 Max.   :30.90           Max.   :2.000

ToothGrowth |>
  filter(supp != "VC") |>
  summary()

      len        supp         dose      
 Min.   : 8.20   OJ:30   Min.   :0.500  
 1st Qu.:15.53   VC: 0   1st Qu.:0.500  
 Median :22.70           Median :1.000  
 Mean   :20.66           Mean   :1.167  
 3rd Qu.:25.73           3rd Qu.:2.000  
 Max.   :30.90           Max.   :2.000

diamonds |>
  filter(cut %in% c("Premium", "Ideal")) |>
  pull(cut) |>
  summary()

     Fair      Good Very Good   Premium     Ideal 
        0         0         0     13791     21551

You can also filter using multiple variables.

# Filter on multiple variables
ToothGrowth |>
  filter(supp == "OJ", dose == 0.5) |>
  summary()

      len        supp         dose    
 Min.   : 8.20   OJ:10   Min.   :0.5  
 1st Qu.: 9.70   VC: 0   1st Qu.:0.5  
 Median :12.25           Median :0.5  
 Mean   :13.23           Mean   :0.5  
 3rd Qu.:16.18           3rd Qu.:0.5  
 Max.   :21.50           Max.   :0.5

diamonds |>
  filter(
    cut %in% c("Premium", "Ideal"),
    carat <= .75,
    color == "D",
    !(clarity %in% c("VS1", "VS2")) # not VS1 or VS2
  ) |>
  select(cut, carat, color, clarity) |>
  summary()

        cut           carat        color       clarity   
 Fair     :   0   Min.   :0.2300   D:1798   SI1    :920  
 Good     :   0   1st Qu.:0.3300   E:   0   SI2    :382  
 Very Good:   0   Median :0.4100   F:   0   VVS2   :298  
 Premium  : 634   Mean   :0.4516   G:   0   VVS1   :166  
 Ideal    :1164   3rd Qu.:0.5400   H:   0   IF     : 26  
                  Max.   :0.7500   I:   0   I1     :  6  
                                   J:   0   (Other):  0

19.7 Slice

The slice functions allow you to subset the data in a variety of ways.

# Top of data.frame
ToothGrowth |> 
  slice_head()

  len supp dose
1 4.2   VC  0.5

# Bottom of data.frame
ToothGrowth |> 
  slice_tail()

  len supp dose
1  23   OJ    2

# Random rows
ToothGrowth |>
  slice_sample(n = 5) # number of rows

   len supp dose
1 23.3   VC    2
2 22.5   VC    1
3 23.3   OJ    1
4 17.3   VC    1
5 25.8   OJ    1

ToothGrowth |>
  slice_sample(prop = 2/60) # proportion of rows

   len supp dose
1 23.3   OJ  1.0
2  7.3   VC  0.5

# Filter 
ToothGrowth |>
  slice_min(
    len, # variable to order data by
    prop = 2 / 60
  )

  len supp dose
1 4.2   VC  0.5
2 5.2   VC  0.5

19.8 Summarize

Previously we have seen the summary() function which can be used to provide default summaries for numeric and factor variables. The summarize() function can be used to calculate user-determined values. One commonly used function is n() which counts the number of observations in the data.frame.

ToothGrowth |>
  summarize(
    n        = n(),       # number of observations
    mean_len = mean(len),
    sdlen    = sd(len)
  )

   n mean_len    sdlen
1 60 18.81333 7.649315

19.8.1 Group

Especially when summarizing the data.frame, we can use the group_by() function to allow the summarization to happen within each combination of the group by variables.

# Summarize by group
ToothGrowth |>
  group_by(supp, dose) |>
  summarize(
    n        = n(),
    mean_len = mean(len),
    sdlen    = sd(len),
    
    .groups = "drop"      # removes grouping
  )

# A tibble: 6 × 5
  supp   dose     n mean_len sdlen
  <fct> <dbl> <int>    <dbl> <dbl>
1 OJ      0.5    10    13.2   4.46
2 OJ      1      10    22.7   3.91
3 OJ      2      10    26.1   2.66
4 VC      0.5    10     7.98  2.75
5 VC      1      10    16.8   2.52
6 VC      2      10    26.1   4.80

The summarize() function requires that each argument returns a single value. Most of the time this is what you want, but sometimes you want more flexibility. If you try to use summarize() you will receive a warning.

# Summarize returns 2 rows per group
p <- c(.25,.75) # Q1 and Q3
ToothGrowth |>
  group_by(supp, dose) |>
  summarize(
    prob = p,
    qs   = quantile(len, 
                    prob = p)
  )

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

`summarise()` has grouped output by 'supp', 'dose'. You can override using the
`.groups` argument.

# A tibble: 12 × 4
# Groups:   supp, dose [6]
   supp   dose  prob    qs
   <fct> <dbl> <dbl> <dbl>
 1 OJ      0.5  0.25  9.7 
 2 OJ      0.5  0.75 16.2 
 3 OJ      1    0.25 20.3 
 4 OJ      1    0.75 25.6 
 5 OJ      2    0.25 24.6 
 6 OJ      2    0.75 27.1 
 7 VC      0.5  0.25  5.95
 8 VC      0.5  0.75 10.9 
 9 VC      1    0.25 15.3 
10 VC      1    0.75 17.3 
11 VC      2    0.25 23.4 
12 VC      2    0.75 28.8

19.8.2 Reframe

An alternative function that does not have a requirement on the number of rows returned per group is reframe().

# Use reframe instead
ToothGrowth |>
  group_by(supp, dose) |>
  reframe(
    prob = p,
    qs   = quantile(len, prob = p)
  )

# A tibble: 12 × 4
   supp   dose  prob    qs
   <fct> <dbl> <dbl> <dbl>
 1 OJ      0.5  0.25  9.7 
 2 OJ      0.5  0.75 16.2 
 3 OJ      1    0.25 20.3 
 4 OJ      1    0.75 25.6 
 5 OJ      2    0.25 24.6 
 6 OJ      2    0.75 27.1 
 7 VC      0.5  0.25  5.95
 8 VC      0.5  0.75 10.9 
 9 VC      1    0.25 15.3 
10 VC      1    0.75 17.3 
11 VC      2    0.25 23.4 
12 VC      2    0.75 28.8

19.9 Summary

We introduced a variety of functions to help with data wrangling for a single data frame.