14  Bivariate plots

Author

Jarad Niemi

R Code Button

Bivariate plots provide visualization of two variables. If we consider categorical and numeric as two different variable types, we have three different types of bivariate plots depending on the combination of variable types: categorical-categorical, numeric-categorical, and numeric-numeric.

If both variables are categorical, we generally are not plotting them, but instead may construct a contingency table. If one variable is numeric and the other is categorical, we can utilize the univariate plots discussed previously, but create one of those plots for each value of the categorical variable. If both variables are numeric, then we typically construct scatterplots.

library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_bw()) # sets the default theme 

14.1 Categorical-categorical

We can utilize multiple bar charts to visualize two categorical variables. By default, the bar charts will be stacked on top of on another.

# Bar chart
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = color))

Alternatively, we can unstack them using dodge.

# Bar chart with dodge
ggplot(data = diamonds) +
  geom_bar(
    mapping = aes(
      x    = cut, 
      fill = color),
    position = "dodge") 

Or we can use facets

# Barchart with facets
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color)) +
  facet_wrap(~cut,
             scales = "free_y") # so you can see 'Fair'

14.2 Categorical-numeric

When one variable is continuous and the other is categorical, we can utilize the univariate plots previously discussed including boxplots, histograms, density plots, violin plots, and scatterplots.

14.2.1 Boxplot

Here is an example of utilizing a boxplot with a categorical variable.

# Boxplots
ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(
    x = carat, 
    y = cut))

If we want to flip the boxes, we can reverse x and y.

# Boxplots x and y flipped
ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(
    x = cut, 
    y = carat))

Alternatively, we can use the coord_flip() function.

# Boxplots flipped using coord_flip()
ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(
    x = carat, 
    y = cut)) +
  coord_flip()

14.2.2 Violin

Violin plots can be constructed similarly.

# Violin plots
ggplot(data = diamonds) + 
  geom_violin(mapping = aes(
    x = carat, 
    y = cut))

14.2.3 Density

For density plots, we can put each

# Density plots
ggplot(data = diamonds) + 
  geom_density(mapping = aes(
    x        = carat, 
    color    = cut,
    linetype = cut))

or we could use facets

# Density plots with facets
ggplot(data = diamonds) + 
  geom_density(mapping = aes(
    x = carat)) +
  facet_wrap(~cut)

14.2.4 Histogram

For histograms, the default is to create a stacked histogram (similar to the stack bar chart) we saw previously. This provides the univariate histogram but colored according to the number of observations in a second categorical variable.

# Stacked histograms
ggplot(data = diamonds) + 
  geom_histogram(
    mapping = aes(     
      x       = carat, 
      fill    = cut),
    binwidth = 0.1)   

If we instead want overlapping histograms, we can use stat='identity'. Since the histograms will be overlapping, we may want to provide transparency to the histograms using alpha.

# Overlapped histograms
ggplot(data = diamonds) + 
  geom_histogram(
    mapping = aes(     
      x       = carat, 
      fill    = cut),
    binwidth = 0.1,
    position = 'identity', # overlap the histograms
    color = NA,            # remove histogram border color
    alpha = 0.5)           # make histograms semi-transparent

Overlapped histograms can be a bit difficult to

# Faceted histograms
ggplot(data = diamonds) + 
  geom_histogram(
    mapping = aes(
      x = carat),
    binwidth = 0.1) +
  facet_wrap(~cut) 

# Faceted histogram with free y-axes
ggplot(data = diamonds) + 
  geom_histogram(
    mapping = aes(
      x = carat),
    binwidth = 0.1) +
  facet_wrap(~cut, 
             scales = "free_y") # separate y-axes for each facet

If you want more control, you can determine how many columns to use.

# Faceted histogram with 5 columns
ggplot(data = diamonds) + 
  geom_histogram(
    mapping = aes(
      x = carat),
    binwidth = 0.1) +
  facet_wrap(~cut, 
             ncol = 5,          # 5 columns so all histograms on the same row
             scales = "free_y") 

14.2.5 Scatterplots

We can also use scatterplots, especially when there is not a lot of data

# Scatterplots
d <- diamonds |> 
  sample_n(100)  # Sample 100 random data points

ggplot(data = d) + 
  geom_point(mapping = aes(
    x = cut, y = carat))

Since the points line up directly above the categorical variable, it is helpful to utilize jitter to enable the viewer to see individual points.

# Scatterplots with jitter
ggplot(data = d) + 
  geom_jitter(mapping = aes(
    x = cut, 
    y = carat))

You can adjust the amount of jitter using width and height arguments.

# Scatterplots with less than default jitter
ggplot(data = d) + 
  geom_jitter(
    mapping = aes(
      x = cut, 
      y = carat),
    width = 0.1,
    height = 0)  # eliminate jitter on the y-axis variable

14.2.6 Pointrange plots

If you only need to visualizing mean and a measure of uncertainty, you can utilize a pointrange plot. In order to utilize the pointrange plot, you need to summarize your data, i.e. transform it.

# Transform (summarize) data for pointrange plot
diamond_summaries <- diamonds |>
  sample_n(1000) |>     # randomly sample 1,000 observations so we can see the error bars
  group_by(cut) |>
  summarize(
    n = n(),            # n() counts the number of observations in that group
    mean = mean(carat),
    sd   = sd(carat),
    
    # Construct endpoints of a 95% confidence interval
    lb = mean - pt(.975, n-1) * sd / sqrt(n),
    ub = mean + pt(.975, n-1) * sd / sqrt(n),
  ) 

Now we are ready to construct the plot

# Pointrange plot
ggplot(data = diamond_summaries) + 
  geom_pointrange(mapping = aes(
    x = cut,
    y = mean,
    ymin = lb,
    ymax = ub)) +
  labs(
    title = "Mean with 95% confidence interval for each cut",
    y = "carat") 

14.2.7 Bar charts

Many fields utilize bar charts to visualize mean effects.

# Bar chart for means
ggplot(diamond_summaries) +
  geom_col(mapping = aes(   # notice geom_col rather than geom_bar
    x = cut,
    y = mean
  ))

An equivalent alternative is to use geom_bar() with stat = "identity".

# Equivalent bar chart, but using geom_bar
ggplot(diamond_summaries) +
  geom_bar(
    mapping = aes(   
      x = cut,
      y = mean),
    stat = "identity") # avoid the default counting by geom_bar

In these fields, it is common to include some measure of uncertainty. We can add standard errors using geom_linerange().

# Linerange plot
ggplot(diamond_summaries,
       
       # Put mapping in ggplot since both geoms will use it
       mapping = aes(
         x = cut,
         y = mean,
         ymin = lb,
         ymax = ub
       )) +
  
  geom_col() +
  geom_linerange(color = "gray") +
  
  labs(
    x = "Cut",
    y = "Carat"
  ) +
  
  # remove vertical grid so we can see errorbars
  scale_x_discrete(breaks = NULL)  

14.3 Numeric-numeric

When both variables are continuous, there are a variety of plots that can be used to visualize the data. When there are not too many points, we can utilize scatterplots

14.3.1 Scatterplot

# Scatterplot
ggplot(diamonds |> sample_n(100)) +
  geom_point(mapping = aes(
    x = depth,
    y = carat
  ))

But when there a lot of observations, these scatterplots become unreadable.

ggplot(diamonds) +
  geom_point(mapping = aes(
    x = depth,
    y = carat
  ))

Also, when saved as a pdf (or any other vector-based graphics) the files can be extremely large.

14.3.2 Two-dimensional histogram

An alternative is to use a two-dimensional histogram.

# Two-dimensional histogram
ggplot(diamonds) +
  geom_bin2d(
    mapping = aes(
      x = depth,
      y = carat),
    bins = 70)

Instead of rectangular bins, I prefer hexagonal bins.

# install.packages("hexbin")

# Two-dimensional histogram using hexagons
ggplot(diamonds) +
  geom_hex(
    mapping = aes(
      x = depth,
      y = carat),
    bins = 70)

14.3.3 Time series plot

A time series plot is a plot with two continuous variables where one of the variables is time. Time is always put on the x-axis.

We can use a scatterplot for a time series plot

# Time series scatterplot
ggplot(economics) +
  geom_point(mapping = aes(
    x = date,
    y = unemploy
  ))

or we can use a lineplot

# Time series lineplot
ggplot(economics) +
  geom_line(mapping = aes(
    x = date,
    y = unemploy
  ))

14.4 Summary

We discussed a variety of plots for two variables including bar charts when there are two categorical variables, multiple types of plots when you have a mix of a categorical and a numeric variable, and scatterplots and 2-dimensional histograms when you have two numeric variables.

There many more types of plots available and I encourage you to take a look at the R Graph Gallery.