13  Univariate plots

Author

Jarad Niemi

R Code Button

There are a variety of different types of plots that can be constructed using ggplot2. We will not introduce all of these plots, but instead focus on the most commonly used plot types. If you are looking for a plot that is not listed here, check out the R graph gallery which provides visualizations of the plots as well as the R code to produce those plots.

The focus in these slides is to understand the syntax used in ggplot2 to create these different plot types rather than understanding the plots themselves. In particular, we will be modifying the data, mappings, and geom elements presented in the syntax slides. Adding additional customization will be discussed later.

When determine the plot type to use, you need to determine how many different variables you are attempting to visualize and whether those variables are numeric or categorical. Thus, the following sections are organized by the variable number and type.

library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_bw())

13.1 Numeric

When we have one numeric variable, we are typically interested in understanding its distribution in terms of the central tendency, spread, skewness, and possible outliers. To do this, we can use boxplots, histograms, density plots, and violin plots.

13.1.1 Boxplot

A box plot, also known as a box-and-whisker plot, is a graphical representation used to display the distribution of a dataset along with its key statistical properties. It provides a concise summary of the data’s central tendency, spread, dkewness, and potential outliers. Here’s how to interpret the various components of a box plot:

  • Box: The box in the middle of the plot represents the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3) of the data. The length of the box indicates the spread of the middle 50% of the data. The bottom and top edges of the box represent Q1 and Q3, respectively.
  • Median (Q2): A horizontal line inside the box represents the median, which is the middle value of the dataset when it is ordered. Half of the data points are below this line and half are above it. It’s also a measure of the data’s central tendency.
  • Whiskers: The whiskers extend from the edges of the box to indicate the range of the data within a certain limit. The upper whisker typically extends to the maximum data point within 1.5 times the IQR from Q3, and the lower whisker extends to the minimum data point within 1.5 times the IQR from Q1. Data points beyond the whiskers are considered potential outliers and are plotted individually.
  • Potential outliers: Data points that fall beyond the whiskers are potential outliers. Outliers can suggest unusual or extreme observations that might warrant further investigation.
  • Notch: Some box plots have a notch around the median. The notched area gives a rough indication of the uncertainty around the median’s value. If the notches of two box plots don’t overlap, it suggests that the medians of the two groups are significantly different.

Box plots are particularly useful for comparing distributions between different groups or datasets. They provide a clear visual summary of the data’s distribution and allow you to quickly identify patterns and variations within the data.

ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(x = carat))

Customizing a boxplot

ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(x = carat),
               notch         = TRUE,     # there is so much data, the notch is tiny
               coef          = 2,        # 1.5 x IQR is default (and expected)
               color         = "blue",
               fill          = "yellow",
               outlier.color = "red",
               outlier.fill  = "purple",
               outlier.shape = 23,
               outlier.size  = 3,
               outlier.alpha = 0.5,
               linetype      = 5,
               linewidth     = 1.5) 

Unfortunately, boxplots are very good at hiding details of the distribution and thus are not recommended.

13.1.2 Histogram

A histogram is a graphical representation used to visualize the distribution of a numeric variable. It provides insight into how data is spread across different intervals, or bins, along a numerical scale. Here’s how to interpret the various aspects of a histogram: - Bins: The x-axis of the histogram represents the range of values present in the dataset. This range is divided into equal-width bins. Each bin represents a specific range of values. The y-axis represents the frequency or count of data points that fall within each bin. - Bar Heights: The height of each bar in the histogram corresponds to the frequency of data points within the associated bin. The taller the bar, the more data points fall within that bin’s range.

ggplot(data = diamonds) + 
  geom_histogram(mapping = aes(x = carat))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Customizing a histogram.

ggplot(data = diamonds) + 
  geom_histogram(mapping = aes(x = carat, 
                               y = after_stat(density)), # the y-axis is now 
                 binwidth = 0.1,
                 fill = "yellow",
                 color = "blue",
                 linetype = 4,
                 linewidth = 1)

ggplot(data = diamonds) + 
  geom_histogram(mapping = aes(y = carat), # rotate histogram
                 bins = 1000,              # number of bins
                 fill = "red",             # bins are so small, fill is hard to see
                 color = "black")

13.1.3 Density

A density plot, often referred to as a kernel density plot, is a graphical representation used to visualize the distribution of a numeric variable. Unlike a histogram, which uses discrete bins, a density plot uses a numeric curve to represent the distribution. The curve represents the estimated probability density of the data at different points along the x-axis.

ggplot(data = diamonds) + 
  geom_density(mapping = aes(x = carat))

Density combined with a histogram.

ggplot(data = diamonds, 
       # mapping moved to ggplot since both <GEOM>s use the same mapping
       mapping = aes(x = carat)) + 
  geom_histogram(aes(y = after_stat(density)), # must use after_stat_density
                 alpha = 0.5,
                 fill = "gray") + 
  geom_density()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = diamonds) + 
  geom_density(mapping = aes(x = carat),
               color = "#C8102E",
               fill = "#F1BE48",
               alpha = 0.5,
               linetype = 2,
               linewidth = 3)

13.1.4 Violin

A violin plot is a type of data visualization that combines aspects of a box plot and a kernel density plot. It is used to display the distribution of a numeric variable.

ggplot(data = diamonds) + 
  geom_violin(mapping = aes(x = carat, y = 0)) # y is required

ggplot(data = diamonds) + 
  geom_violin(mapping = aes(x = carat, y = 0), # y is required
               color = "#C8102E",
               fill = "#F1BE48",
               alpha = 0.5,
              linetype = 1,
              linewidth = 2) 

13.2 Categorical

Categorical data are those that are not numeric.

13.2.1 Bar chart

A bar chart is a common type of data visualization used to represent a categorical variable. It displays the count of data points within different categories using rectangular bars.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut)) 

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(y = cut),
           color = "#C8102E",      # outside of bar
           fill = "#F1BE48",       # inside of bar
           alpha = 0.5,            # transparency
           linetype = 3,           # dotted line
           linewidth = 1.2)        # thicker line

13.3 Summary

We introduced boxplots, histograms, density plots, and violin plots as methods to understand the central tendency, spread, skewness, and outliers for a numeric variable. We introduced a bar chat to provide a visual representation for a categorical variable in terms of the number of observations in each category.