Normal analyses are (typically) relevant for independent, continuous observations. These observations may need to be transformed to satisfy assumptions of normality. Most commonly by taking a logarithmic transformation of our data.
library("tidyverse"); theme_set(theme_bw())
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("Sleuth3")set.seed(20230319)
24.1 One group
The single group normal analysis begins with the assumptions \[
Y_i \stackrel{ind}{\sim} N(\mu,\sigma^2)
\] for \(i=1,\ldots,n\).
An alternative way to write this model is \[
Y_i = \mu + \epsilon_i, \qquad \epsilon_i \stackrel{ind}{\sim} N(0,\sigma^2)
\]
Call:
lm(formula = Score ~ 1, data = creativity)
Residuals:
Min 1Q Median 3Q Max
-7.8833 -2.4583 0.5167 2.4167 9.8167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.8833 0.9062 21.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.44 on 23 degrees of freedom
You can extract many useful summaries from this analysis using built-in functions and inspecting the model and summary objects. Let’s start with the summary statistics (which you can compare to the summary statistics above).
# Degrees of freedomm$df.residual # n-1
[1] 23
# Estimated meancoef(m)
(Intercept)
19.88333
m$coefficients
(Intercept)
19.88333
# Estimated standard deviations$sigma
[1] 4.439513
We can also obtain inferential statistics including confidence intervals and p-values.
# Confidence intervals for the meanconfint(m) # default is 95\%
2.5 % 97.5 %
(Intercept) 18.00869 21.75798
confint(m, level =0.9)
5 % 95 %
(Intercept) 18.3302 21.43646
# P-value to compare mean to 0s$coefficients[4]
[1] 6.360338e-17
The built-in functions, e.g. summary(), coef(), and confint() require you to remember the names of these functions. Fortunately, these functions are commonly used for many statistical models and thus, with time, will be easier to remember.
For the objects access with $, you can use the names() function to help you remember these objects.
As you look at these summary statistics, you can start to evaluate model assumptions. From these plots and summary statistics, it appears the two data sets have approximately equal variance while the means are likely different.
24.2.2 Test and confidence intervals
Typically we are interested in the hypothesis \[
H_0: \mu_1 = \mu_2
\qquad \mbox{or, equivalently,} \qquad
H_0: \mu_1 - \mu_2 = 0
\] and a confidence interval for the difference \(\mu_1-\mu_2\).
confint(m)[2,] # 95% confidence interval for the difference
2.5 % 97.5 %
1.291432 6.996973
# P-value to compare difference to 0s$coefficients[2,4]
[1] 0.005366476
24.3 Paired data
A paired analysis can be used when the data have a natural grouping structure into a pair of observations. Within each pair of observations, one observation has one level of the explanatory variable while the other observation has the other level of the explanatory variable.
In the following example, the data are naturally paired because each set of observations is on a set of identical twins. One twin has schizophrenia while the other does not. These data compares the volume of the left hippocampus to understand its relationship to schizophrenia.
24.3.1 Summary statistics
We could compare summary statistics separately for all individuals.
summary(case0202)
Unaffected Affected
Min. :1.250 Min. :1.02
1st Qu.:1.600 1st Qu.:1.31
Median :1.770 Median :1.59
Mean :1.759 Mean :1.56
3rd Qu.:1.935 3rd Qu.:1.78
Max. :2.080 Max. :2.02
The power in these data arises from the use of pairing. In order to make use of this pairing we should compute the difference (or ratio) for each pair.
For analysis and plotting, it is useful to wrangling the data into a longer format.
# Prepare data for plottingschizophrenia <- case0202 |>mutate(pair = LETTERS[1:n()]) |># Create a variable for the pairpivot_longer(-pair, names_to ="diagnosis", values_to ="volume")
In this plot, each pair is a line with the affected twin on the left and the unaffected twin on the right.
ggplot(schizophrenia, aes(x = diagnosis, y = volume, group = pair)) +geom_line()
It is important to connect these points between each twin because we are looking for a pattern (lines generally all going up or lines generally all going down) amongst the pairs.
24.3.2 Paired t-test
When the data are paired, we can perform a paired t-test.
In this analysis, we do not care about significance of any particular pair. Instead, we include pair in the analysis to perform a paired analysis. We will focus on the coefficient for the diagnosis.
# Degrees of freedomm$df.residual # number of pairs - 1
[1] 14
# Estimated differencecoef(m)[2]
diagnosisUnaffected
0.1986667
m$coefficients[2]
diagnosisUnaffected
0.1986667
# Estimated standard deviations$sigma
[1] 0.168499
We can also extract inferential statistics.
# Confidence interval for the differenceconfint(m)[2,]
2.5 % 97.5 %
0.0667041 0.3306292
# P-value for test of difference being 0s$coefficients[2,4]
[1] 0.006061544
This analysis is equivalent to taking the difference in each pair and performing a one-sample t-test.
# One-sample t-test on the differencem <-lm(Unaffected - Affected ~1,data = case0202)m
# Jitterplotggplot(case0501, aes(x = Diet, y = Lifetime)) +geom_jitter(width =0.1)
# Boxplotxggplot(case0501, aes(x = Diet, y = Lifetime)) +geom_boxplot()
We can see from these boxplots that the data appear a bit left-skewed, but that the variability in each group is pretty similar.
24.4.2 ANOVA
An ANOVA F-test considers the following null hypothesis \[
H_0: \mu_g = \mu \quad \mbox{ for all } g
\]
# ANOVAm <-lm(Lifetime ~ Diet, data = case0501)anova(m)
Analysis of Variance Table
Response: Lifetime
Df Sum Sq Mean Sq F value Pr(>F)
Diet 5 12734 2546.8 57.104 < 2.2e-16 ***
Residuals 343 15297 44.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
24.5 Summary
We introduced a number of analyses for normal (continuous) data depending on how many groups we have and the structure of those groups. In the future, we will introduce linear regression models that can be used for continuous data. We will see that these models allow us to perform all of the analyses above and analyze data with a more complicated structure.