24 Normal analyses (old)

Author

Jarad Niemi

Normal analyses are (typically) relevant for independent, continuous observations. These observations may need to be transformed to satisfy assumptions of normality. Most commonly by taking a logarithmic transformation of our data.

library("tidyverse"); theme_set(theme_bw())

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("Sleuth3")
set.seed(20230319)

24.1 One group

The single group normal analysis begins with the assumptions \[ Y_i \stackrel{ind}{\sim} N(\mu,\sigma^2) \] for $i=1,\ldots,n$.

An alternative way to write this model is \[ Y_i = \mu + \epsilon_i, \qquad \epsilon_i \stackrel{ind}{\sim} N(0,\sigma^2) \]

which has the following assumptions

errors are independent,
errors are normally distributed,
errors have a constant variance,

and

common mean.

24.1.1 Summary statistics

# One-sample data
creativity <- case0101 |>
  filter(Treatment == "Intrinsic")

A first step in any analysis is to look at summary statistics.

# One-sample summary
creativity |>
  summarize(
    n    = n(),
    mean = mean(Score),
    sd   = sd(Score)
  )

   n     mean       sd
1 24 19.88333 4.439513

Plotting the data will also help provide insight.

# One-sample histogram
ggplot(creativity, aes(x = Score)) + 
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# One-sample boxplot
ggplot(creativity, aes(x = Score)) + 
  geom_boxplot()

24.1.2 Test and confidence interval

In a one-sample t-test, we are typically interested in the hypothesis \[ H_0: \mu = \mu_0 \] for some value $\mu_0$.

# One-sample t-test
m <- lm(Score ~ 1, 
        data = creativity)
m


Call:
lm(formula = Score ~ 1, data = creativity)

Coefficients:
(Intercept)  
      19.88

# Regression model summary
s <- summary(m)
s


Call:
lm(formula = Score ~ 1, data = creativity)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8833 -2.4583  0.5167  2.4167  9.8167 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  19.8833     0.9062   21.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.44 on 23 degrees of freedom

You can extract many useful summaries from this analysis using built-in functions and inspecting the model and summary objects. Let’s start with the summary statistics (which you can compare to the summary statistics above).

# Degrees of freedom
m$df.residual # n-1

[1] 23

# Estimated mean
coef(m)

(Intercept) 
   19.88333

m$coefficients

(Intercept) 
   19.88333

# Estimated standard deviation
s$sigma

[1] 4.439513

We can also obtain inferential statistics including confidence intervals and p-values.

# Confidence intervals for the mean
confint(m) # default is 95\%

               2.5 %   97.5 %
(Intercept) 18.00869 21.75798

confint(m, level = 0.9)

                5 %     95 %
(Intercept) 18.3302 21.43646

# P-value to compare mean to 0
s$coefficients[4]

[1] 6.360338e-17

The built-in functions, e.g. summary(), coef(), and confint() require you to remember the names of these functions. Fortunately, these functions are commonly used for many statistical models and thus, with time, will be easier to remember.

For the objects access with $, you can use the names() function to help you remember these objects.

# R object names
names(m)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "call"          "terms"         "model"

names(s)

 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "adj.r.squared"
 [9] "r.squared"     "cov.unscaled"

24.2 Two groups

When you have two groups and they are independent, a common assumption is

\[ Y_{gi} \stackrel{ind}{\sim} N(\mu_g, \sigma^2) \] for $i = 1, \ldots, n_g$ and $g=1,2$. Alternatively

\[ Y_{gi} = \mu_g + \epsilon_{gi}, \qquad \epsilon_{gi} \stackrel{ind}{\sim} N(0,\sigma^2) \] which assumes

errors are independent,
errors are normally distributed,
errors have constant variance,

and

each group has its own mean.

24.2.1 Summary statistics

# Two-groups summary
case0101 |>
  group_by(Treatment) |>
  summarize(
    n    = n(),
    mean = mean(Score),
    sd   = sd(Score),
    
    .groups = "drop"
  )

# A tibble: 2 × 4
  Treatment     n  mean    sd
  <fct>     <int> <dbl> <dbl>
1 Extrinsic    23  15.7  5.25
2 Intrinsic    24  19.9  4.44

ggplot(case0101, 
       aes(x = Score)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ Treatment, 
              ncol = 1)

ggplot(case0101, 
       aes(x = Treatment,
           y = Score)) +
  geom_boxplot()

As you look at these summary statistics, you can start to evaluate model assumptions. From these plots and summary statistics, it appears the two data sets have approximately equal variance while the means are likely different.

24.2.2 Test and confidence intervals

Typically we are interested in the hypothesis \[ H_0: \mu_1 = \mu_2 \qquad \mbox{or, equivalently,} \qquad H_0: \mu_1 - \mu_2 = 0 \] and a confidence interval for the difference $\mu_1-\mu_2$.

# Two-sample t-test
m <- lm(Score ~ Treatment,
        data = case0101)
m


Call:
lm(formula = Score ~ Treatment, data = case0101)

Coefficients:
       (Intercept)  TreatmentIntrinsic  
            15.739               4.144

s <- summary(m)
s


Call:
lm(formula = Score ~ Treatment, data = case0101)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.739  -2.983   1.061   2.961   9.817 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          15.739      1.012  15.550  < 2e-16 ***
TreatmentIntrinsic    4.144      1.416   2.926  0.00537 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.854 on 45 degrees of freedom
Multiple R-squared:  0.1598,    Adjusted R-squared:  0.1412 
F-statistic: 8.561 on 1 and 45 DF,  p-value: 0.005366

Similar to the one group analysis, we can obtain summary statistics from the model output.

# Degrees of freedom
m$df.residual # n-2

[1] 45

# Estimated coefficients
coef(m)

       (Intercept) TreatmentIntrinsic 
         15.739130           4.144203

m$coefficients

       (Intercept) TreatmentIntrinsic 
         15.739130           4.144203

coef(m)[1] # mean of reference group (Extrinsic)

(Intercept) 
   15.73913

coef(m)[2] # difference in group means

TreatmentIntrinsic 
          4.144203

# Estimated standard deviation
s$sigma

[1] 4.854066

# R-squared (coefficient of determination)
s$r.squared

[1] 0.1598325

Then we can move on to the inferential statistics.

# Confidence intervals for the coefficients
confint(m)

                       2.5 %    97.5 %
(Intercept)        13.700570 17.777691
TreatmentIntrinsic  1.291432  6.996973

confint(m)[2,] # 95% confidence interval for the difference

   2.5 %   97.5 % 
1.291432 6.996973

# P-value to compare difference to 0
s$coefficients[2,4]

[1] 0.005366476

24.3 Paired data

A paired analysis can be used when the data have a natural grouping structure into a pair of observations. Within each pair of observations, one observation has one level of the explanatory variable while the other observation has the other level of the explanatory variable.

In the following example, the data are naturally paired because each set of observations is on a set of identical twins. One twin has schizophrenia while the other does not. These data compares the volume of the left hippocampus to understand its relationship to schizophrenia.

24.3.1 Summary statistics

We could compare summary statistics separately for all individuals.

summary(case0202)

   Unaffected       Affected   
 Min.   :1.250   Min.   :1.02  
 1st Qu.:1.600   1st Qu.:1.31  
 Median :1.770   Median :1.59  
 Mean   :1.759   Mean   :1.56  
 3rd Qu.:1.935   3rd Qu.:1.78  
 Max.   :2.080   Max.   :2.02

The power in these data arises from the use of pairing. In order to make use of this pairing we should compute the difference (or ratio) for each pair.

case0202 |>
  mutate(diff = Unaffected - Affected) |>
  summarize(
    n    = n(),
    mean = mean(diff),
    sd   = sd(diff)
  )

   n      mean        sd
1 15 0.1986667 0.2382935

For analysis and plotting, it is useful to wrangling the data into a longer format.

# Prepare data for plotting
schizophrenia <- case0202 |>
  mutate(pair = LETTERS[1:n()]) |>    # Create a variable for the pair
  pivot_longer(-pair, 
               names_to = "diagnosis", 
               values_to = "volume")

In this plot, each pair is a line with the affected twin on the left and the unaffected twin on the right.

ggplot(schizophrenia, 
       aes(x     = diagnosis, 
           y     = volume, 
           group = pair)) + 
  geom_line()

It is important to connect these points between each twin because we are looking for a pattern (lines generally all going up or lines generally all going down) amongst the pairs.

24.3.2 Paired t-test

When the data are paired, we can perform a paired t-test.

# Paired t-test
m <- lm(volume ~ diagnosis + pair,
   data = schizophrenia)
m


Call:
lm(formula = volume ~ diagnosis + pair, data = schizophrenia)

Coefficients:
        (Intercept)  diagnosisUnaffected                pairB  
          1.506e+00            1.987e-01           -7.000e-02  
              pairC                pairD                pairE  
         -9.000e-02           -1.200e-01            3.900e-01  
              pairF                pairG                pairH  
         -1.450e-01            1.250e-01            1.150e-01  
              pairI                pairJ                pairK  
         -7.500e-02            2.800e-01           -4.700e-01  
              pairL                pairM                pairN  
          3.000e-02            4.250e-01            1.640e-15  
              pairO  
          4.200e-01

s <- summary(m)
s


Call:
lm(formula = volume ~ diagnosis + pair, data = schizophrenia)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.23567 -0.07558  0.00000  0.07558  0.23567 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.506e+00  1.231e-01  12.236 7.28e-09 ***
diagnosisUnaffected  1.987e-01  6.153e-02   3.229  0.00606 ** 
pairB               -7.000e-02  1.685e-01  -0.415  0.68412    
pairC               -9.000e-02  1.685e-01  -0.534  0.60163    
pairD               -1.200e-01  1.685e-01  -0.712  0.48806    
pairE                3.900e-01  1.685e-01   2.315  0.03633 *  
pairF               -1.450e-01  1.685e-01  -0.861  0.40399    
pairG                1.250e-01  1.685e-01   0.742  0.47044    
pairH                1.150e-01  1.685e-01   0.682  0.50606    
pairI               -7.500e-02  1.685e-01  -0.445  0.66305    
pairJ                2.800e-01  1.685e-01   1.662  0.11879    
pairK               -4.700e-01  1.685e-01  -2.789  0.01448 *  
pairL                3.000e-02  1.685e-01   0.178  0.86124    
pairM                4.250e-01  1.685e-01   2.522  0.02439 *  
pairN                1.640e-15  1.685e-01   0.000  1.00000    
pairO                4.200e-01  1.685e-01   2.493  0.02583 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1685 on 14 degrees of freedom
Multiple R-squared:  0.8336,    Adjusted R-squared:  0.6554 
F-statistic: 4.677 on 15 and 14 DF,  p-value: 0.003132

In this analysis, we do not care about significance of any particular pair. Instead, we include pair in the analysis to perform a paired analysis. We will focus on the coefficient for the diagnosis.

# Degrees of freedom
m$df.residual # number of pairs - 1

[1] 14

# Estimated difference
coef(m)[2]

diagnosisUnaffected 
          0.1986667

m$coefficients[2]

diagnosisUnaffected 
          0.1986667

# Estimated standard deviation
s$sigma

[1] 0.168499

We can also extract inferential statistics.

# Confidence interval for the difference
confint(m)[2,]

    2.5 %    97.5 % 
0.0667041 0.3306292

# P-value for test of difference being 0
s$coefficients[2,4]

[1] 0.006061544

This analysis is equivalent to taking the difference in each pair and performing a one-sample t-test.

# One-sample t-test on the difference
m <- lm(Unaffected - Affected ~ 1,
        data = case0202)
m


Call:
lm(formula = Unaffected - Affected ~ 1, data = case0202)

Coefficients:
(Intercept)  
     0.1987

summary(m)


Call:
lm(formula = Unaffected - Affected ~ 1, data = case0202)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38867 -0.14367 -0.08867  0.11633  0.47133 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.19867    0.06153   3.229  0.00606 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2383 on 14 degrees of freedom

24.4 More than two groups

When you have two groups and they are independent, a common assumption is

\[ Y_{gi} \stackrel{ind}{\sim} N(\mu_g, \sigma^2) \] for $i = 1, \ldots, n_g$ and $g=1,2,\ldots,G$. Alternatively

\[ Y_{gi} = \mu_g + \epsilon_{gi}, \qquad \epsilon_{gi} \stackrel{ind}{\sim} N(0,\sigma^2) \] which assumes

errors are independent,
errors are normally distributed,
errors have constant variance,

and

each group has its own mean.

24.4.1 Summary statistics

# More than two-groups data
case0501 |>
  group_by(Diet) |>
  summarize(
    n = n(),
    mean = mean(Lifetime),
    sd = sd(Lifetime),
    
    .groups = "drop"
  )

# A tibble: 6 × 4
  Diet      n  mean    sd
  <fct> <int> <dbl> <dbl>
1 N/N85    57  32.7  5.13
2 N/R40    60  45.1  6.70
3 N/R50    71  42.3  7.77
4 NP       49  27.4  6.13
5 R/R50    56  42.9  6.68
6 lopro    56  39.7  6.99

# Jitterplot
ggplot(case0501, 
       aes(x = Diet, 
           y = Lifetime)) +
  geom_jitter(width = 0.1)

# Boxplotx
ggplot(case0501, 
       aes(x = Diet, 
           y = Lifetime)) +
  geom_boxplot()

We can see from these boxplots that the data appear a bit left-skewed, but that the variability in each group is pretty similar.

24.4.2 ANOVA

An ANOVA F-test considers the following null hypothesis \[ H_0: \mu_g = \mu \quad \mbox{ for all } g \]

# ANOVA
m <- lm(Lifetime ~ Diet, 
        data = case0501)
anova(m)

Analysis of Variance Table

Response: Lifetime
           Df Sum Sq Mean Sq F value    Pr(>F)    
Diet        5  12734  2546.8  57.104 < 2.2e-16 ***
Residuals 343  15297    44.6                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

24.5 Summary

We introduced a number of analyses for normal (continuous) data depending on how many groups we have and the structure of those groups. In the future, we will introduce linear regression models that can be used for continuous data. We will see that these models allow us to perform all of the analyses above and analyze data with a more complicated structure.