# Comparing Means of Two Groups in R

## T-test in R

The t-test is used to compare two means.

This chapter describes the different types of t-test, including:

• one-sample t-tests,
• independent samples t-tests: Student’s t-test and Welch’s t-test
• paired samples t-test.

You will learn how to:

• Compute the different t-tests in R. The pipe-friendly function t_test() [rstatix package] will be used.
• Check t-test assumptions
• Calculate and report t-test effect size using Cohen’s d. The d statistic redefines the difference in means as the number of standard deviations that separates those means. T-test conventional effect sizes, proposed by Cohen, are: 0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect) (Cohen 1998).

Contents:

#### Related Book

Practical Statistics in R II - Comparing Groups: Numerical Variables

## Prerequisites

Make sure you have installed the following R packages:

• tidyverse for data manipulation and visualization
• ggpubr for creating easily publication ready plots
• rstatix provides pipe-friendly R functions for easy statistical analyses.
• datarium: contains required data sets for this chapter.

library(tidyverse)
library(ggpubr)
library(rstatix)

## One-Sample t-test

The one-sample t-test, also known as the single-parameter t test or single-sample t-test, is used to compare the mean of one sample to a known standard (or theoretical / hypothetical) mean.

Generally, the theoretical mean comes from:

• a previous experiment. For example, comparing whether the mean weight of mice differs from 200 mg, a value determined in a previous study.
• or from an experiment where you have control and treatment conditions. If you express your data as “percent of control”, you can test whether the average value of treatment condition differs significantly from 100.

### Demo data

Demo dataset: mice [in datarium package]. Contains the weight of 10 mice:

# Load and inspect the data
data(mice, package = "datarium")
head(mice, 3)
## # A tibble: 3 x 2
##   name  weight
##   <chr>  <dbl>
## 1 M_1     18.9
## 2 M_2     19.5
## 3 M_3     23.1

### Summary statistics

Compute some summary statistics: count (number of subjects), mean and sd (standard deviation)

mice %>% get_summary_stats(weight, type = "mean_sd")
## # A tibble: 1 x 4
##   variable     n  mean    sd
##   <chr>    <dbl> <dbl> <dbl>
## 1 weight      10  20.1  1.90

### Visualization

Create a boxplot to visualize the distribution of mice weights. Add also jittered points to show individual observations. The big dot represents the mean point.

bxp <- ggboxplot(
mice\$weight, width = 0.5, add = c("mean", "jitter"),
ylab = "Weight (g)", xlab = FALSE
)
bxp

### Assumptions and preliminary tests

The one-sample t-test assumes the following characteristics about the data:

• No significant outliers in the data
• Normality. the data should be approximately normally distributed

In this section, we’ll perform some preliminary tests to check whether these assumptions are met.

#### Identify outliers

Outliers can be easily identified using boxplot methods, implemented in the R function identify_outliers() [rstatix package].

mice %>% identify_outliers(weight)
## # A tibble: 0 x 4
## # … with 4 variables: name <chr>, weight <dbl>, is.outlier <lgl>, is.extreme <lgl>

There were no extreme outliers.

Note that, in the situation where you have extreme outliers, this can be due to: 1) data entry errors, measurement errors or unusual values.

In this case, you could consider running the non parametric Wilcoxon test.

#### Check normality assumption

The normality assumption can be checked by computing the Shapiro-Wilk test. If the data is normally distributed, the p-value should be greater than 0.05.

mice %>% shapiro_test(weight)
## # A tibble: 1 x 3
##   variable statistic     p
##   <chr>        <dbl> <dbl>
## 1 weight       0.923 0.382

From the output, the p-value is greater than the significance level 0.05 indicating that the distribution of the data are not significantly different from the normal distribution. In other words, we can assume the normality.

You can also create a QQ plot of the weight data. QQ plot draws the correlation between a given data and the normal distribution.

ggqqplot(mice, x = "weight")

All the points fall approximately along the (45-degree) reference line, for each group. So we can assume normality of the data.

Note that, if your sample size is greater than 50, the normal QQ plot is preferred because at larger sample sizes the Shapiro-Wilk test becomes very sensitive even to a minor deviation from normality.

If the data are not normally distributed, it’s recommended to use a non-parametric test such as the one-sample Wilcoxon signed-rank test. This test is similar to the one-sample t-test, but focuses on the median rather than the mean.

### Computation

We want to know, whether the average weight of the mice differs from the 25g (two-tailed test)?

stat.test <- mice %>% t_test(weight ~ 1, mu = 25)
stat.test
## # A tibble: 1 x 7
##   .y.    group1 group2         n statistic    df       p
## * <chr>  <chr>  <chr>      <int>     <dbl> <dbl>   <dbl>
## 1 weight 1      null model    10     -8.10     9 0.00002

The results above show the following components:

• .y.: the outcome variable used in the test.
• group1,group2: generally, the compared groups in the pairwise tests. Here, we have null model (one-sample test).
• statistic: test statistic (t-value) used to compute the p-value.
• df: degrees of freedom.
• p: p-value.

You can obtain a detailed result by specifying the option detailed = TRUE in the function t_test().

### Effect size

To calculate an effect size, called Cohen's d, for the one-sample t-test you need to divide the mean difference by the standard deviation of the difference, as shown below. Note that, here: sd(x-mu) = sd(x).

Cohen’s d formula:

d = abs(mean(x) - mu)/sd(x), where:

• x is a numeric vector containing the data.
• mu is the mean against which the mean of x is compared (default value is mu = 0).

Calculation:

mice %>% cohens_d(weight ~ 1, mu = 25)
## # A tibble: 1 x 6
##   .y.    group1 group2     effsize     n magnitude
## * <chr>  <chr>  <chr>        <dbl> <int> <ord>
## 1 weight 1      null model    10.6    10 large

Recall that, t-test conventional effect sizes, proposed by Cohen J. (1998), are: 0.2 (small effect), 0.5 (moderate effect) and 0.8 (large effect) (Cohen 1998). As the effect size, d, is 2.56 you can conclude that there is a large effect.

### Report

We could report the result as follow:

A one-sample t-test was computed to determine whether the recruited mice average weight was different to the population normal mean weight (25g).

The mice weight value were normally distributed, as assessed by Shapiro-Wilk’s test (p > 0.05) and there were no extreme outliers in the data, as assessed by boxplot method.

The measured mice mean weight (20.14 +/- 1.94) was statistically significantly lower than the population normal mean weight 25 (t(9) = -8.1, p < 0.0001, d = 2.56); where t(9) is shorthand notation for a t-statistic that has 9 degrees of freedom.

Create a box plot with p-value:

bxp + labs(
subtitle = get_test_label(stat.test, detailed = TRUE)
)

Create a density plot with p-value:

• Red line corresponds to the observed mean
• Blue line corresponds to the theoretical mean
ggdensity(mice, x = "weight", rug = TRUE, fill = "lightgray") +
scale_x_continuous(limits = c(15, 27)) +
stat_central_tendency(type = "mean", color = "red", linetype = "dashed") +
geom_vline(xintercept = 25, color = "blue", linetype = "dashed") +
labs(subtitle = get_test_label(stat.test,  detailed = TRUE))

## Independent samples t-test

The independent samples t-test (or unpaired samples t-test) is used to compare the mean of two independent groups.

For example, you might want to compare the average weights of individuals grouped by gender: male and female groups, which are two unrelated/independent groups.

The independent samples t-test comes in two different forms:

• the standard Student’s t-test, which assumes that the variance of the two groups are equal.
• the Welch’s t-test, which is less restrictive compared to the original Student’s test. This is the test where you do not assume that the variance is the same in the two groups, which results in the fractional degrees of freedom.

By default, R computes the Weltch t-test, which is the safer one. The two methods give very similar results unless both the group sizes and the standard deviations are very different.

### Demo data

Demo dataset: genderweight [in datarium package] containing the weight of 40 individuals (20 women and 20 men).

Load the data and show some random rows by groups:

# Load the data
data("genderweight", package = "datarium")
# Show a sample of the data by group
set.seed(123)
genderweight %>% sample_n_by(group, size = 2)
## # A tibble: 4 x 3
##   id    group weight
##   <fct> <fct>  <dbl>
## 1 6     F       65.0
## 2 15    F       65.9
## 3 29    M       88.9
## 4 37    M       77.0

### Summary statistics

Compute some summary statistics by groups: mean and sd (standard deviation)

genderweight %>%
group_by(group) %>%
get_summary_stats(weight, type = "mean_sd")
## # A tibble: 2 x 5
##   group variable     n  mean    sd
##   <fct> <chr>    <dbl> <dbl> <dbl>
## 1 F     weight      20  63.5  2.03
## 2 M     weight      20  85.8  4.35

### Visualization

Visualize the data using box plots. Plot weight by groups.

bxp <- ggboxplot(
genderweight, x = "group", y = "weight",
ylab = "Weight", xlab = "Groups", add = "jitter"
)
bxp

### Assumptions and preliminary tests

The two-samples independent t-test assume the following characteristics about the data:

• Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
• No significant outliers in the two groups
• Normality. the data for each group should be approximately normally distributed.
• Homogeneity of variances. the variance of the outcome variable should be equal in each group.

In this section, we’ll perform some preliminary tests to check whether these assumptions are met.

#### Identify outliers by groups

genderweight %>%
group_by(group) %>%
identify_outliers(weight)
## # A tibble: 2 x 5
##   group id    weight is.outlier is.extreme
##   <fct> <fct>  <dbl> <lgl>      <lgl>
## 1 F     20      68.8 TRUE       FALSE
## 2 M     31      95.1 TRUE       FALSE

There were no extreme outliers.

#### Check normality by groups

# Compute Shapiro wilk test by goups
data(genderweight, package = "datarium")
genderweight %>%
group_by(group) %>%
shapiro_test(weight)
## # A tibble: 2 x 4
##   group variable statistic     p
##   <fct> <chr>        <dbl> <dbl>
## 1 F     weight       0.938 0.224
## 2 M     weight       0.986 0.989
# Draw a qq plot by group
ggqqplot(genderweight, x = "weight", facet.by = "group")

From the output above, we can conclude that the data of the two groups are normally distributed.

#### Check the equality of variances

This can be done using the Levene’s test. If the variances of groups are equal, the p-value should be greater than 0.05.

genderweight %>% levene_test(weight ~ group)
## # A tibble: 1 x 4
##     df1   df2 statistic      p
##   <int> <int>     <dbl>  <dbl>
## 1     1    38      6.12 0.0180

The p-value of the Levene’s test is significant, suggesting that there is a significant difference between the variances of the two groups. Therefore, we’ll use the Weltch t-test, which doesn’t assume the equality of the two variances.

### Computation

We want to know, whether the average weights are different between groups.

Recall that, by default, R computes the Weltch t-test, which is the safer one:

stat.test <- genderweight %>%
t_test(weight ~ group) %>%
stat.test
## # A tibble: 1 x 9
##   .y.    group1 group2    n1    n2 statistic    df        p p.signif
##   <chr>  <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl> <chr>
## 1 weight F      M         20    20     -20.8  26.9 4.30e-18 ****

If you want to assume the equality of variances (Student t-test), specify the option var.equal = TRUE:

stat.test2 <- genderweight %>%
t_test(weight ~ group, var.equal = TRUE) %>%
stat.test2

The output is similar to the result of one-sample test. Recall that, more details can be obtained by specifying the option detailed = TRUE in the function t_test(). The p-value of the comparison is significant (p < 0.0001).

### Effect size

#### Cohen’s d for Student t-test

This effect size is calculated by dividing the mean difference between the groups by the pooled standard deviation.

Cohen’s d formula:

d = (mean1 - mean2)/pooled.sd, where:

• pooled.sd is the common standard deviation of the two groups. pooled.sd = sqrt([var1*(n1-1) + var2*(n2-1)]/[n1 + n2 -2]);
• var1 and var2 are the variances (squared standard deviation) of group1 and 2, respectively.
• n1 and n2 are the sample counts for group 1 and 2, respectively.
• mean1 and mean2 are the means of each group, respectively.

Calculation:

genderweight %>%  cohens_d(weight ~ group, var.equal = TRUE)
## # A tibble: 1 x 7
##   .y.    group1 group2 effsize    n1    n2 magnitude
## * <chr>  <chr>  <chr>    <dbl> <int> <int> <ord>
## 1 weight F      M         6.57    20    20 large

There is a large effect size, d = 6.57.

#### Cohen’s d for Welch t-test

The Welch test is a variant of t-test used when the equality of variance can’t be assumed. The effect size can be computed by dividing the mean difference between the groups by the “averaged” standard deviation.

Cohen’s d formula:

d = (mean1 - mean2)/sqrt((var1 + var2)/2), where:

• mean1 and mean2 are the means of each group, respectively
• var1 and var2 are the variance of the two groups.

Calculation:

genderweight %>% cohens_d(weight ~ group, var.equal = FALSE)
## # A tibble: 1 x 7
##   .y.    group1 group2 effsize    n1    n2 magnitude
## * <chr>  <chr>  <chr>    <dbl> <int> <int> <ord>
## 1 weight F      M         6.57    20    20 large

Note that, when group sizes are equal and group variances are homogeneous, Cohen’s d for the standard Student and Welch t-tests are identical.

### Report

We could report the result as follow:

The mean weight in female group was 63.5 (SD = 2.03), whereas the mean in male group was 85.8 (SD = 4.3). A Welch two-samples t-test showed that the difference was statistically significant, t(26.9) = -20.8, p < 0.0001, d = 6.57; where, t(26.9) is shorthand notation for a Welch t-statistic that has 26.9 degrees of freedom.

stat.test <- stat.test %>% add_xy_position(x = "group")
bxp +
stat_pvalue_manual(stat.test, tip.length = 0) +
labs(subtitle = get_test_label(stat.test, detailed = TRUE))

## Paired samples t-test

The paired sample t-test is used to compare the means of two related groups of samples. Put into another words, it’s used in a situation where you have two pairs of values measured for the same samples.

For example, you might want to compare the average weight of 20 mice before and after treatment. The data contain 20 sets of values before treatment and 20 sets of values after treatment from measuring twice the weight of the same mice. In such situations, paired t-test can be used to compare the mean weights before and after treatment.

### Demo data

Here, we’ll use a demo dataset mice2 [datarium package], which contains the weight of 10 mice before and after the treatment.

# Wide format
data("mice2", package = "datarium")
head(mice2, 3)
##   id before after
## 1  1    187   430
## 2  2    194   404
## 3  3    232   406
# Transform into long data:
# gather the before and after values in the same column
mice2.long <- mice2 %>%
gather(key = "group", value = "weight", before, after)
head(mice2.long, 3)
##   id  group weight
## 1  1 before    187
## 2  2 before    194
## 3  3 before    232

### Summary statistics

Compute some summary statistics (mean and sd) by groups:

mice2.long %>%
group_by(group) %>%
get_summary_stats(weight, type = "mean_sd")
## # A tibble: 2 x 5
##   group  variable     n  mean    sd
##   <chr>  <chr>    <dbl> <dbl> <dbl>
## 1 after  weight      10  400.  30.1
## 2 before weight      10  201.  20.0

### Visualization

bxp <- ggpaired(mice2.long, x = "group", y = "weight",
order = c("before", "after"),
ylab = "Weight", xlab = "Groups")
bxp

### Assumptions and preliminary tests

The paired samples t-test assume the following characteristics about the data:

• the two groups are paired. In our example, this is the case since the data have been collected from measuring twice the weight of the same mice.
• No significant outliers in the difference between the two related groups
• Normality. the difference of pairs follow a normal distribution.

In this section, we’ll perform some preliminary tests to check whether these assumptions are met.

First, start by computing the difference between groups:

mice2 <- mice2 %>% mutate(differences = before - after)
head(mice2, 3)
##   id before after differences
## 1  1    187   430        -242
## 2  2    194   404        -210
## 3  3    232   406        -174

#### Identify outliers

mice2 %>% identify_outliers(differences)
## [1] id          before      after       differences is.outlier  is.extreme
## <0 rows> (or 0-length row.names)

There were no extreme outliers.

#### Check normality assumption

# Shapiro-Wilk normality test for the differences
mice2 %>% shapiro_test(differences) 
## # A tibble: 1 x 3
##   variable    statistic     p
##   <chr>           <dbl> <dbl>
## 1 differences     0.968 0.867
# QQ plot for the difference
ggqqplot(mice2, "differences")

From the output above, it can be assumed that the differences are normally distributed.

### Computation

We want to know, if there is any significant difference in the mean weights after treatment?

stat.test <- mice2.long  %>%
t_test(weight ~ group, paired = TRUE) %>%
stat.test
## # A tibble: 1 x 9
##   .y.    group1 group2    n1    n2 statistic    df             p p.signif
##   <chr>  <chr>  <chr>  <int> <int>     <dbl> <dbl>         <dbl> <chr>
## 1 weight after  before    10    10      25.5     9 0.00000000104 ****

The output is similar to that of a one-sample t-test. Again, more details can be obtained by specifying the option detailed = TRUE in the function t_test().

### Effect size

The effect size for a paired-samples t-test can be calculated by dividing the mean difference by the standard deviation of the difference, as shown below.

Cohen’s formula:

d = mean(D)/sd(D), where D is the differences of the paired samples values.

Calculation:

mice2.long  %>% cohens_d(weight ~ group, paired = TRUE)
## # A tibble: 1 x 7
##   .y.    group1 group2 effsize    n1    n2 magnitude
## * <chr>  <chr>  <chr>    <dbl> <int> <int> <ord>
## 1 weight after  before    8.08    10    10 large

There is a large effect size, Cohen’s d = 8.07.

### Report

We could report the result as follow: The average weight of mice was significantly increased after treatment, t(9) = 25.5, p < 0.0001, d = 8.07.

stat.test <- stat.test %>% add_xy_position(x = "group")
bxp +
stat_pvalue_manual(stat.test, tip.length = 0) +
labs(subtitle = get_test_label(stat.test, detailed= TRUE))

## Summary

This chapter describes how to compare two means in R using t-test. Quick start R codes, to compute the different t-tests, are:

# One-sample t-test
mice %>% t_test(weight ~ 1, mu = 25)
# Independent samples t-test
genderweight %>% t_test(weight ~ group)
# Paired sample t-test
mice2.long %>% t_test(weight ~ group, paired = TRUE)

Note that, to compute one-sided t-tests, you can specify the option alternative, which possible values can be “greater”, “less” or “two.sided”.

We also explain the assumptions made by the t-test and provide practical examples of R codes to check whether the test assumptions are met.

The t-test assumptions can be summarized as follow:

• One-sample t-test:
• No significant outliers in the data
• the data should be normally distributed.
• Independent sample t-test:
• No significant outliers in the groups
• the two groups of samples (A and B), being compared, should be normally distributed.
• the variances of the two groups should not be significantly different. This assumption is made only by the original Student’s t-test. It is relaxed in the Welch’s t-test.
• Paired sample t-test:
• No significant outliers in the differences between groups
• the difference of pairs should follow a normal distribution.

Assessing normality. With large enough samples size (n > 30) the violation of the normality assumption should not cause major problems (according to the central limit theorem). This implies that we can ignore the distribution of the data and use parametric tests. However, to be consistent, the Shapiro-Wilk test can be used to ascertain whether data show or not a serious deviation from normality (See Chapter @ref(normality-test-in-r)).

Assessing equality of variances. Homogeneity of variances can be checked using the Levene’s test. Note that, by default, the t_test() function does not assume equal variances; instead of the standard Student’s t-test, it uses the Welch t-test by default, which is the considered the safer one. To use Student’s t-test, set var.equal = TRUE. The two methods give very similar results unless both the group sizes and the standard deviations are very different.

In the situations where the assumptions are violated, non-parametric tests, such as Wilcoxon test, are recommended.

## References

Cohen, J. 1998. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates.

Version: Français

### Comment ( 1 )

• Roro

Paired sample t-test requires the test for normality of the difference, but the difference of two normal distributions is also a normal distribution. Can’t we treat it the same as the independent test, or is the opposite not true always?