This article describes the independent t-test assumptions and provides examples of R code to check whether the assumptions are met before calculating the t-test. This also referred as the two sample t test assumptions.
The independent samples t-test comes in two different forms:
- the standard Student’s t-test, which assumes that the variance of the two groups are equal.
- the Welch’s t-test, which is less restrictive compared to the original Student’s test. This is the test where you do not assume that the variance is the same in the two groups, which results in the fractional degrees of freedom.
The two methods give very similar results unless both the group sizes and the standard deviations are very different.
The two-samples independent t-test assume the following characteristics about the data:
- Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
- No significant outliers in the two groups
- Normality. the data for each group should be approximately normally distributed.
- Homogeneity of variances. the variance of the outcome variable should be equal in each group. Recall that, the Welch t-test does not make this assumptions.
In this section, we’ll perform some preliminary tests to check whether these assumptions are met.
Check independent t-test assumptions in R
Make sure you have installed the following R packages:
tidyversefor data manipulation and visualization
ggpubrfor creating easily publication ready plots
rstatixprovides pipe-friendly R functions for easy statistical analyses.
datarium: contains required data sets for this chapter.
Start by loading the following required packages:
library(tidyverse) library(ggpubr) library(rstatix)
genderweight [in datarium package] containing the weight of 40 individuals (20 women and 20 men).
Load the data and show some random rows by groups:
# Load the data data("genderweight", package = "datarium") # Show a sample of the data by group set.seed(123) genderweight %>% sample_n_by(group, size = 2)
## # A tibble: 4 x 3 ## id group weight ## <fct> <fct> <dbl> ## 1 6 F 65.0 ## 2 15 F 65.9 ## 3 29 M 88.9 ## 4 37 M 77.0
Outliers can be easily identified using boxplot methods, implemented in the R function
identify_outliers() [rstatix package].
genderweight %>% group_by(group) %>% identify_outliers(weight)
## # A tibble: 2 x 5 ## group id weight is.outlier is.extreme ## <fct> <fct> <dbl> <lgl> <lgl> ## 1 F 20 68.8 TRUE FALSE ## 2 M 31 95.1 TRUE FALSE
There were no extreme outliers.
Note that, in the situation where you have extreme outliers, this can be due to: 1) data entry errors, measurement errors or unusual values.
Yo can include the outlier in the analysis anyway if you do not believe the result will be substantially affected. This can be evaluated by comparing the result of the t-test with and without the outlier.
It’s also possible to keep the outliers in the data and perform Wilcoxon test or robust t-test using the WRS2 package.
Check normality by groups
The normality assumption can be checked by computing the Shapiro-Wilk test for each group. If the data is normally distributed, the p-value should be greater than 0.05.
genderweight %>% group_by(group) %>% shapiro_test(weight)
## # A tibble: 2 x 4 ## group variable statistic p ## <fct> <chr> <dbl> <dbl> ## 1 F weight 0.938 0.224 ## 2 M weight 0.986 0.989
From the output, the two p-values are greater than the significance level 0.05 indicating that the distribution of the data are not significantly different from the normal distribution. In other words, we can assume the normality.
You can also create QQ plots for each group. QQ plot draws the correlation between a given data and the normal distribution.
ggqqplot(genderweight, x = "weight", facet.by = "group")
All the points fall approximately along the (45-degree) reference line, for each group. So we can assume normality of the data.
Note that, if your sample size is greater than 50, the normal QQ plot is preferred because at larger sample sizes the Shapiro-Wilk test becomes very sensitive even to a minor deviation from normality.
Note that, in the situation where the data are not normally distributed, it’s recommended to use the non parametric two-samples Wilcoxon test.
Check the equality of variances
This can be done using the Levene’s test. If the variances of groups are equal, the p-value should be greater than 0.05.
genderweight %>% levene_test(weight ~ group)
## # A tibble: 1 x 4 ## df1 df2 statistic p ## <int> <int> <dbl> <dbl> ## 1 1 38 6.12 0.0180
The p-value of the Levene’s test is significant, suggesting that there is a significant difference between the variances of the two groups. Therefore, we’ll use the Welch t-test, which doesn’t assume the equality of the two variances.
Recommended for you
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
- Course: Machine Learning: Master the Fundamentals by Standford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University
Popular Courses Launched in 2020
- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services
- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts
Books - Data Science
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet