R is a free and powerful statistical software for analyzing and visualizing data. If you want to learn easily the essential of R programming, visit our series of tutorials available on STHDA: http://www.sthda.com/english/wiki/r-basics-quick-and-easy.
In this chapter, you will learn:
- a very brief introduction to R, for installing R/RStudio as well as importing your data into R and installing required libraries.
- introduction to categorical data structure
- basics of creating contingency tables
- Install R and RStudio
- Install and load required R packages
- Data format
- Import your data in R
- Demo data sets
- Data manipulation
- Working with Categorical data
- Close your R/RStudio session
Install R and RStudio
R and RStudio can be installed on Windows, MAC OSX and Linux platforms. RStudio is an integrated development environment for R that makes using R easier. It includes a console, code editor and tools for plotting.
- R can be downloaded and installed from the Comprehensive R Archive Network (CRAN) webpage (http://cran.r-project.org/)
- After installing R software, install also the RStudio software available at: http://www.rstudio.com/products/RStudio/.
- Launch RStudio and start use R inside R studio.
R can be also accessed online without any installation. You can find an example at https://rdrr.io/snippets/. This site include thousands add-on packages.
Install and load required R packages
An R package is a collection of functionalities that extends the capabilities of base R. For example, to use the R code provided in this book, you should install the following R packages:
tidyversepackages, which are a collection of R packages that share the same programming philosophy. These packages include:
readr: for importing data into R
dplyr: for data manipulation
ggplot2: for data visualization.
datarium: contains demo data for statistical analyses.
psychpackages: for inter-rater reliability measures. which makes it easy, for beginner, to create publication ready plots
- Install the tidyverse package. Installing tidyverse will install automatically readr, dplyr, ggplot2 and more. Type the following code in the R console:
- Install datarium, irr, vcd and psych
install.packages("datarium") install.packages("irr") install.packages("vcd") install.packages("psych")
- Load required packages. After installation, you must first load the package for using the functions in the package. The function
library()is used for this task. An alternative function is
require(). For example, to load the
vcdpackage, type this:
Now, we can use R functions, such as
Kappa() [in the vcd package] for computing Cohen’s Kappa and weighted kappa.
If you want a help about a given function, say
Kappa(), type this in R console:
Your data should be in rectangular format, where columns are variables and rows are observations (individuals or samples).
- Column names should be compatible with R naming conventions. Avoid column with blank space and special characters. Good column names:
long.jump. Bad column name:
- Avoid beginning column names with a number. Use letter instead. Good column names:
x100m. Bad column name:
- Replace missing values by
NA(for not available)
For example, your data should look like this:
Import your data in R
First, save your data into txt or csv file formats and import it as follow (you will be asked to choose the file):
library("readr") # Reads tab delimited files (.txt tab) my_data <- read_tsv(file.choose()) # Reads comma (,) delimited files (.csv) my_data <- read_csv(file.choose()) # Reads semicolon(;) separated files(.csv) my_data <- read_csv2(file.choose())
Read more about how to import data into R at this link: http://www.sthda.com/english/wiki/importing-data-into-r
Demo data sets
R comes with several demo data sets for playing with R functions. The most used R demo data sets include: USArrests, iris and mtcars. To load a demo data set, use the function data() as follow. The function
head() is used to inspect the data.
data("iris") # Loading head(iris, n = 3) # Print the first n = 3 rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa
To learn more about iris data sets, type this:
After typing the above R code, you will see the description of
iris data set: this iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
After importing your data in R, you can easily manipulate it using the
dplyr package, which can be installed using the R code:
After loading dplyr, you can use the following R functions:
filter(): Pick rows (observations/samples) based on their values.
distinct(): Remove duplicate rows.
arrange(): Reorder the rows.
select(): Select columns (variables) by their names.
rename(): Rename columns.
mutate(): Add/create new variables.
summarise(): Compute statistical summaries (e.g., computing the mean or the sum)
group_by(): Operate on subsets of the data set.
Note that, dplyr package allows to use the forward-pipe chaining operator (%>%) for combining multiple operations. For example, x %>% f is equivalent to f(x). Using the pipe (%>%), the output of each operation is passed to the next operation. This makes R programming easy.
Read more about Data Manipulation at this link: https://www.datanovia.com/en/courses/data-manipulation-in-r/
Working with Categorical data
Categorical variables are variables whose values comprise a set of groups. Examples of categorical variables include: gender (male, female), passenger’s classes (1st, 2nd, 3rd class), smokers (yes, no), eye color (brown, blue, Hazel, Green) etc.
There are different types of categorical variables depending on the number of categories they contain:
- Binary variables or dichotomous variables contain only two groups
- Polytomous variables contain three or more groups
Categorical variables containing ordered categories, such as passenger’s classes (1st < 2nd < 3rd), are called ordinal variables. Categorical variable containing unordered categories, such as eye color (brown, blue, Hazel, Green), are called nominal variable.
Categorical data can be available into different forms, including:
- Case form, in which each row corresponds to a case (or individual)
- Frequency form, in which the data are tabulated. The table cells contain the frequencies of categories (cross-classified or not).
Example of case form:
## Hair Eye ## 1 Black Brown ## 2 Black Brown ## 3 Black Brown ## 4 Black Brown
Example of frequency form (1/2) : Cross-tabulation
## Eye ## Hair Brown Blue Hazel Green ## Black 32 11 10 3 ## Brown 53 50 25 15 ## Red 10 10 7 7 ## Blond 3 30 5 8
Example of frequency form (2/2) : Data frame
## Hair Eye Freq ## 1 Black Brown 32 ## 2 Brown Brown 53 ## 3 Red Brown 10 ## 4 Blond Brown 3
Create contingency table
This section describes how to create contingency table in R. You will learn how to:
- Create one-, two- and three-way tables containing the frequency counts
- Compute proportions and, add rows and columns totals
- Transform a multi-way contingency table into a beautiful two-way flat format
- Subset a contingency table
data("titanic.raw", package = "datarium") head(titanic.raw)
## Class Sex Age Survived ## 1 3rd Male Child No ## 2 3rd Male Child No ## 3 3rd Male Child No ## 4 3rd Male Child No ## 5 3rd Male Child No ## 6 3rd Male Child No
Key R functions
xtabs(): Create contingency tables containing frequencies.
xtabsallows to use formulas.
prop.table(): Compute proportions over rows, columns or overall totals
margin.table(): Compute rows, columns or overall totals (i.e, sums)
Create contingency tables
Using xtabs(): formula interface
# One way table xtabs(~Class, data = titanic.raw)
## Class ## 1st 2nd 3rd Crew ## 325 285 706 885
# Two way table xtabs(~Class + Survived, data = titanic.raw)
## Survived ## Class No Yes ## 1st 122 203 ## 2nd 167 118 ## 3rd 528 178 ## Crew 673 212
# Three way table xtabs(~Class + Survived + Sex, data = titanic.raw)
## , , Sex = Male ## ## Survived ## Class No Yes ## 1st 118 62 ## 2nd 154 25 ## 3rd 422 88 ## Crew 670 192 ## ## , , Sex = Female ## ## Survived ## Class No Yes ## 1st 4 141 ## 2nd 13 93 ## 3rd 106 90 ## Crew 3 20
# One way table ----------- # use this: table(titanic.raw$Class) # Or this table(titanic.raw[, "Class"]) # or this: with(titanic.raw, table(Class)) # Two way table ----------- with(titanic.raw, table(Class, Survived)) # Three way table ----------- with(titanic.raw, table(Class, Survived, Sex))
Compute margins: row and column totals
# Two way table xtab <- xtabs(~Class + Survived, data = titanic.raw) # Compute row totals margin.table(xtab, 1)
## Class ## 1st 2nd 3rd Crew ## 325 285 706 885
# Compute column totals margin.table(xtab, 2)
## Survived ## No Yes ## 1490 711
You can also use the functions
Add rows and columns sums to the table
## Survived ## Class No Yes Sum ## 1st 122 203 325 ## 2nd 167 118 285 ## 3rd 528 178 706 ## Crew 673 212 885 ## Sum 1490 711 2201
# Frequencies relative to row total prop.table(xtab, 1)
## Survived ## Class No Yes ## 1st 0.375 0.625 ## 2nd 0.586 0.414 ## 3rd 0.748 0.252 ## Crew 0.760 0.240
# Frequencies relative to column total prop.table(xtab, 2)
## Survived ## Class No Yes ## 1st 0.0819 0.2855 ## 2nd 0.1121 0.1660 ## 3rd 0.3544 0.2504 ## Crew 0.4517 0.2982
# Frequencies relative to the table grand total xtab/sum(xtab)
## Survived ## Class No Yes ## 1st 0.0554 0.0922 ## 2nd 0.0759 0.0536 ## 3rd 0.2399 0.0809 ## Crew 0.3058 0.0963
Close your R/RStudio session
Each time you close R/RStudio, you will be asked whether you want to save the data from your R session. If you decide to save, the data will be available in future R sessions.
This chapter provides a quick introduction to R and a brief description of how to work with categorical data in R.
Recommended for you
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
- Course: Machine Learning: Master the Fundamentals by Stanford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University
Popular Courses Launched in 2020
- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services
- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts
Amazing Selling Machine
- Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM
Books - Data Science
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet