Chapter 5 Common Wrangling I
For this week, we will once again be working with the titanic_train
data set from the titanic
package. This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner “Titanic” summarized according to economic status (class), sex, age, and survival. As a reminder, here are some of the important columns:
Survived
: binary variable equal to 1 if the passenger survivedPclass
: the passenger’s className
: the passenger’s nameSex
: the sex of the passengerAge
: the age of the passenger
Now let’s load in the data and necessary packages.
## install the package if you do not have it
# install.packages("titanic")
library(titanic)
library(tidyverse)
library(janitor)
## saving our data to the name titanic as a tibble
<- tibble(titanic_train) titanic
We will be focusing on getting comfortable with the following functions in the tidyverse
package:
select
distinct
mutate
group_by
summarize
We will also learn how to use the following functions in the janitor
package:
clean_names
We will be using these functions in conjunction with the pipe (typed as %>%
) operator. By going through these exercises, you will see proper ways to utilize functions and learn to write code in a readable and reproducible way.
5.1 The Pipe
A pipe (typed as %>%
) is a specific operator that comes from the magrittr
package, but is automatically loaded in with the tidyverse
. Essentially, it makes reading code easier, typing code faster, and finding complicated results very easy. A pipe essentially is saying “and now, do this” to a tibble. Piping is best used to chain together multiple functions to subset your data set into smaller pieces that you find more interesting. We will see this pipe in action in the following sections. Mastering the pipe is essential to quick and efficient cleaning, and you can find some incredible results with using pipes and a few simple functions. To demonstrate how a pipe works, I’ll use a rather bland motivating example: suppose we want to glance at our titanic
data using the head
function. Before the pipe existed, we would need to type the following:
## without a pipe
head(titanic)
However, with a pipe, we can type the following:
## using a pipe
%>%
titanic head()
Each of these give us the same result. The way the pipe works, is that it automatically fills the “data” argument of a function with the tibble you are piping from (in this case, the titanic
tibble). While this seems extraneous in this example, the benefit of the pipe is that you can combine multiple functions together in a readable way. We will see a demonstration of this as we get further into this Guided Exercise.
5.2 The clean_names
function
The clean_names
function allows us to put all of our column names in our tibble in a standardized format. In particular, the function makes certain that the column names are all lowercase and blank spaces are replaced with an underscore. Keeping your variable names standardized is an important practice that will make data wrangling much easier as your data sets get bigger and you begin collaborating with others.
Let’s take a look at our data set without cleaning the names.
## viewing a snapshot of our data set
%>%
titanic head()
## # A tibble: 6 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr> <chr>
## 1 1 0 3 Brau… male 22 1 0 A/5 2… 7.25 "" S
## 2 2 1 1 Cumi… fema… 38 1 0 PC 17… 71.3 "C85" C
## 3 3 1 3 Heik… fema… 26 0 0 STON/… 7.92 "" S
## 4 4 1 1 Futr… fema… 35 1 0 113803 53.1 "C12… S
## 5 5 0 3 Alle… male 35 0 0 373450 8.05 "" S
## 6 6 0 3 Mora… male NA 0 0 330877 8.46 "" Q
Notice here we used the pipe. The pipe told us to take the titanic
data set, and perform the head
function to it. In terms of piping language we could specify what happened:
- Use titanic data set
- And now take the
head
of the titanic data set
Another important aspect of this output to notice is that our columns begin with capital letters (e.g. Class
). As stated, we can use the clean_names
function from the janitor package to standardize the column name format to all lower case and underscores. Observe:
## Using the clean_names function on the titanic data
%>%
titanic clean_names()
The column names now all have our desired standardized format. Now let’s try to use our previous head
function:
## Using the head function again
%>%
titanic head()
## # A tibble: 6 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr> <chr>
## 1 1 0 3 Brau… male 22 1 0 A/5 2… 7.25 "" S
## 2 2 1 1 Cumi… fema… 38 1 0 PC 17… 71.3 "C85" C
## 3 3 1 3 Heik… fema… 26 0 0 STON/… 7.92 "" S
## 4 4 1 1 Futr… fema… 35 1 0 113803 53.1 "C12… S
## 5 5 0 3 Alle… male 35 0 0 373450 8.05 "" S
## 6 6 0 3 Mora… male NA 0 0 330877 8.46 "" Q
The columns have gone back to their normal messy ways! This is because we failed to save the changes we made to our titanic data set. The pipe operator will perform functions on your tibble, but it will not save the changes unless you explicitly tell R to do so.
Let’s use the clean_names
function, and save the tibble with the cleaned column names.
## saving the titanic tibble with cleaned names
<- titanic %>%
titanic clean_names()
## viewing the cleaned names data set as it has now been saved
%>%
titanic head()
## # A tibble: 6 × 12
## passenger_id survived pclass name sex age sib_sp parch ticket fare cabin
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr>
## 1 1 0 3 Braund, Mr.… male 22 1 0 A/5 2… 7.25 ""
## 2 2 1 1 Cumings, Mr… fema… 38 1 0 PC 17… 71.3 "C85"
## 3 3 1 3 Heikkinen, … fema… 26 0 0 STON/… 7.92 ""
## 4 4 1 1 Futrelle, M… fema… 35 1 0 113803 53.1 "C12…
## 5 5 0 3 Allen, Mr. … male 35 0 0 373450 8.05 ""
## 6 6 0 3 Moran, Mr. … male NA 0 0 330877 8.46 ""
## # ℹ 1 more variable: embarked <chr>
For the rest of this guided exercise, we will be working with this tibble.
5.3 The select
function
The select
function is a way to subset your data. It selects whichever variables you are particularly concerned with. For instance, suppose we were only interested in the age
and survived
columns of the titanic
tibble. We could use the select
function to observe only these columns.
## selecting only the age and survived columns and then previewing with head
%>%
titanic select(age, survived) %>%
head()
## # A tibble: 6 × 2
## age survived
## <dbl> <int>
## 1 22 0
## 2 38 1
## 3 26 1
## 4 35 1
## 5 35 0
## 6 NA 0
There are a few things that should be noted here. First, we did not save this sub-selection of variables as a new tibble, so this is just a temporary sub-selection. Second, we performed two pipes with one pipe on each line until the ending function. This code could be read as:
- Use the
titanic
tibble - And then
select
theage
andsurvived
columns - And then use the
head
function to view
Of course, if we wanted to save our sub-selection, we could easily do this by assigning it to a new tibble.
## assigning the subselection to a new tibble
<- titanic %>%
titanic_age_survived select(age, survived)
## did not use head here because do not want only the first 5 rows to be saved
Generally, the select
function is a great way to subset your data to focus on only the columns you are particularly concerned with.
5.4 The filter
function
The filter
function is one of the most powerful and frequently used functions when combined with a pipe. The filter function filters your data set based on some criteria you choose. For example, suppose we want to only look at children in this data set. We can filter out all of the passengers in the data set that have an age less than 18. Observe:
%>%
titanic filter(age < 18) %>%
head()
## # A tibble: 6 × 12
## passenger_id survived pclass name sex age sib_sp parch ticket fare cabin
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr>
## 1 8 0 3 "Palsson, M… male 2 3 1 349909 21.1 ""
## 2 10 1 2 "Nasser, Mr… fema… 14 1 0 237736 30.1 ""
## 3 11 1 3 "Sandstrom,… fema… 4 1 1 PP 95… 16.7 "G6"
## 4 15 0 3 "Vestrom, M… fema… 14 0 0 350406 7.85 ""
## 5 17 0 3 "Rice, Mast… male 2 4 1 382652 29.1 ""
## 6 23 1 3 "McGowan, M… fema… 15 0 0 330923 8.03 ""
## # ℹ 1 more variable: embarked <chr>
Just to further our understanding, let’s once again write out what this code is doing:
- Use the titanic tibble
- And then filter out only rows that have
age
equal to “Child” - And then give the heading of the tibble
5.5 The drop_na
function
The drop_na
function allows you to drop any missing values from specified columns. Notice that in the titanic
tibble there are NA
values (i.e. missing values) in the age
column. Suppose we wish to drop these from the tibble so that they do not interfere with our analysis. We can easily do this with the following code:
## dropping all missing values from age column
## notice the tibble now only has 714 rows
%>%
titanic drop_na(age) %>%
head()
## # A tibble: 6 × 12
## passenger_id survived pclass name sex age sib_sp parch ticket fare cabin
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr>
## 1 1 0 3 Braund, Mr.… male 22 1 0 A/5 2… 7.25 ""
## 2 2 1 1 Cumings, Mr… fema… 38 1 0 PC 17… 71.3 "C85"
## 3 3 1 3 Heikkinen, … fema… 26 0 0 STON/… 7.92 ""
## 4 4 1 1 Futrelle, M… fema… 35 1 0 113803 53.1 "C12…
## 5 5 0 3 Allen, Mr. … male 35 0 0 373450 8.05 ""
## 6 7 0 1 McCarthy, M… male 54 0 0 17463 51.9 "E46"
## # ℹ 1 more variable: embarked <chr>
Note that the documentation states that if no columns are specified in the drop_na
function, then all NA
values from every column will be dropped.
## drops all NA values from every column
## notice that here, the number of rows is the same as above
## this means age contains all of the rows with missing data
%>%
titanic drop_na()
5.6 The distinct
function
The distinct
function allows you to see the unique values within a specified column. For instance, suppose we wanted to know all of the unique values that are within the pclass
column of the titanic
tibble. We could use the distinct
function to do this.
## using distinct to find unique values in a column
%>%
titanic distinct(pclass)
## # A tibble: 3 × 1
## pclass
## <int>
## 1 3
## 2 1
## 3 2
We can see from the distinct
function that the pclass
column has three unique values: 1, 2, and 3 which correspond to the passengers’ class. The distinct
function can be a great way to take a look at your data and figure out what kind of values reside within specific columns.
5.7 The mutate
function.
The mutate
function creates a new column in your tibble based on some computation statement. To motivate this, suppose we wanted to create a column in the titanic
tibble that is named adjusted_fare
which takes the fare
column and multiplies it by the rate of inflation to get the ticket fare in today’s prices. Using the mutate
function, we could accomplish this:
## assigning a variable the inflation rate
<- 27.14
inflation_rate
## creating a new variable called adjusted_fare which will be the fare in today's dollars
%>%
titanic mutate(adjusted_fare = fare * inflation_rate) %>%
head()
## # A tibble: 6 × 13
## passenger_id survived pclass name sex age sib_sp parch ticket fare cabin
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr>
## 1 1 0 3 Braund, Mr.… male 22 1 0 A/5 2… 7.25 ""
## 2 2 1 1 Cumings, Mr… fema… 38 1 0 PC 17… 71.3 "C85"
## 3 3 1 3 Heikkinen, … fema… 26 0 0 STON/… 7.92 ""
## 4 4 1 1 Futrelle, M… fema… 35 1 0 113803 53.1 "C12…
## 5 5 0 3 Allen, Mr. … male 35 0 0 373450 8.05 ""
## 6 6 0 3 Moran, Mr. … male NA 0 0 330877 8.46 ""
## # ℹ 2 more variables: embarked <chr>, adjusted_fare <dbl>
There is actually quite a bit going on here, so it’s worth noting the syntax of the mutate
function.
## mutate function syntax
%>%
titanic mutate(your_variable_name = some expression)
Recall that the tibble will not save with this new variable that you created unless you tell it to do so.
Where mutate
becomes very powerful is using in conjunction with the ifelse
function. The ifelse
function is a function that takes a conditional statement, and if it is TRUE, assigns a value, and if it is FALSE, assigns a different value. For instance, suppose we want to create a binary variable equal to 1 if a person is under the age of 18 and 0 if are not. Hence, we can use the ifelse
function to create our desired binary variable. The syntax for the ifelse
function is as follows:
## ifelse syntax
ifelse(a condition, value if condition is TRUE, value if condition is FALSE)
This will be more clear once you see it in action. Let’s actually create the desired binary variable:
%>%
titanic mutate(child = ifelse(age < 18, 1, 0)) %>%
head()
## # A tibble: 6 × 13
## passenger_id survived pclass name sex age sib_sp parch ticket fare cabin
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr>
## 1 1 0 3 Braund, Mr.… male 22 1 0 A/5 2… 7.25 ""
## 2 2 1 1 Cumings, Mr… fema… 38 1 0 PC 17… 71.3 "C85"
## 3 3 1 3 Heikkinen, … fema… 26 0 0 STON/… 7.92 ""
## 4 4 1 1 Futrelle, M… fema… 35 1 0 113803 53.1 "C12…
## 5 5 0 3 Allen, Mr. … male 35 0 0 373450 8.05 ""
## 6 6 0 3 Moran, Mr. … male NA 0 0 330877 8.46 ""
## # ℹ 2 more variables: embarked <chr>, child <dbl>
Since we cannot see from the preview that we actually created a binary variable, let’s use the distinct
function as a check.
%>%
titanic mutate(child = ifelse(age < 18, 1, 0)) %>%
distinct(child)
## # A tibble: 3 × 1
## child
## <dbl>
## 1 0
## 2 NA
## 3 1
REMEMBER the child
variable DID NOT save unless you specifically tell R to do so. We will save this variable as we will use it later.
## saving the new variable
<- titanic %>%
titanic mutate(child = ifelse(age < 18, 1, 0))
## observing the first 5 rows
%>%
titanic head()
## # A tibble: 6 × 13
## passenger_id survived pclass name sex age sib_sp parch ticket fare cabin
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr>
## 1 1 0 3 Braund, Mr.… male 22 1 0 A/5 2… 7.25 ""
## 2 2 1 1 Cumings, Mr… fema… 38 1 0 PC 17… 71.3 "C85"
## 3 3 1 3 Heikkinen, … fema… 26 0 0 STON/… 7.92 ""
## 4 4 1 1 Futrelle, M… fema… 35 1 0 113803 53.1 "C12…
## 5 5 0 3 Allen, Mr. … male 35 0 0 373450 8.05 ""
## 6 6 0 3 Moran, Mr. … male NA 0 0 330877 8.46 ""
## # ℹ 2 more variables: embarked <chr>, child <dbl>
5.8 The summarize
function
The summarize
function allows us to create statistics over columns quickly and efficiently. As a demonstration, we will be focusing on our survived
column. Let’s suppose that we wanted to know the average of survived
. Since this is a binary variable, this would be equivalent to the proportion of people who survived the titanic.
Now let’s make a new column called proportion_survived
which is equal to the mean of survived
.
## finding the average of the survived column
%>%
titanic summarize(proportion_survived = mean(survived, na.rm = T))
## # A tibble: 1 × 1
## proportion_survived
## <dbl>
## 1 0.384
Take a closer look at the summarize
function. You can think of the summarize
function as similar to the mutate
function as it creates a new variable equal to some summary statistic that you tell it to do. The basic syntax for the summarize
function is as follows:
summarize(your_variable_name = somefunction)
As another example, we could find the standard deviation of the survived
column using the summarize function.
##finding the standard deviation of the survived column
%>%
titanic summarize(sd_survived = sd(survived, na.rm = T))
## # A tibble: 1 × 1
## sd_survived
## <dbl>
## 1 0.487
5.9 The group_by
and summarize
functions
The group_by
and summarize
functions work together to make computing statistics within-groups easy. Suppose you wanted to know the average rate of survival by class type on the titanic. In other words, you suspect that the survival rate differs by people of different class To do this, you want to take the average of each group. The group_by
function will group classes together and then the summarize
function will be able to do summary statistics on each group individually.
## finding the survival rate among classes
%>%
titanic group_by(pclass) %>%
summarize(survival_rate = mean(survived, na.rm = T))
## # A tibble: 3 × 2
## pclass survival_rate
## <int> <dbl>
## 1 1 0.630
## 2 2 0.473
## 3 3 0.242
From here you can see that the survival rate greatly varied across different classes. It appears that survival was much more prevalent for higher class people. An interesting result! Let’s also review what is happening here in the language of pipes:
- Take the
titanic
data - And now group by class
- And now summarize the survival rate by creating a column equal to the mean of the
survived
column within each class.
You may be curious what happens when you do a group_by
without a summarize
. The truth is, nothing happens! R will create a grouping, but it means nothing unless you actually perform some sort of meaningful statistic on each group.
## using a group_by without a summarize or following function does nothing
%>%
titanic group_by(sex)
## # A tibble: 891 × 13
## # Groups: sex [2]
## passenger_id survived pclass name sex age sib_sp parch ticket fare cabin
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr>
## 1 1 0 3 Braund, Mr… male 22 1 0 A/5 2… 7.25 ""
## 2 2 1 1 Cumings, M… fema… 38 1 0 PC 17… 71.3 "C85"
## 3 3 1 3 Heikkinen,… fema… 26 0 0 STON/… 7.92 ""
## 4 4 1 1 Futrelle, … fema… 35 1 0 113803 53.1 "C12…
## 5 5 0 3 Allen, Mr.… male 35 0 0 373450 8.05 ""
## 6 6 0 3 Moran, Mr.… male NA 0 0 330877 8.46 ""
## 7 7 0 1 McCarthy, … male 54 0 0 17463 51.9 "E46"
## 8 8 0 3 Palsson, M… male 2 3 1 349909 21.1 ""
## 9 9 1 3 Johnson, M… fema… 27 0 2 347742 11.1 ""
## 10 10 1 2 Nasser, Mr… fema… 14 1 0 237736 30.1 ""
## # ℹ 881 more rows
## # ℹ 2 more variables: embarked <chr>, child <dbl>
Since it is imperative to understand the group_by
followed by the summarize
function, try out a couple of exercises.
5.9.1 Exercise
- Did women have a higher rate of survival than males? Find the answer to this question using the
group_by
andsummarize
functions.
5.9.2 Exercise
- Is there a difference in survival rates between women and men who were in a higher class? Using the
group_by
andsummarize
functions, find the answer to this question. Hint: put two arguments in thegroup_by
function.
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
5.10 Selected Solutions
- (Exercise 1.7.1) Variance = 0.24
- (Exercise 1.8.1) Women had higher survival rates at 0.74. Males were at 0.19.
- (Exercise 1.8.2) Females in first, second, and third class had survivals rate of 0.97, 0.92, and 0.5 respectively. On the other hand, males in first, second, and third class had survival rates of 0.37, 0.16, and 0.14 respectively.