Chapter 10 Strings
For this week, we will once again be importing the horror movies data set from the TidyTuesday Github. This data set contains information from IMDB on various horror movies. Some important columns we will be focusing on this week:
title
- the title of the movierelease_date
- movie release date in day-month-year format
Let’s load in the data:
## install the package if you do not have it
library(tidyverse)
## Load in the data
<- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv") horror_movies
## Rows: 3328 Columns: 12
## ── Column specification ───────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): title, genres, release_date, release_country, movie_rating, movie_run_tim...
## dbl (1): review_rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you are having trouble viewing the link, you can get the data by visiting the TidyTuesday Github Repository and following the instructions on loading in the data.
Our goal this week to get comfortable with three things:
- Simple pattern matching with regular expressions
- Using the
extract
function - Using the
separate
function
Using all of these three things together will allow you to do powerful data wrangling techniques that are valued greatly in the workforce.
10.1 Regular Expressions
Regular expressions are a fantastic tool that will allow you to match patterns in strings. To demonstrate their value, we will start off by making a vector of strings, and use str_view
to show how we can match patterns using regular expressions.
## creating a vector of strings I would like to analyze
= c("Econ 145", "Hello!", "Best CLASS") x
Let’s see if we can match some basic patterns. Suppose I want to find the “o” in each component of the vector. I can do the following:
## finding all of the o's in tthe vector x using str_view
str_view(string = x, pattern = "o")
## [1] │ Ec<o>n 145
## [2] │ Hell<o>!
Notice that we are matching (see the highlighted text) on the pattern “o” in every element of the vector. Note that there was no “o” to match on in the last element, “Best CLASS”, so nothing was highlighted.
We can take this even further by matching on a selection of numbers. Suppose we want to match the numbers “145”. We can simply type in “145” into our pattern match.
## matching on the pattern 145
str_view(string = x, pattern = "145")
## [1] │ Econ <145>
However, these are all trivial examples. Regular expressions allow us to match on patterns, not just simple expressions. For example, suppose we have a vector which has multiple names of courses in the economics department:
## creating a new vector with three string elements
= c("Econ 145", "Econ 10A", "Econ 140A") econ_classes
Now suppose I want to pattern match only the course numbers. I can do this by using a regular expression.
\d
: the regular expression pattern that matches any digit
Let’s see this in action.
## matching incorrectly on any digit
str_view(string = econ_classes, pattern = "\d")
The first thing you should notice when you click enter is that an error message occurs: Error: ’ is an unrecognized escape in character string starting ““. We get this error message because \
is an escape character in the R programming language. Therefore, we need to escape the escape character by using \\d
to match on a number. This can often be annoyingly difficult to remember, but the the error message should guide you to remember this detail.
## matching correctly on any digit
str_view(string = econ_classes, pattern = "\\d")
## [1] │ Econ <1><4><5>
## [2] │ Econ <1><0>A
## [3] │ Econ <1><4><0>A
We managed to match on a number in every single element of the vector without actually specifying we wanted a “1”. This is absolutely incredible.
It is important to note that \\d
will only match on one number, and by default, the first number it comes across in the string going from left to right. If we want to match on more numbers, we need to add in more \\d
s. For instance, if we wanted to match on 2 digits, our pattern argument would be equal to \\d\\d
.
## matching on any two digits in each component
str_view(string = econ_classes, pattern = "\\d\\d")
## [1] │ Econ <14>5
## [2] │ Econ <10>A
## [3] │ Econ <14>0A
Let’s try to match on all of the course numbers. Notice that some course numbers have 2 digits, while others have 3. Let’s see what happens when we try to match on 3 digits.
## matching on exactly three digits in each component
str_view(string = econ_classes, pattern = "\\d\\d\\d")
## [1] │ Econ <145>
## [3] │ Econ <140>A
We can see that we matched on exactly 3 digits, and failed to match on the element that only contained 2 digits. However, as you may have expected, we have ways to work around this using what we call repetition patterns. You can specify the number of matches you would like to make using the following:
{n}
: matches exactly n amount of your pattern{n,}
: matches n or more{,m}
: matches at most m{n,m}
: matches between n and m
Hence, we can match on all the course digits by specifying how many times we would like to match the \d
patter.
##matches any digit 2 or more times
str_view(string = econ_classes, pattern = "\\d{2,}")
## [1] │ Econ <145>
## [2] │ Econ <10>A
## [3] │ Econ <140>A
Success! We managed to match the \d
pattern two or more times and have matched on the course digits. The way we would read this pattern is left to right: “match on any digit, and do this two times or more”.
10.2 Regular Expressions Continued
Let’s try matching on other types of expressions.
## creating a new vector for us to match patterns on
<- c("aaabbbccc", "dddAAAccc", "eib iii") vector
Here are a couple more useful patterns:
[abc]
: matches exactly one time on a, b, or c. If multiple appear, it matches on the one that comes first.[^abc]
: matches exactly one time on anything except a, b, or c.\s
: matches any whitespace (e.g. space, tab, newline).
We will match on the beginning letter of each of these elements in our vector.
##matching on the letters a d or e
str_view(string = vector, pattern = "[ade]")
## [1] │ <a><a><a>bbbccc
## [2] │ <d><d><d>AAAccc
## [3] │ <e>ib iii
10.3 Using extract
We will now focus on matching regular expressions within the context of tidying data. Let’s specifically focus on the title
and release_date
columns.
## selecting only the title and release date columns in the horror movies data set
<- horror_movies %>%
horror_movies select(title, release_date)
## showing the head of the data set
%>% head() horror_movies
## # A tibble: 6 × 2
## title release_date
## <chr> <chr>
## 1 Gut (2012) 26-Oct-12
## 2 The Haunting of Mia Moss (2017) 13-Jan-17
## 3 Sleepwalking (2017) 21-Oct-17
## 4 Treasure Chest of Horrors II (2013) 23-Apr-13
## 5 Infidus (2015) 10-Apr-15
## 6 In Extremis (2017) 2017
Notice that it looks like each movie title is followed by a whitespace, and then a parenthesis with the movie release year inside of it. For the purposes of explaining extract
, we will focus on extracting the dates from the title
column.
The extract
function extracts information you want from a column and puts it into a column of your choice. The specific arguments it has are col, into, and regex. These correspond to the column you want to extract an expression from, the column name you are going to send your extracted expression into, and the regular expression pattern you want to extract respectively.
## extracting the year from the
%>%
horror_movies extract(col = title, into = "year", regex = "(\\d\\d\\d\\d)")
## # A tibble: 3,328 × 2
## year release_date
## <chr> <chr>
## 1 2012 26-Oct-12
## 2 2017 13-Jan-17
## 3 2017 21-Oct-17
## 4 2013 23-Apr-13
## 5 2015 10-Apr-15
## 6 2017 2017
## 7 2013 3-Jun-14
## 8 2015 25-Apr-15
## 9 2015 28-May-17
## 10 2016 7-Oct-16
## # ℹ 3,318 more rows
It is important to note that we need to put parenthesis around the pattern that we want to match in our regex
argument for the match to work. Observe what happens when we do not:
## code does not work because no parenthesis around the regex expression
%>%
horror_movies extract(col = title, into = "year", regex = "\\d\\d\\d\\d")
This is because the parenthesis define a group of patterns that you want to match. Everything inside your parenthesis is a group. For the purposes of this class, we will only be extracting 1 group, as extracting more than one group gets very complicated.
## the correct code: works because we put parenthesis around the regex expression
%>%
horror_movies extract(col = title, into = "year", regex = "(\\d\\d\\d\\d)")
## # A tibble: 3,328 × 2
## year release_date
## <chr> <chr>
## 1 2012 26-Oct-12
## 2 2017 13-Jan-17
## 3 2017 21-Oct-17
## 4 2013 23-Apr-13
## 5 2015 10-Apr-15
## 6 2017 2017
## 7 2013 3-Jun-14
## 8 2015 25-Apr-15
## 9 2015 28-May-17
## 10 2016 7-Oct-16
## # ℹ 3,318 more rows
Notice when we ran the previous correct code, our title
column disappeared. This is because by default, the extract function deletes the original column. To avoid this, we can set the remove
argument to FALSE
.
## the correct code: works because we put parenthesis around the regex expression
%>%
horror_movies extract(col = title, into = "year", regex = "(\\d\\d\\d\\d)", remove = F)
## # A tibble: 3,328 × 3
## title year release_date
## <chr> <chr> <chr>
## 1 Gut (2012) 2012 26-Oct-12
## 2 The Haunting of Mia Moss (2017) 2017 13-Jan-17
## 3 Sleepwalking (2017) 2017 21-Oct-17
## 4 Treasure Chest of Horrors II (2013) 2013 23-Apr-13
## 5 Infidus (2015) 2015 10-Apr-15
## 6 In Extremis (2017) 2017 2017
## 7 Ghostlight (2013) 2013 3-Jun-14
## 8 Parasyte: Part 2 (2015) 2015 25-Apr-15
## 9 Stranger in the House (2015) 2015 28-May-17
## 10 Tutak Tutak Tutiya (2016) 2016 7-Oct-16
## # ℹ 3,318 more rows
10.4 Using separate
The separate
function will split a column into multiple columns based on a regular expression. You can think of separate as a way to split with a more complicated delimiter (a delimiter is a character that separates values). For starters, we will separate once again look at the title
column in the horror movies data. Recall that the title
column is organized as the movie title, followed by a blank space and then the year of release in parenthesis. We will take advantage of this unique organization and separate our title
column into two columns: title
which will have the title of the movie, and year
which will have the release year. While this will take multiple steps to get into a perfectly cleaned data set, we can utilize our piping procedures to make it relatively straightforward.
First, let’s separate the column based on the first parenthesis that we see.
## separating on the first parenthesis in the title column
%>%
horror_movies separate(col = title, into = c("title","year"), sep = "\\(")
## Warning: Expected 2 pieces. Additional pieces discarded in 7 rows [691, 928, 1289, 2086, 2130,
## 2545, 3262].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [932].
## # A tibble: 3,328 × 3
## title year release_date
## <chr> <chr> <chr>
## 1 "Gut " 2012) 26-Oct-12
## 2 "The Haunting of Mia Moss " 2017) 13-Jan-17
## 3 "Sleepwalking " 2017) 21-Oct-17
## 4 "Treasure Chest of Horrors II " 2013) 23-Apr-13
## 5 "Infidus " 2015) 10-Apr-15
## 6 "In Extremis " 2017) 2017
## 7 "Ghostlight " 2013) 3-Jun-14
## 8 "Parasyte: Part 2 " 2015) 25-Apr-15
## 9 "Stranger in the House " 2015) 28-May-17
## 10 "Tutak Tutak Tutiya " 2016) 7-Oct-16
## # ℹ 3,318 more rows
Notice how we managed to separate the two columns, but there are now more problems we need to deal with:
- What do the warning messages mean?
- We need to get rid of the extra parenthesis in the
year
column. - We need (well, we don’t NEED) to get rid of the blank space at the end of the
title
column.
First and foremost, we will take a second to review what a warning message is.
Warning message: These are messages that tell you that your R code was able to execute, but in the process, R decided to do something that you did not tell it to. It is extremely important that you Google warning messages, as your data could be manipulated by R in ways you really did not want. Let’s look at our two warning messages in depth:
:
Warning messages1: Expected 2 pieces. Additional pieces discarded in 7 rows [691, 928, 1289, 2086, 2130, 2545, 3262].
2: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [932]
The first warning message is telling us that R expected 2 pieces, and then discarded additional pieces in 7 rows. While this does not make much sense at first sight, it gives us the rows which it performed this process. The best practice is to investigate these rows to get a better idea of what is happening.
## investigating a couple of the rows that caused the warning message
$title[691] horror_movies
## [1] "ErOddity(s) 2 (2015)"
$title[928] horror_movies
## [1] "Truth or Double Dare (TODD) (2017)"
$title[1289] horror_movies
## [1] "Hi-8 (Horror Independent 8) (2013)"
By now you should see a problem: each of these warning rows have a parenthesis in the title name! This means that R is separating based on our parenthesis, but since there are multiple parenthesis, it is separating multiple times. Therefore, it is discarding the extra separated piece in each row. Generally, you want to make sure R is not throwing away any important information. You could check this yourself by saving this as a new tibble, and investigating the rows. Hence, we will create a new tibble for demonstration purposes.
## performing the same task but saving it in a new tibble
<- horror_movies %>%
investigate_horror_movies separate(col = title, into = c("title","year"), sep = "\\(")
## Warning: Expected 2 pieces. Additional pieces discarded in 7 rows [691, 928, 1289, 2086, 2130,
## 2545, 3262].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [932].
## investigating the 691th row of the saved tibble
691,] investigate_horror_movies[
## # A tibble: 1 × 3
## title year release_date
## <chr> <chr> <chr>
## 1 ErOddity "s) 2 " 24-Nov-15
## investigating the 928th row of the saved tibble
928,] investigate_horror_movies[
## # A tibble: 1 × 3
## title year release_date
## <chr> <chr> <chr>
## 1 "Truth or Double Dare " "TODD) " 30-Oct-17
Clearly, this isn’t giving us our desired result for the 7 rows that were causing this issue. While we will ignore this for the purpose of this exercise, you want to be very aware of what is happening to your data.
Next, let’s investigate the second warning message: Expected 2 pieces. Missing pieces filled with NA
in 1 rows [932]. It seems as though R filled a missing piece of information in our row 932. Let’s take a closer look:
## investigating what happened in row 932
## checking the title column first in our original data
$title[932] horror_movies
## [1] "American Exorcist"
## look what happened to our data in row 932
932,] investigate_horror_movies[
## # A tibble: 1 × 3
## title year release_date
## <chr> <chr> <chr>
## 1 American Exorcist <NA> 2017
Clearly we missed something: not all the movies have the same format of MOVIE TITLE (YEAR). It appears that the movie “American Exorcist” in row 932 did not have a date attached to it. Therefore, when we tried to separate by parenthesis, R was unable to perform this task and replaced the value with NA. In our case, this wouldn’t have affected our analysis since there was no date attached to it, but we would want to find another way to extract the date from another column. However, as before, we will ignore this warning message since fixing it will take a lot of time in effort. It is important to realize that you must read and investigate warning messages since R constantly performs operations on your data that may not be exactly what you intend.
Moving on, let’s start solving our second task: getting rid of the extra parenthesis in the year
column. Recall, we can actually use the extract
function here to remove the final parenthesis. We can do this all in one step using a pipe.
## separating on the first parenthesis in the title column and then extracting the four numbers for date
%>%
horror_movies separate(col = title, into = c("title","year"), sep = "\\(") %>%
extract(col = year, into = "year", regex = "(\\d\\d\\d\\d)")
## Warning: Expected 2 pieces. Additional pieces discarded in 7 rows [691, 928, 1289, 2086, 2130,
## 2545, 3262].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [932].
## # A tibble: 3,328 × 3
## title year release_date
## <chr> <chr> <chr>
## 1 "Gut " 2012 26-Oct-12
## 2 "The Haunting of Mia Moss " 2017 13-Jan-17
## 3 "Sleepwalking " 2017 21-Oct-17
## 4 "Treasure Chest of Horrors II " 2013 23-Apr-13
## 5 "Infidus " 2015 10-Apr-15
## 6 "In Extremis " 2017 2017
## 7 "Ghostlight " 2013 3-Jun-14
## 8 "Parasyte: Part 2 " 2015 25-Apr-15
## 9 "Stranger in the House " 2015 28-May-17
## 10 "Tutak Tutak Tutiya " 2016 7-Oct-16
## # ℹ 3,318 more rows
Finally, let’s satisfy (2) and get rid of the whitespace that is at the end of our title
column. Luckily, the stringr
package has a pre-programmed function called stringr::str_trim
that will trim the whitespace off the end of our column. Now we can perform the entire cleaning of the title
column in one piping procedure:
## separating on the first parenthesis in the title column and then extracting the four numbers for date
## and then removing the whitespace that is left over
%>%
horror_movies separate(col = title, into = c("title","year"), sep = "\\(") %>%
extract(col = year, into = "year", regex = "(\\d\\d\\d\\d)") %>%
mutate(title = str_trim(title))
## Warning: Expected 2 pieces. Additional pieces discarded in 7 rows [691, 928, 1289, 2086, 2130,
## 2545, 3262].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [932].
## # A tibble: 3,328 × 3
## title year release_date
## <chr> <chr> <chr>
## 1 Gut 2012 26-Oct-12
## 2 The Haunting of Mia Moss 2017 13-Jan-17
## 3 Sleepwalking 2017 21-Oct-17
## 4 Treasure Chest of Horrors II 2013 23-Apr-13
## 5 Infidus 2015 10-Apr-15
## 6 In Extremis 2017 2017
## 7 Ghostlight 2013 3-Jun-14
## 8 Parasyte: Part 2 2015 25-Apr-15
## 9 Stranger in the House 2015 28-May-17
## 10 Tutak Tutak Tutiya 2016 7-Oct-16
## # ℹ 3,318 more rows
10.5 The str_replace
function
Many times, we want to simply replace simple patterns in columns of data sets. The stringr::str_replace
function is a simple way to take expressions within variables and substitute them with a different result. Let’s create a vector of values to demonstrate.
## creating a vector to demonstrate gsub
<- c("$100", "$150", "$200") money
A common use of gsub
is to get rid of dollar signs in columns that have monetary values. Omitting the dollar sign is essential if you want to perform any type of summary statistic.
The three arguments in the stringr::str_replace
function that we are concerned with are the pattern
, replacement
, and string
arguments. The pattern
argument is the regular expression we want to find in our vector, and the replacement
argument is what we would like to replace the regular expression with. The string
argument is the vector you want to perform the stringr::str_replace
function on.
## the gsub function attempting (but unsuccessfully) trying to replace the $ with nothing
str_replace(money, "\\$", "")
## [1] "100" "150" "200"
Notice that we had to once again, use the escape characters since the dollar sign is a special character.
## successfully replacing the $ with nothing
str_replace(money, "\\$", "")
## [1] "100" "150" "200"
## saving this new result
<- str_replace(money, "\\$", "") money
10.6 The str_replace_all
function
ALERT: the function str_replace
only replaces one instance per-element of the pattern you are looking for, and unless specified, it is always the first one. To demonstrate this, let’s consider the following example:
## observe that elements 2 and 3 have multiple $ signs
<- c("$100", "$200 $200", "$300 $300 $300") money_2
If we were to perform a simple str_replace
on the money_2
vector, only the dollar sign would be replaced. Observe in the following code, and note the piping into the str_replace
function is equivalent to str_replace(money_2, "\\$", "")
:
## running str_replace on money_2 - piping
%>%
money_2 str_replace("\\$", "")
## [1] "100" "200 $200" "300 $300 $300"
Only one of the dollar signs has been replaced!
Now suppose you would like to get rid of all dollar signs. This can be accomplished with the str_replace_all
function:
%>%
money_2 str_replace_all("\\$", "")
## [1] "100" "200 200" "300 300 300"