Chapter 6 Common Wrangling II

For this chapter, we will be importing a dataset from TidyTuesday Github. Tidy Tuesday is a weekly social data project in R where users explore a new dataset each week and share their findings on Twitter with #TidyTuesday. In particular, we will be focusing on a horror movies data set from IMDB. IMDB is the world’s most popular and authoritative source for movie, TV and celebrity content, designed to help fans explore the world of movies and shows and decide what to watch. This data set shows us information on horror movies that are on IMDB’s website. Here are some important variables we will be working with:

  • review_rating- the IMDB users average rating of the movie.
  • release_country - the country the movie was released in.
  • movie_rating - the movie’s Motion Picture Association film rating system score (e.g. G, PG, PG-13)

We will be focusing on getting comfortable with the following functions:

  • count
  • slice
  • pull
  • is.na
  • arrange
  • filter in conjunction with logicals
  • adorn_totals from the janitor package (janitor::adorn_totals)

Let’s begin by importing in the data. To do this, we will be importing it using the read_csv function. Copy and paste the following link: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv and put it into a tibble called horror_movies using the read_csv function as shown below.

## install the package if you do not have it
library(tidyverse)

## loading in the data
horror_movies <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv")
## Rows: 3328 Columns: 12
## ── Column specification ───────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): title, genres, release_date, release_country, movie_rating, movie_run_tim...
## dbl  (1): review_rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice how easy it was to read data off of a website and into R using the read_csv function: all it took was copying a pasting a link.

6.1 The count function

The count function takes all of the unique elements in a column, and counts how many times each element occurs. For instance, say we want to find the amount of times each movie rating occurs in our data set. We can do this using the count function.

## finding out how many of each rating of horror movie there are
horror_movies %>% 
  count(movie_rating)
## # A tibble: 12 × 2
##    movie_rating     n
##    <chr>        <int>
##  1 E                1
##  2 NC-17            1
##  3 NOT RATED      699
##  4 PG               5
##  5 PG-13           79
##  6 R              416
##  7 TV-14           33
##  8 TV-MA          107
##  9 TV-PG            1
## 10 UNRATED        108
## 11 X                1
## 12 <NA>          1877

The count function also has a helpful argument called sort. By default, sort is set to FALSE. This means that the count function will not order your results in descending order by the number of times they occur. If you want to view your elements in descending order, you can set the sort argument to TRUE.

## counting in descending order by setting sort = T
horror_movies %>% 
  count(movie_rating, sort = T)
## # A tibble: 12 × 2
##    movie_rating     n
##    <chr>        <int>
##  1 <NA>          1877
##  2 NOT RATED      699
##  3 R              416
##  4 UNRATED        108
##  5 TV-MA          107
##  6 PG-13           79
##  7 TV-14           33
##  8 PG               5
##  9 E                1
## 10 NC-17            1
## 11 TV-PG            1
## 12 X                1

The sort argument is particularly useful for spotting things like a large amount of NAs, or getting an idea of how a column is distributed.

6.2 The slice and pull functions

Oftentimes, we want to extract a particular entry in a column and save it into a vector. For example, suppose you are asked to find the second most frequent occurrence of the movie_rating column and then save this category in a vector named second_most_frequent_movie_rating. The bad solution which you should never do unless absolutely necessary would be as follows:

## do not do this unless absolutely necessary
second_most_frequent_movie_rating <- "NOT RATED"

This solution is poor because this data could easily be updated or changed. If this happened, “NOT RATED” may no longer be the second most frequent occurrence of the movie_rating column! Hence, we want to find a way to do this programmatically. This is where slice and pull will assist us. The slice function simply slices the data by a certain amount of rows that you type in. Observe:

## gets the first row of the counted up tibble
horror_movies %>% 
  count(movie_rating, sort = T) %>% 
  slice(2) ## the 2 says to only take out the second row
## # A tibble: 1 × 2
##   movie_rating     n
##   <chr>        <int>
## 1 NOT RATED      699

In the above code, only the second row is sliced from the tibble (hence, the 2). However, we can also slice more rows than just one:

## gets the first 3 rows 
## 1:3 means "one to three" in R language
horror_movies %>% 
  count(movie_rating, sort = T) %>% 
  slice(1:3)
## # A tibble: 3 × 2
##   movie_rating     n
##   <chr>        <int>
## 1 <NA>          1877
## 2 NOT RATED      699
## 3 R              416

In the above code, the first three rows are sliced from the tibble by using the 1:3 code which is translated as “one to three” in R.

Notice that while the slice function is useful, it needs help from pull to get our desired result. The pull function will essentially allow us to extract the values from a column into a vector. Observe:

horror_movies %>% 
  count(movie_rating, sort = T) %>% 
  slice(2) %>%  ## the 2 says to only slice the second row 
  pull(movie_rating) ## telling R to pull the elements from the movie_rating column
## [1] "NOT RATED"

Here, the pull function simply extracts the values from movie_rating and makes it into a vector. We can then save this vector with our desired name that will update programmatically as follows:

## saving the second most frequent movie rating into a vector named
## second_most_frequent_movie_rating
## this is a programmatic and excellent solution
second_most_frequent_movie_rating <- horror_movies %>% 
  count(movie_rating, sort = T) %>% 
  slice(2) %>%  ## the 2 says to only slice the second row 
  pull(movie_rating) ## telling R to pull the elements from the movie_rating column

second_most_frequent_movie_rating
## [1] "NOT RATED"

We can similarly do it with a larger tibble and get the top three most frequent movie ratings into a vector:

## getting the top three most frequent movie ratings
top_three_frequent_movie_ratings <- horror_movies %>% 
  count(movie_rating, sort = T) %>% 
  slice(1:3) %>%  ## the 1:3 means row 1 to row 3 
  pull(movie_rating) ## telling R to pull the elements from the movie_rating column

top_three_frequent_movie_ratings
## [1] NA          "NOT RATED" "R"

6.2.1 Exercise

  • What is the most frequent review rating? Save this as a vector named most_frequent_review_rating.

6.3 Using the is.na function

The is.na function used to find the NAs withing a particular column. It takes one argument: the column you specify . The is.na function works particularly well with the filter function. Suppose we want to see how many NAs are in the movie_rating column. We can do this by count, and a filter in conjunction with the is.na function.

## Looking at only the NAs
horror_movies %>% 
  count(movie_rating, sort = T) %>% 
  filter(is.na(movie_rating))
## # A tibble: 1 × 2
##   movie_rating     n
##   <chr>        <int>
## 1 <NA>          1877

While this is useful, it might be even more useful if we filter out the NAs. Observe:

## Filtering out the NAs
horror_movies %>% 
  count(review_rating, sort = T) %>% 
  filter(!is.na(review_rating))
## # A tibble: 87 × 2
##    review_rating     n
##            <dbl> <int>
##  1           4.7    95
##  2           5.2    89
##  3           5.7    89
##  4           4.8    83
##  5           5      83
##  6           4.6    82
##  7           5.6    81
##  8           4.3    80
##  9           5.1    78
## 10           5.3    78
## # ℹ 77 more rows

What exactly happened here? The ! logical is the “not” or “negating” logical. If we were to type filter(is.na(review_rating)) we are telling R to filter all the elements inside of review_rating that are NA. However, if we type filter(!is.na(review_rating)) we are telling R to NOT filter all the elements inside of review_rating taht are NA. Hence, we are actually filtering out the NAs in this line of code.

However, notice that you can actually do this using the drop_na function as well:

## Filtering out the NAs
## same result as using filter(!is.na(review_rating))
horror_movies %>% 
  count(review_rating, sort = T) %>% 
  drop_na(review_rating)
## # A tibble: 87 × 2
##    review_rating     n
##            <dbl> <int>
##  1           4.7    95
##  2           5.2    89
##  3           5.7    89
##  4           4.8    83
##  5           5      83
##  6           4.6    82
##  7           5.6    81
##  8           4.3    80
##  9           5.1    78
## 10           5.3    78
## # ℹ 77 more rows

6.3.1 Exercise

  • Make a new tibble called \(\color{magenta}{\text{horrror\_movies\_NA}}\) that filters out all the NAs in the entire data set.

6.4 The arrange function

The arrange function is a simple function that simply sorts columns into ascending or descending order. For instance, suppose we sort our entire data set by which movies had the highest review_rating. We could do this by using the arrange function:

## using the arrange function to sort the review rating from lowest to highest
horror_movies %>% 
  arrange(review_rating) %>% 
  head(10)
## # A tibble: 10 × 12
##    title  genres release_date release_country movie_rating review_rating movie_run_time
##    <chr>  <chr>  <chr>        <chr>           <chr>                <dbl> <chr>         
##  1 Count… Horror 8-Sep-17     USA             <NA>                   1   <NA>          
##  2 Una C… Comed… 21-Sep-17    Peru            <NA>                   1   91 min        
##  3 A Rai… Horror 4-Jun-16     Japan           <NA>                   1.2 <NA>          
##  4 Potat… Comed… 23-Jul-15    Germany         <NA>                   1.3 81 min        
##  5 Dead … Horror 31-Jul-13    USA             TV-MA                  1.4 75 min        
##  6 Amity… Horror 3-Jan-17     USA             <NA>                   1.5 77 min        
##  7 Shark… Horror 14-Aug-15    USA             <NA>                   1.5 71 min        
##  8 Attac… Actio… 20-Apr-12    India           <NA>                   1.5 75 min        
##  9 Inter… Actio… 9-May-16     USA             <NA>                   1.6 49 min        
## 10 Raide… Actio… 19-May-15    USA             NOT RATED              1.6 71 min        
## # ℹ 5 more variables: plot <chr>, cast <chr>, language <chr>, filming_locations <chr>,
## #   budget <chr>

By default, the arrange function sorts in ascending rather than descending order. If we want to change this, we can use the arrange function in conjunction with the desc function.

## using the arrange and desc functions to sort the review rating from highest to lowest
horror_movies %>% 
  arrange(desc(review_rating)) %>% 
  head(10)
## # A tibble: 10 × 12
##    title  genres release_date release_country movie_rating review_rating movie_run_time
##    <chr>  <chr>  <chr>        <chr>           <chr>                <dbl> <chr>         
##  1 Boneh… Horror 27-Oct-17    USA             <NA>                   9.8 <NA>          
##  2 The T… Actio… 13-Oct-17    USA             <NA>                   9.6 98 min        
##  3 The C… Comed… 26-Oct-17    Canada          <NA>                   9.6 97 min        
##  4 The S… Horror 1-Oct-17     UK              <NA>                   9.5 90 min        
##  5 Hotel… Actio… 29-Sep-17    UK              <NA>                   9.5 70 min        
##  6 Flesh… Horror 21-Oct-17    USA             <NA>                   9.5 77 min        
##  7 Bong … Horror 20-Oct-17    USA             <NA>                   9.4 <NA>          
##  8 The T… Horror 15-May-17    UK              <NA>                   9.4 72 min        
##  9 Take … Horror 1-Feb-15     USA             <NA>                   9.3 <NA>          
## 10 Johan… Actio… 1-Sep-16     USA             <NA>                   9.3 52 min        
## # ℹ 5 more variables: plot <chr>, cast <chr>, language <chr>, filming_locations <chr>,
## #   budget <chr>

6.5 The filter function with logicals

As you saw last week, the filter function is great for subsetting your data based on a certain criteria. However, the filter function becomes much more powerful when used with logical operators. The three most common logical operators we use are the following:

  • ! - the “not” logical operator
  • & - the “and” logical operator
  • | - the “or” logical operator

We already briefly specified the ! logical operator in the previous section, so let’s focus on the & and |. The & operator becomes useful when we want to filter based on more than one true criteria. For example, suppose we want to filter out the movies that received a 9.0 review rating or higher AND was released in Canada. We would need to evaluate whether two criteria are satisfied: the statement “movies that received a 9.0 movie rating or higher”and the statement “released only in Canada”. If both of these statements are TRUE, then they get displayed. If not, they are filtered out.

## filtering for only movies receiving a movie rating of 9.0 or higher 
## and in the country Canada
horror_movies %>% 
  filter(review_rating >= 9.0 & release_country == "Canada")
## # A tibble: 2 × 12
##   title   genres release_date release_country movie_rating review_rating movie_run_time
##   <chr>   <chr>  <chr>        <chr>           <chr>                <dbl> <chr>         
## 1 Tales … Horror 18-Oct-17    Canada          <NA>                   9   80 min        
## 2 The Ca… Comed… 26-Oct-17    Canada          <NA>                   9.6 97 min        
## # ℹ 5 more variables: plot <chr>, cast <chr>, language <chr>, filming_locations <chr>,
## #   budget <chr>

Notice that now we are looking at movies that have a 9.0 or higher movie rating, and were released in Canada. We can see that there are only 2 movies that match these criteria.

On the other hand, suppose we used the | logical operator instead. The | operator will evaluate whether “review rating is great than 9.0” is TRUE, or “release country is Canada” is TRUE. If either of these statements are TRUE or both of these are TRUE, then the data is displayed. If both of these are false, then they are filtered out. Observe:

horror_movies %>% 
  filter(review_rating >= 9.0 | release_country == "Canada") %>% 
  count(review_rating, release_country,  sort = T)
## # A tibble: 52 × 3
##    review_rating release_country     n
##            <dbl> <chr>           <int>
##  1          NA   Canada              7
##  2           9   USA                 5
##  3           4.4 Canada              4
##  4           4.5 Canada              4
##  5           2.4 Canada              3
##  6           3.3 Canada              3
##  7           4.9 Canada              3
##  8           9.1 USA                 3
##  9           3.4 Canada              2
## 10           4   Canada              2
## # ℹ 42 more rows

Notice that we we have review ratings that are less than 9.0, and also countries that are not Canada. This is because only one of our statements need to be TRUE (although, as stated, both can be TRUE as well).

6.5.1 Exercise

  • Use the filter function to filter the horror movies that were released only in the USA or were “NOT RATED”. Find which of these movies had the highest review rating.

6.5.2 Exercise

  • Count the number of PG-13 movies that are only in Japan and USA.

6.6 Selected Solutions

  • (Exercise 1.2.1) The most common review rating is NA if we take into consideration missing values, and 4.7 if we do not take missing values into consideration.

  • (Exercise 1.5.1) Of the movies that were released in USA or had a rating of “NOT RATED” the highest review rating was a 9.8 by the movie Bonehill Road (2017).

  • (Exercise 1.5.2) Of the movies that were released in USA or Japan, there were 67 movies that were rated “PG-13”.