Chapter 6 Data Cleaning II

For this chapter, we will be importing a dataset from TidyTuesday Github. Tidy Tuesday is a weekly social data project in R where users explore a new dataset each week and share their findings on Twitter with #TidyTuesday. In particular, we will be focusing on a horror movies data set from IMDB. IMDB is the world’s most popular and authoritative source for movie, TV and celebrity content, designed to help fans explore the world of movies and shows and decide what to watch. This data set shows us information on horror movies that are on IMDB’s website. Here are some important variables we will be working with:

  • review_rating- the IMDB users average rating of the movie.
  • release_country - the country the movie was released in.
  • movie_rating - the movie’s Motion Picture Association film rating system score (e.g. G, PG, PG-13)

We will be focusing on getting comfortable with the following functions:

  • count
  • is.na
  • arrange
  • filter in conjunction with logicals

Let’s begin by importing in the data. To do this, we will be importing it using the read_csv function. Copy and paste the following link: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv and put it into a tibble called horror_movies using the read_csv function as shown below.

## install the package if you do not have it
library(tidyverse)

## loading in the data
horror_movies <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv")
## Rows: 3328 Columns: 12
## ── Column specification ──────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): title, genres, release_date, release_country, movie_rating, movie_run_time, plot, cast, ...
## dbl  (1): review_rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice how easy it was to read data off of a website and into R using the read_csv function: all it took was copying a pasting a link.

6.1 The count function

The count function takes all of the unique elements in a column, and counts how many times each element occurs. For instance, say we want to find the amount of times each movie rating occurs in our data set. We can do this using the count function.

## finding out how many of each rating of horror movie there are
horror_movies %>% 
  count(movie_rating)
## # A tibble: 12 × 2
##    movie_rating     n
##    <chr>        <int>
##  1 E                1
##  2 NC-17            1
##  3 NOT RATED      699
##  4 PG               5
##  5 PG-13           79
##  6 R              416
##  7 TV-14           33
##  8 TV-MA          107
##  9 TV-PG            1
## 10 UNRATED        108
## 11 X                1
## 12 <NA>          1877

The count function also has a helpful argument called sort. By default, sort is set to FALSE. This means that the count function will not order your results in descending order by the number of times they occur. If you want to view your elements in descending order, you can set the sort argument to TRUE.

## counting in descending order by setting sort = T
horror_movies %>% 
  count(movie_rating, sort = T)
## # A tibble: 12 × 2
##    movie_rating     n
##    <chr>        <int>
##  1 <NA>          1877
##  2 NOT RATED      699
##  3 R              416
##  4 UNRATED        108
##  5 TV-MA          107
##  6 PG-13           79
##  7 TV-14           33
##  8 PG               5
##  9 E                1
## 10 NC-17            1
## 11 TV-PG            1
## 12 X                1

The sort argument is particularly useful for spotting things like a large amount of NAs, or getting an idea of how a column is distributed.

6.1.1 Exercise

  • What is the most frequent review rating?

6.2 Using the is.na function

The is.na function used to find the NAs withing a particular column. It takes one argument: the column you specify . The is.na function works particularly well with the filter function. Suppose we want to see how many NAs are in the movie_rating column. We can do this by count, and a filter in conjunction with the is.na function.

## Looking at only the NAs
horror_movies %>% 
  count(movie_rating, sort = T) %>% 
  filter(is.na(movie_rating))
## # A tibble: 1 × 2
##   movie_rating     n
##   <chr>        <int>
## 1 <NA>          1877

While this is useful, it might be even more useful if we filter out the NAs. Observe:

## Filtering out the NAs
horror_movies %>% 
  count(review_rating, sort = T) %>% 
  filter(!is.na(review_rating))
## # A tibble: 87 × 2
##    review_rating     n
##            <dbl> <int>
##  1           4.7    95
##  2           5.2    89
##  3           5.7    89
##  4           4.8    83
##  5           5      83
##  6           4.6    82
##  7           5.6    81
##  8           4.3    80
##  9           5.1    78
## 10           5.3    78
## # … with 77 more rows

What exactly happened here? The ! logical is the “not” or “negating” logical. If we were to type filter(is.na(review_rating)) we are telling R to filter all the elements inside of review_rating that are NA. However, if we type filter(!is.na(review_rating)) we are telling R to NOT filter all the elements inside of review_rating taht are NA. Hence, we are actually filtering out the NAs in this line of code.

6.2.1 Exercise

  • Make a new tibble called \(\color{magenta}{\text{horrror\_movies\_NA}}\) that filters out all the NAs in the entire data set.

6.3 The arrange function

The arrange function is a simple function that simply sorts columns into ascending or descending order. For instance, suppose we sort our entire data set by which movies had the highest review_rating. We could do this by using the arrange function:

## using the arrange function to sort the review rating from lowest to highest
horror_movies %>% 
  arrange(review_rating) %>% 
  head(10)
## # A tibble: 10 × 12
##    title  genres  release_date release_country movie_rating review_rating movie_run_time plot   cast  
##    <chr>  <chr>   <chr>        <chr>           <chr>                <dbl> <chr>          <chr>  <chr> 
##  1 Count… Horror  8-Sep-17     USA             <NA>                   1   <NA>           Direc… Robin…
##  2 Una C… Comedy… 21-Sep-17    Peru            <NA>                   1   91 min         Direc… Emili…
##  3 A Rai… Horror  4-Jun-16     Japan           <NA>                   1.2 <NA>           Direc… Hono …
##  4 Potat… Comedy… 23-Jul-15    Germany         <NA>                   1.3 81 min         Direc… Joyce…
##  5 Dead … Horror  31-Jul-13    USA             TV-MA                  1.4 75 min         Direc… Phili…
##  6 Amity… Horror  3-Jan-17     USA             <NA>                   1.5 77 min         Direc… Marie…
##  7 Shark… Horror  14-Aug-15    USA             <NA>                   1.5 71 min         Direc… Angel…
##  8 Attac… Action… 20-Apr-12    India           <NA>                   1.5 75 min         Direc… Emanu…
##  9 Inter… Action… 9-May-16     USA             <NA>                   1.6 49 min         Direc… Danil…
## 10 Raide… Action… 19-May-15    USA             NOT RATED              1.6 71 min         Direc… Dan D…
## # … with 3 more variables: language <chr>, filming_locations <chr>, budget <chr>

By default, the arrange function sorts in ascending rather than descending order. If we want to change this, we can use the arrange function in conjunction with the desc function.

## using the arrange and desc functions to sort the review rating from highest to lowest
horror_movies %>% 
  arrange(desc(review_rating)) %>% 
  head(10)
## # A tibble: 10 × 12
##    title  genres  release_date release_country movie_rating review_rating movie_run_time plot   cast  
##    <chr>  <chr>   <chr>        <chr>           <chr>                <dbl> <chr>          <chr>  <chr> 
##  1 Boneh… Horror  27-Oct-17    USA             <NA>                   9.8 <NA>           "Dire… Andre…
##  2 The T… Action… 13-Oct-17    USA             <NA>                   9.6 98 min         "Dire… Victo…
##  3 The C… Comedy… 26-Oct-17    Canada          <NA>                   9.6 97 min         "Dire… Elise…
##  4 The S… Horror  1-Oct-17     UK              <NA>                   9.5 90 min         "Dire… Brink…
##  5 Hotel… Action… 29-Sep-17    UK              <NA>                   9.5 70 min         "Dire… Rayne…
##  6 Flesh… Horror  21-Oct-17    USA             <NA>                   9.5 77 min         "Dire… Man W…
##  7 Bong … Horror  20-Oct-17    USA             <NA>                   9.4 <NA>           "Dire… Tiffa…
##  8 The T… Horror  15-May-17    UK              <NA>                   9.4 72 min         "Dire… Rebec…
##  9 Take … Horror  1-Feb-15     USA             <NA>                   9.3 <NA>           "Dire… Tyler…
## 10 Johan… Action… 1-Sep-16     USA             <NA>                   9.3 52 min         "Dire… Curti…
## # … with 3 more variables: language <chr>, filming_locations <chr>, budget <chr>

6.4 The filter function with logicals

As you saw last week, the filter function is great for subsetting your data based on a certain criteria. However, the filter function becomes much more powerful when used with logical operators. The three most common logical operators we use are the following:

  • ! - the “not” logical operator
  • & - the “and” logical operator
  • | - the “or” logical operator

We already briefly specified the ! logical operator in the previous section, so let’s focus on the & and |. The & operator becomes useful when we want to filter based on more than one true criteria. For example, suppose we want to filter out the movies that received a 9.0 review rating or higher AND was released in Canada. We would need to evaluate whether two criteria are satisfied: the statement “movies that received a 9.0 movie rating or higher”and the statement “released only in Canada.” If both of these statements are TRUE, then they get displayed. If not, they are filtered out.

## filtering for only movies receiving a movie rating of 9.0 or higher 
## and in the country Canada
horror_movies %>% 
  filter(review_rating >= 9.0 & release_country == "Canada")
## # A tibble: 2 × 12
##   title  genres release_date release_country movie_rating review_rating movie_run_time plot    cast   
##   <chr>  <chr>  <chr>        <chr>           <chr>                <dbl> <chr>          <chr>   <chr>  
## 1 Tales… Horror 18-Oct-17    Canada          <NA>                   9   80 min         "Direc… Dave M…
## 2 The C… Comed… 26-Oct-17    Canada          <NA>                   9.6 97 min         "Direc… Elise …
## # … with 3 more variables: language <chr>, filming_locations <chr>, budget <chr>

Notice that now we are looking at movies that have a 9.0 or higher movie rating, and were released in Canada. We can see that there are only 2 movies that match these criteria.

On the other hand, suppose we used the | logical operator instead. The | operator will evaluate whether “review rating is great than 9.0” is TRUE, or “release country is Canada” is TRUE. If either of these statements are TRUE or both of these are TRUE, then the data is displayed. If both of these are false, then they are filtered out. Observe:

horror_movies %>% 
  filter(review_rating >= 9.0 | release_country == "Canada") %>% 
  count(review_rating, release_country,  sort = T)
## # A tibble: 52 × 3
##    review_rating release_country     n
##            <dbl> <chr>           <int>
##  1          NA   Canada              7
##  2           9   USA                 5
##  3           4.4 Canada              4
##  4           4.5 Canada              4
##  5           2.4 Canada              3
##  6           3.3 Canada              3
##  7           4.9 Canada              3
##  8           9.1 USA                 3
##  9           3.4 Canada              2
## 10           4   Canada              2
## # … with 42 more rows

Notice that we we have review ratings that are less than 9.0, and also countries that are not Canada. This is because only one of our statements need to be TRUE (although, as stated, both can be TRUE as well).

6.5 Exercise

  • Use the filter function to filter the horror movies that were released only in the USA or were “NOT RATED.” Find which of these movies had the highest review rating.

6.5.1 Exercise

  • Count the number of PG-13 movies that are only in Japan and USA.

6.6 Selected Solutions

  • (Exercise 1.1.1) The most common review rating is 252 for NA, and 95 if we consider NA to not be a review rating.

  • (Exercise 1.4.1) Of the movies that were released in USA or had a rating of “NOT RATED” the highest review rating was a 9.8 by the movie Bonehill Road (2017).

  • (Exercise 1.4.2) Of the movies that were released in USA or Japan, there were 67 movies that were rated “PG-13.”