Chapter 6 Data Cleaning II
For this chapter, we will be importing a dataset from TidyTuesday Github. Tidy Tuesday is a weekly social data project in R where users explore a new dataset each week and share their findings on Twitter with #TidyTuesday. In particular, we will be focusing on a horror movies data set from IMDB. IMDB is the world’s most popular and authoritative source for movie, TV and celebrity content, designed to help fans explore the world of movies and shows and decide what to watch. This data set shows us information on horror movies that are on IMDB’s website. Here are some important variables we will be working with:
review_rating
- the IMDB users average rating of the movie.release_country
- the country the movie was released in.movie_rating
- the movie’s Motion Picture Association film rating system score (e.g. G, PG, PG-13)
We will be focusing on getting comfortable with the following functions:
count
is.na
arrange
filter
in conjunction with logicals
Let’s begin by importing in the data. To do this, we will be importing it using the read_csv
function. Copy and paste the following link: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv
and put it into a tibble called horror_movies
using the read_csv
function as shown below.
## install the package if you do not have it
library(tidyverse)
## loading in the data
<- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv") horror_movies
## Rows: 3328 Columns: 12
## ── Column specification ──────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): title, genres, release_date, release_country, movie_rating, movie_run_time, plot, cast, ...
## dbl (1): review_rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice how easy it was to read data off of a website and into R using the read_csv
function: all it took was copying a pasting a link.
6.1 The count
function
The count
function takes all of the unique elements in a column, and counts how many times each element occurs. For instance, say we want to find the amount of times each movie rating occurs in our data set. We can do this using the count
function.
## finding out how many of each rating of horror movie there are
%>%
horror_movies count(movie_rating)
## # A tibble: 12 × 2
## movie_rating n
## <chr> <int>
## 1 E 1
## 2 NC-17 1
## 3 NOT RATED 699
## 4 PG 5
## 5 PG-13 79
## 6 R 416
## 7 TV-14 33
## 8 TV-MA 107
## 9 TV-PG 1
## 10 UNRATED 108
## 11 X 1
## 12 <NA> 1877
The count
function also has a helpful argument called sort
. By default, sort
is set to FALSE
. This means that the count
function will not order your results in descending order by the number of times they occur. If you want to view your elements in descending order, you can set the sort
argument to TRUE
.
## counting in descending order by setting sort = T
%>%
horror_movies count(movie_rating, sort = T)
## # A tibble: 12 × 2
## movie_rating n
## <chr> <int>
## 1 <NA> 1877
## 2 NOT RATED 699
## 3 R 416
## 4 UNRATED 108
## 5 TV-MA 107
## 6 PG-13 79
## 7 TV-14 33
## 8 PG 5
## 9 E 1
## 10 NC-17 1
## 11 TV-PG 1
## 12 X 1
The sort
argument is particularly useful for spotting things like a large amount of NAs, or getting an idea of how a column is distributed.
6.1.1 Exercise
- What is the most frequent review rating?
6.2 Using the is.na
function
The is.na
function used to find the NAs withing a particular column. It takes one argument: the column you specify . The is.na
function works particularly well with the filter
function. Suppose we want to see how many NAs are in the movie_rating
column. We can do this by count
, and a filter
in conjunction with the is.na
function.
## Looking at only the NAs
%>%
horror_movies count(movie_rating, sort = T) %>%
filter(is.na(movie_rating))
## # A tibble: 1 × 2
## movie_rating n
## <chr> <int>
## 1 <NA> 1877
While this is useful, it might be even more useful if we filter out the NAs. Observe:
## Filtering out the NAs
%>%
horror_movies count(review_rating, sort = T) %>%
filter(!is.na(review_rating))
## # A tibble: 87 × 2
## review_rating n
## <dbl> <int>
## 1 4.7 95
## 2 5.2 89
## 3 5.7 89
## 4 4.8 83
## 5 5 83
## 6 4.6 82
## 7 5.6 81
## 8 4.3 80
## 9 5.1 78
## 10 5.3 78
## # … with 77 more rows
What exactly happened here? The !
logical is the “not” or “negating” logical. If we were to type filter(is.na(review_rating))
we are telling R to filter all the elements inside of review_rating
that are NA. However, if we type filter(!is.na(review_rating))
we are telling R to NOT filter all the elements inside of review_rating
taht are NA. Hence, we are actually filtering out the NAs in this line of code.
6.2.1 Exercise
- Make a new tibble called \(\color{magenta}{\text{horrror\_movies\_NA}}\) that filters out all the NAs in the entire data set.
6.3 The arrange
function
The arrange function is a simple function that simply sorts columns into ascending or descending order. For instance, suppose we sort our entire data set by which movies had the highest review_rating
. We could do this by using the arrange
function:
## using the arrange function to sort the review rating from lowest to highest
%>%
horror_movies arrange(review_rating) %>%
head(10)
## # A tibble: 10 × 12
## title genres release_date release_country movie_rating review_rating movie_run_time plot cast
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Count… Horror 8-Sep-17 USA <NA> 1 <NA> Direc… Robin…
## 2 Una C… Comedy… 21-Sep-17 Peru <NA> 1 91 min Direc… Emili…
## 3 A Rai… Horror 4-Jun-16 Japan <NA> 1.2 <NA> Direc… Hono …
## 4 Potat… Comedy… 23-Jul-15 Germany <NA> 1.3 81 min Direc… Joyce…
## 5 Dead … Horror 31-Jul-13 USA TV-MA 1.4 75 min Direc… Phili…
## 6 Amity… Horror 3-Jan-17 USA <NA> 1.5 77 min Direc… Marie…
## 7 Shark… Horror 14-Aug-15 USA <NA> 1.5 71 min Direc… Angel…
## 8 Attac… Action… 20-Apr-12 India <NA> 1.5 75 min Direc… Emanu…
## 9 Inter… Action… 9-May-16 USA <NA> 1.6 49 min Direc… Danil…
## 10 Raide… Action… 19-May-15 USA NOT RATED 1.6 71 min Direc… Dan D…
## # … with 3 more variables: language <chr>, filming_locations <chr>, budget <chr>
By default, the arrange
function sorts in ascending rather than descending order. If we want to change this, we can use the arrange
function in conjunction with the desc
function.
## using the arrange and desc functions to sort the review rating from highest to lowest
%>%
horror_movies arrange(desc(review_rating)) %>%
head(10)
## # A tibble: 10 × 12
## title genres release_date release_country movie_rating review_rating movie_run_time plot cast
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Boneh… Horror 27-Oct-17 USA <NA> 9.8 <NA> "Dire… Andre…
## 2 The T… Action… 13-Oct-17 USA <NA> 9.6 98 min "Dire… Victo…
## 3 The C… Comedy… 26-Oct-17 Canada <NA> 9.6 97 min "Dire… Elise…
## 4 The S… Horror 1-Oct-17 UK <NA> 9.5 90 min "Dire… Brink…
## 5 Hotel… Action… 29-Sep-17 UK <NA> 9.5 70 min "Dire… Rayne…
## 6 Flesh… Horror 21-Oct-17 USA <NA> 9.5 77 min "Dire… Man W…
## 7 Bong … Horror 20-Oct-17 USA <NA> 9.4 <NA> "Dire… Tiffa…
## 8 The T… Horror 15-May-17 UK <NA> 9.4 72 min "Dire… Rebec…
## 9 Take … Horror 1-Feb-15 USA <NA> 9.3 <NA> "Dire… Tyler…
## 10 Johan… Action… 1-Sep-16 USA <NA> 9.3 52 min "Dire… Curti…
## # … with 3 more variables: language <chr>, filming_locations <chr>, budget <chr>
6.4 The filter
function with logicals
As you saw last week, the filter
function is great for subsetting your data based on a certain criteria. However, the filter
function becomes much more powerful when used with logical operators. The three most common logical operators we use are the following:
!
- the “not” logical operator&
- the “and” logical operator|
- the “or” logical operator
We already briefly specified the !
logical operator in the previous section, so let’s focus on the &
and |
. The &
operator becomes useful when we want to filter
based on more than one true criteria. For example, suppose we want to filter out the movies that received a 9.0 review rating or higher AND was released in Canada. We would need to evaluate whether two criteria are satisfied: the statement “movies that received a 9.0 movie rating or higher”and the statement “released only in Canada.” If both of these statements are TRUE, then they get displayed. If not, they are filtered out.
## filtering for only movies receiving a movie rating of 9.0 or higher
## and in the country Canada
%>%
horror_movies filter(review_rating >= 9.0 & release_country == "Canada")
## # A tibble: 2 × 12
## title genres release_date release_country movie_rating review_rating movie_run_time plot cast
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Tales… Horror 18-Oct-17 Canada <NA> 9 80 min "Direc… Dave M…
## 2 The C… Comed… 26-Oct-17 Canada <NA> 9.6 97 min "Direc… Elise …
## # … with 3 more variables: language <chr>, filming_locations <chr>, budget <chr>
Notice that now we are looking at movies that have a 9.0 or higher movie rating, and were released in Canada. We can see that there are only 2 movies that match these criteria.
On the other hand, suppose we used the |
logical operator instead. The |
operator will evaluate whether “review rating is great than 9.0” is TRUE, or “release country is Canada” is TRUE. If either of these statements are TRUE or both of these are TRUE, then the data is displayed. If both of these are false, then they are filtered out. Observe:
%>%
horror_movies filter(review_rating >= 9.0 | release_country == "Canada") %>%
count(review_rating, release_country, sort = T)
## # A tibble: 52 × 3
## review_rating release_country n
## <dbl> <chr> <int>
## 1 NA Canada 7
## 2 9 USA 5
## 3 4.4 Canada 4
## 4 4.5 Canada 4
## 5 2.4 Canada 3
## 6 3.3 Canada 3
## 7 4.9 Canada 3
## 8 9.1 USA 3
## 9 3.4 Canada 2
## 10 4 Canada 2
## # … with 42 more rows
Notice that we we have review ratings that are less than 9.0, and also countries that are not Canada. This is because only one of our statements need to be TRUE (although, as stated, both can be TRUE as well).
6.5 Exercise
- Use the
filter
function to filter the horror movies that were released only in the USA or were “NOT RATED.” Find which of these movies had the highest review rating.
6.5.1 Exercise
- Count the number of PG-13 movies that are only in Japan and USA.
6.6 Selected Solutions
(Exercise 1.1.1) The most common review rating is 252 for NA, and 95 if we consider NA to not be a review rating.
(Exercise 1.4.1) Of the movies that were released in USA or had a rating of “NOT RATED” the highest review rating was a 9.8 by the movie Bonehill Road (2017).
(Exercise 1.4.2) Of the movies that were released in USA or Japan, there were 67 movies that were rated “PG-13.”