Chapter 6 Common Wrangling II
For this chapter, we will be importing a dataset from TidyTuesday Github. Tidy Tuesday is a weekly social data project in R where users explore a new dataset each week and share their findings on Twitter with #TidyTuesday. In particular, we will be focusing on a horror movies data set from IMDB. IMDB is the world’s most popular and authoritative source for movie, TV and celebrity content, designed to help fans explore the world of movies and shows and decide what to watch. This data set shows us information on horror movies that are on IMDB’s website. Here are some important variables we will be working with:
review_rating
- the IMDB users average rating of the movie.release_country
- the country the movie was released in.movie_rating
- the movie’s Motion Picture Association film rating system score (e.g. G, PG, PG-13)
We will be focusing on getting comfortable with the following functions:
count
slice
pull
is.na
arrange
filter
in conjunction with logicalsadorn_totals
from the janitor package (janitor::adorn_totals
)
Let’s begin by importing in the data. To do this, we will be importing it using the read_csv
function. Copy and paste the following link: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv
and put it into a tibble called horror_movies
using the read_csv
function as shown below.
## install the package if you do not have it
library(tidyverse)
## loading in the data
<- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv") horror_movies
## Rows: 3328 Columns: 12
## ── Column specification ───────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): title, genres, release_date, release_country, movie_rating, movie_run_tim...
## dbl (1): review_rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice how easy it was to read data off of a website and into R using the read_csv
function: all it took was copying a pasting a link.
6.1 The count
function
The count
function takes all of the unique elements in a column, and counts how many times each element occurs. For instance, say we want to find the amount of times each movie rating occurs in our data set. We can do this using the count
function.
## finding out how many of each rating of horror movie there are
%>%
horror_movies count(movie_rating)
## # A tibble: 12 × 2
## movie_rating n
## <chr> <int>
## 1 E 1
## 2 NC-17 1
## 3 NOT RATED 699
## 4 PG 5
## 5 PG-13 79
## 6 R 416
## 7 TV-14 33
## 8 TV-MA 107
## 9 TV-PG 1
## 10 UNRATED 108
## 11 X 1
## 12 <NA> 1877
The count
function also has a helpful argument called sort
. By default, sort
is set to FALSE
. This means that the count
function will not order your results in descending order by the number of times they occur. If you want to view your elements in descending order, you can set the sort
argument to TRUE
.
## counting in descending order by setting sort = T
%>%
horror_movies count(movie_rating, sort = T)
## # A tibble: 12 × 2
## movie_rating n
## <chr> <int>
## 1 <NA> 1877
## 2 NOT RATED 699
## 3 R 416
## 4 UNRATED 108
## 5 TV-MA 107
## 6 PG-13 79
## 7 TV-14 33
## 8 PG 5
## 9 E 1
## 10 NC-17 1
## 11 TV-PG 1
## 12 X 1
The sort
argument is particularly useful for spotting things like a large amount of NAs, or getting an idea of how a column is distributed.
6.2 The slice
and pull
functions
Oftentimes, we want to extract a particular entry in a column and save it into a vector. For example, suppose you are asked to find the second most frequent occurrence of the movie_rating
column and then save this category in a vector named second_most_frequent_movie_rating
. The bad solution which you should never do unless absolutely necessary would be as follows:
## do not do this unless absolutely necessary
<- "NOT RATED" second_most_frequent_movie_rating
This solution is poor because this data could easily be updated or changed. If this happened, “NOT RATED” may no longer be the second most frequent occurrence of the movie_rating
column! Hence, we want to find a way to do this programmatically. This is where slice
and pull
will assist us. The slice
function simply slices the data by a certain amount of rows that you type in. Observe:
## gets the first row of the counted up tibble
%>%
horror_movies count(movie_rating, sort = T) %>%
slice(2) ## the 2 says to only take out the second row
## # A tibble: 1 × 2
## movie_rating n
## <chr> <int>
## 1 NOT RATED 699
In the above code, only the second row is sliced from the tibble (hence, the 2). However, we can also slice more rows than just one:
## gets the first 3 rows
## 1:3 means "one to three" in R language
%>%
horror_movies count(movie_rating, sort = T) %>%
slice(1:3)
## # A tibble: 3 × 2
## movie_rating n
## <chr> <int>
## 1 <NA> 1877
## 2 NOT RATED 699
## 3 R 416
In the above code, the first three rows are sliced from the tibble by using the 1:3
code which is translated as “one to three” in R.
Notice that while the slice
function is useful, it needs help from pull
to get our desired result. The pull
function will essentially allow us to extract the values from a column into a vector. Observe:
%>%
horror_movies count(movie_rating, sort = T) %>%
slice(2) %>% ## the 2 says to only slice the second row
pull(movie_rating) ## telling R to pull the elements from the movie_rating column
## [1] "NOT RATED"
Here, the pull
function simply extracts the values from movie_rating
and makes it into a vector. We can then save this vector with our desired name that will update programmatically as follows:
## saving the second most frequent movie rating into a vector named
## second_most_frequent_movie_rating
## this is a programmatic and excellent solution
<- horror_movies %>%
second_most_frequent_movie_rating count(movie_rating, sort = T) %>%
slice(2) %>% ## the 2 says to only slice the second row
pull(movie_rating) ## telling R to pull the elements from the movie_rating column
second_most_frequent_movie_rating
## [1] "NOT RATED"
We can similarly do it with a larger tibble and get the top three most frequent movie ratings into a vector:
## getting the top three most frequent movie ratings
<- horror_movies %>%
top_three_frequent_movie_ratings count(movie_rating, sort = T) %>%
slice(1:3) %>% ## the 1:3 means row 1 to row 3
pull(movie_rating) ## telling R to pull the elements from the movie_rating column
top_three_frequent_movie_ratings
## [1] NA "NOT RATED" "R"
6.3 Using the is.na
function
The is.na
function used to find the NAs withing a particular column. It takes one argument: the column you specify . The is.na
function works particularly well with the filter
function. Suppose we want to see how many NAs are in the movie_rating
column. We can do this by count
, and a filter
in conjunction with the is.na
function.
## Looking at only the NAs
%>%
horror_movies count(movie_rating, sort = T) %>%
filter(is.na(movie_rating))
## # A tibble: 1 × 2
## movie_rating n
## <chr> <int>
## 1 <NA> 1877
While this is useful, it might be even more useful if we filter out the NAs. Observe:
## Filtering out the NAs
%>%
horror_movies count(review_rating, sort = T) %>%
filter(!is.na(review_rating))
## # A tibble: 87 × 2
## review_rating n
## <dbl> <int>
## 1 4.7 95
## 2 5.2 89
## 3 5.7 89
## 4 4.8 83
## 5 5 83
## 6 4.6 82
## 7 5.6 81
## 8 4.3 80
## 9 5.1 78
## 10 5.3 78
## # ℹ 77 more rows
What exactly happened here? The !
logical is the “not” or “negating” logical. If we were to type filter(is.na(review_rating))
we are telling R to filter all the elements inside of review_rating
that are NA. However, if we type filter(!is.na(review_rating))
we are telling R to NOT filter all the elements inside of review_rating
taht are NA. Hence, we are actually filtering out the NAs in this line of code.
However, notice that you can actually do this using the drop_na
function as well:
## Filtering out the NAs
## same result as using filter(!is.na(review_rating))
%>%
horror_movies count(review_rating, sort = T) %>%
drop_na(review_rating)
## # A tibble: 87 × 2
## review_rating n
## <dbl> <int>
## 1 4.7 95
## 2 5.2 89
## 3 5.7 89
## 4 4.8 83
## 5 5 83
## 6 4.6 82
## 7 5.6 81
## 8 4.3 80
## 9 5.1 78
## 10 5.3 78
## # ℹ 77 more rows
6.4 The arrange
function
The arrange function is a simple function that simply sorts columns into ascending or descending order. For instance, suppose we sort our entire data set by which movies had the highest review_rating
. We could do this by using the arrange
function:
## using the arrange function to sort the review rating from lowest to highest
%>%
horror_movies arrange(review_rating) %>%
head(10)
## # A tibble: 10 × 12
## title genres release_date release_country movie_rating review_rating movie_run_time
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Count… Horror 8-Sep-17 USA <NA> 1 <NA>
## 2 Una C… Comed… 21-Sep-17 Peru <NA> 1 91 min
## 3 A Rai… Horror 4-Jun-16 Japan <NA> 1.2 <NA>
## 4 Potat… Comed… 23-Jul-15 Germany <NA> 1.3 81 min
## 5 Dead … Horror 31-Jul-13 USA TV-MA 1.4 75 min
## 6 Amity… Horror 3-Jan-17 USA <NA> 1.5 77 min
## 7 Shark… Horror 14-Aug-15 USA <NA> 1.5 71 min
## 8 Attac… Actio… 20-Apr-12 India <NA> 1.5 75 min
## 9 Inter… Actio… 9-May-16 USA <NA> 1.6 49 min
## 10 Raide… Actio… 19-May-15 USA NOT RATED 1.6 71 min
## # ℹ 5 more variables: plot <chr>, cast <chr>, language <chr>, filming_locations <chr>,
## # budget <chr>
By default, the arrange
function sorts in ascending rather than descending order. If we want to change this, we can use the arrange
function in conjunction with the desc
function.
## using the arrange and desc functions to sort the review rating from highest to lowest
%>%
horror_movies arrange(desc(review_rating)) %>%
head(10)
## # A tibble: 10 × 12
## title genres release_date release_country movie_rating review_rating movie_run_time
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Boneh… Horror 27-Oct-17 USA <NA> 9.8 <NA>
## 2 The T… Actio… 13-Oct-17 USA <NA> 9.6 98 min
## 3 The C… Comed… 26-Oct-17 Canada <NA> 9.6 97 min
## 4 The S… Horror 1-Oct-17 UK <NA> 9.5 90 min
## 5 Hotel… Actio… 29-Sep-17 UK <NA> 9.5 70 min
## 6 Flesh… Horror 21-Oct-17 USA <NA> 9.5 77 min
## 7 Bong … Horror 20-Oct-17 USA <NA> 9.4 <NA>
## 8 The T… Horror 15-May-17 UK <NA> 9.4 72 min
## 9 Take … Horror 1-Feb-15 USA <NA> 9.3 <NA>
## 10 Johan… Actio… 1-Sep-16 USA <NA> 9.3 52 min
## # ℹ 5 more variables: plot <chr>, cast <chr>, language <chr>, filming_locations <chr>,
## # budget <chr>
6.5 The filter
function with logicals
As you saw last week, the filter
function is great for subsetting your data based on a certain criteria. However, the filter
function becomes much more powerful when used with logical operators. The three most common logical operators we use are the following:
!
- the “not” logical operator&
- the “and” logical operator|
- the “or” logical operator
We already briefly specified the !
logical operator in the previous section, so let’s focus on the &
and |
. The &
operator becomes useful when we want to filter
based on more than one true criteria. For example, suppose we want to filter out the movies that received a 9.0 review rating or higher AND was released in Canada. We would need to evaluate whether two criteria are satisfied: the statement “movies that received a 9.0 movie rating or higher”and the statement “released only in Canada”. If both of these statements are TRUE, then they get displayed. If not, they are filtered out.
## filtering for only movies receiving a movie rating of 9.0 or higher
## and in the country Canada
%>%
horror_movies filter(review_rating >= 9.0 & release_country == "Canada")
## # A tibble: 2 × 12
## title genres release_date release_country movie_rating review_rating movie_run_time
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Tales … Horror 18-Oct-17 Canada <NA> 9 80 min
## 2 The Ca… Comed… 26-Oct-17 Canada <NA> 9.6 97 min
## # ℹ 5 more variables: plot <chr>, cast <chr>, language <chr>, filming_locations <chr>,
## # budget <chr>
Notice that now we are looking at movies that have a 9.0 or higher movie rating, and were released in Canada. We can see that there are only 2 movies that match these criteria.
On the other hand, suppose we used the |
logical operator instead. The |
operator will evaluate whether “review rating is great than 9.0” is TRUE, or “release country is Canada” is TRUE. If either of these statements are TRUE or both of these are TRUE, then the data is displayed. If both of these are false, then they are filtered out. Observe:
%>%
horror_movies filter(review_rating >= 9.0 | release_country == "Canada") %>%
count(review_rating, release_country, sort = T)
## # A tibble: 52 × 3
## review_rating release_country n
## <dbl> <chr> <int>
## 1 NA Canada 7
## 2 9 USA 5
## 3 4.4 Canada 4
## 4 4.5 Canada 4
## 5 2.4 Canada 3
## 6 3.3 Canada 3
## 7 4.9 Canada 3
## 8 9.1 USA 3
## 9 3.4 Canada 2
## 10 4 Canada 2
## # ℹ 42 more rows
Notice that we we have review ratings that are less than 9.0, and also countries that are not Canada. This is because only one of our statements need to be TRUE (although, as stated, both can be TRUE as well).
6.6 Selected Solutions
(Exercise 1.2.1) The most common review rating is NA if we take into consideration missing values, and 4.7 if we do not take missing values into consideration.
(Exercise 1.5.1) Of the movies that were released in USA or had a rating of “NOT RATED” the highest review rating was a 9.8 by the movie Bonehill Road (2017).
(Exercise 1.5.2) Of the movies that were released in USA or Japan, there were 67 movies that were rated “PG-13”.