Chapter 13 Relational Data
For this chapter, we will be focusing on how to find primary keys. A primary key is a compact way of saying “how to uniquely identify observations in your data set”. To practice finding primary keys, we will be using a variety of data sets that are already preloaded in R packages. The purpose of identifying primary keys is to make merging multiple data sets together easier. Merging data works best when you can merge on keys that uniquely identify your observations. This ensures that each observation is connected to their true data in the foreign data set (i.e. the data set that it will be merged to). A failure to merge your data correctly will result in inaccurate data that will not produce any meaningful result. This is what we want to avoid.
13.1 Finding Primary Keys
The most useful technique to finding primary keys involves a combination of your intuition and the count
and filter
functions. To start, we’ll load in the titanic_train
data from the titanic
package.
##loading in the necessary package
library(tidyverse)
library(titanic)
library(janitor)
## data we will be working with
<- clean_names(titanic_train) titanic
Let’s take a look at the first few rows of our data by using the head
function.
##getting a snapshot of our data
%>%
titanic head()
## passenger_id survived pclass name
## 1 1 0 3 Braund, Mr. Owen Harris
## 2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## 3 3 1 3 Heikkinen, Miss. Laina
## 4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5 5 0 3 Allen, Mr. William Henry
## 6 6 0 3 Moran, Mr. James
## sex age sib_sp parch ticket fare cabin embarked
## 1 male 22 1 0 A/5 21171 7.2500 S
## 2 female 38 1 0 PC 17599 71.2833 C85 C
## 3 female 26 0 0 STON/O2. 3101282 7.9250 S
## 4 female 35 1 0 113803 53.1000 C123 S
## 5 male 35 0 0 373450 8.0500 S
## 6 male NA 0 0 330877 8.4583 Q
To uniquely identify an observation, we need a column (or combination of columns) that puts each passenger into one row. In other words, since we have individual-level data, we want to find a row that will uniquely identify each individual. Since this is our first example, this is relatively simple: passenger_id
uniquely identifies an individual, and so does name
. While this seems intuitive, it is important to actually check whether this is true. This is where the technique of filter
and count
must be applied.
We will be using the following algorithm to check whether a column (or combination of columns) uniquely identifies our observations:
- Start with one column that you believe could uniquely identify an observation
- Count all of the elements in a column
- Filter out any of them that occur more than once
- If you do not receive any observation that occur more than once, you are done, if not, you must either try another column, or try a combination of columns. Use your intuition.
This small algorithm ensures that we are uniquely identifying each observation. It clearly displays observations that occur more than once to inform us that this column (or columns) does or does not identify a primary key. To summarize, if we run our algorithm and receive any output of observations, we failed to find a primary key and need to rethink how we identify an observation.
Let’s try out this algorithm on our two columns that we intuitively thought could identify each individual.
## using algorithm on the passenger_id column
%>%
titanic count(passenger_id) %>%
filter(n >1)
## [1] passenger_id n
## <0 rows> (or 0-length row.names)
Success! The passenger_id
column does not contain multiple observations of identification. Therefore, it uniquely identifies each observation, and we could use this column to merge in data from other data sets.
## using algorithm on the name column
%>%
titanic count(name) %>%
filter(n >1)
## [1] name n
## <0 rows> (or 0-length row.names)
Another success! The name
column also does not contain multiple observations of any particular name. Therefore, it also identifies each individual, and we could also use this column to merge in data from other data sets.
On the other hand, let’s test the ticket
column as it intuitively seems like it could also uniquely identify an observation (there is usually one ticket per person).
## testing if ticket uniquely identifies
%>%
titanic count(ticket) %>%
filter(n >1) %>%
head()
## ticket n
## 1 110152 3
## 2 110413 3
## 3 110465 2
## 4 111361 2
## 5 113505 2
## 6 113572 2
Clearly, since we see that there are multiple duplicates of ticket numbers, we cannot uniquely identify our observations using the ticket
column.
Our titanic
dataset is simple in that we have two ways to uniquely identify an observation which only included one column, but this is not always true. In the next section, we will go through a harder example which will require a collection of columns.
13.2 Finding Primary Keys (harder)
In the titanic
example, finding the primary keys was pretty intuitive, and a little trivial. However, this is not always the case. Let’s take a look at a data set from the babynames
package. The babynames
data set provides data on names given at birth from the Social Security Administration. This includes all names with at least 5 uses (sorry Elon). For our purposes, we will subset the data set since the original data is almost 2 million observations.
## loading the necessary package
library(babynames)
## selecting only the first 20,000 observations and getting rid of one of the rows we don't need
<- babynames[1:20000,] %>%
babynames select(-n)
## getting a preview of the data
%>%
babynames head()
## # A tibble: 6 × 4
## year sex name prop
## <dbl> <chr> <chr> <dbl>
## 1 1880 F Mary 0.0724
## 2 1880 F Anna 0.0267
## 3 1880 F Emma 0.0205
## 4 1880 F Elizabeth 0.0199
## 5 1880 F Minnie 0.0179
## 6 1880 F Margaret 0.0162
As we can see, there are four columns:
year
- the year of the birthsex
- the sex of the childname
- the name of the childprop
- a weighting variable we do not care about
We want to uniquely identify an observation. This is where we need intuition as well as our algorithm. Would name
uniquely identify an observation? No. A name can appear in many different years. Observe our algorithm:
## checking if name uniquely identifies the data
%>%
babynames count(name) %>%
filter(n >1)
## # A tibble: 2,728 × 2
## name n
## <chr> <int>
## 1 Aaron 9
## 2 Ab 4
## 3 Abbie 10
## 4 Abbott 2
## 5 Abby 8
## 6 Abe 9
## 7 Abel 9
## 8 Abigail 9
## 9 Abner 9
## 10 Abraham 9
## # ℹ 2,718 more rows
Clearly, we see that there are multiple times a name appears. Hence, we must use a collection of columns to uniquely identify (recall our algorithm).
Our intuition should tell us that using name
and year
may allow us to uniquely identify our data since each name likely occurs one time within each year. As always, our algorithm can check this assumption:
## checking if name and year uniquely identifies the data
%>%
babynames count(name, year) %>%
filter(n>1)
## # A tibble: 1,245 × 3
## name year n
## <chr> <dbl> <int>
## 1 Abbie 1887 2
## 2 Ada 1885 2
## 3 Ada 1886 2
## 4 Ada 1887 2
## 5 Addie 1880 2
## 6 Addie 1881 2
## 7 Addie 1882 2
## 8 Addie 1883 2
## 9 Addie 1884 2
## 10 Addie 1886 2
## # ℹ 1,235 more rows
It seems we are still not getting unique observations as there are many names that appear twice. Why would this be? This is where you must think critically about your data. Notice that our algorithm has shown us that names occur a maximum of two times. This should hint at something: these names are being used for both males and females. Therefore, to uniquely identify an observation, we must use three columns: name
, year
, and sex
.
## checking if name, year, and sex uniquely identify an observation
%>%
babynames count(name, year, sex) %>%
filter(n > 1)
## # A tibble: 0 × 4
## # ℹ 4 variables: name <chr>, year <dbl>, sex <chr>, n <int>
Success! We managed to find a grouping of columns that brought each observation to one entry. Hence, if we were merging in data, we would want to use these three columns to connect the data.
13.2.1 Exercise
- (Taken from R for Data Science Chapter 13.3 Exercises) Identify the primary key in the following data set from the
fueleconomy
package:vehicles
. Remember, the primary key could be a single column or a collection of columns.
13.3 Selected Solutions
These are shown on the next page. Please do not look at these until you have tried the exercises. These are provided because these exercises are more difficult.
- (Exercise 1.2.1) The column
id
uniquely identifies the observations. - (Exercise 1.2.2) The columns (
playerID
,yearID
,stint
) uniquely identify each observation. The columns (playerID
,yearID
) are not a primary key because players can play on different teams within the same year - (Exercise 1.2.3) The primary key is (
lat
,long
,year
,month
). The primary key represents the location and time that the measurement was taken.