Chapter 13 Relational Data

For this chapter, we will be focusing on how to find primary keys. A primary key is a compact way of saying “how to uniquely identify observations in your data set”. To practice finding primary keys, we will be using a variety of data sets that are already preloaded in R packages. The purpose of identifying primary keys is to make merging multiple data sets together easier. Merging data works best when you can merge on keys that uniquely identify your observations. This ensures that each observation is connected to their true data in the foreign data set (i.e. the data set that it will be merged to). A failure to merge your data correctly will result in inaccurate data that will not produce any meaningful result. This is what we want to avoid.

13.1 Finding Primary Keys

The most useful technique to finding primary keys involves a combination of your intuition and the count and filter functions. To start, we’ll load in the titanic_train data from the titanic package.

##loading in the necessary package
library(tidyverse)
library(titanic)
library(janitor)
## data we will be working with  
titanic <- clean_names(titanic_train)

Let’s take a look at the first few rows of our data by using the head function.

##getting a snapshot of our data
titanic %>% 
  head()

##   passenger_id survived pclass                                                name
## 1            1        0      3                             Braund, Mr. Owen Harris
## 2            2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## 3            3        1      3                              Heikkinen, Miss. Laina
## 4            4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5            5        0      3                            Allen, Mr. William Henry
## 6            6        0      3                                    Moran, Mr. James
##      sex age sib_sp parch           ticket    fare cabin embarked
## 1   male  22      1     0        A/5 21171  7.2500              S
## 2 female  38      1     0         PC 17599 71.2833   C85        C
## 3 female  26      0     0 STON/O2. 3101282  7.9250              S
## 4 female  35      1     0           113803 53.1000  C123        S
## 5   male  35      0     0           373450  8.0500              S
## 6   male  NA      0     0           330877  8.4583              Q

To uniquely identify an observation, we need a column (or combination of columns) that puts each passenger into one row. In other words, since we have individual-level data, we want to find a row that will uniquely identify each individual. Since this is our first example, this is relatively simple: passenger_id uniquely identifies an individual, and so does name. While this seems intuitive, it is important to actually check whether this is true. This is where the technique of filter and count must be applied.

We will be using the following algorithm to check whether a column (or combination of columns) uniquely identifies our observations:

Start with one column that you believe could uniquely identify an observation
Count all of the elements in a column
Filter out any of them that occur more than once
If you do not receive any observation that occur more than once, you are done, if not, you must either try another column, or try a combination of columns. Use your intuition.

This small algorithm ensures that we are uniquely identifying each observation. It clearly displays observations that occur more than once to inform us that this column (or columns) does or does not identify a primary key. To summarize, if we run our algorithm and receive any output of observations, we failed to find a primary key and need to rethink how we identify an observation.

Let’s try out this algorithm on our two columns that we intuitively thought could identify each individual.

## using algorithm on the passenger_id column
titanic %>% 
  count(passenger_id) %>% 
  filter(n >1)

## [1] passenger_id n           
## <0 rows> (or 0-length row.names)

Success! The passenger_id column does not contain multiple observations of identification. Therefore, it uniquely identifies each observation, and we could use this column to merge in data from other data sets.

## using algorithm on the name column
titanic %>% 
  count(name) %>% 
  filter(n >1)

## [1] name n   
## <0 rows> (or 0-length row.names)

Another success! The name column also does not contain multiple observations of any particular name. Therefore, it also identifies each individual, and we could also use this column to merge in data from other data sets.

On the other hand, let’s test the ticket column as it intuitively seems like it could also uniquely identify an observation (there is usually one ticket per person).

## testing if ticket uniquely identifies
titanic %>% 
  count(ticket) %>% 
  filter(n >1) %>% 
  head()

##   ticket n
## 1 110152 3
## 2 110413 3
## 3 110465 2
## 4 111361 2
## 5 113505 2
## 6 113572 2

Clearly, since we see that there are multiple duplicates of ticket numbers, we cannot uniquely identify our observations using the ticket column.

Our titanic dataset is simple in that we have two ways to uniquely identify an observation which only included one column, but this is not always true. In the next section, we will go through a harder example which will require a collection of columns.

13.2 Finding Primary Keys (harder)

In the titanic example, finding the primary keys was pretty intuitive, and a little trivial. However, this is not always the case. Let’s take a look at a data set from the babynames package. The babynames data set provides data on names given at birth from the Social Security Administration. This includes all names with at least 5 uses (sorry Elon). For our purposes, we will subset the data set since the original data is almost 2 million observations.

## loading the necessary package
library(babynames)
## selecting only the first 20,000 observations and getting rid of one of the rows we don't need
babynames <- babynames[1:20000,] %>% 
  select(-n)

## getting a preview of the data
babynames %>% 
  head()

## # A tibble: 6 × 4
##    year sex   name        prop
##   <dbl> <chr> <chr>      <dbl>
## 1  1880 F     Mary      0.0724
## 2  1880 F     Anna      0.0267
## 3  1880 F     Emma      0.0205
## 4  1880 F     Elizabeth 0.0199
## 5  1880 F     Minnie    0.0179
## 6  1880 F     Margaret  0.0162

As we can see, there are four columns:

year- the year of the birth
sex - the sex of the child
name - the name of the child
prop - a weighting variable we do not care about

We want to uniquely identify an observation. This is where we need intuition as well as our algorithm. Would name uniquely identify an observation? No. A name can appear in many different years. Observe our algorithm:

## checking if name uniquely identifies the data 
babynames %>% 
  count(name) %>% 
  filter(n >1)

## # A tibble: 2,728 × 2
##    name        n
##    <chr>   <int>
##  1 Aaron       9
##  2 Ab          4
##  3 Abbie      10
##  4 Abbott      2
##  5 Abby        8
##  6 Abe         9
##  7 Abel        9
##  8 Abigail     9
##  9 Abner       9
## 10 Abraham     9
## # ℹ 2,718 more rows

Clearly, we see that there are multiple times a name appears. Hence, we must use a collection of columns to uniquely identify (recall our algorithm).

Our intuition should tell us that using name and year may allow us to uniquely identify our data since each name likely occurs one time within each year. As always, our algorithm can check this assumption:

## checking if name and year uniquely identifies the data
babynames %>% 
  count(name, year) %>% 
  filter(n>1)

## # A tibble: 1,245 × 3
##    name   year     n
##    <chr> <dbl> <int>
##  1 Abbie  1887     2
##  2 Ada    1885     2
##  3 Ada    1886     2
##  4 Ada    1887     2
##  5 Addie  1880     2
##  6 Addie  1881     2
##  7 Addie  1882     2
##  8 Addie  1883     2
##  9 Addie  1884     2
## 10 Addie  1886     2
## # ℹ 1,235 more rows

It seems we are still not getting unique observations as there are many names that appear twice. Why would this be? This is where you must think critically about your data. Notice that our algorithm has shown us that names occur a maximum of two times. This should hint at something: these names are being used for both males and females. Therefore, to uniquely identify an observation, we must use three columns: name, year, and sex.

## checking if name, year, and sex uniquely identify an observation
babynames %>% 
  count(name, year, sex) %>% 
  filter(n > 1)

## # A tibble: 0 × 4
## # ℹ 4 variables: name <chr>, year <dbl>, sex <chr>, n <int>

Success! We managed to find a grouping of columns that brought each observation to one entry. Hence, if we were merging in data, we would want to use these three columns to connect the data.

13.2.1 Exercise

(Taken from R for Data Science Chapter 13.3 Exercises) Identify the primary key in the following data set from the fueleconomy package: vehicles. Remember, the primary key could be a single column or a collection of columns.

13.2.2 Exercise

(Taken from R for Data Science Chapter 13.3 Exercises) Identify the primary key in the following data set from the Lahman package: Batting. Remember, the primary key could be a single column or a collection of columns.

13.2.3 Exercise

(Taken from R for Data Science Chapter 13.3 Exercises) Identify the primary key in the following data set from the nasaweather package: atmos. Remember, the primary key could be a single column or a collection of columns.

13.3 Selected Solutions

These are shown on the next page. Please do not look at these until you have tried the exercises. These are provided because these exercises are more difficult.

(Exercise 1.2.1) The column id uniquely identifies the observations.
(Exercise 1.2.2) The columns (playerID, yearID, stint) uniquely identify each observation. The columns (playerID,yearID) are not a primary key because players can play on different teams within the same year
(Exercise 1.2.3) The primary key is (lat, long, year, month). The primary key represents the location and time that the measurement was taken.