Welcome!

About me

Patrick Li
Lecturer
Econometrics and Business Statistics
Monash University

TengMCing

[email protected]

  • I am a PhD candidate currently awaiting thesis review.
  • With over 20 years of programming experience, I have worked in languages such as C, Pascal, and Python, and have used R for the past 6 years.
  • My research focuses on data visualization, machine learning, visual inference, and computer vision models.
  • I have a strong interest in developing statistical software.
  • I have tutored business analytics units for 4 years, including the ETC3250/ETC5250 Introduction to Machine Learning, ETC5523 Communicating with Data, and ET5521 Exploratory Data Analysis.

Teaching Schedule

I will be teaching the second half of this unit. This includes

  • Relational data (week 7)
  • Web scraping (week 8)
  • Text analysis (week 9)
  • Cluster analysis (week 10)
  • Linear models (week 11)
  • Review and discussion (week 12)

Upcoming Assessment

Mid-semester Test (week 7)

Assignment 2 (week 9)

Final Exam (Nov 8th)

Expectation

Upon completion, you will

  • become an intermediate R user,
  • capable of tidying, exploring, and analyzing data, and
  • compiling it all into a reproducible report using RMarkdown or Quarto.

Code Examples

I will use webR with the quarto-webr extension to run code interactively within the slides during the lecture whenever possible, and I encourage you to explore and run the code yourself.

The tidyverse package is already loaded.

Recap

  • Advanced ggplot customization
  • Palettes
  • Color usage
  • Animated plots (gganimate)
  • Computer paths and RStudio projects

Relational Data

Outline

  1. What is relational data?
  2. How to combine different data sets for data analysis?
  3. Joins and Keys
  4. Different types of joins
  5. Stacking data frames
  6. Working with Excel files

Relational Data


Multiple tables of data are called relational data.


  • Data analysis rarely involves only a single table of data.

  • To answer questions we generally need to combine many tables of data.

  • It is the relations, not just the individual data sets, that are important. The existence of common variables (keys/value pairs) allow us to combine different data sets.

Flights that Departed NYC in 2013

The nycflights13 package contains data set of flights that departed NYC in 2013.

It has five tables:

  1. flights: on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
  2. airlines: airline names from their carrier codes.
  3. airports: records information about each airport (identified by the faa airport code).
  4. planes: data about each plane, identified by its tailnum.
  5. weather: information about the weather at each NYC airport for each hour.

Information Comes from Different Tables

flights airlines airports planes weather
year carrier faa tailnum origin
month name name year year
day lat type month
dep_time lon manufacturer day
sched_dep_time alt model hour
dep_delay tz engines temp
arr_time dst seats dewp
sched_arr_time tzone speed humid
arr_delay engine wind_dir
carrier wind_speed
flight wind_gust
tailnum precip
origin pressure
dest visib
air_time time_hour
distance
hour
minute
time_hour

Explore the Data Sets

Tip

  • Use Crtl/CMD + Enter/Return to run a single line of code or highlighted code.
  • Use Shift + Enter/Return to run the entire code block.

Concept Map of Tables and Joins from the Text

Keys

Keys are variable(s) used to connect records in one table to another.

Keys can uniquely identifies an observation.

There are two types of keys:

1. Primary Key

  • uniquely identifies an observation in its own table.
  • For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.

2. Foreign key

  • uniquely identifies an observation in another table.
  • For example, flights$tailnum is a foreign key because it helps to combine flights and planes data. It references the primary key of another table and establish a link between the existing table and another table.

Keys

In the nycflights13 data,

  • flights connects to planes by a single variable tailnum.
  • flights connects to airlines by a single variable carrier.
  • flights connects to airports by two variables, origin and dest.
  • flights connects to weather using multiple variables, origin, and year, month, day and hour.

Joins

Joins allow you to combine information from multiple tables.

There is always a decision on what variables and observations are copied to the new table.

We will explore six types of joins, grouped into two main categories:

Mutating joins

  • Inner join: inner_join()
  • Left join: left_join()
  • Right join: right_join()
  • Full join: full_join()

Filtering joins

  • Semi join: semi_join()
  • Anti join: anti_join()

Let’s discuss how joins work using some lovely animations provided by Garrick Aden-Buie.

Example Data

x <- tibble(id = c(1, 2, 3),
            x = c("x1", "x2", "x3"))
y <- tibble(id = c(1, 2, 4),
            y = c("y1", "y2", "y4"))

Mutating Joins

Inner Join

An inner join matches pairs of observations when their keys are equal.

The output of an inner join is a new data frame that contains

  • key
  • x values
  • y values

Important

The unmatched rows are not included in the result.

Inner Join

Use inner_join() to combine the full airline names with the flights data.

Note

  • Since every carrier code in flights matches one in airlines, all 336,766 observations from flights will be retained.

  • The full airline names will be added to the resulting table.

Outer joins

An outer join keeps observations that appear in at least one of the tables.

Three types of outer joins:

  • Left join: keeps all observations in x.
  • Right join: keeps all observations in y.
  • Full join: keeps all observations in x and y.

dplyr functions:

Left Join

All observations from the left table, but only the observations from the right table that match those in the left.

Left Join

Use left_join() to combine the information about each plane with the flights data.

Note

  • Out of the 336,776 tail numbers in flights, only 284,170 match those in planes.

  • left_join(flights, planes) keeps all observations from flights, the resulting table will have 336,776 rows.

  • Some additional variables from planes, such as manufacturer, will have NA values where there are no matches.

Right Join

Same as left join, but in reverse.

Tip

Technically, you can replace left_join(x, y) by right_join(y, x), but be aware that the order of the variables in the resulting table will differ.

Full Join

Union of the two tables, all observations from both, and missing values might get added.

Different Key Name when Joining Tables

What if the key names from the two tables are different?

Use a named vector to specify the key pair.

The primary key in x_diff_key is x_id.

The primary key in y_diff_key is y_id.

Note

The key name from the left table will be retained in the resulting table.

Filtering Joins

Semi Join

keeps all observations in x that have a match in y and returns just the columns from x.

Same as an inner join, but no additional columns are added.

Note

A semi join can be replaced by a filter() call, which is often recommended because it makes the code easier to read.

Anti Join

Drops all observations in x that have a match in y.

Note

An anti join can also be replaced by a filter() call, and it is often recommended.

Flights Joining to Airports and Airlines

Retrieve both the airport and airline information for each flight based on its departure location.

Flights Joining to Weather

Retrieve the weather information for each flight based on the departure location and time.

Note

time_hour is a datetime variable that consolidates the information from year, month, day, and hour.

In the weather data set, the combination of origin and time_hour can uniquely identify each row, as does the combination of origin, year, month, day, and hour.

Both combinations can be used for joins. We use the second combination as the primary key in this case.

Stacking Data Frames

Stacking Rows

bind_rows() will bind multiple data frames by row, making a longer result.

The output will contain all columns that appear in any of the inputs.

Columns are matched by name, and any missing columns will be filled with NA.

Stacking Columns

bind_cols() will bind multiple data frames by column, making a wider result.

Row sizes must be compatible when binding column.

Working with Excel Files

Example Data: NSW Crime Statistics

https://bocsar.nsw.gov.au/statistics-dashboards/open-datasets/local-area-datasets.html

Example Data: NSW Crime Statistics

https://bocsar.nsw.gov.au/content/dam/dcj/bocsar/documents/publications/lga/NewSouthWales.xlsx

Working with Excel Files

List sheets in an Excel file:

library(readxl)
excel_sheets("data/NewSouthWales.xlsx")
[1] "Summary of offences" "Premises Type"       "Victims"            
[4] "Offenders"           "Aboriginality"       "Alcohol Related"    
[7] "Month"               "Time"               

Read different Excel sheets:

read_excel("data/NewSouthWales.xlsx", sheet = 1) %>%
  glimpse()
Rows: 90
Columns: 15
$ `NSW Recorded Crime Statistics Apr 2019 - Mar 2024` <chr> NA, "Number of rec…
$ ...2                                                <chr> NA, NA, NA, NA, NA…
$ `Apr 2023 - Mar 2024`                               <chr> NA, NA, NA, NA, "A…
$ `Definitions and explanations`                      <chr> NA, NA, NA, NA, NA…
$ ...5                                                <chr> NA, NA, NA, NA, "A…
$ ...6                                                <chr> NA, NA, NA, NA, NA…
$ ...7                                                <chr> NA, NA, NA, NA, "A…
$ ...8                                                <chr> NA, NA, NA, NA, NA…
$ ...9                                                <chr> NA, NA, NA, NA, "A…
$ ...10                                               <chr> NA, NA, NA, NA, NA…
$ ...11                                               <chr> NA, NA, NA, NA, "A…
$ ...12                                               <chr> NA, NA, NA, NA, NA…
$ ...13                                               <chr> NA, NA, NA, NA, NA…
$ ...14                                               <chr> NA, NA, NA, NA, NA…
$ ...15                                               <chr> NA, NA, NA, NA, NA…

Working with Excel Files

Read specific cells from an Excel file:

read_excel("data/NewSouthWales.xlsx", 
           sheet = 1, 
           range = "E8:F20", 
           col_names = c("num_of_incidents", 
                         "rate_per_100000_population"))
# A tibble: 13 × 2
   num_of_incidents rate_per_100000_population
              <dbl>                      <dbl>
 1               67                      0.828
 2            32181                    397.   
 3            29375                    363.   
 4             7608                     94.0  
 5             8173                    101.   
 6             1972                     24.4  
 7            18466                    228.   
 8             6737                     83.2  
 9            11223                    139.   
10            26735                    330.   
11            18974                    234.   
12            36538                    451.   
13            52726                    651.   

Working with Excel Files

Other ways to specify the rectangle area (range) you want to read data from

  • anchored("C4", dim = c(3, 2)): “C4” is the anchor point at the upper left. The selected area spans 3 rows and 2 columns.
  • cell_limits(c(1, 3), c(2, 5)): Defines a rectangle from the first row, third column (upper left) to the second row, fifth column (lower right).
  • cell_rows(c(1, 3)): Specifies the area spanning rows 1 through 3.
  • cell_cols(c(1, 3)): Specifies the area spanning columns 1 through 3.

A value of NA in cell_limits() means the corresponding limit is unspecified.

  • cell_limits(c(NA, NA), c(NA, NA)) will select the entire sheet.
  • cell_limits(c(1, NA), c(2, 5)) will select from the first row, first column (upper left) to the second row, fifth column (lower right).
  • cell_limits(c(1, NA), c(2, NA)) will select row 1 and 2.

Reading Different Formats

The package haven enables R to read and write various data formats used by other statistical packages.

  • SAS: read_sas()
  • SPSS: read_spss()
  • Stata: read_stata()

Mid Semester Test

This is a timed quiz available from September 5 to September 8.

You will have 80 minutes to complete it.

Once you start, you must finish. There is no option to save and return to it later.

Make sure you have enough time before you begin, as you only have one attempt.