ETC1010/ETC5510: Introduction to Data Analysis

Welcome!

About me

Patrick Li
Lecturer
Econometrics and Business Statistics
Monash University

TengMCing

[email protected]

I am a PhD candidate currently awaiting thesis review.
With over 20 years of programming experience, I have worked in languages such as C, Pascal, and Python, and have used R for the past 6 years.
My research focuses on data visualization, machine learning, visual inference, and computer vision models.
I have a strong interest in developing statistical software.
I have tutored business analytics units for 4 years, including the ETC3250/ETC5250 Introduction to Machine Learning, ETC5523 Communicating with Data, and ET5521 Exploratory Data Analysis.

Teaching Schedule

I will be teaching the second half of this unit. This includes

Relational data (week 7)
Web scraping (week 8)
Text analysis (week 9)
Cluster analysis (week 10)
Linear models (week 11)
Review and discussion (week 12)

Upcoming Assessment

Mid-semester Test (week 7)

Assignment 2 (week 9)

Final Exam (Nov 8th)

Expectation

Upon completion, you will

become an intermediate R user,
capable of tidying, exploring, and analyzing data, and
compiling it all into a reproducible report using RMarkdown or Quarto.

Code Examples

I will use webR with the quarto-webr extension to run code interactively within the slides during the lecture whenever possible, and I encourage you to explore and run the code yourself.

The tidyverse package is already loaded.

Recap

Advanced ggplot customization
Palettes
Color usage
Animated plots (gganimate)
Computer paths and RStudio projects

Relational Data

Outline

What is relational data?
How to combine different data sets for data analysis?
Joins and Keys
Different types of joins
Stacking data frames
Working with Excel files

Relational Data

Multiple tables of data are called relational data.

Data analysis rarely involves only a single table of data.
To answer questions we generally need to combine many tables of data.
It is the relations, not just the individual data sets, that are important. The existence of common variables (keys/value pairs) allow us to combine different data sets.

Flights that Departed NYC in 2013

The nycflights13 package contains data set of flights that departed NYC in 2013.

It has five tables:

flights: on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
airlines: airline names from their carrier codes.
airports: records information about each airport (identified by the faa airport code).
planes: data about each plane, identified by its tailnum.
weather: information about the weather at each NYC airport for each hour.

Information Comes from Different Tables

flights	airlines	airports	planes	weather
year	carrier	faa	tailnum	origin
month	name	name	year	year
day		lat	type	month
dep_time		lon	manufacturer	day
sched_dep_time		alt	model	hour
dep_delay		tz	engines	temp
arr_time		dst	seats	dewp
sched_arr_time		tzone	speed	humid
arr_delay			engine	wind_dir
carrier				wind_speed
flight				wind_gust
tailnum				precip
origin				pressure
dest				visib
air_time				time_hour
distance
hour
minute
time_hour

Explore the Data Sets

Tip

Use Crtl/CMD + Enter/Return to run a single line of code or highlighted code.
Use Shift + Enter/Return to run the entire code block.

Concept Map of Tables and Joins from the Text

Keys

Keys are variable(s) used to connect records in one table to another.

Keys can uniquely identifies an observation.

There are two types of keys:

1. Primary Key

uniquely identifies an observation in its own table.
For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.

2. Foreign key

uniquely identifies an observation in another table.
For example, flights$tailnum is a foreign key because it helps to combine flights and planes data. It references the primary key of another table and establish a link between the existing table and another table.

Keys

In the nycflights13 data,

flights connects to planes by a single variable tailnum.
flights connects to airlines by a single variable carrier.
flights connects to airports by two variables, origin and dest.
flights connects to weather using multiple variables, origin, and year, month, day and hour.

Joins

Joins allow you to combine information from multiple tables.

There is always a decision on what variables and observations are copied to the new table.

We will explore six types of joins, grouped into two main categories:

Mutating joins

Inner join: inner_join()
Left join: left_join()
Right join: right_join()
Full join: full_join()

Filtering joins

Semi join: semi_join()
Anti join: anti_join()

Let’s discuss how joins work using some lovely animations provided by Garrick Aden-Buie.

Example Data

x <- tibble(id = c(1, 2, 3),
            x = c("x1", "x2", "x3"))
y <- tibble(id = c(1, 2, 4),
            y = c("y1", "y2", "y4"))

Mutating Joins

Inner Join

An inner join matches pairs of observations when their keys are equal.

The output of an inner join is a new data frame that contains

key
x values
y values

Important

The unmatched rows are not included in the result.

Animation
Code

Inner Join

Use inner_join() to combine the full airline names with the flights data.

Note

Since every carrier code in flights matches one in airlines, all 336,766 observations from flights will be retained.
The full airline names will be added to the resulting table.

Outer joins

An outer join keeps observations that appear in at least one of the tables.

Three types of outer joins:

Left join: keeps all observations in x.
Right join: keeps all observations in y.
Full join: keeps all observations in x and y.

dplyr functions:

Left Join

All observations from the left table, but only the observations from the right table that match those in the left.

Animation
Code

Animation
Code

Left Join

Use left_join() to combine the information about each plane with the flights data.

Note

Out of the 336,776 tail numbers in flights, only 284,170 match those in planes.
left_join(flights, planes) keeps all observations from flights, the resulting table will have 336,776 rows.
Some additional variables from planes, such as manufacturer, will have NA values where there are no matches.

Right Join

Same as left join, but in reverse.

Tip

Technically, you can replace left_join(x, y) by right_join(y, x), but be aware that the order of the variables in the resulting table will differ.

Animation
Code

Full Join

Union of the two tables, all observations from both, and missing values might get added.

Animation
Code

Different Key Name when Joining Tables

What if the key names from the two tables are different?

Use a named vector to specify the key pair.

The primary key in x_diff_key is x_id.

The primary key in y_diff_key is y_id.

Note

The key name from the left table will be retained in the resulting table.

Filtering Joins

Semi Join

keeps all observations in x that have a match in y and returns just the columns from x.

Same as an inner join, but no additional columns are added.

Note

A semi join can be replaced by a filter() call, which is often recommended because it makes the code easier to read.

Animation
Code

Anti Join

Drops all observations in x that have a match in y.

Note

An anti join can also be replaced by a filter() call, and it is often recommended.

Animation
Code

Flights Joining to Airports and Airlines

Retrieve both the airport and airline information for each flight based on its departure location.

Flights Joining to Weather

Retrieve the weather information for each flight based on the departure location and time.

Note

time_hour is a datetime variable that consolidates the information from year, month, day, and hour.

In the weather data set, the combination of origin and time_hour can uniquely identify each row, as does the combination of origin, year, month, day, and hour.

Both combinations can be used for joins. We use the second combination as the primary key in this case.

Stacking Data Frames

Stacking Rows

bind_rows() will bind multiple data frames by row, making a longer result.

The output will contain all columns that appear in any of the inputs.

Columns are matched by name, and any missing columns will be filled with NA.

Stacking Columns

bind_cols() will bind multiple data frames by column, making a wider result.

Row sizes must be compatible when binding column.

Working with Excel Files

Example Data: NSW Crime Statistics

https://bocsar.nsw.gov.au/statistics-dashboards/open-datasets/local-area-datasets.html

Example Data: NSW Crime Statistics

https://bocsar.nsw.gov.au/content/dam/dcj/bocsar/documents/publications/lga/NewSouthWales.xlsx

Working with Excel Files

List sheets in an Excel file:

library(readxl)
excel_sheets("data/NewSouthWales.xlsx")

[1] "Summary of offences" "Premises Type"       "Victims"            
[4] "Offenders"           "Aboriginality"       "Alcohol Related"    
[7] "Month"               "Time"

Read different Excel sheets:

read_excel("data/NewSouthWales.xlsx", sheet = 1) %>%
  glimpse()

Rows: 90
Columns: 15
$ `NSW Recorded Crime Statistics Apr 2019 - Mar 2024` <chr> NA, "Number of rec…
$ ...2                                                <chr> NA, NA, NA, NA, NA…
$ `Apr 2023 - Mar 2024`                               <chr> NA, NA, NA, NA, "A…
$ `Definitions and explanations`                      <chr> NA, NA, NA, NA, NA…
$ ...5                                                <chr> NA, NA, NA, NA, "A…
$ ...6                                                <chr> NA, NA, NA, NA, NA…
$ ...7                                                <chr> NA, NA, NA, NA, "A…
$ ...8                                                <chr> NA, NA, NA, NA, NA…
$ ...9                                                <chr> NA, NA, NA, NA, "A…
$ ...10                                               <chr> NA, NA, NA, NA, NA…
$ ...11                                               <chr> NA, NA, NA, NA, "A…
$ ...12                                               <chr> NA, NA, NA, NA, NA…
$ ...13                                               <chr> NA, NA, NA, NA, NA…
$ ...14                                               <chr> NA, NA, NA, NA, NA…
$ ...15                                               <chr> NA, NA, NA, NA, NA…

Working with Excel Files

Read specific cells from an Excel file:

read_excel("data/NewSouthWales.xlsx", 
           sheet = 1, 
           range = "E8:F20", 
           col_names = c("num_of_incidents", 
                         "rate_per_100000_population"))

# A tibble: 13 × 2
   num_of_incidents rate_per_100000_population
              <dbl>                      <dbl>
 1               67                      0.828
 2            32181                    397.   
 3            29375                    363.   
 4             7608                     94.0  
 5             8173                    101.   
 6             1972                     24.4  
 7            18466                    228.   
 8             6737                     83.2  
 9            11223                    139.   
10            26735                    330.   
11            18974                    234.   
12            36538                    451.   
13            52726                    651.

Working with Excel Files

Other ways to specify the rectangle area (range) you want to read data from

anchored("C4", dim = c(3, 2)): “C4” is the anchor point at the upper left. The selected area spans 3 rows and 2 columns.
cell_limits(c(1, 3), c(2, 5)): Defines a rectangle from the first row, third column (upper left) to the second row, fifth column (lower right).
cell_rows(c(1, 3)): Specifies the area spanning rows 1 through 3.
cell_cols(c(1, 3)): Specifies the area spanning columns 1 through 3.

A value of NA in cell_limits() means the corresponding limit is unspecified.

cell_limits(c(NA, NA), c(NA, NA)) will select the entire sheet.
cell_limits(c(1, NA), c(2, 5)) will select from the first row, first column (upper left) to the second row, fifth column (lower right).
cell_limits(c(1, NA), c(2, NA)) will select row 1 and 2.

Reading Different Formats

The package haven enables R to read and write various data formats used by other statistical packages.

SAS: read_sas()
SPSS: read_spss()
Stata: read_stata()

Mid Semester Test

This is a timed quiz available from September 5 to September 8.

You will have 80 minutes to complete it.

Once you start, you must finish. There is no option to save and return to it later.

Make sure you have enough time before you begin, as you only have one attempt.

ETC1010/ETC5510: Introduction to Data Analysis

R History Command Contents