Data wRangling

2016-02-23

Warmups - Problem 1

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time	treatment	subject	rep	potato	buttery	grassy	rancid	painty
1	1	3	1	2.9	0.0	0.0	0.0	5.5
1	1	3	2	14.0	0.0	0.0	1.1	0.0
1	1	10	1	11.0	6.4	0.0	0.0	0.0
1	1	10	2	9.9	5.9	2.9	2.2	0.0

What do you want to know?

Warmups - Problem 2

What's in the column names of this data?

id	WI-6.R1	WI-6.R2	WI-6.R4	WM-6.R1	WM-6.R2	WI-12.R1	WI-12.R2	WI-12.R4	WM-12.R1	WM-12.R2	WM-12.R4
Gene 1	2.2	2.20	4.2	2.63	5.1	4.5	5.5	4.4	3.9	4.2	3.73
Gene 2	1.5	0.59	1.9	0.52	2.9	1.4	3.0	1.3	1.2	1.2	0.89
Gene 3	2.0	0.87	3.3	0.53	4.6	2.2	5.6	2.5	2.5	3.0	1.35

Warmups - Problem 3

How many ways can you write today's date?

What we are going to cover today

Reading different data formats
Tidying data
Split - apply - combine
Pipes
Joins
Working with dates
Splitting strings

Reading different data formats

images: library(EBImage)
sound: library(tuneR)
fixed width fields: read.fwf()
netCDF: library(ncdf)
hdf5: library(hdf5)
json: library(jsonlite)

XML/HTML

library(XML)
src <- "http://www.realclearpolitics.com/epolls/2012/president/us/republican_presidential_nomination-1452.html"
tables <- readHTMLTable(src)
polls <- tables[[1]]
head(polls)
#>                                       Poll        Date Sample MoE Romney 
#> 1                              RCP Average  4/9 - 4/17     --  --    52.8
#> 2       CBS News/NY TimesCBS News/NY Times 4/13 - 4/17 268 RV 6.0      54
#> 3 CNN/Opinion ResearchCNN/Opinion Research 4/13 - 4/15  473 A 4.5      57
#> 4                           PPP (D)PPP (D) 4/12 - 4/15 742 RV 3.6      54
#> 5                         FOX NewsFOX News  4/9 - 4/11 376 RV 5.0      46
#>   Santorum  Gingrich  Paul  Perry  Huntsman  Bachmann  Cain        Spread
#> 1                19.0  15.0                                  Romney +33.8
#> 2        --        20    12     --        --        --    --   Romney +34
#> 3        --        19    18     --        --        --    --   Romney +38
#> 4        --        24    14     --        --        --    --   Romney +30
#> 5        15        13    16     --        --        --    --   Romney +30

GIS

This code is a bit slow to run, but it draws all the electoral districts of Australia.

library(maptools)
xx <- readShapeSpatial("http://dicook.github.io/Monash-R/data/australia/region.shp")
object.size(as(xx, "SpatialPolygons"))
xxx <- thinnedSpatialPoly(as(xx, "SpatialPolygons"), 
  tolerance=0.5, minarea=0.001, topologyPreserve=TRUE)
object.size(as(xxx, "SpatialPolygons"))
qplot(long, lat, data=xx, group=group) + geom_path() + coord_map()

French fries - hot chips

time	treatment	subject	rep	potato	buttery	grassy	rancid	painty
1	1	3	1	2.9	0.0	0.0	0.0	5.5
1	1	3	2	14.0	0.0	0.0	1.1	0.0
1	1	10	1	11.0	6.4	0.0	0.0	0.0
1	1	10	2	9.9	5.9	2.9	2.2	0.0
1	1	15	1	1.2	0.1	0.0	1.1	5.1
1	1	15	2	8.8	3.0	3.6	1.5	2.3

What would we like to know?

Is the design complete?
Are replicates like each other?
How do the ratings on the different scales differ?
Are raters giving different scores on average?
Do ratings change over the weeks?

Each of these questions involves different summaries of the data.

What we have and what we want

Gathering

When gathering, you need to specify the keys (identifiers) and the values (measures).

Keys/Identifiers: - Identify a record (must be unique) - Example: Indices on an random variable - Fixed by design of experiment (known in advance) - May be single or composite (may have one or more variables)

Values/Measures: - Collected during the experiment (not known in advance) - Usually numeric quantities

Gathering the French Fries

library(tidyr)
ff_long <- gather(french_fries, key = variable, value = rating, potato:painty)
head(ff_long)
#>   time treatment subject rep variable rating
#> 1    1         1       3   1   potato    2.9
#> 2    1         1       3   2   potato   14.0
#> 3    1         1      10   1   potato   11.0
#> 4    1         1      10   2   potato    9.9
#> 5    1         1      15   1   potato    1.2
#> 6    1         1      15   2   potato    8.8

Long to Wide

In certain applications, we may wish to take a long dataset and convert it to a wide dataset (Perhaps displaying in a table).

This is called "spreading" the data.

Spread

We use the spread function from tidyr to do this:

french_fries_wide <- spread(ff_long, key = variable, value = rating)

head(french_fries_wide)
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
#> 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
#> 3    1         1      10   1   11.0     6.4    0.0    0.0    0.0
#> 4    1         1      10   2    9.9     5.9    2.9    2.2    0.0
#> 5    1         1      15   1    1.2     0.1    0.0    1.1    5.1
#> 6    1         1      15   2    8.8     3.0    3.6    1.5    2.3

Lets use gather and spread to answer some questions

Easiest question to start is whether the ratings are similar on the different scales, potato'y, buttery, grassy, rancid and painty.

We need to gather the data into long form, and make plots facetted by the scale.

Ratings on the different scales

library(ggplot2)
ff.m <- french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep)
head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8
qplot(rating, data=ff.m, binwidth=2) + 
  facet_wrap(~type, ncol=5)

Side-by-side boxplots

qplot(type, rating, data = ff.m, fill = type, geom = "boxplot")

Do the replicates look like each other?

We will start to tackle this by plotting the replicates against each other using a scatterplot.

We need to gather the data into long form, and then get the replicates spread into separate columns.

Check replicates

head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8
ff.s <- ff.m %>% spread(rep, rating)
head(ff.s)
#>   time treatment subject    type    1    2
#> 1    1         1       3  potato  2.9 14.0
#> 2    1         1       3 buttery  0.0  0.0
#> 3    1         1       3  grassy  0.0  0.0
#> 4    1         1       3  rancid  0.0  1.1
#> 5    1         1       3  painty  5.5  0.0
#> 6    1         1      10  potato 11.0  9.9

Check replicates

qplot(`1`, `2`, data=ff.s) + theme(aspect.ratio=1) + 
  xlab("Rep 1") + ylab("Rep 2")
qplot(`1`, `2`, data=ff.s) + theme(aspect.ratio=1) + 
  xlab("Rep 1") + ylab("Rep 2") + 
  scale_x_log10() + scale_y_log10()

Your turn

Make the scatterplots of reps against each other separately for scales, and treatment.

Legos = tidy data

(Courtesy of Hadley Wickham)

Play mobile = messy data

(Courtesy of Hadley Wickham)

Your turn

Read in the billboard top 100 music data, which contains N'Sync and Backstreet Boys songs that entered the billboard charts in the year 2000

billboard <- read.csv("http://dicook.github.io/Monash-R/data/billboard.csv")

What's in this data? What's X1-X76?

Your turn

Use tidyr to convert this data into a long format appropriate for plotting a time series (date on the x axis, chart position on the y axis)
Use ggplot2 to create this time series plot:

The Split-Apply-Combine Approach

(Diagram originally from Hadley Wickham)

Split-Apply-Combine in dplyr

library(dplyr)
french_fries_split <- group_by(ff_long, variable) # SPLIT
french_fries_apply <- summarise(french_fries_split, rating = mean(rating, na.rm = TRUE)) # APPLY + COMBINE
french_fries_apply
#> Source: local data frame [5 x 2]
#> 
#>   variable rating
#>     (fctr)  (dbl)
#> 1   potato   6.95
#> 2  buttery   1.82
#> 3   grassy   0.66
#> 4   rancid   3.85
#> 5   painty   2.52

The pipe operator

dplyr allows us to chain together these data analysis tasks using the %>% (pipe) operator
x %>% f(y) is shorthand for f(x, y)
Example:

french_fries %>%
    gather(key = variable, value = rating, potato:painty) %>%
    group_by(variable) %>%
    summarise(rating = mean(rating, na.rm = TRUE))
#> Source: local data frame [5 x 2]
#> 
#>   variable rating
#>     (fctr)  (dbl)
#> 1   potato   6.95
#> 2  buttery   1.82
#> 3   grassy   0.66
#> 4   rancid   3.85
#> 5   painty   2.52

dplyr verbs

There are five primary dplyr verbs, representing distinct data analysis tasks:

Filter: Remove the rows of a data frame, producing subsets
Arrange: Reorder the rows of a data frame
Select: Select particular columns of a data frame
Mutate: Add new columns that are functions of existing columns
Summarise: Create collapsed summaries of a data frame

Filter

french_fries %>%
    filter(subject == 3, time == 1)
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
#> 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
#> 3    1         2       3   1   13.9     0.0    0.0    3.9    0.0
#> 4    1         2       3   2   13.4     0.1    0.0    1.5    0.0
#> 5    1         3       3   1   14.1     0.0    0.0    1.1    0.0
#> 6    1         3       3   2    9.5     0.0    0.6    2.8    0.0

Arrange

french_fries %>%
    arrange(desc(rancid)) %>%
    head
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    9         2      51   1    7.3     2.3      0     15    0.1
#> 2   10         1      86   2    0.7     0.0      0     14   13.1
#> 3    5         2      63   1    4.4     0.0      0     14    0.6
#> 4    9         2      63   1    1.8     0.0      0     14   12.3
#> 5    5         2      19   2    5.5     4.7      0     13    4.6
#> 6    4         3      63   1    5.6     0.0      0     13    4.4

Select

french_fries %>%
    select(time, treatment, subject, rep, potato) %>%
    head
#>    time treatment subject rep potato
#> 61    1         1       3   1    2.9
#> 25    1         1       3   2   14.0
#> 62    1         1      10   1   11.0
#> 26    1         1      10   2    9.9
#> 63    1         1      15   1    1.2
#> 27    1         1      15   2    8.8

Summarise

french_fries %>%
    group_by(time, treatment) %>%
    summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid))
#> Source: local data frame [30 x 4]
#> Groups: time [?]
#> 
#>      time treatment mean_rancid sd_rancid
#>    (fctr)    (fctr)       (dbl)     (dbl)
#> 1       1         1         2.8       3.2
#> 2       1         2         1.7       2.7
#> 3       1         3         2.6       3.2
#> 4       2         1         3.9       4.4
#> 5       2         2         2.1       3.1
#> 6       2         3         2.5       3.4
#> 7       3         1         4.7       3.9
#> 8       3         2         2.9       3.8
#> 9       3         3         3.6       3.6
#> 10      4         1         2.1       2.4
#> ..    ...       ...         ...       ...

Let's use these tools to answer the rest of the french fries questions

If the data is complete it should be 12 x 10 x 3 x 2, that is, 6 records for each person. (Assuming that each person rated on all scales.)

To check this we want to tabulate the number of records for each subject, time and treatment. This means select appropriate columns, tabulate, count and spread it out to give a nice table.

Check completeness

french_fries %>% 
  select(subject, time, treatment) %>% 
  tbl_df() %>% 
  count(subject, time) %>%
  spread(time, n)
#> Source: local data frame [12 x 11]
#> 
#>    subject     1     2     3     4     5     6     7     8     9    10
#>     (fctr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int)
#> 1        3     6     6     6     6     6     6     6     6     6    NA
#> 2       10     6     6     6     6     6     6     6     6     6     6
#> 3       15     6     6     6     6     6     6     6     6     6     6
#> 4       16     6     6     6     6     6     6     6     6     6     6
#> 5       19     6     6     6     6     6     6     6     6     6     6
#> 6       31     6     6     6     6     6     6     6     6    NA     6
#> 7       51     6     6     6     6     6     6     6     6     6     6
#> 8       52     6     6     6     6     6     6     6     6     6     6
#> 9       63     6     6     6     6     6     6     6     6     6     6
#> 10      78     6     6     6     6     6     6     6     6     6     6
#> 11      79     6     6     6     6     6     6     6     6     6    NA
#> 12      86     6     6     6     6     6     6     6     6    NA     6

Check completeness with different scales, too

french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep) %>%
  select(subject, time, treatment, type) %>% 
  tbl_df() %>% 
  count(subject, time) %>%
  spread(time, n)
#> Source: local data frame [12 x 11]
#> 
#>    subject     1     2     3     4     5     6     7     8     9    10
#>     (fctr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int)
#> 1        3    30    30    30    30    30    30    30    30    30    NA
#> 2       10    30    30    30    30    30    30    30    30    30    30
#> 3       15    30    30    30    30    30    30    30    30    30    30
#> 4       16    30    30    30    30    30    30    30    30    30    30
#> 5       19    30    30    30    30    30    30    30    30    30    30
#> 6       31    30    30    30    30    30    30    30    30    NA    30
#> 7       51    30    30    30    30    30    30    30    30    30    30
#> 8       52    30    30    30    30    30    30    30    30    30    30
#> 9       63    30    30    30    30    30    30    30    30    30    30
#> 10      78    30    30    30    30    30    30    30    30    30    30
#> 11      79    30    30    30    30    30    30    30    30    30    NA
#> 12      86    30    30    30    30    30    30    30    30    NA    30

Change in ratings over weeks, relative to experimental design

qplot(time, rating, data=ff.m, colour=treatment) + 
  facet_grid(subject~type)

Add means over reps, and connect the dots

ff.m.av <- ff.m %>% 
  group_by(subject, time, type, treatment) %>%
  summarise(rating=mean(rating))
qplot(time, rating, data=ff.m, colour=treatment) + 
  facet_grid(subject~type) +
  geom_line(data=ff.m.av, aes(group=treatment))

String manipulation

When the experimental design is packed into column names, we need to unpack it.

genes <- read_csv("http://dicook.github.io/Monash-R/data/genes.csv")
genes
#> Source: local data frame [3 x 12]
#> 
#>       id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2
#>    (chr)   (dbl)   (dbl)   (dbl)   (dbl)   (dbl)    (dbl)    (dbl)
#> 1 Gene 1     2.2    2.20     4.2    2.63     5.1      4.5      5.5
#> 2 Gene 2     1.5    0.59     1.9    0.52     2.9      1.4      3.0
#> 3 Gene 3     2.0    0.87     3.3    0.53     4.6      2.2      5.6
#> Variables not shown: WI-12.R4 (dbl), WM-12.R1 (dbl), WM-12.R2 (dbl),
#>   WM-12.R4 (dbl)

Gather column names into long form

gather(genes, variable, expr, -id)
#> Source: local data frame [33 x 3]
#> 
#>        id variable  expr
#>     (chr)   (fctr) (dbl)
#> 1  Gene 1  WI-6.R1  2.18
#> 2  Gene 2  WI-6.R1  1.46
#> 3  Gene 3  WI-6.R1  2.03
#> 4  Gene 1  WI-6.R2  2.20
#> 5  Gene 2  WI-6.R2  0.59
#> 6  Gene 3  WI-6.R2  0.87
#> 7  Gene 1  WI-6.R4  4.20
#> 8  Gene 2  WI-6.R4  1.86
#> 9  Gene 3  WI-6.R4  3.28
#> 10 Gene 1  WM-6.R1  2.63
#> ..    ...      ...   ...

Separate columns

genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-")
#> Source: local data frame [33 x 4]
#> 
#>        id   trt leftover  expr
#>     (chr) (chr)    (chr) (dbl)
#> 1  Gene 1    WI     6.R1  2.18
#> 2  Gene 2    WI     6.R1  1.46
#> 3  Gene 3    WI     6.R1  2.03
#> 4  Gene 1    WI     6.R2  2.20
#> 5  Gene 2    WI     6.R2  0.59
#> 6  Gene 3    WI     6.R2  0.87
#> 7  Gene 1    WI     6.R4  4.20
#> 8  Gene 2    WI     6.R4  1.86
#> 9  Gene 3    WI     6.R4  3.28
#> 10 Gene 1    WM     6.R1  2.63
#> ..    ...   ...      ...   ...

genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.")
#> Source: local data frame [33 x 5]
#> 
#>        id   trt  time   rep  expr
#>     (chr) (chr) (chr) (chr) (dbl)
#> 1  Gene 1    WI     6    R1  2.18
#> 2  Gene 2    WI     6    R1  1.46
#> 3  Gene 3    WI     6    R1  2.03
#> 4  Gene 1    WI     6    R2  2.20
#> 5  Gene 2    WI     6    R2  0.59
#> 6  Gene 3    WI     6    R2  0.87
#> 7  Gene 1    WI     6    R4  4.20
#> 8  Gene 2    WI     6    R4  1.86
#> 9  Gene 3    WI     6    R4  3.28
#> 10 Gene 1    WM     6    R1  2.63
#> ..    ...   ...   ...   ...   ...

gtidy <- genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") %>%
  mutate(trt = sub("W", "", trt)) %>%
  mutate(rep = sub("R", "", rep))
gtidy
#> Source: local data frame [33 x 5]
#> 
#>        id   trt  time   rep  expr
#>     (chr) (chr) (chr) (chr) (dbl)
#> 1  Gene 1     I     6     1  2.18
#> 2  Gene 2     I     6     1  1.46
#> 3  Gene 3     I     6     1  2.03
#> 4  Gene 1     I     6     2  2.20
#> 5  Gene 2     I     6     2  0.59
#> 6  Gene 3     I     6     2  0.87
#> 7  Gene 1     I     6     4  4.20
#> 8  Gene 2     I     6     4  1.86
#> 9  Gene 3     I     6     4  3.28
#> 10 Gene 1     M     6     1  2.63
#> ..    ...   ...   ...   ...   ...

Your turn

Using the tidied dataset (gtidy), find the mean expression for each combination of id, trt, and time.
Use this tidied data to make this plot.

Flight database

SQLite database with flights that departed from New York City in the year 2013.

db <- nycflights13_sqlite()
db
#> src:  sqlite 3.8.6 [/var/folders/9d/njpjyby920z_l16gb_p5wx0c0000gn/T//RtmpbaDugV/nycflights13.sqlite]
#> tbls: airlines, airports, flights, planes, sqlite_stat1, weather

We have a number of tables here, but 'flights' is most important

tbl(db, "flights")
#> Source: sqlite 3.8.6 [/var/folders/9d/njpjyby920z_l16gb_p5wx0c0000gn/T//RtmpbaDugV/nycflights13.sqlite]
#> From: flights [336,776 x 16]
#> 
#>     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
#>    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
#> 1   2013     1     1      517         2      830        11      UA  N14228
#> 2   2013     1     1      533         4      850        20      UA  N24211
#> 3   2013     1     1      542         2      923        33      AA  N619AA
#> 4   2013     1     1      544        -1     1004       -18      B6  N804JB
#> 5   2013     1     1      554        -6      812       -25      DL  N668DN
#> 6   2013     1     1      554        -4      740        12      UA  N39463
#> 7   2013     1     1      555        -5      913        19      B6  N516JB
#> 8   2013     1     1      557        -3      709       -14      EV  N829AS
#> 9   2013     1     1      557        -3      838        -8      B6  N593JB
#> 10  2013     1     1      558        -2      753         8      AA  N3ALAA
#> ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

Top carriers

top <- tbl(db, "flights") %>%
  count(carrier) %>%
  arrange(desc(n))
top
#> Source: sqlite 3.8.6 [/var/folders/9d/njpjyby920z_l16gb_p5wx0c0000gn/T//RtmpbaDugV/nycflights13.sqlite]
#> From: <derived table> [?? x 2]
#> Arrange: desc(n) 
#> 
#>    carrier     n
#>      (chr) (int)
#> 1       UA 58665
#> 2       B6 54635
#> 3       EV 54173
#> 4       DL 48110
#> 5       AA 32729
#> 6       MQ 26397
#> 7       US 20536
#> 8       9E 18460
#> 9       WN 12275
#> 10      VX  5162
#> ..     ...   ...

tbl(db, "airlines")
#> Source: sqlite 3.8.6 [/var/folders/9d/njpjyby920z_l16gb_p5wx0c0000gn/T//RtmpbaDugV/nycflights13.sqlite]
#> From: airlines [16 x 2]
#> 
#>    carrier                        name
#>      (chr)                       (chr)
#> 1       9E           Endeavor Air Inc.
#> 2       AA      American Airlines Inc.
#> 3       AS        Alaska Airlines Inc.
#> 4       B6             JetBlue Airways
#> 5       DL        Delta Air Lines Inc.
#> 6       EV    ExpressJet Airlines Inc.
#> 7       F9      Frontier Airlines Inc.
#> 8       FL AirTran Airways Corporation
#> 9       HA      Hawaiian Airlines Inc.
#> 10      MQ                   Envoy Air
#> 11      OO       SkyWest Airlines Inc.
#> 12      UA       United Air Lines Inc.
#> 13      US             US Airways Inc.
#> 14      VX              Virgin America
#> 15      WN      Southwest Airlines Co.
#> 16      YV          Mesa Airlines Inc.

a <- tbl(db, "airlines")
top <- left_join(top, a)
top
#> Source: sqlite 3.8.6 [/var/folders/9d/njpjyby920z_l16gb_p5wx0c0000gn/T//RtmpbaDugV/nycflights13.sqlite]
#> From: <derived table> [?? x 3]
#> 
#>    carrier     n                     name
#>      (chr) (int)                    (chr)
#> 1       UA 58665    United Air Lines Inc.
#> 2       B6 54635          JetBlue Airways
#> 3       EV 54173 ExpressJet Airlines Inc.
#> 4       DL 48110     Delta Air Lines Inc.
#> 5       AA 32729   American Airlines Inc.
#> 6       MQ 26397                Envoy Air
#> 7       US 20536          US Airways Inc.
#> 8       9E 18460        Endeavor Air Inc.
#> 9       WN 12275   Southwest Airlines Co.
#> 10      VX  5162           Virgin America
#> ..     ...   ...                      ...

dplyr does SQL for us

top$query
#> <Query> SELECT "carrier", "n", "name"
#> FROM (SELECT * FROM (SELECT "carrier", "n"
#> FROM (SELECT "carrier", COUNT() AS "n"
#> FROM "flights"
#> GROUP BY "carrier") AS "zzz1"
#> ORDER BY "n" DESC) AS "zzz2"
#> 
#> LEFT JOIN 
#> 
#> (SELECT "carrier", "name"
#> FROM "airlines") AS "zzz3"
#> 
#> USING ("carrier")) AS "zzz4"
#> <SQLiteConnection>

Ain't nobody got time for that!!

All the means!

tbl(db, "flights") %>%
  group_by(carrier) %>% 
  summarise_each(funs(mean)) %>%
  arrange(desc(dep_delay))
#> Source: sqlite 3.8.6 [/var/folders/9d/njpjyby920z_l16gb_p5wx0c0000gn/T//RtmpbaDugV/nycflights13.sqlite]
#> From: <derived table> [?? x 16]
#> Arrange: desc(dep_delay) 
#> 
#>    carrier  year month   day dep_time dep_delay arr_time arr_delay tailnum
#>      (chr) (dbl) (dbl) (dbl)    (dbl)     (dbl)    (dbl)     (dbl)   (dbl)
#> 1       F9  2013   6.6    16     1438        20     1672      21.9       0
#> 2       EV  2013   6.6    16     1369        20     1488      15.8       0
#> 3       YV  2013   6.9    16     1601        19     1761      15.6       0
#> 4       FL  2013   6.0    16     1387        19     1574      20.1       0
#> 5       WN  2013   6.6    16     1281        18     1443       9.6       0
#> 6       9E  2013   6.6    16     1487        17     1639       7.4       0
#> 7       B6  2013   6.5    16     1381        13     1406       9.5       0
#> 8       VX  2013   6.9    16     1280        13     1523       1.8       0
#> 9       OO  2013   8.8    16     1725        13     1913      11.9       0
#> 10      UA  2013   6.6    16     1327        12     1509       3.6       0
#> ..     ...   ...   ...   ...      ...       ...      ...       ...     ...
#> Variables not shown: flight (dbl), origin (dbl), dest (dbl), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

tbl(db, "flights") %>%
  group_by(carrier) %>%
  summarise(q = quantile(dep_delay, 0.75))
#> Source: sqlite 3.8.6 [/var/folders/9d/njpjyby920z_l16gb_p5wx0c0000gn/T//RtmpbaDugV/nycflights13.sqlite]
#> From: <derived table> [?? x 2]
#> Error in sqliteSendQuery(conn, statement): error in statement: no such function: QUANTILE

We can't expect all R functions to translate to SQL.
In order to use some R functions, we have no choice but to pull data into R (into memory).

flights <- collect(tbl(db, "flights"))

object.size(tbl(db, "flights"))
#> 6104 bytes
object.size(flights)
#> 35227168 bytes

flights %>%
  group_by(carrier) %>%
  summarise(q = quantile(arr_time, 0.75, na.rm = T))
#> Source: local data frame [16 x 2]
#> 
#>    carrier     q
#>      (chr) (dbl)
#> 1       9E  2046
#> 2       AA  1940
#> 3       AS  2128
#> 4       B6  1931
#> 5       DL  2017
#> 6       EV  1918
#> 7       F9  2010
#> 8       FL  2027
#> 9       HA  1514
#> 10      MQ  1912
#> 11      OO  1956
#> 12      UA  1944
#> 13      US  1811
#> 14      VX  2018
#> 15      WN  1904
#> 16      YV  1929

fortunes::fortune(192)
#> 
#> RAM is cheap and thinking hurts.
#>    -- Uwe Ligges (about memory requirements in R)
#>       R-help (June 2007)

Dates and Times

Dates are deceptively hard to work with in R.

Example: 02/05/2012. Is it February 5th, or May 2nd?

Other things are difficult too:

Time zones
POSIXct format in base R is challenging

The lubridate package helps tackle some of these issues.

Basic Lubridate Use

library(lubridate)

now()
#> [1] "2016-02-23 20:01:42 AEDT"
today()
#> [1] "2016-02-23"
now() + hours(4)
#> [1] "2016-02-24 00:01:42 AEDT"
today() - days(2)
#> [1] "2016-02-21"

Parsing Dates

ymd("2013-05-14")
#> [1] "2013-05-14 UTC"
mdy("05/14/2013")
#> [1] "2013-05-14 UTC"
dmy("14052013")
#> [1] "2013-05-14 UTC"
ymd_hms("2013:05:14 14:5:30", tz = "America/New_York")
#> [1] "2013-05-14 14:05:30 EDT"

Flight Dates

flights %>%
  mutate(date = paste(year, month, day, sep = "-")) %>%
  mutate(date2 = ymd(date)) %>%
  select(date, date2)
#> Source: local data frame [336,776 x 2]
#> 
#>        date      date2
#>       (chr)     (time)
#> 1  2013-1-1 2013-01-01
#> 2  2013-1-1 2013-01-01
#> 3  2013-1-1 2013-01-01
#> 4  2013-1-1 2013-01-01
#> 5  2013-1-1 2013-01-01
#> 6  2013-1-1 2013-01-01
#> 7  2013-1-1 2013-01-01
#> 8  2013-1-1 2013-01-01
#> 9  2013-1-1 2013-01-01
#> 10 2013-1-1 2013-01-01
#> ..      ...        ...

Your turn

Use the ymd_hms() function to make a date time that incorporates hours and minutes (bonus: why can't ymd_hms() parse every date time?)

Date times

flights <- flights %>%
  mutate(date = paste(year, month, day, sep = "-")) %>%
  mutate(time = paste(hour, minute, "0", sep = ":")) %>%
  mutate(dt = ymd_hms(paste(date, time))) 
#> Warning: 8255 failed to parse.

flights %>%
  filter(is.na(dt)) %>%
  select(hour, minute, dt)
#> Source: local data frame [8,255 x 3]
#> 
#>     hour minute     dt
#>    (dbl)  (dbl) (time)
#> 1     NA     NA   <NA>
#> 2     NA     NA   <NA>
#> 3     NA     NA   <NA>
#> 4     NA     NA   <NA>
#> 5     NA     NA   <NA>
#> 6     NA     NA   <NA>
#> 7     NA     NA   <NA>
#> 8     NA     NA   <NA>
#> 9     NA     NA   <NA>
#> 10    NA     NA   <NA>
#> ..   ...    ...    ...

Fitting models across groups

If your apply function returns anything but a single value, use do() instead of summarise()

models <- flights %>% 
  group_by(carrier) %>%
  do(m = lm(dep_delay ~ as.numeric(dt), data = .))
models
#> Source: local data frame [16 x 2]
#> Groups: <by row>
#> 
#>    carrier       m
#>      (chr)   (chr)
#> 1       9E <S3:lm>
#> 2       AA <S3:lm>
#> 3       AS <S3:lm>
#> 4       B6 <S3:lm>
#> 5       DL <S3:lm>
#> 6       EV <S3:lm>
#> 7       F9 <S3:lm>
#> 8       FL <S3:lm>
#> 9       HA <S3:lm>
#> 10      MQ <S3:lm>
#> 11      OO <S3:lm>
#> 12      UA <S3:lm>
#> 13      US <S3:lm>
#> 14      VX <S3:lm>
#> 15      WN <S3:lm>
#> 16      YV <S3:lm>

For each additional day, the departure delay for AirTran Airways increases by 0.05 minutes (3 seconds)

models %>%
  mutate(slope_min = coef(m)[["as.numeric(dt)"]]) %>%
  mutate(slope_day = slope_min * 60 * 60 * 24) %>%
  arrange(desc(slope_day)) %>%
  left_join(a, copy = TRUE)
#> Source: local data frame [16 x 5]
#> 
#>    carrier       m slope_min slope_day                        name
#>      (chr)   (chr)     (dbl)     (dbl)                       (chr)
#> 1       FL <S3:lm>   5.7e-07    0.0491 AirTran Airways Corporation
#> 2       WN <S3:lm>   1.7e-07    0.0147      Southwest Airlines Co.
#> 3       UA <S3:lm>   2.5e-08    0.0021       United Air Lines Inc.
#> 4       US <S3:lm>   4.6e-09    0.0004             US Airways Inc.
#> 5       DL <S3:lm>  -2.0e-08   -0.0017        Delta Air Lines Inc.
#> 6       AS <S3:lm>  -5.2e-08   -0.0045        Alaska Airlines Inc.
#> 7       MQ <S3:lm>  -6.8e-08   -0.0059                   Envoy Air
#> 8       AA <S3:lm>  -9.6e-08   -0.0083      American Airlines Inc.
#> 9       B6 <S3:lm>  -1.3e-07   -0.0114             JetBlue Airways
#> 10      9E <S3:lm>  -1.7e-07   -0.0150           Endeavor Air Inc.
#> 11      VX <S3:lm>  -1.9e-07   -0.0161              Virgin America
#> 12      YV <S3:lm>  -1.9e-07   -0.0162          Mesa Airlines Inc.
#> 13      EV <S3:lm>  -3.4e-07   -0.0297    ExpressJet Airlines Inc.
#> 14      F9 <S3:lm>  -3.6e-07   -0.0313      Frontier Airlines Inc.
#> 15      HA <S3:lm>  -1.2e-06   -0.1064      Hawaiian Airlines Inc.
#> 16      OO <S3:lm>  -3.6e-06   -0.3087       SkyWest Airlines Inc.

Your turn

Which carrier has the highest intercept (departure delay at time 0)?