2016-02-23

Warmups - Problem 1

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0

What do you want to know?

Warmups - Problem 2

What's in the column names of this data?

id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2 WI-12.R4 WM-12.R1 WM-12.R2 WM-12.R4
Gene 1 2.2 2.20 4.2 2.63 5.1 4.5 5.5 4.4 3.9 4.2 3.73
Gene 2 1.5 0.59 1.9 0.52 2.9 1.4 3.0 1.3 1.2 1.2 0.89
Gene 3 2.0 0.87 3.3 0.53 4.6 2.2 5.6 2.5 2.5 3.0 1.35

Warmups - Problem 3

How many ways can you write today's date?

What we are going to cover today

  • Reading different data formats
  • Tidying data
  • Split - apply - combine
  • Pipes
  • Joins
  • Working with dates
  • Splitting strings

Reading different data formats

  • images: library(EBImage)
  • sound: library(tuneR)
  • fixed width fields: read.fwf()
  • netCDF: library(ncdf)
  • hdf5: library(hdf5)
  • json: library(jsonlite)

XML/HTML

library(XML)
src <- "http://www.realclearpolitics.com/epolls/2012/president/us/republican_presidential_nomination-1452.html"
tables <- readHTMLTable(src)
polls <- tables[[1]]
head(polls)
#>                                       Poll        Date Sample MoE Romney 
#> 1                              RCP Average  4/9 - 4/17     --  --    52.8
#> 2       CBS News/NY TimesCBS News/NY Times 4/13 - 4/17 268 RV 6.0      54
#> 3 CNN/Opinion ResearchCNN/Opinion Research 4/13 - 4/15  473 A 4.5      57
#> 4                           PPP (D)PPP (D) 4/12 - 4/15 742 RV 3.6      54
#> 5                         FOX NewsFOX News  4/9 - 4/11 376 RV 5.0      46
#>   Santorum  Gingrich  Paul  Perry  Huntsman  Bachmann  Cain        Spread
#> 1                19.0  15.0                                  Romney +33.8
#> 2        --        20    12     --        --        --    --   Romney +34
#> 3        --        19    18     --        --        --    --   Romney +38
#> 4        --        24    14     --        --        --    --   Romney +30
#> 5        15        13    16     --        --        --    --   Romney +30

See also scrapeR, rvest

GIS

This code is a bit slow to run, but it draws all the electoral districts of Australia.

library(maptools)
xx <- readShapeSpatial("http://dicook.github.io/Monash-R/data/australia/region.shp")
object.size(as(xx, "SpatialPolygons"))
xxx <- thinnedSpatialPoly(as(xx, "SpatialPolygons"), 
  tolerance=0.5, minarea=0.001, topologyPreserve=TRUE)
object.size(as(xxx, "SpatialPolygons"))
qplot(long, lat, data=xx, group=group) + geom_path() + coord_map() 

French fries - hot chips

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0
1 1 15 1 1.2 0.1 0.0 1.1 5.1
1 1 15 2 8.8 3.0 3.6 1.5 2.3

What would we like to know?

  • Is the design complete?
  • Are replicates like each other?
  • How do the ratings on the different scales differ?
  • Are raters giving different scores on average?
  • Do ratings change over the weeks?

Each of these questions involves different summaries of the data.

What we have and what we want

Gathering

  • When gathering, you need to specify the keys (identifiers) and the values (measures).

Keys/Identifiers: - Identify a record (must be unique) - Example: Indices on an random variable - Fixed by design of experiment (known in advance) - May be single or composite (may have one or more variables)

Values/Measures: - Collected during the experiment (not known in advance) - Usually numeric quantities

Gathering the French Fries

library(tidyr)
ff_long <- gather(french_fries, key = variable, value = rating, potato:painty)
head(ff_long)
#>   time treatment subject rep variable rating
#> 1    1         1       3   1   potato    2.9
#> 2    1         1       3   2   potato   14.0
#> 3    1         1      10   1   potato   11.0
#> 4    1         1      10   2   potato    9.9
#> 5    1         1      15   1   potato    1.2
#> 6    1         1      15   2   potato    8.8

Long to Wide

In certain applications, we may wish to take a long dataset and convert it to a wide dataset (Perhaps displaying in a table).

This is called "spreading" the data.

Spread

We use the spread function from tidyr to do this:

french_fries_wide <- spread(ff_long, key = variable, value = rating)

head(french_fries_wide)
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
#> 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
#> 3    1         1      10   1   11.0     6.4    0.0    0.0    0.0
#> 4    1         1      10   2    9.9     5.9    2.9    2.2    0.0
#> 5    1         1      15   1    1.2     0.1    0.0    1.1    5.1
#> 6    1         1      15   2    8.8     3.0    3.6    1.5    2.3

Lets use gather and spread to answer some questions

Easiest question to start is whether the ratings are similar on the different scales, potato'y, buttery, grassy, rancid and painty.

We need to gather the data into long form, and make plots facetted by the scale.

Ratings on the different scales

library(ggplot2)
ff.m <- french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep)
head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8
qplot(rating, data=ff.m, binwidth=2) + 
  facet_wrap(~type, ncol=5) 

Side-by-side boxplots

qplot(type, rating, data = ff.m, fill = type, geom = "boxplot")

Do the replicates look like each other?

We will start to tackle this by plotting the replicates against each other using a scatterplot.

We need to gather the data into long form, and then get the replicates spread into separate columns.

Check replicates

head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8
ff.s <- ff.m %>% spread(rep, rating)
head(ff.s)
#>   time treatment subject    type    1    2
#> 1    1         1       3  potato  2.9 14.0
#> 2    1         1       3 buttery  0.0  0.0
#> 3    1         1       3  grassy  0.0  0.0
#> 4    1         1       3  rancid  0.0  1.1
#> 5    1         1       3  painty  5.5  0.0
#> 6    1         1      10  potato 11.0  9.9

Check replicates

qplot(`1`, `2`, data=ff.s) + theme(aspect.ratio=1) + 
  xlab("Rep 1") + ylab("Rep 2")
qplot(`1`, `2`, data=ff.s) + theme(aspect.ratio=1) + 
  xlab("Rep 1") + ylab("Rep 2") + 
  scale_x_log10() + scale_y_log10()

Your turn

Make the scatterplots of reps against each other separately for scales, and treatment.