2016-02-23

With your neighbor, list some things R is and does

R is …

R does …

R is …

  • Free to use
  • Extensible
    • Over 7300 user contributed add-on packages currently on CRAN!
  • Powerful
    • With the right tools, get more work done, faster.
  • Flexible
    • Not a question of can, but how.
  • Frustrating
    • Flexibility comes at a cost (easy to shoot yourself in the foot!).

R does …

  • Graphics, statistics, machine learning, etc.
  • Data acquisition, munging, management
  • Literate programming (dynamic reports)
  • Web applications

(This roughly serves as our outline. Slides and material can be found at http://dicook.github.io/Monash-R/.)

RStudio is …

From Julie Lowndes:

If R were an airplane, RStudio would be the airport, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer.

The RStudio IDE

  1. Source editor
    • Docking station for multiple files
    • Useful shortcuts ("Knit")
    • Highlighting/Tab-completion
    • Code-checking (R, HTML, JS)
    • Debugging features
  2. Console window
    • Highlighting/Tab-completion
    • Search recent commands
  3. Other tabs/panes
    • Graphics
    • R documentation
    • Environment pane
    • File system navigation/access
    • Tools for package development, git, etc

Create a project

Create a project to contain all of the material covered in this set of tutorials:

  • File -> New Project -> New Directory -> Empty Project

Hello R Markdown!

  • File -> New File -> R Markdown -> OK -> Knit HTML

What is R Markdown?

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document. R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).

For your reference

  • Download, save, and open the document which generates these slides in RStudio (you can do that with this R code):
curl::curl_download(
  "https://raw.githubusercontent.com/dicook/Monash-R/gh-pages/1-Intro/index.Rmd",
  "index.Rmd"
)
file.edit("index.Rmd")
  • Download, save, and open the corresponding R script.
curl::curl_download(
  "https://raw.githubusercontent.com/dicook/Monash-R/gh-pages/1-Intro/index.R",
  "index.R"
)
file.edit("index.R")

Some Basics

  • Combine values into a vector with c()
c(0, 1)
#> [1] 0 1
  • Assign values to a name with <-
x <- c(0, 1)
  • Avoid for loops and use built-in vectorized functions
sum(x + 10)
#> [1] 21
  • Although arcane at times, R has rich support for documentation, see ?sum

  • Use [ to extract elements of a vector.
c("a", "b")[1]
#> [1] "a"
c("a", "b")[c(T, F)]
#> [1] "a"
  • Extract named elements with $, [[, and/or [
x <- list(
  a = 10,
  b = c(1, "2")
)
x$a
#> [1] 10
x[["a"]]
#> [1] 10
x["a"]
#> $a
#> [1] 10

Examining 'structure'

  • str() is probably my favorite R function. It shows you the "structure" of any R object (and everything in R is an object!!!)
str(x)
#> List of 2
#>  $ a: num 10
#>  $ b: chr [1:2] "1" "2"
  • Lists are heterogenous (elements can have different types)
  • Vectors are homogenous (elements have the same type)
    • That's why c(1, "2") ends up being a character string.
  • If you'd like to know more, see Hadley Wickham's online chapters on data structures and subsetting

Getting Help

  • Reading documentation only gets you so far. What about finding function(s) and/or package(s) to help solve a problem???

  • Google! (I usually prefix "CRAN" to my search; others might suggest http://www.rseek.org/)

  • Ask your question on a relevant StackExchange outlet such as http://stackoverflow.com/ or http://stats.stackexchange.com/

  • It's becoming more and more popular to bundle "vignettes" with a package (dplyr has awesome vignettes)

browseVignettes("dplyr")

Some Oddities

  • Yes, + is a function (which calls compiled C code)
`+`
#> function (e1, e2)  .Primitive("+")
  • What's that? You don't like addition? Me neither!
"+" <- function(x, y) "I forgot how to add"
1 + 2
#> [1] "I forgot how to add"
  • But seriously, don't "overload operators" unless you know what you're doing
rm("+")

Trust me

Obligatory economics example

data(economics, package = "ggplot2")
# data frames are essentially a list of vectors
str(economics)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    574 obs. of  6 variables:
#>  $ date    : Date, format: "1967-07-01" "1967-08-01" ...
#>  $ pce     : num  507 510 516 513 518 ...
#>  $ pop     : int  198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
#>  $ psavert : num  12.5 12.5 11.7 12.5 12.5 12.1 11.7 12.2 11.6 12.2 ...
#>  $ uempmed : num  4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
#>  $ unemploy: int  2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...

Your turn

Read the documentation for economics. Can you think of a interesting/informative function of these variable(s)?

Hello ggplot2

library(ggplot2)
p <- ggplot(economics, aes(date, unemploy / pop)) + 
  geom_line()
p

Hello Linear/Additive Models!

p
p + geom_smooth(method = "lm", se = F)
p + geom_smooth(method = "loess", se = F)
p + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cr"), se = F)

How does geom_smooth() work?

m <- lm((unemploy / pop) ~ date, data = economics)
str(m)
#> List of 12
#>  $ coefficients : Named num [1:2] 2.60e-02 4.95e-07
#>   ..- attr(*, "names")= chr [1:2] "(Intercept)" "date"
#>  $ residuals    : Named num [1:574] -0.01076 -0.01078 -0.01075 -0.00985 -0.01026 ...
#>   ..- attr(*, "names")= chr [1:574] "1" "2" "3" "4" ...
#>  $ effects      : Named num [1:574] -0.71605 0.0598 -0.00965 -0.00875 -0.00917 ...
#>   ..- attr(*, "names")= chr [1:574] "(Intercept)" "date" "" "" ...
#>  $ rank         : int 2
#>  $ fitted.values: Named num [1:574] 0.0256 0.0256 0.0256 0.0256 0.0256 ...
#>   ..- attr(*, "names")= chr [1:574] "1" "2" "3" "4" ...
#>  $ assign       : int [1:2] 0 1
#>  $ qr           :List of 5
#>   ..$ qr   : num [1:574, 1:2] -23.9583 0.0417 0.0417 0.0417 0.0417 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:574] "1" "2" "3" "4" ...
#>   .. .. ..$ : chr [1:2] "(Intercept)" "date"
#>   .. ..- attr(*, "assign")= int [1:2] 0 1
#>   ..$ qraux: num [1:2] 1.04 1.07
#>   ..$ pivot: int [1:2] 1 2
#>   ..$ tol  : num 1e-07
#>   ..$ rank : int 2
#>   ..- attr(*, "class")= chr "qr"
#>  $ df.residual  : int 572
#>  $ xlevels      : Named list()
#>  $ call         : language lm(formula = (unemploy/pop) ~ date, data = economics)
#>  $ terms        :Classes 'terms', 'formula' length 3 (unemploy/pop) ~ date
#>   .. ..- attr(*, "variables")= language list((unemploy/pop), date)
#>   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:2] "(unemploy/pop)" "date"
#>   .. .. .. ..$ : chr "date"
#>   .. ..- attr(*, "term.labels")= chr "date"
#>   .. ..- attr(*, "order")= int 1
#>   .. ..- attr(*, "intercept")= int 1
#>   .. ..- attr(*, "response")= int 1
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. ..- attr(*, "predvars")= language list((unemploy/pop), date)
#>   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "other"
#>   .. .. ..- attr(*, "names")= chr [1:2] "(unemploy/pop)" "date"
#>  $ model        :'data.frame':   574 obs. of  2 variables:
#>   ..$ (unemploy/pop): num [1:574] 0.0148 0.0148 0.0149 0.0158 0.0154 ...
#>   ..$ date          : Date[1:574], format: "1967-07-01" ...
#>   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 (unemploy/pop) ~ date
#>   .. .. ..- attr(*, "variables")= language list((unemploy/pop), date)
#>   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : chr [1:2] "(unemploy/pop)" "date"
#>   .. .. .. .. ..$ : chr "date"
#>   .. .. ..- attr(*, "term.labels")= chr "date"
#>   .. .. ..- attr(*, "order")= int 1
#>   .. .. ..- attr(*, "intercept")= int 1
#>   .. .. ..- attr(*, "response")= int 1
#>   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. .. ..- attr(*, "predvars")= language list((unemploy/pop), date)
#>   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "other"
#>   .. .. .. ..- attr(*, "names")= chr [1:2] "(unemploy/pop)" "date"
#>  - attr(*, "class")= chr "lm"

economics$yhat <- m$fitted.values
ggplot(economics) + 
  geom_line(aes(date, unemploy / pop)) +
  geom_line(aes(date, yhat), color = "blue")

Interactive and web-based!

library(plotly)
ggplotly()

Your turn

Make the loess smoother plot without geom_smooth() (hint: convert date to a numeric vector with as.numeric() and use loess()). For a solution, see here

Reshaping data

head(economics)
#>         date   pce    pop psavert uempmed unemploy
#> 1 1967-07-01 507.4 198712    12.5     4.5     2944
#> 2 1967-08-01 510.5 198911    12.5     4.7     2945
#> 3 1967-09-01 516.3 199113    11.7     4.6     2958
#> 4 1967-10-01 512.9 199311    12.5     4.9     3143
#> 5 1967-11-01 518.1 199498    12.5     4.7     3066
#> 6 1967-12-01 525.8 199657    12.1     4.8     3018
library(tidyr)
e <- gather(economics, variable, value, -date)
head(e)
#> Source: local data frame [6 x 3]
#> 
#>         date variable value
#>       (date)    (chr) (dbl)
#> 1 1967-07-01      pce 507.4
#> 2 1967-08-01      pce 510.5
#> 3 1967-09-01      pce 516.3
#> 4 1967-10-01      pce 512.9
#> 5 1967-11-01      pce 518.1
#> 6 1967-12-01      pce 525.8

ggplot(e, aes(date, value)) + geom_line() +
  facet_wrap(~ variable, scales = "free_y")

Split-apply-combine

Split-apply-combine in dplyr

library(dplyr)
# add a year column
e <- mutate(economics, year = format(date, "%Y"))
# split by year
e <- group_by(e, year)
# mean of each group
summarise(e, mpce = mean(pce))
#> Source: local data frame [49 x 2]
#> 
#>     year      mpce
#>    (chr)     (dbl)
#> 1   1967  515.1667
#> 2   1968  557.4583
#> 3   1969  604.4833
#> 4   1970  647.6917
#> 5   1971  701.0000
#> 6   1972  769.4333
#> 7   1973  851.1417
#> 8   1974  932.0000
#> 9   1975 1032.7667
#> 10  1976 1150.1667
#> ..   ...       ...

Focus on the transformation(s), not the object

economics %>%
  mutate(year = format(date, "%Y")) %>%
  group_by(year) %>%
  summarise(mpce = mean(pce))
#> Source: local data frame [49 x 2]
#> 
#>     year      mpce
#>    (chr)     (dbl)
#> 1   1967  515.1667
#> 2   1968  557.4583
#> 3   1969  604.4833
#> 4   1970  647.6917
#> 5   1971  701.0000
#> 6   1972  769.4333
#> 7   1973  851.1417
#> 8   1974  932.0000
#> 9   1975 1032.7667
#> 10  1976 1150.1667
#> ..   ...       ...

Your turn

  • Easy: Find the average annual personal savings rate (psavert).
  • Hard: Find the annual average for each variable. Can you create a useful visual of this information?

A new example: babynames

library(babynames)
head(babynames)
#> Source: local data frame [6 x 5]
#> 
#>    year   sex      name     n       prop
#>   (dbl) (chr)     (chr) (int)      (dbl)
#> 1  1880     F      Mary  7065 0.07238359
#> 2  1880     F      Anna  2604 0.02667896
#> 3  1880     F      Emma  2003 0.02052149
#> 4  1880     F Elizabeth  1939 0.01986579
#> 5  1880     F    Minnie  1746 0.01788843
#> 6  1880     F  Margaret  1578 0.01616720
dim(babynames)
#> [1] 1825433       5

Write/read csv (or other flat files)

library(readr)
bb_path <- tempfile(fileext = ".csv", tmpdir = ".")
write_csv(babynames, bb_path)
read_csv(bb_path)
#> Source: local data frame [1,825,433 x 5]
#> 
#>     year   sex      name     n       prop
#>    (int) (chr)     (chr) (int)      (dbl)
#> 1   1880     F      Mary  7065 0.07238359
#> 2   1880     F      Anna  2604 0.02667896
#> 3   1880     F      Emma  2003 0.02052149
#> 4   1880     F Elizabeth  1939 0.01986579
#> 5   1880     F    Minnie  1746 0.01788843
#> 6   1880     F  Margaret  1578 0.01616720
#> 7   1880     F       Ida  1472 0.01508119
#> 8   1880     F     Alice  1414 0.01448696
#> 9   1880     F    Bertha  1320 0.01352390
#> 10  1880     F     Sarah  1288 0.01319605
#> ..   ...   ...       ...   ...        ...

Get other formats into R

library(readxl)
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)
library(haven)
# SAS files
read_sas("path/to/file")
# SPSS files
read_por("path/to/file")
read_sav("path/to/file")
# Stata files
read_dta("path/to/file")

dplyr loves databases

db <- src_sqlite("babynames.sqlite3", create = TRUE)
if (!db_has_table(db$con, "babynames")) {
  copy_to(db, babynames)
}
#> Source: sqlite 3.8.6 [babynames.sqlite3]
#> From: babynames [1,825,433 x 5]
#> 
#>     year   sex      name     n       prop
#>    (dbl) (chr)     (chr) (int)      (dbl)
#> 1   1880     F      Mary  7065 0.07238359
#> 2   1880     F      Anna  2604 0.02667896
#> 3   1880     F      Emma  2003 0.02052149
#> 4   1880     F Elizabeth  1939 0.01986579
#> 5   1880     F    Minnie  1746 0.01788843
#> 6   1880     F  Margaret  1578 0.01616720
#> 7   1880     F       Ida  1472 0.01508119
#> 8   1880     F     Alice  1414 0.01448696
#> 9   1880     F    Bertha  1320 0.01352390
#> 10  1880     F     Sarah  1288 0.01319605
#> ..   ...   ...       ...   ...        ...
db
#> src:  sqlite 3.8.6 [babynames.sqlite3]
#> tbls: babynames, sqlite_stat1
tbl(db, "babynames")
#> Source: sqlite 3.8.6 [babynames.sqlite3]
#> From: babynames [1,825,433 x 5]
#> 
#>     year   sex      name     n       prop
#>    (dbl) (chr)     (chr) (int)      (dbl)
#> 1   1880     F      Mary  7065 0.07238359
#> 2   1880     F      Anna  2604 0.02667896
#> 3   1880     F      Emma  2003 0.02052149
#> 4   1880     F Elizabeth  1939 0.01986579
#> 5   1880     F    Minnie  1746 0.01788843
#> 6   1880     F  Margaret  1578 0.01616720
#> 7   1880     F       Ida  1472 0.01508119
#> 8   1880     F     Alice  1414 0.01448696
#> 9   1880     F    Bertha  1320 0.01352390
#> 10  1880     F     Sarah  1288 0.01319605
#> ..   ...   ...       ...   ...        ...

Send computation to the data

h <- db %>% 
  tbl("babynames") %>%
  filter(name == "Hilary")
class(h)
#> [1] "tbl_sqlite" "tbl_sql"    "tbl"
h$query
#> <Query> SELECT "year", "sex", "name", "n", "prop"
#> FROM "babynames"
#> WHERE "name" = 'Hilary'
#> <SQLiteConnection>
# execute SQL query and bring into R
hc <- collect(h)
class(hc)
#> [1] "tbl_df"     "tbl"        "data.frame"
hc
#> Source: local data frame [190 x 5]
#> 
#>     year   sex   name     n         prop
#>    (dbl) (chr)  (chr) (int)        (dbl)
#> 1   1882     M Hilary     7 5.736153e-05
#> 2   1883     M Hilary     6 5.334282e-05
#> 3   1887     M Hilary     7 6.403571e-05
#> 4   1891     M Hilary     8 7.321314e-05
#> 5   1896     M Hilary     6 4.648496e-05
#> 6   1897     M Hilary     5 4.100276e-05
#> 7   1898     M Hilary     5 3.784811e-05
#> 8   1902     M Hilary     8 6.026411e-05
#> 9   1904     M Hilary     5 3.609900e-05
#> 10  1905     M Hilary     6 4.188628e-05
#> ..   ...   ...    ...   ...          ...

Interactive graphics

plot_ly(hc, x = year, y = prop, color = sex, colors = c("blue", "hotpink"))

Your turn

Use Rmarkdown to write a one page report, that does these things:

  • Loads the babynames package
  • Filters on your name
  • Finds the year when your name was most popular
  • Plots the number of babies born with this name by year and sex
  • Written paragraphs describing what you learn

popular <- babynames %>%
  group_by(name) %>%
  summarise(N = sum(n)) %>%
  arrange(desc(N))
popular
#> Source: local data frame [93,889 x 2]
#> 
#>       name       N
#>      (chr)   (int)
#> 1    James 5129096
#> 2     John 5106590
#> 3   Robert 4816785
#> 4  Michael 4330805
#> 5     Mary 4130441
#> 6  William 4071368
#> 7    David 3590557
#> 8   Joseph 2580687
#> 9  Richard 2564867
#> 10 Charles 2376700
#> ..     ...     ...

top <- top_n(popular, 1000)
topnames <- subset(babynames, name %in% top[["name"]])
topnames
#> Source: local data frame [178,358 x 5]
#> 
#>     year   sex      name     n       prop
#>    (dbl) (chr)     (chr) (int)      (dbl)
#> 1   1880     F      Mary  7065 0.07238359
#> 2   1880     F      Anna  2604 0.02667896
#> 3   1880     F      Emma  2003 0.02052149
#> 4   1880     F Elizabeth  1939 0.01986579
#> 5   1880     F    Minnie  1746 0.01788843
#> 6   1880     F  Margaret  1578 0.01616720
#> 7   1880     F       Ida  1472 0.01508119
#> 8   1880     F     Alice  1414 0.01448696
#> 9   1880     F    Bertha  1320 0.01352390
#> 10  1880     F     Sarah  1288 0.01319605
#> ..   ...   ...       ...   ...        ...

library(shiny)
library(ggplot2)
ui <- bootstrapPage(
  selectizeInput(
    inputId = 'name', 
    label = 'Enter a name', 
    choices = unique(topnames$name),
    selected = "James",
    multiple = TRUE
  ),
  plotOutput('plot')
)
server <- function(input, output) {
  output$plot <- renderPlot({ 
    dat <- subset(topnames, name %in% input$name)
    ggplot(dat, aes(year, prop, colour = sex)) + 
      geom_line() + facet_wrap(~ name)
  })
}
runApp(shinyApp(ui, server))

Thanks for coming!