2016-02-23
(This roughly serves as our outline. Slides and material can be found at http://dicook.github.io/Monash-R/.)
If R were an airplane, RStudio would be the airport, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer.
Create a project to contain all of the material covered in this set of tutorials:
R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document. R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).
RStudio's cheatsheet gives a nice, concise overview of its capabilities.
RStudio's reference guide lists its options.
curl::curl_download( "https://raw.githubusercontent.com/dicook/Monash-R/gh-pages/1-Intro/index.Rmd", "index.Rmd" ) file.edit("index.Rmd")
curl::curl_download( "https://raw.githubusercontent.com/dicook/Monash-R/gh-pages/1-Intro/index.R", "index.R" ) file.edit("index.R")
c()
c(0, 1) #> [1] 0 1
<-
x <- c(0, 1)
for
loops and use built-in vectorized functionssum(x + 10) #> [1] 21
R
has rich support for documentation, see ?sum
[
to extract elements of a vector.c("a", "b")[1] #> [1] "a" c("a", "b")[c(T, F)] #> [1] "a"
$
, [[
, and/or [
x <- list( a = 10, b = c(1, "2") ) x$a #> [1] 10 x[["a"]] #> [1] 10 x["a"] #> $a #> [1] 10
str()
is probably my favorite R
function. It shows you the "structure" of any R object (and everything in R is an object!!!)str(x) #> List of 2 #> $ a: num 10 #> $ b: chr [1:2] "1" "2"
c(1, "2")
ends up being a character string.Reading documentation only gets you so far. What about finding function(s) and/or package(s) to help solve a problem???
Google! (I usually prefix "CRAN" to my search; others might suggest http://www.rseek.org/)
Ask your question on a relevant StackExchange outlet such as http://stackoverflow.com/ or http://stats.stackexchange.com/
It's becoming more and more popular to bundle "vignettes" with a package (dplyr has awesome vignettes)
browseVignettes("dplyr")
+
is a function (which calls compiled C code)`+` #> function (e1, e2) .Primitive("+")
"+" <- function(x, y) "I forgot how to add" 1 + 2 #> [1] "I forgot how to add"
rm("+")
data(economics, package = "ggplot2") # data frames are essentially a list of vectors str(economics) #> Classes 'tbl_df', 'tbl' and 'data.frame': 574 obs. of 6 variables: #> $ date : Date, format: "1967-07-01" "1967-08-01" ... #> $ pce : num 507 510 516 513 518 ... #> $ pop : int 198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ... #> $ psavert : num 12.5 12.5 11.7 12.5 12.5 12.1 11.7 12.2 11.6 12.2 ... #> $ uempmed : num 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ... #> $ unemploy: int 2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...
Read the documentation for economics
. Can you think of a interesting/informative function of these variable(s)?
library(ggplot2) p <- ggplot(economics, aes(date, unemploy / pop)) + geom_line() p
p p + geom_smooth(method = "lm", se = F) p + geom_smooth(method = "loess", se = F) p + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cr"), se = F)
geom_smooth()
work?m <- lm((unemploy / pop) ~ date, data = economics) str(m) #> List of 12 #> $ coefficients : Named num [1:2] 2.60e-02 4.95e-07 #> ..- attr(*, "names")= chr [1:2] "(Intercept)" "date" #> $ residuals : Named num [1:574] -0.01076 -0.01078 -0.01075 -0.00985 -0.01026 ... #> ..- attr(*, "names")= chr [1:574] "1" "2" "3" "4" ... #> $ effects : Named num [1:574] -0.71605 0.0598 -0.00965 -0.00875 -0.00917 ... #> ..- attr(*, "names")= chr [1:574] "(Intercept)" "date" "" "" ... #> $ rank : int 2 #> $ fitted.values: Named num [1:574] 0.0256 0.0256 0.0256 0.0256 0.0256 ... #> ..- attr(*, "names")= chr [1:574] "1" "2" "3" "4" ... #> $ assign : int [1:2] 0 1 #> $ qr :List of 5 #> ..$ qr : num [1:574, 1:2] -23.9583 0.0417 0.0417 0.0417 0.0417 ... #> .. ..- attr(*, "dimnames")=List of 2 #> .. .. ..$ : chr [1:574] "1" "2" "3" "4" ... #> .. .. ..$ : chr [1:2] "(Intercept)" "date" #> .. ..- attr(*, "assign")= int [1:2] 0 1 #> ..$ qraux: num [1:2] 1.04 1.07 #> ..$ pivot: int [1:2] 1 2 #> ..$ tol : num 1e-07 #> ..$ rank : int 2 #> ..- attr(*, "class")= chr "qr" #> $ df.residual : int 572 #> $ xlevels : Named list() #> $ call : language lm(formula = (unemploy/pop) ~ date, data = economics) #> $ terms :Classes 'terms', 'formula' length 3 (unemploy/pop) ~ date #> .. ..- attr(*, "variables")= language list((unemploy/pop), date) #> .. ..- attr(*, "factors")= int [1:2, 1] 0 1 #> .. .. ..- attr(*, "dimnames")=List of 2 #> .. .. .. ..$ : chr [1:2] "(unemploy/pop)" "date" #> .. .. .. ..$ : chr "date" #> .. ..- attr(*, "term.labels")= chr "date" #> .. ..- attr(*, "order")= int 1 #> .. ..- attr(*, "intercept")= int 1 #> .. ..- attr(*, "response")= int 1 #> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> #> .. ..- attr(*, "predvars")= language list((unemploy/pop), date) #> .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "other" #> .. .. ..- attr(*, "names")= chr [1:2] "(unemploy/pop)" "date" #> $ model :'data.frame': 574 obs. of 2 variables: #> ..$ (unemploy/pop): num [1:574] 0.0148 0.0148 0.0149 0.0158 0.0154 ... #> ..$ date : Date[1:574], format: "1967-07-01" ... #> ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 (unemploy/pop) ~ date #> .. .. ..- attr(*, "variables")= language list((unemploy/pop), date) #> .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1 #> .. .. .. ..- attr(*, "dimnames")=List of 2 #> .. .. .. .. ..$ : chr [1:2] "(unemploy/pop)" "date" #> .. .. .. .. ..$ : chr "date" #> .. .. ..- attr(*, "term.labels")= chr "date" #> .. .. ..- attr(*, "order")= int 1 #> .. .. ..- attr(*, "intercept")= int 1 #> .. .. ..- attr(*, "response")= int 1 #> .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> #> .. .. ..- attr(*, "predvars")= language list((unemploy/pop), date) #> .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "other" #> .. .. .. ..- attr(*, "names")= chr [1:2] "(unemploy/pop)" "date" #> - attr(*, "class")= chr "lm"
economics$yhat <- m$fitted.values ggplot(economics) + geom_line(aes(date, unemploy / pop)) + geom_line(aes(date, yhat), color = "blue")
library(plotly) ggplotly()
Make the loess smoother plot without geom_smooth()
(hint: convert date
to a numeric vector with as.numeric()
and use loess()
). For a solution, see here
head(economics) #> date pce pop psavert uempmed unemploy #> 1 1967-07-01 507.4 198712 12.5 4.5 2944 #> 2 1967-08-01 510.5 198911 12.5 4.7 2945 #> 3 1967-09-01 516.3 199113 11.7 4.6 2958 #> 4 1967-10-01 512.9 199311 12.5 4.9 3143 #> 5 1967-11-01 518.1 199498 12.5 4.7 3066 #> 6 1967-12-01 525.8 199657 12.1 4.8 3018
library(tidyr) e <- gather(economics, variable, value, -date) head(e) #> Source: local data frame [6 x 3] #> #> date variable value #> (date) (chr) (dbl) #> 1 1967-07-01 pce 507.4 #> 2 1967-08-01 pce 510.5 #> 3 1967-09-01 pce 516.3 #> 4 1967-10-01 pce 512.9 #> 5 1967-11-01 pce 518.1 #> 6 1967-12-01 pce 525.8
ggplot(e, aes(date, value)) + geom_line() + facet_wrap(~ variable, scales = "free_y")
Image by Karthik Ram (adapted from Hadley Wickham, plyr paper)
library(dplyr) # add a year column e <- mutate(economics, year = format(date, "%Y")) # split by year e <- group_by(e, year) # mean of each group summarise(e, mpce = mean(pce)) #> Source: local data frame [49 x 2] #> #> year mpce #> (chr) (dbl) #> 1 1967 515.1667 #> 2 1968 557.4583 #> 3 1969 604.4833 #> 4 1970 647.6917 #> 5 1971 701.0000 #> 6 1972 769.4333 #> 7 1973 851.1417 #> 8 1974 932.0000 #> 9 1975 1032.7667 #> 10 1976 1150.1667 #> .. ... ...
economics %>% mutate(year = format(date, "%Y")) %>% group_by(year) %>% summarise(mpce = mean(pce)) #> Source: local data frame [49 x 2] #> #> year mpce #> (chr) (dbl) #> 1 1967 515.1667 #> 2 1968 557.4583 #> 3 1969 604.4833 #> 4 1970 647.6917 #> 5 1971 701.0000 #> 6 1972 769.4333 #> 7 1973 851.1417 #> 8 1974 932.0000 #> 9 1975 1032.7667 #> 10 1976 1150.1667 #> .. ... ...
psavert
).library(babynames) head(babynames) #> Source: local data frame [6 x 5] #> #> year sex name n prop #> (dbl) (chr) (chr) (int) (dbl) #> 1 1880 F Mary 7065 0.07238359 #> 2 1880 F Anna 2604 0.02667896 #> 3 1880 F Emma 2003 0.02052149 #> 4 1880 F Elizabeth 1939 0.01986579 #> 5 1880 F Minnie 1746 0.01788843 #> 6 1880 F Margaret 1578 0.01616720 dim(babynames) #> [1] 1825433 5
library(readr) bb_path <- tempfile(fileext = ".csv", tmpdir = ".") write_csv(babynames, bb_path) read_csv(bb_path) #> Source: local data frame [1,825,433 x 5] #> #> year sex name n prop #> (int) (chr) (chr) (int) (dbl) #> 1 1880 F Mary 7065 0.07238359 #> 2 1880 F Anna 2604 0.02667896 #> 3 1880 F Emma 2003 0.02052149 #> 4 1880 F Elizabeth 1939 0.01986579 #> 5 1880 F Minnie 1746 0.01788843 #> 6 1880 F Margaret 1578 0.01616720 #> 7 1880 F Ida 1472 0.01508119 #> 8 1880 F Alice 1414 0.01448696 #> 9 1880 F Bertha 1320 0.01352390 #> 10 1880 F Sarah 1288 0.01319605 #> .. ... ... ... ... ...
library(readxl) read_excel("my-spreadsheet.xls", sheet = "data") read_excel("my-spreadsheet.xls", sheet = 2)
library(haven) # SAS files read_sas("path/to/file") # SPSS files read_por("path/to/file") read_sav("path/to/file") # Stata files read_dta("path/to/file")
db <- src_sqlite("babynames.sqlite3", create = TRUE) if (!db_has_table(db$con, "babynames")) { copy_to(db, babynames) } #> Source: sqlite 3.8.6 [babynames.sqlite3] #> From: babynames [1,825,433 x 5] #> #> year sex name n prop #> (dbl) (chr) (chr) (int) (dbl) #> 1 1880 F Mary 7065 0.07238359 #> 2 1880 F Anna 2604 0.02667896 #> 3 1880 F Emma 2003 0.02052149 #> 4 1880 F Elizabeth 1939 0.01986579 #> 5 1880 F Minnie 1746 0.01788843 #> 6 1880 F Margaret 1578 0.01616720 #> 7 1880 F Ida 1472 0.01508119 #> 8 1880 F Alice 1414 0.01448696 #> 9 1880 F Bertha 1320 0.01352390 #> 10 1880 F Sarah 1288 0.01319605 #> .. ... ... ... ... ...
db #> src: sqlite 3.8.6 [babynames.sqlite3] #> tbls: babynames, sqlite_stat1 tbl(db, "babynames") #> Source: sqlite 3.8.6 [babynames.sqlite3] #> From: babynames [1,825,433 x 5] #> #> year sex name n prop #> (dbl) (chr) (chr) (int) (dbl) #> 1 1880 F Mary 7065 0.07238359 #> 2 1880 F Anna 2604 0.02667896 #> 3 1880 F Emma 2003 0.02052149 #> 4 1880 F Elizabeth 1939 0.01986579 #> 5 1880 F Minnie 1746 0.01788843 #> 6 1880 F Margaret 1578 0.01616720 #> 7 1880 F Ida 1472 0.01508119 #> 8 1880 F Alice 1414 0.01448696 #> 9 1880 F Bertha 1320 0.01352390 #> 10 1880 F Sarah 1288 0.01319605 #> .. ... ... ... ... ...
h <- db %>% tbl("babynames") %>% filter(name == "Hilary")
class(h) #> [1] "tbl_sqlite" "tbl_sql" "tbl" h$query #> <Query> SELECT "year", "sex", "name", "n", "prop" #> FROM "babynames" #> WHERE "name" = 'Hilary' #> <SQLiteConnection> # execute SQL query and bring into R hc <- collect(h) class(hc) #> [1] "tbl_df" "tbl" "data.frame" hc #> Source: local data frame [190 x 5] #> #> year sex name n prop #> (dbl) (chr) (chr) (int) (dbl) #> 1 1882 M Hilary 7 5.736153e-05 #> 2 1883 M Hilary 6 5.334282e-05 #> 3 1887 M Hilary 7 6.403571e-05 #> 4 1891 M Hilary 8 7.321314e-05 #> 5 1896 M Hilary 6 4.648496e-05 #> 6 1897 M Hilary 5 4.100276e-05 #> 7 1898 M Hilary 5 3.784811e-05 #> 8 1902 M Hilary 8 6.026411e-05 #> 9 1904 M Hilary 5 3.609900e-05 #> 10 1905 M Hilary 6 4.188628e-05 #> .. ... ... ... ... ...
plot_ly(hc, x = year, y = prop, color = sex, colors = c("blue", "hotpink"))
Use Rmarkdown to write a one page report, that does these things:
popular <- babynames %>% group_by(name) %>% summarise(N = sum(n)) %>% arrange(desc(N)) popular #> Source: local data frame [93,889 x 2] #> #> name N #> (chr) (int) #> 1 James 5129096 #> 2 John 5106590 #> 3 Robert 4816785 #> 4 Michael 4330805 #> 5 Mary 4130441 #> 6 William 4071368 #> 7 David 3590557 #> 8 Joseph 2580687 #> 9 Richard 2564867 #> 10 Charles 2376700 #> .. ... ...
top <- top_n(popular, 1000) topnames <- subset(babynames, name %in% top[["name"]]) topnames #> Source: local data frame [178,358 x 5] #> #> year sex name n prop #> (dbl) (chr) (chr) (int) (dbl) #> 1 1880 F Mary 7065 0.07238359 #> 2 1880 F Anna 2604 0.02667896 #> 3 1880 F Emma 2003 0.02052149 #> 4 1880 F Elizabeth 1939 0.01986579 #> 5 1880 F Minnie 1746 0.01788843 #> 6 1880 F Margaret 1578 0.01616720 #> 7 1880 F Ida 1472 0.01508119 #> 8 1880 F Alice 1414 0.01448696 #> 9 1880 F Bertha 1320 0.01352390 #> 10 1880 F Sarah 1288 0.01319605 #> .. ... ... ... ... ...
library(shiny) library(ggplot2) ui <- bootstrapPage( selectizeInput( inputId = 'name', label = 'Enter a name', choices = unique(topnames$name), selected = "James", multiple = TRUE ), plotOutput('plot') ) server <- function(input, output) { output$plot <- renderPlot({ dat <- subset(topnames, name %in% input$name) ggplot(dat, aes(year, prop, colour = sex)) + geom_line() + facet_wrap(~ name) }) } runApp(shinyApp(ui, server))