--- title: "2-Introduction to Plotting Data" subtitle: "E/EBS Honours, Monash University" author: "Carson Sievert (cpsievert1@gmail.com, @cpsievert); Di Cook (dicook@monash.edu, @visnut); Heike Hofmann (heike.hofmann@gmail.com, @heike_hh) Barret Schloerke (schloerke@gmail.com, @schloerke)" date: "`r Sys.Date()`" output: ioslides_presentation: transition: default widescreen: true css: styles.css --- ```{r, echo = FALSE} knitr::opts_chunk$set( message = FALSE, warning = FALSE, collapse = TRUE, comment = "#>", fig.height = 4, fig.width = 8, fig.align = "center", cache = FALSE ) ``` ## Your turn ![](rainbow-lorikeet.png) - What is a (data) plot? - What are the three most important data plots? ## Your turn ![](rainbow-lorikeet.png) How would you describe this plot? ```{r echo=FALSE} library(ggplot2) library(tidyr) library(dplyr) library(lubridate) library(scales) library(readr) rates <- read_csv("http://dicook.github.io/Monash-R/data/rates.csv") rates.sub <- select(rates, date, AUD, NZD) rates.sub.m <- gather(rates.sub, currency, rate, -date) rates.sub.m$date <- as.POSIXct(rates.sub.m$date) qplot(date, rate, data=rates.sub.m, geom="line", colour=currency) + scale_x_datetime(breaks = date_breaks("1 month"), labels = date_format("%b")) + scale_color_brewer(palette="Dark2") + theme_bw() ``` ## Using the package ggplot2 Elements of a plot - data - aesthetics: mapping of variables to graphical elements - geom: type of plot structure to use - transformations: log scale, ... Additional components - layers: multiple geoms, multiple data sets, annotation - facets: show subsets in different plots - themes: modifying style ## Have you opened your project? ![](projectname.png) ## Data - Currency cross rates Extracted from [http://openexchangerates.org](http://openexchangerates.org), extracted using the json api, with the R package, jsonlite. ```{r} library(readr) rates <- read_csv("http://dicook.github.io/Monash-R/data/rates.csv") rates[1:5,1:8] ``` If you'd like to collect exchange rates yourself, see [here](https://gist.github.com/cpsievert/e05da83fc4253e6d1986). ## Plotting points ```{r} qplot(date, AUD, data=rates) ``` ## Adding lines ```{r} qplot(date, AUD, data=rates, geom="line") ``` ## Points and lines ```{r} qplot(date, AUD, data=rates, geom=c("line", "point")) ``` ## Plot structure - data: rates - aesthetics: x=date, y=AUD - geom: point, line ```{r} ggplot(data=rates, aes(x=date, y=AUD)) + geom_point() + geom_line() ``` ## Multiple currencies ```{r} ggplot(data=rates, aes(x=date, y=AUD)) + geom_line() + geom_line(aes(y=NZD), colour="blue") + geom_line(aes(y=GBP), colour="red") ``` ## Hmmm - That code is clunky! - Better to rearrange data, and then let ggplot2 handle the colors, legends, ... ## Better way ```{r} rates.sub <- select(rates, date, AUD, NZD, GBP) rates.sub.m <- gather(rates.sub, currency, rate, -date) qplot(date, rate, data=rates.sub.m, geom="line", colour=currency) ``` ```{r echo=FALSE} qplot(date, rate, data=rates.sub.m, geom="line", colour=currency, linetype=currency) qplot(date, rate, data=rates.sub.m, geom=c("line","point"), colour=currency, linetype=currency, shape=currency) ``` ## Emphasis on magnitude of cross rate - On this scale we see that the GBP has a much lower ratio with the USD than AUD and NZD. This is information that is easy to obtain by eyeballing the numbers. - We are probably more interested in comparing the trend of rates, which would suggest to standardise the rates. We will use the traditional definition of standardising. ## Scaling ```{r} rates.sub <- mutate(rates.sub, AUD=scale(AUD), NZD=scale(NZD), GBP=scale(GBP)) rates.sub$date <- as.Date(rates.sub$date) rates.sub.m <- gather(rates.sub, currency, rate, -date) qplot(date, rate, data=rates.sub.m, geom="line", colour=currency) ``` ## Your turn ![](rainbow-lorikeet.png) Set the linetype/shape to be different for the different currencies. ## Your turn ![](rainbow-lorikeet.png) Set the linetype/shape to be different for the different currencies. ```{r} qplot(date, rate, data=rates.sub.m, geom="line", colour=currency, linetype=currency, lwd=I(2)) qplot(date, rate, data=rates.sub.m, geom=c("line", "point"), colour=currency, linetype=currency, shape=currency) ``` ## Scatterplot ```{r} qplot(AUD, NZD, data=rates.sub) + theme(aspect.ratio=1) ``` ## Scatterplot, time connected? ```{r} qplot(AUD, NZD, data=rates.sub, geom="line") + theme(aspect.ratio=1) ``` Problem: line only connects points from left to right along the x axis ## Scatterplot, time connected! ```{r} qplot(AUD, NZD, data=rates.sub, geom="path", colour=order(date)) + theme(aspect.ratio=1) ``` ## Scatterplot, overlaid density ```{r} qplot(AUD, NZD, data=rates.sub, geom=c("density2d", "point")) + theme(aspect.ratio=1) ``` ## Scatterplot of lagged AUD ```{r, fig.show='hold', fig.align='default', fig.width=3, fig.height=3} AUD <- rates[,c("date", "AUD")] AUD.1 <- lag(AUD$AUD, 1) AUD.2 <- lag(AUD$AUD, 2) AUD.7 <- lag(AUD$AUD, 7) qplot(AUD, AUD.1, data=rates.sub) + theme(aspect.ratio=1) qplot(AUD, AUD.2, data=rates.sub) + theme(aspect.ratio=1) qplot(AUD, AUD.7, data=rates.sub) + theme(aspect.ratio=1) ``` ## Your turn ![](rainbow-lorikeet.png) Look up `geom_rug` in the [ggplot2 cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/08/ggplot2-cheatsheet.pdf) and add the marginal distributions to a scatterplot of AUD and NZD. ```{r echo=FALSE, fig.show='hide'} qplot(AUD, NZD, data=rates.sub) + geom_rug() + theme(aspect.ratio=1) ``` ## Histogram ```{r} qplot(AUD, data=rates.sub, geom="histogram") ``` ## Your turn ![](rainbow-lorikeet.png) Only one variable was passed to the plot command, but two axes are shown in the histogram. What happened? What is plotted on the vertical axis? ## Density ```{r} qplot(AUD, data=rates.sub, geom="density", fill=I("black")) ``` ## Your turn ![](rainbow-lorikeet.png) - What ways have we seen data variables mapped to graphical elements so far? - What are the natural ways to map a categorical variable to a graphical element? ## Modifying labels ```{r} qplot(date, rate, data=rates.sub.m, geom="line", colour=currency) + xlab("Date") + ylab("Standardized rates") + ggtitle("Cross rates 23/2/2015-11/11/2015") ``` ## Equations in labels ```{r} qplot(date, rate, data=rates.sub.m, geom="line", colour=currency) + xlab(expression(Date[i]^2~ mu ~ pi * sigma)) + ylab("Standardized rates") + ggtitle("Cross rates 23/2/2015-11/11/2015") ``` ## Scales: modifying axes ```{r} rates.sub.m$date <- as.POSIXct(rates.sub.m$date) p <- qplot(date, rate, data = rates.sub.m, geom = "line", colour = currency) + scale_x_datetime(breaks = date_breaks("1 month"), labels = date_format("%b")) + scale_y_continuous("Standardized rates") p ``` ## Legend Position ```{r} p + theme(legend.position = "bottom") ``` ## Themes ```{r} library(ggthemes) p + theme_tufte() ``` --- ```{r} p + theme_economist() ``` ## Color palettes ```{r} p + scale_color_brewer("", palette = "Dark2") ``` ## Color blind-proofing ```{r, fig.width=4.5, fig.show='hold', fig.align='default'} library(dichromat) clrs <- hue_pal()(3) p + scale_color_manual("", values=clrs) + theme(legend.position = "none") clrs <- dichromat(hue_pal()(3)) p + scale_color_manual("", values=clrs) + theme(legend.position = "none") ``` --- ```{r, fig.width=4.5, fig.show='hold', fig.align='default'} library(RColorBrewer) clrs <- brewer.pal(3, "Dark2") p + scale_color_manual("", values=clrs) + theme(legend.position = "none") clrs <- dichromat(brewer.pal(3, "Dark2")) p + scale_color_manual("", values=clrs) + theme(legend.position = "none") ``` ## Color palettes - Qualitative: categorical variables - Sequential: low to high numeric values - Diverging: negative to positive values --- ```{r, echo=FALSE, fig.height=7, fig.width=12} display.brewer.all() ``` ## Modeling ```{r} p <- qplot(date, rate, data = rates.sub.m, geom = "line", colour = currency, linetype=I(2)) + scale_x_datetime(breaks = date_breaks("1 month"), labels = date_format("%b")) + scale_y_continuous("Standardized rates") p + geom_smooth() + scale_color_brewer(palette="Dark2") ``` ```{r, echo=FALSE, fig.show='hide'} library(forecast) AUD.ts <- as.ts(rates$AUD, start=1, end=259) AUD.ar <- ar(AUD.ts, newdata=AUD.ts) AUD$resid <- as.numeric(AUD.ar$resid) qplot(date, resid, data=AUD, geom="line") ``` ```{r, echo=FALSE, fig.show='hide'} AUD$weekday <- wday(AUD$date) AUD$week <- week(AUD$date) AUD_s <- AUD %>% group_by(week) %>% mutate(AUD_s = (AUD-mean(AUD))) qplot(weekday, AUD_s, data=AUD_s, geom="line", group=week) ``` ## Your turn ![](rainbow-lorikeet.png) How has the EUR changed relative to the USD over the course of this year? What about the JPY? ```{r echo=FALSE, fig.show='hide'} qplot(date, EUR, data=rates, geom="line") qplot(date, JPY, data=rates, geom="line") ``` ## Maps - group currencies first, and then combine with geography ```{r, echo=FALSE, fig.show='hide'} rates.sub2 <- t(scale(rates[,-c(1,13,18,22,36,115,152)])) rates.hc <- hclust(dist(rates.sub2), method="ward.D2") plot(rates.hc, hang=-1) rates.t <- data.frame(rates.sub2) rates.t$cl <- cutree(rates.hc, 3) rates.cl <- data.frame(currency=rownames(rates.t), cl=rates.t$cl) rates.sub2 <- data.frame(date=rates$date, scale(rates[,-c(1,13,18,22,36,115,152)])) rates.sub2.m <- gather(rates.sub2, currency, rate, -date) rates.sub2.m <- merge(rates.sub2.m, rates.cl) ``` ```{r} qplot(date, rate, data=rates.sub2.m, group=currency, geom="line", alpha=I(0.5)) + facet_wrap(~cl, ncol=3) ``` ## Maps ```{r, echo=FALSE} countries <- read_csv("http://dicook.github.io/Monash-R/data/countries.csv") rates.countries <- merge(countries, rates.cl) ``` ```{r, fig.height=4, fig.width=9, fig.show='hold'} library(maps) world <- map_data("world") ggplot(data=world) + geom_path(aes(x=long, y=lat, group=group, order=order)) + theme_solid() ggplot(data=world) + geom_polygon(aes(x=long, y=lat, group=group, order=order), fill="grey70") + theme_solid() ``` ## Maps ```{r, fig.align='center', fig.show='hold'} rates.map <- merge(rates.countries, world, by.x="name", by.y="region") rates.map <- rates.map[order(rates.map$order),] rates.map$cl <- factor(rates.map$cl) ggplot(data=rates.map) + geom_polygon(aes(x=long, y=lat, group=group, order=order, fill=cl)) + scale_fill_brewer(palette="Dark2") + theme_solid() + theme(legend.position="None") ``` ## Combining plots Map and line plots ```{r} p1 <- ggplot(data=subset(rates.map, cl==3)) + geom_polygon(aes(x=long, y=lat, group=group, order=order), fill="#1B9E77") + geom_path(aes(x=long, y=lat, group=group, order=order), data=world, colour="grey90") + theme_solid() + theme(legend.position="None") p2 <- ggplot(data=subset(rates.sub2.m, cl==3)) + geom_line(aes(x=date, y=rate, group=currency), alpha=0.2, colour="#1B9E77") + theme(legend.position="None") p3 <- ggplot(aes(x=long, y=lat, group=group, order=order), data=subset(rates.map, cl==2)) + geom_polygon(fill="#D95F02") + geom_path(data=world, colour="grey90") + theme_solid() + theme(legend.position="None") p4 <- ggplot(data=subset(rates.sub2.m, cl==2)) + geom_line(aes(x=date, y=rate, group=currency), alpha=0.2, colour="#D95F02") + theme(legend.position="None") ``` ## Now for the plots ```{r} library(gridExtra) grid.arrange(p1, p2, p3, p4, ncol=2) ``` ## Maps - get the map !! Requires internet connection ```{r} library(ggmap) melb <- get_map(location=c(144.9631, -37.8136)) ggmap(melb) + theme_solid() ``` ## Maps - Add points ```{r} poll_loc <- read_csv("http://dicook.github.io/Monash-R/data/polling-places.csv") ggmap(melb) + geom_point(data=poll_loc, aes(x=Long, y=Lat)) + theme_solid() ``` ## Your turn ![](rainbow-lorikeet.png) Work out how to get a **watercolor** map as the background. ```{r, fig.show='hide'} melb <- get_map(location=c(144.9631, -37.8136), maptype="watercolor") ggmap(melb) + theme_solid() ``` ## More examples Plots - Barcharts - Boxplots Data - subset of OECD PISA 2012 survey - subset of 2006 NRC ratings of US graduate programs ## Categorical variables - barchart ```{r} internet <- read_csv("http://dicook.github.io/Monash-R/data/internet.csv") qplot(`Social networks`, data=internet, geom="bar", binwidth=0.5) + facet_grid(Gender~name) ``` ## Categorical variables - stacked barchart ```{r} ggplot( data=internet) + geom_bar(aes(x=`Social networks`, fill=Gender)) + facet_wrap(~name, ncol=5) + theme(legend.position="bottom") ``` ## Categorical variables - dodged bars ```{r} ggplot( data=internet) + geom_bar(aes(x=`Social networks`, fill=Gender), position="dodge") + facet_wrap(~name, ncol=5) + theme(legend.position="bottom") ``` ## Quantitative and categorical - boxplots ```{r} grad <- read_csv("http://dicook.github.io/Monash-R/data/graduate-programs.csv") qplot(subject, AvGREs, data=grad, geom="boxplot") ``` ## Your turn ![](rainbow-lorikeet.png) How do the four programs compare in terms of average number of publications? Number of students? ## Perceptual principles - Hierarchy of mappings: (first) position along an axis - (last) color (Cleveland, 1984; Heer and Bostock, 2009) - Pre-attentive: Some elements are noticed before you even realise it. - Color: (pre-attentive) palettes - qualitative, sequential, diverging. - Proximity: Place elements for primary comparison close together. - Change blindness: When focus is interrupted differences may not be noticed. ## Pre-attentive Can you find the odd one out? ```{r echo=FALSE} df <- data.frame(x=runif(100), y=runif(100), cl=sample(c(rep("A", 1), rep("B", 99)))) qplot(x, y, data=df, shape=cl) + theme_bw() + theme(legend.position="None", aspect.ratio=1) ``` --- Is it easier now? ```{r echo=FALSE} qplot(x, y, data=df, colour=cl) + theme_bw() + theme(legend.position="None", aspect.ratio=1) ``` ## Your turn - Proximity What's wrong with this plot? Can you answer: Is the proportion of girls who use social networks every day (4) higher than boys, in Australia? And is this different in Germany? ```{r, echo=FALSE} qplot(`Social networks`, data=internet, geom="bar", binwidth=0.5) + facet_grid(Gender~name) ``` ## Proximity ```{r echo=FALSE} internet.m.tb <- internet[,c(1,3,8)] %>% group_by(name, Gender, `Social networks`) %>% tally(sort=TRUE) internet.m.tb <- subset(internet.m.tb, !is.na(`Social networks`)) internet.m.tb.n <- summarise(group_by(internet.m.tb, name, Gender), tot=sum(n)) internet.m.tb <- merge(internet.m.tb, internet.m.tb.n) internet.m.tb.p <- summarise(group_by(internet.m.tb, name, Gender, `Social networks`), p=n/tot) ``` ```{r, echo=FALSE} qplot(`Social networks`, p, data=internet.m.tb.p, geom="line", color=Gender) + facet_wrap(~name, ncol=5) + theme(legend.position="bottom") ``` Is this easier? But what about answering this: Are German girls more likely to report using social networks once or twice per month (1) than Japanese girls? --- ```{r, echo=FALSE} qplot(`Social networks`, p, data=internet.m.tb.p, geom="line", color=name) + facet_wrap(~Gender, ncol=2) + theme(legend.position="bottom") ``` Are German girls more likely to report using social networks once or twice per month (1) than Japanese girls? Easier? ## Your turn ![](rainbow-lorikeet.png) Using rmarkdown generate a document with several plots to answer these questions: - How has the JPY changed relative to the USD over the course of this year? - How do the four programs compare in terms of average number of publications? Number of students? - Do boys tend to download music more than girls? ## Resources - [Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/08/ggplot2-cheatsheet.pdf) - [ggplot2: Elegant Graphics for Data Analysis, Hadley Wickham](http://ggplot2.org/book/), [web site](http://ggplot2.org) - [R Graphics Cookbook, Winston Chang](http://www.cookbook-r.com/Graphs/) - [Q/A site: http://stackoverflow.com](http://stackoverflow.com) - [Dynamic Documents with R and knitr, Yihui Xie, ](http://yihui.name/knitr/) - [Naomi Robbins, Creating More Effective Graphs](http://www.nbr-graphs.com) - [Antony Unwin, Graphical Data Analysis with R](https://www.crcpress.com/Graphical-Data-Analysis-with-R/Unwin/9781498715232) ## Share and share alike This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/ 3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.