--- title: "Web scraping and graphics" subtitle: "E/EBS Honours, Monash University" author: "Carson Sievert (cpsievert1@gmail.com, @cpsievert); Di Cook (dicook@monash.edu, @visnut); Heike Hofmann (heike.hofmann@gmail.com, @heike_hh) Barret Schloerke (schloerke@gmail.com, @schloerke)" date: "`r Sys.Date()`" output: ioslides_presentation: transition: default widescreen: true css: styles.css --- ```{r, echo = FALSE} knitr::opts_chunk$set( message = FALSE, warning = FALSE, collapse = TRUE, comment = "#>", fig.height = 4, fig.width = 8, fig.align = "center", cache = FALSE ) ``` ## The Data Analysis Workflow

* Adapted from [R for data science](http://r4ds.had.co.nz/intro.html) by Garrett Grolemund and Hadley Wickham. ## A web of data - In 2008, [an estimated](http://yz.mit.edu/papers/webtables-vldb08.pdf) __154 million HTML tables__ (out of the 14.1 billion) contain 'high quality relational data'!!! - Hard to quantify how much more exists outside of HTML Tables, but there is [an estimate](https://cs.uwaterloo.ca/~x4chu/SIGMOD2015_1.pdf) of __at least 30 million lists__ with 'high quality relational data'. - A growing number of websites/companies [provide programmatic access](http://www.programmableweb.com/category/all/apis?order=field_popularity) to their data/services via web APIs (that data typically comes in XML/JSON format). ## Before scraping, do some googling! - Chances are, someone else built a tool to help you. - I wrote [pitchRx](http://cran.r-project.org/web/packages/pitchRx/) which downloads, parses, cleans, and transforms XML data for a specific baseball data resource. Just give it start/end dates. - [ropensci](https://ropensci.org/) has a [ton of R packages](https://ropensci.org/packages/) providing easy-to-use interfaces to open data. - The [Web Technologies and Services CRAN Task View](http://cran.r-project.org/web/views/WebTechnologies.html) is a great overview of various tools for working with data that lives on the web in R. ## A web of _messy_ data! - In statistical modeling, we typically assume data is [tidy](http://vita.had.co.nz/papers/tidy-data.pdf). - That is, data appears in a tabular form where * 1 row == 1 observation * 1 column == 1 variable (observational attribute) - Parsing HTML/XML/JSON is easy; but putting it into a tidy form is typically _not easy_. - Knowing a bit about modern tools & web technologies makes it _much_ easier. ## Motivating Example ## Inspecting elements

## Hover to find desired elements

## Wikitable ```{r} library(rvest) src <- html("http://en.wikipedia.org/wiki/Table_(information)") node <- html_node(src, css = ".wikitable") ``` - `".wikitable"` is a CSS selector which says: "grab nodes (aka elements) with a class of wikitable". - `html_table()` converts a single `` node to a data frame. ```{r} html_table(node) ``` ## Pipeable! ```{r} html("http://en.wikipedia.org/wiki/Table_(information)") %>% html_node(".wikitable") %>% html_table() ``` - Much easier to read/understand! ## Your Turn 1 Navigate [this page](http://www.wunderground.com/history/airport/KVAY/2015/2/17/DailyHistory.html?req_city=Cherry+Hill&req_state=NJ&req_statename=New+Jersey&reqdb.zip=08002&reqdb.magic=1&reqdb.wmo=99999&MR=1) and try the following: __Easy__: Grab the table at the bottom of the page (hint: instead of grabbing a node by class with `html_node(".class")`, you can grab by id with `html_node("#id")`) __Medium__: Grab the actual mean, max, and min temperature. __Hard__: Grab the weather history graph and write the figure to disk (`download.file()` may be helpful here). [See here](https://gist.github.com/cpsievert/57be009120bb5298affa) for a solution (thanks Hadley Wickham for the example) # What about non-`

` data? ## (selectorgadget + rvest) to the rescue! - [Selectorgadget](http://selectorgadget.com/) is a [Chrome browser extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en) for quickly extracting desired parts of an HTML page. - With some user feedback, the gadget find out the [CSS selector](http://www.w3.org/TR/2011/REC-css3-selectors-20110929/) that returns the highlighted page elements. - Let's try it out on [this page](http://www.sec.gov/litigation/suspensions.shtml) ## Extracting links to download reports ```{r} domain <- "http://www.sec.gov" susp <- paste0(domain, "/litigation/suspensions.shtml") hrefs <- html(susp) %>% html_nodes("p+ table a") %>% html_attr(name = "href") tail(hrefs) ``` ```{r, eval = FALSE} # download all the pdfs! hrefs <- hrefs[!is.na(hrefs)] pdfs <- paste0(domain, hrefs) mapply(download.file, pdfs, basename(pdfs)) ``` ## Your Turn 2 Nativigate to Wikipedia's [list of data structures](http://en.wikipedia.org/wiki/List_of_data_structures) use SelectorGadget + rvest to do the following: 1. Obtain a list of Primitive types 2. Obtain a list of the different Array types [See here](https://gist.github.com/cpsievert/c1b851ff5e1bd846de46) for a solution. # Scraping _dynamic_ web pages ## A simple example

--- ```{r, error = TRUE} html("http://bl.ocks.org/cpsievert/raw/2a9fb8f504cd56e9e8e3/") %>% html_node("table") ``` * Huh, no `

`? --- ```{r} html("http://bl.ocks.org/cpsievert/raw/2a9fb8f504cd56e9e8e3/") %>% html_node("body") %>% as.character() %>% cat() ``` ## Browser <-> Web Server

--- [rdom](https://github.com/cpsievert/rdom) can construct the DOM: ```{r, eval = FALSE} library(rdom) rdom("http://bl.ocks.org/cpsievert/raw/2a9fb8f504cd56e9e8e3/") %>% html_node("table") %>% html_table() ``` ``` X1 X2 X3 1 Cell Cell Cell 2 Cell Cell Cell 3 Cell Cell Cell ``` You can give `rdom()` CSS Selectors directly to avoid sending the _entire_ DOM from phantomjs to R ```{r, eval = FALSE} rdom("http://www.techstars.com/companies/stats/", "table") %>% html_table() ``` ## Don't abuse your power - If you scrape a website, please read the terms and conditions!! - For [client-side dynamic sites](https://en.wikipedia.org/wiki/Dynamic_web_page#Client-side_scripting), it's sometimes more efficient/appropriate to [find the API](http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/) rather than rendering the entire DOM. - If a website public offers an API, USE IT (instead of scraping)!!! ## Web APIs - [Server-side Web APIs](https://en.wikipedia.org/wiki/Web_API#Server-side) are a popular way to provide easy access to data and other services. - If you (the client) want data from a server, you typically need one HTTP verb -- `GET`. ```{r} library(httr) response <- GET("https://api.github.com/users/hadley") content(response)[c("name", "company")] ``` - Other HTTP verbs -- `POST`, `PUT`, `DELETE`, etc... * You probably won't need these unless your developing a web app. ## Request/response model - When you (the client) _requests_ a resource from the server. The server _responds_ with a bunch of additional information. ```{r} response$header[1:3] ``` - Nowadays content-type is usually XML or JSON (HTML is great for _sharing content_ between _people_, but it isn't great for _exchanging data_ between _machines_.) ## What is XML? XML is a markup language that looks very similar to HTML. ```xml Wario Bike Piranha Prowler Royal Racer Wild Wing ``` - This example shows that XML can (and is) used to store inherently tabular data ([thanks Jeroen Ooms for the fun example](http://arxiv.org/pdf/1403.2805v1.pdf)) - What is are the observational units here? How many observations in total? - Two units and 6 total observations (4 vehicles and 2 drivers). ## XML2R [XML2R](https://github.com/cpsievert/XML2R) is a framework to simplify acquistion of tabular/relational XML. ```{r, eval = FALSE} library(XML2R) obs <- XML2Obs("https://gist.githubusercontent.com/cpsievert/85e340814cb855a60dc4/raw/651b7626e34751c7485cff2d7ea3ea66413609b8/mariokart.xml") table(names(obs)) ``` ```{r, echo = FALSE} library(XML2R) obs <- XML2Obs("https://gist.githubusercontent.com/cpsievert/85e340814cb855a60dc4/raw/651b7626e34751c7485cff2d7ea3ea66413609b8/mariokart.xml", quiet = TRUE) obs <- lapply(obs, function(x) x[, !colnames(x) %in% "url", drop = FALSE]) table(names(obs)) ``` * The main idea of __XML2R__ is to coerce XML into a _flat_ list of observations. * The list names track the "observational unit". * The list values track the "observational attributes". --- ```{r} obs ``` --- ```{r} collapse_obs(obs) # group into table(s) by observational name/unit ``` - What information have I lost? - I can't map vehicles to the drivers! --- ```{r} obs <- add_key(obs, parent = "mariokart//driver", recycle = "name") collapse_obs(obs) ``` --- Now (if I want) I can merge the tables into a single table... ```{r} tabs <- collapse_obs(obs) merge(tabs[[1]], tabs[[2]], by = "name") ``` ## What about JSON? - JSON is quickly becoming _the_ format for data on the web. - JavaScript Object Notation (JSON) is comprised of two components: * arrays => [value1, value2] * objects => {"key1": value1, "key2": [value2, value3]} ## Back to Mariokart {.smaller} ```json [ { "driver": "Bowser", "occupation": "Koopa", "vehicles": [ { "model": "Wario Bike", "speed": 55, "weight": 25 }, { "model": "Piranha Prowler", "speed": 40, "weight": 67 } ] }, { "driver": "Peach", "occupation": "Princess", "vehicles": [ { "model": "Royal Racer", "speed": 54, "weight": 29 }, { "model": "Wild Wing", "speed": 50, "weight": 34 } ] } ] ``` --- ```{r} library(jsonlite) mario <- fromJSON("http://bit.ly/mario-json") str(mario) # nested data.frames?!? ``` --- ```{r} mario$driver mario$vehicles ``` How do we get two tables (with a common id) like the XML example? --- ```{r} # this mapply statement is essentially equivalent to add_key vehicles <- Map(function(x, y) cbind(x, driver = y), mario$vehicles, mario$driver) Reduce(rbind, vehicles) mario[!grepl("vehicle", names(mario))] ``` # Hello shiny! ## What is shiny?

* *courtesy of : ## Motivating Example ```{r, eval = FALSE} # install dependencies and run first example (press ESC to quit) if (!require("shiny")) install.packages("shiny") if (!require("leaflet")) install.packages("leaflet") runGitHub("rstudio/shiny-examples", subdir = "063-superzip-example") ``` ## Learn by example ```{r, eval = FALSE} library(shiny) library(ggplot2) ui <- fluidPage( numericInput( inputId = "size", label = "Choose a point size", value = 3, min = 1, max = 10 ), plotOutput("plotId") ) server <- function(input, output) { output$plotId <- renderPlot({ ggplot(mtcars, aes(wt, mpg)) + geom_point(size = input$size) }) } shinyApp(ui, server) ``` --- ```{r, eval = FALSE} ui <- fluidPage( sidebarPanel( selectInput( inputId = "x", label = "Choose an x variable", choices = names(mtcars) ), selectInput( inputId = "y", label = "Choose an y variable", choices = names(mtcars) ) ), mainPanel( plotOutput("plotId") ) ) server <- function(input, output) { output$plotId <- renderPlot({ ggplot(mtcars, aes_string(input$x, input$y)) + geom_point() }) } shinyApp(ui, server) ``` ## Your Turn * Add a control for `colour`. * Get creative!