2016-02-25

Your turn

  • What is a model?

Outline

  • Linear Regression

Real estate prices

library(readr)
realestate <- read_csv("http://dicook.github.io/Monash-R/4-Modelling/data/realestate.csv")
str(realestate)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    522 obs. of  19 variables:
#>  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ price       : int  360000 340000 250000 205500 275500 248000 229900 150000 195000 160000 ...
#>  $ sqft        : int  3032 2058 1780 1638 2196 1966 2216 1597 1622 1976 ...
#>  $ bed         : int  4 4 4 4 4 4 3 2 3 3 ...
#>  $ bath        : int  4 2 3 2 3 3 2 1 2 3 ...
#>  $ ac          : chr  "yes" "yes" "yes" "yes" ...
#>  $ cars        : int  2 2 2 2 2 5 2 1 2 1 ...
#>  $ pool        : chr  "no" "no" "no" "no" ...
#>  $ built       : int  1972 1976 1980 1963 1968 1972 1972 1955 1975 1918 ...
#>  $ quality     : chr  "medium" "medium" "medium" "medium" ...
#>  $ style       : int  1 1 1 1 7 1 7 1 1 1 ...
#>  $ lot         : int  22221 22912 21345 17342 21786 18902 18639 22112 14321 32358 ...
#>  $ highway     : chr  "no" "no" "no" "no" ...
#>  $ fitted.m3   : num  12.9 12.4 12.2 12.1 12.5 ...
#>  $ fitted.m4   : num  12.7 12.4 12.3 12.2 12.5 ...
#>  $ lot.discrete: chr  "large" "large" "large" "large" ...
#>  $ resid.m5    : num  -0.0212 0.3448 0.0735 0.1227 0.0938 ...
#>  $ fitted.m5   : num  12.8 12.4 12.4 12.1 12.4 ...
#>  $ resid.m5b   : num  -30705 70168 7858 28684 3620 ...

Real estate prices

is a data set consisting of observations on the sales price of 522 homes along with numerous characteristics of the home and property

We might be interested in how the price is affected by these characteristics

A first model?

  • not so fast!
  • getting a quick overview of your data using visualizations pays off enormously, because you're not modelling in the dark.

A generalized scatterplot matrix

library(GGally)
ggpairs(realestate[,2:7])

Fitting a regression

Fitting a regression model in is accomplished through the lm command

lm(formula, data, weight, na.action)

A formula has the form

y ~ x1 + x2 + x3

where y is the dependent (price) and x1, x2, x3 are the covariates

A first model

lm(price ~ sqft, data=realestate)
#> 
#> Call:
#> lm(formula = price ~ sqft, data = realestate)
#> 
#> Coefficients:
#> (Intercept)         sqft  
#>      -81433          159

Estimate for Intercept is the average price of a 0 sqft home … doesn't make much sense :)

… but, with an increase of the square footage the price increases … that DOES make sense

(1 sq ft = 0.093 sq m or 1 sq m = 10.8 sq ft)

Problem: the variance of the prices increases with the size of the homes.

Solution: use a transformation (square root or log) for the price

Using a log transformation, variability stabilizes (not perfect, but better)

First model then becomes

lm(log(price)~sqft, data=realestate)
#> 
#> Call:
#> lm(formula = log(price) ~ sqft, data = realestate)
#> 
#> Coefficients:
#> (Intercept)         sqft  
#>   1.128e+01    5.097e-04

Models are objects, too (so save model into a named object)

summary provides a nicer overview of the model output

m1 <- lm(log(price)~sqft, data=realestate)
summary(m1)
#> 
#> Call:
#> lm(formula = log(price) ~ sqft, data = realestate)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.76772 -0.14834 -0.01448  0.11921  0.84182 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 1.128e+01  3.427e-02  329.24   <2e-16 ***
#> sqft        5.097e-04  1.446e-05   35.24   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2347 on 520 degrees of freedom
#> Multiple R-squared:  0.7049, Adjusted R-squared:  0.7043 
#> F-statistic:  1242 on 1 and 520 DF,  p-value: < 2.2e-16

Adding effects

update uses an existing model object and allows for changes of the effects

.~.+ac keeps left hand side (. on the left of the ~) the same and adds ac to the existing right hand side

m2 <- update(m1, .~.+ac)

summary(m2)
#> 
#> Call:
#> lm(formula = log(price) ~ sqft + ac, data = realestate)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.7323 -0.1407 -0.0204  0.1180  0.8196 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 1.120e+01  3.615e-02  309.88  < 2e-16 ***
#> sqft        4.872e-04  1.457e-05   33.45  < 2e-16 ***
#> acyes       1.589e-01  2.764e-02    5.75 1.52e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2278 on 519 degrees of freedom
#> Multiple R-squared:  0.7226, Adjusted R-squared:  0.7215 
#> F-statistic: 675.9 on 2 and 519 DF,  p-value: < 2.2e-16

R can deal with categorical and quantitative variables in lm

only value for houses with ac (acyes) is shown - acno is used as baseline, and by default set to 0

options()$contrasts
#>         unordered           ordered 
#> "contr.treatment"      "contr.poly"
?contr.treatment

Interpreting Coefficients

!!!Beware transformations!!! they make interpretations tricky sometimes

log of price is expected to be higher by 1.589e-01 for houses with an AC … same as …

price of the house is on average exp(1.589e-01) = 1.172221 fold higher with AC than the same house without an AC (i.e. AC leads on average to a 17% increase in price)

Model comparisons

Is model m2 actually an improvement over m1 ?

Statistical answer:

anova(m1, m2)
#> Analysis of Variance Table
#> 
#> Model 1: log(price) ~ sqft
#> Model 2: log(price) ~ sqft + ac
#>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#> 1    520 28.648                                  
#> 2    519 26.932  1    1.7157 33.063 1.524e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, duh.

Visually:

qplot(sqft, log(price), colour=ac, data=realestate) + geom_smooth(method="lm")