April 17, 2019

Announcements

  • No meetup next week.
  • There will not be a lab or homework for the Bayesian unit.
  • Project proposals have been graded. Final projects are due May 15th.

Meetup Presentations

Weight of Books

data(allbacks, package='DAAG')
head(allbacks)
##   volume area weight cover
## 1    885  382    800    hb
## 2   1016  468    950    hb
## 3   1125  387   1050    hb
## 4    239  371    350    hb
## 5    701  371    750    hb
## 6    641  367    600    hb

From: Maindonald, J.H. & Braun, W.J. (2007). Data Analysis and Graphics Using R, 2nd ed.

Weights of Books (cont)

lm.out <- lm(weight ~ volume, data=allbacks)

\[ \hat{weight} = 108 + 0.71 volume \] \[ R^2 = 80\% \]

Modeling weights of books using volume

summary(lm.out)
## 
## Call:
## lm(formula = weight ~ volume, data = allbacks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.97 -109.86   38.08  109.73  145.57 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.67931   88.37758   1.218    0.245    
## volume        0.70864    0.09746   7.271 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared:  0.8026, Adjusted R-squared:  0.7875 
## F-statistic: 52.87 on 1 and 13 DF,  p-value: 6.262e-06

Weights of hardcover and paperback books

  • Can you identify a trend in the relationship between volume and weight of hardcover and paperback books?

  • Paperbacks generally weigh less than hardcover books after controlling for book’s volume.

Modeling weights of books using volume and cover type

lm.out2 <- lm(weight ~ volume + cover, data=allbacks)
summary(lm.out2)
## 
## Call:
## lm(formula = weight ~ volume + cover, data = allbacks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.10  -32.32  -16.10   28.93  210.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.96284   59.19274   3.344 0.005841 ** 
## volume         0.71795    0.06153  11.669  6.6e-08 ***
## coverpb     -184.04727   40.49420  -4.545 0.000672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.2 on 12 degrees of freedom
## Multiple R-squared:  0.9275, Adjusted R-squared:  0.9154 
## F-statistic: 76.73 on 2 and 12 DF,  p-value: 1.455e-07

Linear Model

\[ \hat{weight} = 198 + 0.72 volume - 184 coverpb \]

  1. For hardcover books: plug in 0 for cover.
    \[ \hat{weight} = 197.96 + 0.72 volume - 184.05 \times 0 = 197.96 + 0.72 volume \]

  2. For paperback books: plut in 1 for cover. \[ \hat{weight} = 197.96 + 0.72 volume - 184.05 \times 1 \]

Visualising the linear model

Interpretation of the regression coefficients

Estimate Std. Error t value Pr(>|t|)
(Intercept) 197.9628 59.1927 3.34 0.0058
volume 0.7180 0.0615 11.67 0.0000
coverpb -184.0473 40.4942 -4.55 0.0007



  • Slope of volume: All else held constant, books that are 1 more cubic centimeter in volume tend to weigh about 0.72 grams more.
  • Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books.
  • Intercept: Hardcover books with no volume are expected on average to weigh 198 grams.
    • Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line.

Modeling Poverty

poverty <- read.table("../Data/poverty.txt", h = T, sep = "\t")
names(poverty) <- c("state", "metro_res", "white", "hs_grad", "poverty", "female_house")
poverty <- poverty[,c(1,5,2,3,4,6)]
head(poverty)
##        state poverty metro_res white hs_grad female_house
## 1    Alabama    14.6      55.4  71.3    79.9         14.2
## 2     Alaska     8.3      65.6  70.8    90.6         10.8
## 3    Arizona    13.3      88.2  87.7    83.8         11.1
## 4   Arkansas    18.0      52.5  81.0    80.9         12.1
## 5 California    12.8      94.4  77.5    81.1         12.6
## 6   Colorado     9.4      84.5  90.2    88.7          9.6

From: Gelman, H. (2007). Data Analysis using Regression and Multilevel/Hierarchial Models. Cambridge University Press.

Modeling Poverty

Predicting Poverty using Percent Female Householder

lm.poverty <- lm(poverty ~ female_house, data=poverty)
summary(lm.poverty)
## 
## Call:
## lm(formula = poverty ~ female_house, data = poverty)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7537 -1.8252 -0.0375  1.5565  6.3285 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.3094     1.8970   1.745   0.0873 .  
## female_house   0.6911     0.1599   4.322 7.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.664 on 49 degrees of freedom
## Multiple R-squared:  0.276,  Adjusted R-squared:  0.2613 
## F-statistic: 18.68 on 1 and 49 DF,  p-value: 7.534e-05

Another look at \(R^2\)

\(R^2\) can be calculated in three ways:

  1. square the correlation coefficient of x and y (how we have been calculating it)
  2. square the correlation coefficient of y and \(\hat{y}\)
  3. based on definition:
    \[ R^2 = \frac{explained \quad variability \quad in \quad y}{total \quad variability \quad in \quad y} \]

Using ANOVA we can calculate the explained variability and total variability in y.

Sum of Squares

anova.poverty <- anova(lm.poverty)
print(xtable(anova.poverty, digits = 2), type='html')
Df Sum Sq Mean Sq F value Pr(>F)
female_house 1.00 132.57 132.57 18.68 0.00
Residuals 49.00 347.68 7.10


Sum of squares of y: \({ SS }_{ Total }=\sum { { \left( y-\bar { y } \right) }^{ 2 } } =480.25\) → total variability

Sum of squares of residuals: \({ SS }_{ Error }=\sum { { e }_{ i }^{ 2 } } =347.68\) → unexplained variability

Sum of squares of x: \({ SS }_{ Model }={ SS }_{ Total }-{ SS }_{ Error } = 132.57\) → explained variability

\[ R^2 = \frac{explained \quad variability \quad in \quad y}{total \quad variability \quad in \quad y} = \frac{132.57}{480.25} = 0.28 \]

Why bother?

  • For single-predictor linear regerssion, having three ways to calculate the same value may seem like overkill.
  • However, in multiple linear regression, we can’t calculate \(R^2\) as the square of the correlation between x and y because we have multiple xs.
  • And next we’ll learn another measure of explained variability, adjusted \(R^2\), that requires the use of the third approach, ratio of explained and unexplained variability.

Predicting poverty using % female household and % white

lm.poverty2 <- lm(poverty ~ female_house + white, data=poverty)
print(xtable(lm.poverty2), type='html')
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.5789 5.7849 -0.45 0.6577
female_house 0.8869 0.2419 3.67 0.0006
white 0.0442 0.0410 1.08 0.2868
anova.poverty2 <- anova(lm.poverty2)
print(xtable(anova.poverty2, digits = 2), type='html')
Df Sum Sq Mean Sq F value Pr(>F)
female_house 1.00 132.57 132.57 18.74 0.00
white 1.00 8.21 8.21 1.16 0.29
Residuals 48.00 339.47 7.07

\[ R^2 = \frac{explained \quad variability \quad in \quad y}{total \quad variability \quad in \quad y} = \frac{132.57 + 8.21}{480.25} = 0.29 \]

Does adding the variable white to the model add valuable information that wasn’t provided by female_house?

Collinearity between explanatory variables

poverty vs % female head of household

Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.3094 1.8970 1.74 0.0873
female_house 0.6911 0.1599 4.32 0.0001

poverty vs % female head of household and % female household

Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.5789 5.7849 -0.45 0.6577
female_house 0.8869 0.2419 3.67 0.0006
white 0.0442 0.0410 1.08 0.2868

Note the difference in the estima