Linear Regression in R

Throughout this handout, we'll make use of the heights data comparing heights of wife and husbands.

Let's try to discover if there is a linear relationship between wife's and husband height.

In [13]:
library(lattice)
library(PBImisc)
library(dplyr)
library(ggplot2)
In [14]:
model2 <- lm(Husband ~ Wife, data = heights)
summary(model2)
Out[14]:
Call:
lm(formula = Husband ~ Wife, data = heights)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.7438  -4.2838  -0.1615   4.2562  17.7500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.81005   11.93231   3.169  0.00207 ** 
Wife         0.83292    0.07269  11.458  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.468 on 94 degrees of freedom
Multiple R-squared:  0.5828,	Adjusted R-squared:  0.5783 
F-statistic: 131.3 on 1 and 94 DF,  p-value: < 2.2e-16

Ok, what does it all mean? Let's go through that line by line.

Coefficients

Recall that Linear Regression is all about explaining $Y$ with $\beta$ like this:

$$ Y = X\beta + \epsilon = \sum_{i=1}^p X_i \beta_i + \epsilon$$

Note that if we have $n$ examples provided and $p$ explanatory variables as above, then $Y, X_i, \epsilon \in \mathbb{R}^n$ and $X \in \mathbb{R}^{n\times{p}}$. It's important to realize that $X$ and $epsilon$ are random variables.

How do we look for $\beta$? Well, we find such $\hat{\beta}$ that minimizes Mean Square Error (MSE) or equivalently RSS (Residual Sum of Squares). The bottom line is we minimize this expression

$$ RSS(\tilde{\beta}) = \sum_{i=1}^n (y_i - \tilde{y_i})^2 \in \mathbb{R} $$

Where $\tilde{y_i} = X\tilde{\beta_i}$.

$\hat{\beta}$ is our estimator and the Gauss-Markov theorem states that if for all $i$ $\epsilon_i$ has mean $0$, $Var(\epsilon_i) = \sigma^2$, $\epsilon_i$ i.i.d, and $X$ is invertible, then this estimator is the best one in a sense. We are not going to dive into this topic.

Estimate column

So, the column Estimate of Coefficients is $\hat\beta$.

This is crucial parameter that allows us to make predicitions of $Y$ given new $X$ parameters. In our case, we can easily plot it

In [15]:
xyplot(Husband~Wife, data=heights, typ=c("p", "r"))

If you wonder, like I did, why did the Intercept term show up and therefore the line does not cross $(0,0)$, that's because the lm function always adds an intercept term unless you explicitly remove it.

So lm(Husband~Wife) is in fact lm(Husband ~ Wife + 1) unless you explicitly write lm(Husband ~ Wife + 0)

In [16]:
summary(lm(Husband~Wife+0, data = heights))$coefficients
Out[16]:
EstimateStd. Errort valuePr(>|t|)
Wife 1.062914e+00 4.208631e-03 2.525558e+023.976268e-136

Std. Error column

Recall that our goal is to minimize

$$ RSS(\tilde{\beta}) = \sum_{i=1}^n (y_i - \tilde{y_i})^2 \in \mathbb{R} $$

by differentiating $RSS$ with respect to $\tilde{\beta}$ and comparing the derivative to $0$ we get the equation

$$ \hat{\beta} = (X^TX)^{-1}X^Ty. $$

Note at this point, that $\hat\beta$ is a random variable as it's a transformation of $X$. The Estimate column is in fact realization of $\hat{\beta}$ given the values of $X$. It makes sense, however, to speak about distribution of $\hat{\beta}$ and about it's Std. Error.

t value and $Pr(>|t|)$

Having the distribution we are finally good to go to find the infamous p-value (the $Pr(>|t|)$ column). More precisely, in a row $i$ a hypothesis "if $j >= i$ then $\beta_j = 0$". High values of p-value in a row would suggest that we can reject this hypothesis and leave the $i$-th paremeter in the model.

What is the statistic used for this test ? I won't cover it here.

Residuals

Let's recall the summary

In [17]:
summary(model2)
Out[17]:
Call:
lm(formula = Husband ~ Wife, data = heights)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.7438  -4.2838  -0.1615   4.2562  17.7500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.81005   11.93231   3.169  0.00207 ** 
Wife         0.83292    0.07269  11.458  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.468 on 94 degrees of freedom
Multiple R-squared:  0.5828,	Adjusted R-squared:  0.5783 
F-statistic: 131.3 on 1 and 94 DF,  p-value: < 2.2e-16

Residuals is $\hat{\epsilon}$, where $\hat{\epsilon} = y - \hat{y}$ and $\hat{y} = X\hat{\beta}$. Let's see this at a plot

In [18]:
intercept = (coef(model2))[1]
wife = coef(model2)[2]

husbandEstimates = (heights %>% transmute(Wife = Wife, Husband = wife * heights$Wife + intercept, estimates = T)
                               %>% mutate(grp = row_number()))
dfheights = heights %>% transform(estimates = F) %>% mutate(grp = row_number())

heightsAndHusbandEstimates = rbind(dfheights, husbandEstimates)

ggplot() + 
   geom_point(data = heights, aes(Wife, Husband))+
   geom_line(data = heightsAndHusbandEstimates, aes(Wife, Husband, group = grp)) +
   geom_abline(intercept = intercept, slope = wife, colour='blue') +
   geom_point(data = husbandEstimates, aes(Wife, Husband), colour = "white")

For each observation $y$ (black dot), the white dot is estimate $\hat{y}$ and the black line represents the residual $\hat{e}$ for that observation. Blue line is derived from $\hat{\beta}$

At this point it's good to see the other model lm(Wife ~ Husband) (red line) to understand why the models will yield different solutions.

In [19]:
intercept = (coef(model2))[1]
wife = coef(model2)[2]

model3 = lm(Wife ~ Husband, data = heights)

iIntercept = - coef(model3)[1] / coef(model3)[2]
iWife = 1 / coef(model3)[2]

ggplot(heights, aes(Wife, Husband)) + 
   geom_point() +
   geom_abline(intercept = intercept, slope = wife, colour='blue') +
   geom_abline(intercept = iIntercept, slope = iWife, colour='red')
In [20]:
hIntercept = coef(model3)[1]
husband = coef(model3)[2]

wifeEstimates = (heights %>% transmute(Husband = Husband, Wife = husband * heights$Husband + hIntercept, estimates = T)
                               %>% mutate(grp = row_number()))
dfheights = heights %>% transform(estimates = F) %>% mutate(grp = row_number())

heightsAndWifeEstimates = rbind(dfheights, wifeEstimates)

ggplot() + 
   geom_point(data = heights, aes(Wife, Husband)) +
   geom_line(data = heightsAndWifeEstimates, aes(Wife, Husband, group = grp)) +
   geom_abline(intercept = whIntercept, slope = whusband, colour='red') +
   geom_point(data = wifeEstimates, aes(Wife, Husband), colour = "white")
Error in do.call("layer", list(mapping = mapping, data = data, stat = stat, : object 'whIntercept' not found

At this point you should ask yourself: hey! why don't we therefore minimize the distance to a common plane (line) instead the difference in $y$ values ?

Well... if you did it, you'd find yourself performing Principal Component Analysis (PCA) and the resulting line would like below.

In [21]:
heights.centered = cbind(Husband = (heights$Husband - mean(heights$Husband)), 
                         Wife = (heights$Wife - mean(heights$Wife)))
heights.cov = cov(heights.centered)
heights.eigen = eigen(heights.cov)

slope = (heights.eigen$vectors[1,1]/heights.eigen$vectors[2,1])
intercept = - mean(heights$Husband) + mean(heights$Wife)



ggplot() +
  geom_point(data = heights, aes(Wife, Husband)) +
  geom_abline(intercept=intercept,slope=slope,col="red")

But that was a digression.

Residuals

Let's see once more our model summary

In [22]:
summary(model2)
Out[22]:
Call:
lm(formula = Husband ~ Wife, data = heights)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.7438  -4.2838  -0.1615   4.2562  17.7500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.81005   11.93231   3.169  0.00207 ** 
Wife         0.83292    0.07269  11.458  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.468 on 94 degrees of freedom
Multiple R-squared:  0.5828,	Adjusted R-squared:  0.5783 
F-statistic: 131.3 on 1 and 94 DF,  p-value: < 2.2e-16

Recall that we assumed, residuals share the same std. error. So what you see in the Residuals line is exactly what you expect.

Multiple R-squared

Here we measure how well does the model $M$ explain the data compared to a model with model $M_0$ with intercept only. More precisely the value $R^2(M)$ is calculated

$$ R^2(M) = 1 - \frac{RSS_M}{RSS_{M_0}} $$

When you compare model $M_1$ with $M_2$ then the one with $R^2$ closer to $1$ fits the data better.

One should consider using this test if the parameters of model $M_1$ are not subset of parameters of model $M_2$

Adjusted R-squared

This measure is very similar but we take into account number $p$ of parameters in the model to calculate $R^2_{adj}(M)$

F-statistic

This one is the same statistic as used for testing significance of Coefficients. Here the null hypothesis is model with all parameters (but intercept) equal to $0$.