Throughout this handout, we'll make use of the heights
data comparing heights of wife and husbands.
Let's try to discover if there is a linear relationship between wife's and husband height.
library(lattice)
library(PBImisc)
library(dplyr)
library(ggplot2)
model2 <- lm(Husband ~ Wife, data = heights)
summary(model2)
Ok, what does it all mean? Let's go through that line by line.
Recall that Linear Regression is all about explaining $Y$ with $\beta$ like this:
$$ Y = X\beta + \epsilon = \sum_{i=1}^p X_i \beta_i + \epsilon$$Note that if we have $n$ examples provided and $p$ explanatory variables as above, then $Y, X_i, \epsilon \in \mathbb{R}^n$ and $X \in \mathbb{R}^{n\times{p}}$. It's important to realize that $X$ and $epsilon$ are random variables.
How do we look for $\beta$? Well, we find such $\hat{\beta}$ that minimizes Mean Square Error (MSE) or equivalently RSS (Residual Sum of Squares). The bottom line is we minimize this expression
$$ RSS(\tilde{\beta}) = \sum_{i=1}^n (y_i - \tilde{y_i})^2 \in \mathbb{R} $$Where $\tilde{y_i} = X\tilde{\beta_i}$.
$\hat{\beta}$ is our estimator and the Gauss-Markov theorem states that if for all $i$ $\epsilon_i$ has mean $0$, $Var(\epsilon_i) = \sigma^2$, $\epsilon_i$ i.i.d, and $X$ is invertible, then this estimator is the best one in a sense. We are not going to dive into this topic.
So, the column Estimate
of Coefficients
is $\hat\beta$.
This is crucial parameter that allows us to make predicitions of $Y$ given new $X$ parameters. In our case, we can easily plot it
xyplot(Husband~Wife, data=heights, typ=c("p", "r"))
If you wonder, like I did, why did the Intercept
term show up and therefore the line does not cross $(0,0)$,
that's because the lm
function always adds an intercept term unless you explicitly remove it.
So lm(Husband~Wife)
is in fact lm(Husband ~ Wife + 1)
unless you explicitly write lm(Husband ~ Wife + 0)
summary(lm(Husband~Wife+0, data = heights))$coefficients
Recall that our goal is to minimize
$$ RSS(\tilde{\beta}) = \sum_{i=1}^n (y_i - \tilde{y_i})^2 \in \mathbb{R} $$by differentiating $RSS$ with respect to $\tilde{\beta}$ and comparing the derivative to $0$ we get the equation
$$ \hat{\beta} = (X^TX)^{-1}X^Ty. $$Note at this point, that $\hat\beta$ is a random variable as it's a transformation of $X$. The Estimate
column is in fact realization of $\hat{\beta}$ given the values of $X$. It makes sense, however, to speak about distribution of $\hat{\beta}$ and about it's Std. Error.
Having the distribution we are finally good to go to find the infamous p-value
(the $Pr(>|t|)$ column). More precisely, in a row $i$ a hypothesis "if $j >= i$ then $\beta_j = 0$". High values of p-value
in a row would suggest that we can reject this hypothesis and leave the $i$-th paremeter in the model.
What is the statistic used for this test ? I won't cover it here.
Let's recall the summary
summary(model2)
Residuals
is $\hat{\epsilon}$, where $\hat{\epsilon} = y - \hat{y}$ and $\hat{y} = X\hat{\beta}$. Let's see this at a plot
intercept = (coef(model2))[1]
wife = coef(model2)[2]
husbandEstimates = (heights %>% transmute(Wife = Wife, Husband = wife * heights$Wife + intercept, estimates = T)
%>% mutate(grp = row_number()))
dfheights = heights %>% transform(estimates = F) %>% mutate(grp = row_number())
heightsAndHusbandEstimates = rbind(dfheights, husbandEstimates)
ggplot() +
geom_point(data = heights, aes(Wife, Husband))+
geom_line(data = heightsAndHusbandEstimates, aes(Wife, Husband, group = grp)) +
geom_abline(intercept = intercept, slope = wife, colour='blue') +
geom_point(data = husbandEstimates, aes(Wife, Husband), colour = "white")
For each observation $y$ (black dot), the white dot is estimate $\hat{y}$ and the black line represents the residual $\hat{e}$ for that observation. Blue line is derived from $\hat{\beta}$
At this point it's good to see the other model lm(Wife ~ Husband)
(red line) to understand why the models will yield different solutions.
intercept = (coef(model2))[1]
wife = coef(model2)[2]
model3 = lm(Wife ~ Husband, data = heights)
iIntercept = - coef(model3)[1] / coef(model3)[2]
iWife = 1 / coef(model3)[2]
ggplot(heights, aes(Wife, Husband)) +
geom_point() +
geom_abline(intercept = intercept, slope = wife, colour='blue') +
geom_abline(intercept = iIntercept, slope = iWife, colour='red')
hIntercept = coef(model3)[1]
husband = coef(model3)[2]
wifeEstimates = (heights %>% transmute(Husband = Husband, Wife = husband * heights$Husband + hIntercept, estimates = T)
%>% mutate(grp = row_number()))
dfheights = heights %>% transform(estimates = F) %>% mutate(grp = row_number())
heightsAndWifeEstimates = rbind(dfheights, wifeEstimates)
ggplot() +
geom_point(data = heights, aes(Wife, Husband)) +
geom_line(data = heightsAndWifeEstimates, aes(Wife, Husband, group = grp)) +
geom_abline(intercept = whIntercept, slope = whusband, colour='red') +
geom_point(data = wifeEstimates, aes(Wife, Husband), colour = "white")
At this point you should ask yourself: hey! why don't we therefore minimize the distance to a common plane (line) instead the difference in $y$ values ?
Well... if you did it, you'd find yourself performing Principal Component Analysis (PCA) and the resulting line would like below.
heights.centered = cbind(Husband = (heights$Husband - mean(heights$Husband)),
Wife = (heights$Wife - mean(heights$Wife)))
heights.cov = cov(heights.centered)
heights.eigen = eigen(heights.cov)
slope = (heights.eigen$vectors[1,1]/heights.eigen$vectors[2,1])
intercept = - mean(heights$Husband) + mean(heights$Wife)
ggplot() +
geom_point(data = heights, aes(Wife, Husband)) +
geom_abline(intercept=intercept,slope=slope,col="red")
But that was a digression.
Let's see once more our model summary
summary(model2)
Recall that we assumed, residuals share the same std. error. So what you see in the Residuals
line is exactly what you expect.
Here we measure how well does the model $M$ explain the data compared to a model with model $M_0$ with intercept only. More precisely the value $R^2(M)$ is calculated
$$ R^2(M) = 1 - \frac{RSS_M}{RSS_{M_0}} $$When you compare model $M_1$ with $M_2$ then the one with $R^2$ closer to $1$ fits the data better.
One should consider using this test if the parameters of model $M_1$ are not subset of parameters of model $M_2$
This measure is very similar but we take into account number $p$ of parameters in the model to calculate $R^2_{adj}(M)$
This one is the same statistic as used for testing significance of Coefficients
. Here the null hypothesis is model with all parameters (but intercept) equal to $0$.