Linear regression


Linear regression helps you simplifying a dataset by modelling and drawing a straight line representing this dataset. It is often used to find a relationship between a continuous response variable and a continuous independent/predictor variable. Examples are numerous: finding the relationship between bodyweight and height is one of them, for instance.

Let’s use the following dataset as an example:

bodyweight <- c(70, 75, 72, 58, 80, 80, 48, 56, 103, 51)
size <- c(177, 178, 167, 153, 174, 177, 152, 134, 191, 136)
dataset.df <- data.frame(bodyweight, size) 

 

Everything starts with a plot. A scatter plot of the dataset is usually a good beginning.

plot(bodyweight~size, ylab="bodyweight (kg)", xlab="size (cm)", col="blue", pch = 10)

Skjermbilde 2016-09-20 22.27.09

 

Then we try to fit a linear model with the function lm() which we have already encountered when performing analysis of variance (ANOVA).

lm(bodyweight~size)

Skjermbilde 2016-09-20 22.29.15

Note that you find in this output everything you need to draw the expected line: the intercept is clearly indicated (-56.2716) and is followed by the value of the slope (0.7661). Let’s add these values to the function abline() with the syntax abline(intercept, slope) which will create the regression line on the existing plot:

abline(-56.2716, 0.7661)

Skjermbilde 2016-09-20 22.32.40

Note also that we can directly use the result of lm() into the  function abline() to obtain the exact same graph:

abline(lm(bodyweight~size))

 

At all time, it is of course possible to store the result of lm() into a vector for later use. Here we’ll simply call it lin.mod. Using the function summary(), more information about the model may be obtained:

lin.mod <-lm(bodyweight~size)
summary(lin.mod)

Skjermbilde 2016-09-20 22.40.31

This output provides you with several interesting values such as quartiles, median, minimum and maximum at the top, and the (adjusted) R-squared (R2) at the bottom, which describes how well the model matches the data (NB: be careful when interpreting R-squared, see this blogpost for some info).

Finally, it is good practice to check the model by plotting the line in the following manner to visualize in a few plots how good your model fits with the actual data:

plot(lin.mod)

Skjermbilde 2016-09-20 22.46.27

Skjermbilde 2016-09-20 22.46.40

Skjermbilde 2016-09-20 22.46.48

Skjermbilde 2016-09-20 22.46.52

Skjermbilde 2016-09-20 22.46.58