Comparing two variables – Pearson’s product-moment correlation


The Pearson product-moment correlation (often called Pearson’s r, among others) is a parametric test which measures the linear relationship between two variables. In brief, Pearson’s correlation virtually draws a line through the data points trying to make the best fit line; the coefficient tells you how well the data are “dispatched” relative to that line.

This test comes with assumptions, and one must check that everything is OK before going further:

  • this is a parametric test, samples/variables must be normally distributed (run the Shapiro-Wilk test),
  • the variables are continuous,
  • the variables work in pairs,
  • outliers are not allowed,
  • the variances of these variables are “relatively” similar (Run Fisher’s F-test).

Let’s see this with an example. Here, we consider the weight and height of 16 individuals. Both weight and height are continuous variables, arranged in pairs ( 1 weight entry and 1 height entry per individual).

We need to check that both variables are normally distributed:

weight<-c(84,64,73,78,70,79,74,68,73,63,62,69,54,64,66,70)
height<-c(183,174,179,174,164,184,179,154,167,170,168,164,166,163,154,174)
par(mfrow=c(1,2))
hist(weight, col="red", prob=TRUE)
hist(height, col="green", prob=TRUE)
shapiro.test(weight)
shapiro.test(height)

Skjermbilde 2016-07-04 13.14.59Skjermbilde 2016-07-04 13.15.57

 

As you may see with the histograms and using the Shapiro-Wilk test, both sets are normally distributed. Let’s draw the boxplots and check for similar variance:

par(mfrow=c(1,2))
boxplot(weight, main="weight")
boxplot(height, main="height")
var.test(weight,height)

Skjermbilde 2016-07-04 13.23.23Skjermbilde 2016-07-04 13.23.28

 

Variances are apparently not significantly different according to Fisher’s F test, and no outlier seems to show up on the boxplots. We can proceed…
We may now vizualise these 2 variables in a scatter plot where we add a line of best fit:

plot(weight~height)
abline(lm(weight~height))

Skjermbilde 2016-07-04 13.26.58

Now that the assumptions are checked and that we have a quick idea of the linear relationship, let’s check Pearson’s product-moment correlation. The function is cor.test(). Note that the function is the same as for Spearman’s rho and Kendall’s tau. The extra parameter method=" " defines which correlation coefficient is to be considered in the test (choose between "pearson", "spearman" and "kendall"; if the parameter method is omitted, the default test will be Pearson’s r).

In this test, the null hypothesis H0 states that there is no relationship between the variables.

cor.test(height, weight, method="pearson")

Skjermbilde 2016-07-04 13.29.23

The test concludes that it is very unlikely that there exists no relationship between the variables (p-value under 0.05). The alternative hypothesis (there is a relationship…) is thus accepted.