Supplement 2.1: Point Clouds and Linear Regression Lines (3/3)
The correlation coefficient
We consider a number of data points , with i=1, ..., n. The points are defined by a set of n couples of features and . As already shown, one can calculate the centroid of the point cloud and the best-fit straight line. We are interested in a quantity which gives evidence of the deviation of the points from the best-fit line. As an example we focus on point in the graph below, and on its position as predicted by the best-fit line.
The position of is now explained in two steps, as illustrated in the graph below:
-
the location of point with the same x=x5, predicted by the best-fit line:
(deviation predicted by the best-fit line) -
the deviation of with respect to :
(unexplained deviation)
This concept can be applied with all points of the point cloud, in order to represent their deviations from the best-fit line.
The linear correlation coefficient
The equation of the correlation coefficient given in chapter 2 on page Linear Regression Analysis (3/3) differs from the equation given in the left column. It holds for best-fit straight lines only (hence: linear correlation coefficient), while the equation given in the left column is also valid for curved best-fit lines.
The equation for the linear correlation coefficient can be obtained by squaring the equation given in the left column, and by repacing with :
Using the slope a given on the preceeding page yields:
This can be simplified by cancelling:
It follows by taking the square root:
Omitting the absolute value bars,
allows the correlation coefficient to take on negative values: usually r has the same sign as a.
It is advantageous with this equation that it includes the coordinates of the centroid and the data points only. The equation given in the left column requires also the coordinate valus of the best fit line.