Supplement 2.1: Point Clouds and Linear Regression Lines (1/3)

A point cloud is considered, with i=1...n data points $P (x_{i}, y_{i})$ , where n is a natural number (left graph below). We search for a regression line which best approximates the distribution of data points. For this, the equation of a straight line, shown as blue broken line in the diagrams,

f (x) = a x + b

is investigated, and in particular the points $P (x_{i}, f (x_{i}))$ on that line having the same x coordinates as the data points (right graph below, click on the graph titles to switch sides).

Point cloud P(x_i,y_i) Straight line and projected points P(x_i,f(x_i))

The difference between data points and points on the line having the same x values is $y_{i} - f (x_{i})$ . These differences, shown in the graph below for points $P (x_{1}, y_{1})$ and $P (x_{1}, f (x_{1}))$ as an example, provide a means to find the best-fit line that represents the data points. This best-fit line should indeed have minimum differences to the data points!

Alternatively, one could also investigate the differences between data points and points on the line having the same y values. This would lead to a regression analysis with respect to y.

In the following equations, we investigate the regression analysis with respect to x.

It would not be useful to simply add up the differences and find a minimum by varying the line parameters since the differences are positive and negative. Instead, we get rid of the + or - sign of differences by squaring them, and we calculate the mean square deviation

S = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2}

here: $S = \frac{{(y_{1} - f (x_{1}))}^{2} + ... + {(y_{6} - f (x_{6}))}^{2}}{6}$

Substituting $f (x_{i})$ with $f (x_{i}) = a x_{i} + b$ yields:

S (a, b) = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - (a x_{i} + b))}^{2}

where a is the slope and b is the intercept of the straight line.

The centroid of data points

To find the best fit straight line, parameters a and b of the straight line shall be chosen such that the squared mean deviation

S (a, b) = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - (a x_{i} + b))}^{2}

is a minimum. For this, we calculate the derivative of S and set it to zero.

However, S depends on two variables, a and b. Therefore we calculate the derivative in two steps. In a first step we differentiate S with respect to b and keep a as a constant. Second, we differentiate S with respect to a and keep b as a constant. (In advanced calculus, S is called a function of two variables, and we calculate the partial derivatives of S with respect to a und b)

Differentiation with respect to b yields:

S' (b) = \frac{2}{n} \sum_{i = 1}^{n} (y_{i} - (a x_{i} + b))

S' (b) = \frac{2}{n} (\sum_{i = 1}^{n} y_{i} - a \sum_{i = 1}^{n} x_{i} - \sum_{i = 1}^{n} b)

With $\frac{1}{n} \sum_{i = 1}^{n} y_{i} = \bar{y} \frac{1}{n} \sum_{i = 1}^{n} x_{i} = \bar{x} \frac{1}{n} \sum_{i = 1}^{n} b = b$

where $\bar{x}$ and $\bar{y}$ are the arithmetic mean of the $x_{i}$ and $y_{i}$ coordinates, it follows:

S' (b) = 2 \bar{y} - 2 a \bar{x} - 2 b

The minimum of $S (b)$ follows from setting $S' (b) = 0$ , and one obtains:

\bar{y} = a \bar{x} + b

This looks like the equation of a straight line. However, $\bar{x}$ and $\bar{y}$ are point coordinates (the mean values). The point $P (\bar{x}, \bar{y})$ is the arithmetic mean of the points $P (x_{i}, y_{i})$ , denoted as their centroid.