next up previous
Next: About this document ...


General solution of linear regression problem

Problem: Let $\{(x_1,y_1), \ldots, (x_n,y_n) \}$ be a finite set of points in the plane. Find a formula for the linear function $f(x) = ax + b$ which minimizes the sum of squares of errors.


Sigma notation

If $a_1, \ldots, a_m$ is a sequence of numbers, then

\begin{displaymath}\sum_{i=1}^m a_i := a_1 + \cdots + a_m\end{displaymath}

For example, if $a_1 = 3$, $a_2 = 20$, $a_3 = 12$, the $\sum_{i=1}^3 a_i = 3 + 20 + 12 = 35$; while $\sum_{i=2}^3 a_i = 20 + 12 = 32$.

Sometimes, the inidices are omitted. So that we write $\sum a$ for $\sum_{i=1}^n a_i$.


Formula for sum of squares of errors

\begin{eqnarray*}
S(a,b) & = & (f(x_1) - y_1)^2 + \cdots + (f(x_n) - y_n)^2 \\
...
...ab (\sum x) \\
& & - 2a (\sum xy) - 2b (\sum y)
+ (\sum y^2)
\end{eqnarray*}


Derivatives of the sum of squares of errors


$\displaystyle \frac{\partial S}{\partial a}$ $\textstyle =$ $\displaystyle 2 (\sum x^2) a + 2 (\sum x) b - 2 (\sum xy)$ (1)
$\displaystyle \frac{\partial S}{\partial b}$ $\textstyle =$ $\displaystyle 2 (\sum x) a + 2n b - 2 (\sum y)$ (2)


Solving for $a$ and $b$

Setting $\frac{\partial S}{\partial b} = 0$ we find


\begin{displaymath}b = \frac{(\sum y) - a (\sum x)}{n}\end{displaymath}

Substituting in the equation $\frac{\partial S}{\partial a} = 0$ and clearing denominators, we find


\begin{displaymath}0 = n (\sum x^2) a + (\sum x) (\sum y) - (\sum x)^2 a - n (\sum xy)\end{displaymath}

which yields


\begin{displaymath}a = \frac{n(\sum xy) - (\sum x)(\sum y)}{n (\sum x^2) - (\sum x)^2}\end{displaymath}


Verifying minimization


$\displaystyle \frac{\partial S}{\partial a}$ $\textstyle =$ $\displaystyle 2 (\sum x^2) a + 2 (\sum x) b - 2 (\sum xy)$ (3)
$\displaystyle \frac{\partial S}{\partial b}$ $\textstyle =$ $\displaystyle 2 (\sum x) a + 2n b - 2 (\sum y)$ (4)


$\displaystyle \frac{\partial^2 S}{\partial a^2}$ $\textstyle =$ $\displaystyle 2 (\sum x^2)$ (5)
$\displaystyle \frac{\partial^2 S}{\partial b^2}$ $\textstyle =$ $\displaystyle 2 n$ (6)
$\displaystyle \frac{\partial^2 S}{\partial a \partial b}$ $\textstyle =$ $\displaystyle 2 (\sum x)$ (7)
$\displaystyle D_S$ $\textstyle =$ $\displaystyle 4 n (\sum x^2) - 4 (\sum x)^2$ (8)


Verification, continued

As a general rule, $\sum_{i=1}^n x_i^2 \geq (\sum_{i=1}^n x_i)^2$. Thus, $D_S > 0$ (except when $n = 1$!). As $\frac{\partial^2 S}{\partial b^2} = 2n > 0$, the point we found is a minimum.


Example

Find the line which best fits $\{(0,2), (-3,8), (5,2), (2,1) \}$.


Solution


$\displaystyle n$ $\textstyle =$ $\displaystyle 4$ (9)
$\displaystyle \sum x$ $\textstyle =$ $\displaystyle 4$ (10)
$\displaystyle \sum y$ $\textstyle =$ $\displaystyle 13$ (11)
$\displaystyle \sum xy$ $\textstyle =$ $\displaystyle -12$ (12)
$\displaystyle \sum x^2$ $\textstyle =$ $\displaystyle 38$ (13)


Solution, continued

\begin{eqnarray*}
a & = & \frac{n(\sum xy) - (\sum x)(\sum y)}{n (\sum x^2) - (\...
...frac{-100}{136} \\
& = & \frac{-25}{34} \\
& \approx & -0.74
\end{eqnarray*}


Solution, continued

So

\begin{eqnarray*}
b & = & \frac{(\sum y) - a (\sum x)}{n} \\
& = & \frac{13 - ...
...{-25}{34})4}{4} \\
& = & \frac{123}{34} \\
& \approx & 3.62
\end{eqnarray*}




next up previous
Next: About this document ...
Thomas Scanlon 2004-02-05