
  • Least-Squares Lines (最小二乘直线)
  • The General Linear Model
  • Least-Squares Fitting of Other Curves
  • Multiple Regression
  • References


  • For easy application of the discussion to real problems that you may encounter later in your career, we choose notation that is commonly used in the statistical analysis of scientific and engineering data:

    • Instead of Ax=bA\boldsymbol x =\boldsymbol bAx=b, we write Xβ=yX\boldsymbol \beta=\boldsymbol yXβ=y and refer to XXX as the design matrix (设计矩阵), β\boldsymbol \betaβ as the parameter vector (参数向量), and y\boldsymbol yy as the observation vector (参数向量).

Least-Squares Lines (最小二乘直线)

  • The simplest relation between two variables xxx and yyy is the linear equation y=β0+β1xy=\beta_0+\beta_1 xy=β0​+β1​x. Experimental data often produce points (x1,y1),...,(xn,yn)(x_1, y_1),...,(x_n, y_n)(x1​,y1​),...,(xn​,yn​) that, when graphed, seem to lie close to a line. We want to determine the parameters β0\beta_0β0​ and β1\beta_1β1​ that make the line as “close” to the points as possible.
  • Suppose β0\beta_0β0​ and β1\beta_1β1​ are fixed, and consider the line y=β0+β1xy=\beta_0+\beta_1 xy=β0​+β1​x in Figure 1. Corresponding to each data point (xj,yj)(x_j , y_j)(xj​,yj​) there is a point (xj,β0+β1xj)(x_j , \beta_0+\beta_1 x_j)(xj​,β0​+β1​xj​) on the line with the same xxx-coordinate. We call yjy_jyj​ the observedobservedobserved value of yyy and β0+β1xj\beta_0+\beta_1 x_jβ0​+β1​xj​ the predictedpredictedpredicted yyy-value. The difference between an observed yyy-value and a predicted yyy-value is called a residual(余差)residual(余差)residual(余差).
  • There are several ways to measure how “close” the line is to the data. The usual choice (primarily because the mathematical calculations are simple) is to add the squares of the residuals:
    • The least-squares line is the line y=β0+β1xy=\beta_0+\beta_1 xy=β0​+β1​x that minimizes the sum of the squares of the residuals. This line is also called a line of regression of yyy on xxx (yyy 对 xxx 的回归直线), because any errors in the data are assumed to be only in the yyy-coordinates. The coefficients β0,β1\beta_0, \beta_1β0​,β1​ of the line are called (linear) regression coefficients (回归系数).
  • If the data points were on the line, the parameters β0\beta_0β0​ and β1\beta_1β1​ would satisfy the equations
    We can write this system as
    This is a least-squares problem. The square of the distance between the vectors XXX and y\boldsymbol yy is precisely the sum of the squares of the residuals. Computing the least-squares solution of Xβ=yX\boldsymbol \beta=\boldsymbol yXβ=y is equivalent to finding the β\boldsymbol \betaβ that determines the least-squares line in Figure 1.
  • A common practice before computing a least-squares line is to compute the average x‾\overline xx of the original xxx-values and form a new variable x∗=x−x‾x^* = x -\overline xx∗=x−x. The new xxx-data are said to be in mean-deviation form (平均偏差形式). In this case, the two columns of the design matrix will be orthogonal.


Show that the least-squares line for the data (x1,y1),...,(xn,yn)(x_1, y_1),...,(x_n, y_n)(x1​,y1​),...,(xn​,yn​) must pass through (x‾,y‾)(\overline x,\overline y)(x,y​). That is, show that x‾\overline xx and y‾\overline yy​ satisfy the linear equation y‾=β^0+β^1x‾\overline y =\hat\beta_0+\hat\beta_1\overline xy​=β^​0​+β^​1​x.


  • Derive this equation from the vector equation y=Xβ^+ϵ\boldsymbol y=X\hat\beta +\boldsymbol \epsilony=Xβ^​+ϵ. Denote the first column of XXX by 1\boldsymbol 11. Use the fact that the residual vector ϵ\boldsymbol \epsilonϵ is orthogonal to the column space of XXX and hence is orthogonal to 1\boldsymbol 11. Thus ∑i=1nϵi=0\sum_{i=1}^{n}\epsilon_i=0∑i=1n​ϵi​=0.
    ∵yi=β^0+xiβ^1+ϵi∴∑i=1nyi=nβ^0+β^1∑i=1nxi∴y‾=β^0+β^1x‾\begin{aligned}\because y_i&=\hat\beta_{0}+x_i\hat\beta_{1}+\epsilon_i\\\therefore \sum_{i=1}^{n}y_i&=n\hat\beta_{0}+\hat\beta_{1}\sum_{i=1}^{n}x_i\\\therefore \overline y &=\hat\beta_0+\hat\beta_1\overline x\end{aligned}∵yi​∴i=1∑n​yi​∴y​​=β^​0​+xi​β^​1​+ϵi​=nβ^​0​+β^​1​i=1∑n​xi​=β^​0​+β^​1​x​

  • Given data for a least-squares problem, (x1,y1),...,(xn,yn)(x_1, y_1),...,(x_n, y_n)(x1​,y1​),...,(xn​,yn​), the following abbreviations are helpful:
    ∑x=∑i=1nxi,∑x2=∑i=1nxi2,∑y=∑i=1nyi,∑xy=∑i=1nxiyi\sum x=\sum_{i=1}^{n}x_i,\sum x^2=\sum_{i=1}^{n}x_i^2,\\\sum y=\sum_{i=1}^{n}y_i,\sum xy=\sum_{i=1}^{n}x_iy_i∑x=i=1∑n​xi​,∑x2=i=1∑n​xi2​,∑y=i=1∑n​yi​,∑xy=i=1∑n​xi​yi​
  • The normal equations for a least-squares line y=β^0+β^1xy = \hat\beta_0 +\hat\beta_1xy=β^​0​+β^​1​x is XTXβ=XTyX^TX\boldsymbol \beta=X^T \boldsymbol yXTXβ=XTy.
    ∵XTX=[1TxT][1x]=[n∑x∑x∑x2]\because X^TX=\begin{bmatrix}\boldsymbol 1^T\\\boldsymbol x^T\end{bmatrix}\begin{bmatrix}\boldsymbol 1&\boldsymbol x\end{bmatrix}=\begin{bmatrix}n&\sum x\\\sum x&\sum x^2\end{bmatrix}∵XTX=[1TxT​][1​x​]=[n∑x​∑x∑x2​]The normal equation may be written in the form
    [n∑x∑x∑x2]β^=[1TxT]y=[∑y∑xy]\begin{bmatrix}n&\sum x\\\sum x&\sum x^2\end{bmatrix}\hat\beta=\begin{bmatrix}\boldsymbol 1^T\\\boldsymbol x^T\end{bmatrix}\boldsymbol y=\begin{bmatrix}\sum y\\\sum xy\end{bmatrix}[n∑x​∑x∑x2​]β^​=[1TxT​]y=[∑y∑xy​]∴nβ^0+β^1∑x=∑y,β^0∑x+β^1∑x2=∑xy\therefore n\hat\beta_0+\hat\beta_1\sum x=\sum y\ \ \ \ \ \ ,\ \ \ \ \hat\beta_0\sum x+\hat\beta_1\sum x^2=\sum xy∴nβ^​0​+β^​1​∑x=∑y      ,    β^​0​∑x+β^​1​∑x2=∑xy
  • If XXX has 2 linearly independent columns, then
    β^=[n∑x∑x∑x2]−1[∑y∑xy]=1n∑x2−(∑x)2[∑x2−∑x−∑xn][∑y∑xy]\begin{aligned}\hat\beta&=\begin{bmatrix}n&\sum x\\\sum x&\sum x^2\end{bmatrix}^{-1}\begin{bmatrix}\sum y\\\sum xy\end{bmatrix} \\&=\frac{1}{n\sum x^2-(\sum x)^2}\begin{bmatrix}\sum x^2&-\sum x\\-\sum x&n\end{bmatrix} \begin{bmatrix}\sum y\\\sum xy\end{bmatrix} \end{aligned}β^​​=[n∑x​∑x∑x2​]−1[∑y∑xy​]=n∑x2−(∑x)21​[∑x2−∑x​−∑xn​][∑y∑xy​]​∴β^0=∑x2∑y−∑x∑xyn∑x2−(∑x)2,β^1=n∑xy−∑x∑yn∑x2−(∑x)2\therefore \hat\beta_0=\frac{\sum x^2\sum y-\sum x \sum xy}{n\sum x^2-(\sum x)^2},\hat\beta_1=\frac{n\sum xy-\sum x\sum y}{n\sum x^2-(\sum x)^2}∴β^​0​=n∑x2−(∑x)2∑x2∑y−∑x∑xy​,β^​1​=n∑x2−(∑x)2n∑xy−∑x∑y​

Consider the following numbers.

Every statistics text that discusses regression and the linear model y=Xβ+ϵ\boldsymbol y = X\boldsymbol \beta+\epsilony=Xβ+ϵ introduces these numbers.

  • (i) ∥Xβ^∥2\left\|X\hat\beta\right\|^2∥∥​Xβ^​∥∥​2—the sum of the squares of the “regression term.” Denote this number by SS(R)SS(R)SS(R).
  • (ii) ∥y−Xβ^∥2\left\|\boldsymbol y-X\hat\beta\right\|^2∥∥​y−Xβ^​∥∥​2—the sum of the squares for error term. Denote this number by SS(E)SS(E)SS(E).
  • (iii) ∥y∥2\left\|\boldsymbol y\right\|^2∥y∥2—the “total” sum of the squares of the yyy-values. Denote this number by SS(T)SS(T)SS(T).


Justify the equation SS(T)=SS(R)+SS(E)SS(T) = SS(R) + SS(E)SS(T)=SS(R)+SS(E). This equation is extremely important in statistics, both in regression theory and in the analysis of variance.


  • This follows from the Pythagorean Theorem (in Section 6.1). Then SS(E)=SS(T)−SS(R)=∥y∥2−∥Xβ^∥2=yTy−β^TXTXβ^=yTy−(β^TXTXβ^+β^TXTϵ)=yTy−β^TXT(Xβ^+ϵ)=yTy−β^TXTy\begin{aligned}SS(E)&=SS(T)-SS(R)\\&= \left\|\boldsymbol y\right\|^2- \left\|X\hat\beta\right\|^2\\&= \boldsymbol y^T\boldsymbol y-\hat\beta^TX^TX\hat\beta\\&=\boldsymbol y^T\boldsymbol y-(\hat\beta^TX^TX\hat\beta+\hat\beta^TX^T\boldsymbol \epsilon)\\&= \boldsymbol y^T\boldsymbol y-\hat\beta^TX^T(X\hat\beta+\boldsymbol \epsilon)\\&=\boldsymbol y^T\boldsymbol y-\hat\beta^TX^T\boldsymbol y\end{aligned}SS(E)​=SS(T)−SS(R)=∥y∥2−∥∥​Xβ^​∥∥​2=yTy−β^​TXTXβ^​=yTy−(β^​TXTXβ^​+β^​TXTϵ)=yTy−β^​TXT(Xβ^​+ϵ)=yTy−β^​TXTy​This is the standard formula for SS(E)SS(E)SS(E).

The General Linear Model

  • In some applications, it is necessary to fit data points with something other than a straight line.

    • In the examples that follow, the matrix equation is still Xβ=yX\boldsymbol \beta=\boldsymbol yXβ=y, but the specific form of XXX changes from one problem to the next.
    • Statisticians usually introduce a residual vector (余差向量) ϵ\boldsymbol\epsilonϵ, defined by ϵ=y−Xβ\boldsymbol\epsilon = \boldsymbol y - X\boldsymbol \betaϵ=y−Xβ, and write
      y=Xβ+ϵ\boldsymbol y = X\boldsymbol \beta+\boldsymbol \epsilony=Xβ+ϵAny equation of this form is referred to as a linear model. Once XXX and y\boldsymbol yy are determined, the goal is to minimize the length of ϵ\boldsymbol \epsilonϵ, which amounts to finding a least-squares solution of Xβ=yX\boldsymbol \beta=\boldsymbol yXβ=y. In each case, the least-squares solution β^\hat\betaβ^​ is a solution of the normal equations
      XTXβ=XTyX^TX\boldsymbol \beta=X^T\boldsymbol yXTXβ=XTy

Least-Squares Fitting of Other Curves

  • The next example shows how to fit data by curves that have the general form
    y=β0f0(x)+β1f1(x)+...+βkfk(x)(2)y=\beta_0f_0(x)+\beta_1f_1(x)+...+\beta_kf_k(x)\ \ \ \ \ (2)y=β0​f0​(x)+β1​f1​(x)+...+βk​fk​(x)     (2)where f0,...,fkf_0,..., f_kf0​,...,fk​ are known functions and β0,...,βk\beta_0,...,\beta_kβ0​,...,βk​ are parameters that must be determined.
  • As we will see, equation (2) describes a linear model because it is linear in the unknown parameters.


Suppose we wish to approximate the data by an equation of the form
y=β0+β1x+β2x2(3)y=\beta_0+\beta_1x+\beta_2x^2\ \ \ \ \ (3)y=β0​+β1​x+β2​x2     (3)Describe the linear model that produces a “least-squares fit” of the data by equation (3).


  • The design matrix above is a Vandermonde matrix (范德蒙德矩阵). Example 4 in Section 2.1 and Theorem 14 in Section 6.5 shows that if at least 333 of the values x1,…,xnx_1, …, x_nx1​,…,xn​ are distinct, then the least-squares solution KaTeX parse error: Unexpected end of input in a macro argument, expected '}' at position 5: \hat\̲b̲o̲l̲d̲s̲y̲m̲b̲o̲l̲ ̲\beta will be unique.

Multiple Regression


  • Suppose an experiment involves two independent variables(独立变量)—say, uuu and vvv—and one dependent variable(相关变量), yyy. A simple equation for predicting yyy from uuu and vvv has the form
    y=β0+β1u+β2v(4)y =\beta_0 +\beta_1u +\beta_2v\ \ \ \ \ (4)y=β0​+β1​u+β2​v     (4)A more general prediction equation might have the form
    y=β0+β1u+β2v+β3u2+β4uv+β5v2(5)y =\beta_0 +\beta_1u +\beta_2v+\beta_3u^2 +\beta_4uv+\beta_5v^2\ \ \ \ \ (5)y=β0​+β1​u+β2​v+β3​u2+β4​uv+β5​v2     (5)
  • Equations (4) and (5) both lead to a linear model because they are linear in the unknown parameters (even though uuu and vvv are multiplied). In general, a linear model will arise whenever yyy is to be predicted by an equation of the form
    y0=β0f0(u,v)+β1f1(u,v)+...+βkfk(u,v)y_0=\beta_0f_0(u, v)+\beta_1f_1(u, v)+...+\beta_kf_k(u, v)y0​=β0​f0​(u,v)+β1​f1​(u,v)+...+βk​fk​(u,v)


  • LinearLinearLinear algebraalgebraalgebra andandand itsitsits applicationsapplicationsapplications

