算法原理

线性回归遇到的问题一般是这样的。我们有m个样本，每个样本对应于n维特征和一个结果输出，如下： $(x^{(0)}_1,x^{(0)}_2,…x^{(0)}_n,y_0),(x^{(1)}_1,x^{(1)}_2,…x^{(1)}_n,y_1),…(x^{(m)}_1,x^{(m)}_2,…x^{(m)}_n,y_m)$
我们的问题是，对于一个新的 $(x^{(x)}_1,x^{(x)}_2,…x^{(x)}_n)$, 其所对应的 $y_x $是多少呢？如果这个问题里面的 $y $是连续的，则是一个回归问题，否则是一个分类问题。

对于 $n$维特征的样本数据，如果我们使用线性回归，那么对应的模型是这样的：

$h_{\theta} (x_1, x_2,...x_n)=\theta_0+\theta_1 x_1+...+\theta_n x_n$

其中 $θ_i (i = 0,1,2… n)$为模型参数， $x_i (i = 0,1,2… n)$为每个样本的 $n$个特征值。

这个表示可以简化，我们增加一个特征 $x_{0} = 1$ ，这样

$h_{\theta} (x_{0},x_{1},...x_{n})=∑_{i=0}^n\theta_{i}x_{i}$

进一步用矩阵形式表达更加简洁:

$h_{\mathbf{\theta}}(\mathbf{X}) = \mathbf{X\theta}$

其中，假设函数 $h_{\theta}(\bf X)$为 $m\times 1$的向量, $\mathbf \theta$为 $n\times 1$的向量，里面有 $n$个模型参数。 $\mathbf X$为 $m\times n$维的矩阵。 $m$代表样本的个数， $n$代表样本的特征数。

得到了模型，我们需要求出损失函数，一般线性回归我们用均方误差作为损失函数。损失函数的代数法表示如下：

$J(\theta_0,\theta_1...,\theta_n)=∑_{i=0}^m(h_\theta(x_0,x_1,...x_n)−y_i)^2$

进一步用矩阵形式表达损失函数：

$J(\bf \theta)=\frac{1}{2}(\bf X\bf \theta−\bf Y)^T(\bf X\bf \theta−\bf Y)$

我们常用的有两种方法来求损失函数最小化时候的θ参数：一种是梯度下降法，一种是最小二乘法。

公式推导

梯度下降法

假设样本数目为m，样本特征维度为n，$x$的形状为(m, n)，$y$的形状为(m,)。

对应的，$w$的形状为(n, 1)，$b$形状为(1,1)。

线性回归的初始抽象为下式，对于$x^{(i)}$的每个特征$x_j^{(i)}$，求解一个权重$w_j$，以及一个常数项（截距）$b$，使得加权得到的预测值与真实值尽可能相等。

$y=w^Tx+b$

为了方便，可以将$w$和$b$合并，即一个形状为(n+1, 1)的向量$\theta$：

$y = \theta ^ T x$

定义均方差误差（MSE）作为损失函数（关于为何采用MSE损失函数，详见这里），其损失函数为：

$\begin{equation}\begin{split}J &= \frac{1}{2}(y'-y)^2 \\ &= \frac{1}{2}(\theta^Tx-y)^2 \end{split}\end{equation}$

对应参数$\theta_j$的求导为：

$\begin{equation}\begin{split} \frac{\partial{J}}{\partial{\theta_j}} &= (y'-y)\frac{\partial{\theta^Tx}}{\partial{\theta_j}} \\ &= (y'-y) x_j \end{split}\end{equation}$

最小二乘法

线性回归的目的是求解参数向量$\mathbf w$使得 $\mathbf {x} \mathbf {w}=\mathbf {y}$。

一般来说，在等式两边同时左乘 $\mathbf x^{-1}$即可得到 $\mathbf {w}=\mathbf {x^{-1}y}$。

然而，前提是得保证 $\mathbf {x}$可逆。因此为了使得 $\mathbf {x}$可逆，我们将其转换为 $\mathbf {x^Tx}$，即先同时左乘 $\mathbf{x^T}$使其成为一个方阵，再求逆。

具体的公式推导为：

$\begin {align} \mathbf {xw}&= \mathbf{y} \\ \mathbf {x^Txw}&= \mathbf{x^Ty} \\ \mathbf {(x^Tx)^{-1}(x^Tx)w}&= \mathbf{(x^Tx)^{-1}x^Ty} \\ \mathbf {w}&= \mathbf{(x^Tx)^{-1}x^Ty} \end{align}$

Python 实现

# -*- utf-8 -*-
# source: https://github.com/liuph0119/KnowledgeCloud/blob/master/Code/python/algorithms/MachineLearning/linear_models/linear_regression.py

import numpy as np
from Code.python.algorithms.MachineLearning.utils.utils import mini_batches, log_msg
from Code.python.algorithms.MachineLearning.metrics.losses import mse_loss, mse_loss_grad
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes, load_boston


class LinearRegression:
    def __init__(self, fit_intercept=True):
        """ initialize model parameters.
        :param fit_intercept: bool, default True.
            whether to fit intercept.
        """
        self.w = None
        self.coef_ = None
        self.intercept_ = None
        self.fit_intercept = fit_intercept

    def fit(self, x, y, learning_rate=0.001, batch_size=256, n_epochs=100, verbose=2):
        """ fit model via mini-batch gradient descent
            y = xw
            loss = 1/2 * (y' - y) ^ 2 = 1/2 * (xw - y) ^ 2
            grad = (xw - y) * x.T
        :param x: array of shape '(n_sample, n_features)'.
            training features.
        :param y: array of shape '(n_samples, 1)'.
            training labels.
        :param learning_rate: float, default 1e-3.
            learning rate of gradient descent.
        :param batch_size: int, default 256.
            batch size.
        :param n_epochs: int, default 1000.
            number of training epochs.
        :param verbose: int, default 1.
            whether to print log, '1' for printing while '2' for not printing.
        :return: None.
        """
        n_samples, n_features = x.shape[0], x.shape[1]
        batch_size = n_samples if batch_size == -1 else batch_size
        if self.fit_intercept:
            x = np.c_[np.ones(n_samples), x]
            n_features += 1
        self.w = np.random.rand(n_features) * 0.1

        loss = np.inf
        for i in range(1, n_epochs + 1):
            data_iter, _ = mini_batches(x, batch_size, shuffle=True)
            # bc: batch
            _iter = 1
            for bc_idx in data_iter:
                bc_x, bc_y = x[bc_idx], y[bc_idx]
                bc_h = np.dot(bc_x, self.w)
                loss, grad = mse_loss(bc_y, bc_h), mse_loss_grad(bc_y, bc_h, bc_x)
                self.w = self.w - learning_rate * grad
                _iter += 1
                if verbose == 1:
                    log_msg('# epoch {} iter {}: loss={:.4f}'.format(i, _iter, loss))
            if verbose == 2:
                log_msg('# epoch {}: loss={:.4f}'.format(i, loss))

        if self.fit_intercept:
            self.coef_ = self.w[1:]
            self.intercept_ = self.w[0]

    def fit_ols(self, x, y):
        """ fit model via the ordinary least squares method.
        xw = y  => X.Txw = x.Ty  => (X.Tx)^(-1)(X.Tx)w = (X^Tx)^(-1)X.Ty
                =>w = (X^Tx)^(-1)X.Ty

        :param x: array of shape '(n_sample, n_features)'.
            training features.
        :param y: array of shape '(n_samples, 1)'.
            training labels.
        :return: None.
        """
        n_samples = x.shape[0]
        if self.fit_intercept:
            x = np.c_[np.ones(n_samples), x]
        self.w = np.dot(np.dot(np.linalg.inv(np.dot(x.T, x)), x.T), y)

        if self.fit_intercept:
            self.coef_ = self.w[1:]
            self.intercept_ = self.w[0]

    def predict(self, x):
        n_samples = x.shape[0]
        if self.fit_intercept:
            x = np.c_[np.ones(n_samples), x]
        return np.dot(x, self.w)

    def evaluate(self, x, y):
        preds = self.predict(x)
        mse = mse_loss(y, preds)
        return {'mse': mse}
   

if __name__ == '__main__':
    # x, y = make_regression(n_samples=1000, n_features=5, random_state=0)
    data = load_diabetes()
    x, y = data.data, data.target

    train_x, val_x, train_y, val_y = train_test_split(x, y, train_size=0.8, random_state=0)
    lm = LinearRegression(fit_intercept=True)
    # fit via mini-batch gradient descent
    lm.fit(train_x, train_y, batch_size=256, learning_rate=5e-1, n_epochs=1000, verbose=2)
    print(lm.evaluate(val_x, val_y))
    print(lm.coef_, lm.intercept_)
	# fit via ordinal least square
    lm.fit_ols(train_x, train_y)
    print(lm.evaluate(val_x, val_y))
    print(lm.coef_, lm.intercept_)

# result of mini-batch gradient descent
{'mse': 1686.1390843240692}
[ -17.21044662 -206.78540224  528.58121669  284.19942111  -59.31520498
 -136.10297667 -226.90958011  130.10864544  434.07454611   95.87515587] 151.3963140083344

# result of ordinal least square
{'mse': 1712.1583441068672}
[ -35.55683674 -243.1692265   562.75404632  305.47203008 -662.78772128
  324.27527477   24.78193291  170.33056502  731.67810787   43.02846824] 152.53813351954057

Scikit-learn API

from sklearn.linear_model import LinearRegression, SGDRegressor

lm = LinearRegression(fit_intercept=True)   # or SGDRegressor
lm.fit(train_x, train_y)
print(mse_loss(val_y, lm.predict(val_x)))
print(lm.coef_, lm.intercept_)

1
2
3

mse: 1712.1583441068674
[ -35.55683674 -243.1692265   562.75404632  305.47203008 -662.78772128
  324.27527477   24.78193291  170.33056502  731.67810787   43.02846824] 152.53813351954062