机器学习:回归算法(1)线性回归

算法原理

线性回归遇到的问题一般是这样的。我们有m个样本,每个样本对应于n维特征和一个结果输出,如下: $(x^{(0)}_1,x^{(0)}_2,…x^{(0)}_n,y_0),(x^{(1)}_1,x^{(1)}_2,…x^{(1)}_n,y_1),…(x^{(m)}_1,x^{(m)}_2,…x^{(m)}_n,y_m)$
我们的问题是,对于一个新的 $(x^{(x)}_1,x^{(x)}_2,…x^{(x)}_n)$, 其所对应的 $y_x $是多少呢? 如果这个问题里面的 $y $是连续的,则是一个回归问题,否则是一个分类问题。

对于 $n$维特征的样本数据,如果我们使用线性回归,那么对应的模型是这样的:

其中 $θ_i (i = 0,1,2… n)$为模型参数, $x_i (i = 0,1,2… n)$为每个样本的 $n$个特征值。

这个表示可以简化,我们增加一个特征 $x_{0} = 1$ ,这样

进一步用矩阵形式表达更加简洁:

其中, 假设函数 $h_{\theta}(\bf X)$为 $m\times 1$的向量, $\mathbf \theta$为 $n\times 1$的向量,里面有 $n$个模型参数。 $\mathbf X$为 $m\times n$维的矩阵。 $m$代表样本的个数, $n$代表样本的特征数。

得到了模型,我们需要求出损失函数,一般线性回归我们用均方误差作为损失函数。损失函数的代数法表示如下:

进一步用矩阵形式表达损失函数:

我们常用的有两种方法来求损失函数最小化时候的θ参数:一种是梯度下降法,一种是最小二乘法

公式推导

梯度下降法

假设样本数目为m,样本特征维度为n,$x$的形状为(m, n),$y$的形状为(m,)

对应的,$w$的形状为(n, 1),$b$形状为(1,1)

线性回归的初始抽象为下式,对于$x^{(i)}$的每个特征$x_j^{(i)}$,求解一个权重$w_j$,以及一个常数项(截距)$b$,使得加权得到的预测值与真实值尽可能相等。

为了方便,可以将$w$和$b$合并,即一个形状为(n+1, 1)的向量$\theta$:

定义均方差误差(MSE)作为损失函数(关于为何采用MSE损失函数,详见这里),其损失函数为:

对应参数$\theta_j$的求导为:

最小二乘法

线性回归的目的是求解参数向量$\mathbf w$使得 $\mathbf {x} \mathbf {w}=\mathbf {y}$。

一般来说,在等式两边同时左乘 $\mathbf x^{-1}$即可得到 $\mathbf {w}=\mathbf {x^{-1}y}$。

然而,前提是得保证 $\mathbf {x}$可逆。因此为了使得 $\mathbf {x}$可逆,我们将其转换为 $\mathbf {x^Tx}$,即先同时左乘 $\mathbf{x^T}$使其成为一个方阵,再求逆。

具体的公式推导为:

Python 实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# -*- utf-8 -*-
# source: https://github.com/liuph0119/KnowledgeCloud/blob/master/Code/python/algorithms/MachineLearning/linear_models/linear_regression.py

import numpy as np
from Code.python.algorithms.MachineLearning.utils.utils import mini_batches, log_msg
from Code.python.algorithms.MachineLearning.metrics.losses import mse_loss, mse_loss_grad
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes, load_boston


class LinearRegression:
def __init__(self, fit_intercept=True):
""" initialize model parameters.
:param fit_intercept: bool, default True.
whether to fit intercept.
"""
self.w = None
self.coef_ = None
self.intercept_ = None
self.fit_intercept = fit_intercept

def fit(self, x, y, learning_rate=0.001, batch_size=256, n_epochs=100, verbose=2):
""" fit model via mini-batch gradient descent
y = xw
loss = 1/2 * (y' - y) ^ 2 = 1/2 * (xw - y) ^ 2
grad = (xw - y) * x.T
:param x: array of shape '(n_sample, n_features)'.
training features.
:param y: array of shape '(n_samples, 1)'.
training labels.
:param learning_rate: float, default 1e-3.
learning rate of gradient descent.
:param batch_size: int, default 256.
batch size.
:param n_epochs: int, default 1000.
number of training epochs.
:param verbose: int, default 1.
whether to print log, '1' for printing while '2' for not printing.
:return: None.
"""
n_samples, n_features = x.shape[0], x.shape[1]
batch_size = n_samples if batch_size == -1 else batch_size
if self.fit_intercept:
x = np.c_[np.ones(n_samples), x]
n_features += 1
self.w = np.random.rand(n_features) * 0.1

loss = np.inf
for i in range(1, n_epochs + 1):
data_iter, _ = mini_batches(x, batch_size, shuffle=True)
# bc: batch
_iter = 1
for bc_idx in data_iter:
bc_x, bc_y = x[bc_idx], y[bc_idx]
bc_h = np.dot(bc_x, self.w)
loss, grad = mse_loss(bc_y, bc_h), mse_loss_grad(bc_y, bc_h, bc_x)
self.w = self.w - learning_rate * grad
_iter += 1
if verbose == 1:
log_msg('# epoch {} iter {}: loss={:.4f}'.format(i, _iter, loss))
if verbose == 2:
log_msg('# epoch {}: loss={:.4f}'.format(i, loss))

if self.fit_intercept:
self.coef_ = self.w[1:]
self.intercept_ = self.w[0]

def fit_ols(self, x, y):
""" fit model via the ordinary least squares method.
xw = y => X.Txw = x.Ty => (X.Tx)^(-1)(X.Tx)w = (X^Tx)^(-1)X.Ty
=>w = (X^Tx)^(-1)X.Ty

:param x: array of shape '(n_sample, n_features)'.
training features.
:param y: array of shape '(n_samples, 1)'.
training labels.
:return: None.
"""
n_samples = x.shape[0]
if self.fit_intercept:
x = np.c_[np.ones(n_samples), x]
self.w = np.dot(np.dot(np.linalg.inv(np.dot(x.T, x)), x.T), y)

if self.fit_intercept:
self.coef_ = self.w[1:]
self.intercept_ = self.w[0]

def predict(self, x):
n_samples = x.shape[0]
if self.fit_intercept:
x = np.c_[np.ones(n_samples), x]
return np.dot(x, self.w)

def evaluate(self, x, y):
preds = self.predict(x)
mse = mse_loss(y, preds)
return {'mse': mse}


if __name__ == '__main__':
# x, y = make_regression(n_samples=1000, n_features=5, random_state=0)
data = load_diabetes()
x, y = data.data, data.target

train_x, val_x, train_y, val_y = train_test_split(x, y, train_size=0.8, random_state=0)
lm = LinearRegression(fit_intercept=True)
# fit via mini-batch gradient descent
lm.fit(train_x, train_y, batch_size=256, learning_rate=5e-1, n_epochs=1000, verbose=2)
print(lm.evaluate(val_x, val_y))
print(lm.coef_, lm.intercept_)
# fit via ordinal least square
lm.fit_ols(train_x, train_y)
print(lm.evaluate(val_x, val_y))
print(lm.coef_, lm.intercept_)
1
2
3
4
5
6
7
8
9
# result of mini-batch gradient descent
{'mse': 1686.1390843240692}
[ -17.21044662 -206.78540224 528.58121669 284.19942111 -59.31520498
-136.10297667 -226.90958011 130.10864544 434.07454611 95.87515587] 151.3963140083344

# result of ordinal least square
{'mse': 1712.1583441068672}
[ -35.55683674 -243.1692265 562.75404632 305.47203008 -662.78772128
324.27527477 24.78193291 170.33056502 731.67810787 43.02846824] 152.53813351954057

Scikit-learn API

1
2
3
4
5
6
from sklearn.linear_model import LinearRegression, SGDRegressor

lm = LinearRegression(fit_intercept=True) # or SGDRegressor
lm.fit(train_x, train_y)
print(mse_loss(val_y, lm.predict(val_x)))
print(lm.coef_, lm.intercept_)
1
2
3
mse: 1712.1583441068674
[ -35.55683674 -243.1692265 562.75404632 305.47203008 -662.78772128
324.27527477 24.78193291 170.33056502 731.67810787 43.02846824] 152.53813351954062