对于这个练习,了解泛化误差中的方差和偏差

本章代码涵盖了基于Python的解决方案,用于Coursera机器学习课程的第五个编程练习。 请参考练习文本了解详细的说明和公式。

导入模块

  • Numpy:为大型多维数组和矩阵添加 Python 支持,并提供高级的数学函数来运算这些数组。
  • SciPy:基于 Numpy,汇集了一系列的数学算法和便捷的函数。它可以向开发者提供用于数据操作与可视化的高级命令和类,是构建交互式 Python 会话的强大工具。
  • Pandas:面向数据操作和分析的 Python 库,提供用于处理数字图表和时序数据的数据结构和操作功能。
  • Matplotlib:Python 中常用的绘图库,能在跨平台的交互式环境生成高质量图形。后来在它的基础上又衍生了更为高级的绘图库 Seaborn。

总的来说,如果你想理解和处理手头的数据,就用 Pandas;如果你想执行一些复杂的计算,就用 Numpy 和 SciPy;如果你想将数据可视化,就用 Matplotlib。

1
2
3
4
5
6
import numpy as np
import scipy.io as sio
import scipy.optimize as opt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

加载数据

  • map简单示例
    • map() 会根据提供的函数对指定序列做映射。
    • 第一个参数 function 以参数序列中的每一个元素调用 function 函数,返回包含每次 function 函数返回值的新列表
    • map(function, iterable, …)Python 3.x 返回迭代器
1
map(lambda x, y: x + y, [1, 3, 5, 7, 9], [2, 4, 6, 8, 10])
<map at 0x18734e11a08>
1
list(map(lambda x, y: x + y, [1, 3, 5, 7, 9], [2, 4, 6, 8, 10]))
[3, 7, 11, 15, 19]
1
list(map(lambda x: x ** 2, [1, 2, 3, 4, 5]))
[1, 4, 9, 16, 25]
  • 加载数据
1
2
d = sio.loadmat('ex5data1.mat')
list(map(np.ravel, [d['X'], d['y'], d['Xval'], d['yval'], d['Xtest'], d['ytest']]))
[array([-15.93675813, -29.15297922,  36.18954863,  37.49218733,
        -48.05882945,  -8.94145794,  15.30779289, -34.70626581,
          1.38915437, -44.38375985,   7.01350208,  22.76274892]),
 array([ 2.13431051,  1.17325668, 34.35910918, 36.83795516,  2.80896507,
         2.12107248, 14.71026831,  2.61418439,  3.74017167,  3.73169131,
         7.62765885, 22.7524283 ]),
 array([-16.74653578, -14.57747075,  34.51575866, -47.01007574,
         36.97511905, -40.68611002,  -4.47201098,  26.53363489,
        -42.7976831 ,  25.37409938, -31.10955398,  27.31176864,
         -3.26386201,  -1.81827649, -40.7196624 , -50.01324365,
        -17.41177155,   3.5881937 ,   7.08548026,  46.28236902,
         14.61228909]),
 array([ 4.17020201e+00,  4.06726280e+00,  3.18730676e+01,  1.06236562e+01,
         3.18360213e+01,  4.95936972e+00,  4.45159880e+00,  2.22763185e+01,
        -4.38738274e-05,  2.05038016e+01,  3.85834476e+00,  1.93650529e+01,
         4.88376281e+00,  1.10971588e+01,  7.46170827e+00,  1.47693464e+00,
         2.71916388e+00,  1.09269007e+01,  8.34871235e+00,  5.27819280e+01,
         1.33573396e+01]),
 array([-33.31800399, -37.91216403, -51.20693795,  -6.13259585,
         21.26118327, -40.31952949, -14.54153167,  32.55976024,
         13.39343255,  44.20988595,  -1.14267768, -12.76686065,
         34.05450539,  39.22350028,   1.97449674,  29.6217551 ,
        -23.66962971,  -9.01180139, -55.94057091, -35.70859752,
          9.51020533]),
 array([ 3.31688953,  5.39768952,  0.13042984,  6.1925982 , 17.08848712,
         0.79950805,  2.82479183, 28.62123334, 17.04639081, 55.38437334,
         4.07936733,  8.27039793, 31.32355102, 39.15906103,  8.08727989,
        24.11134389,  2.4773548 ,  6.56606472,  6.0380888 ,  4.69273956,
        10.83004606])]
1
d['X'].shape
(12, 1)
1
d['y']
array([[ 2.13431051],
       [ 1.17325668],
       [34.35910918],
       [36.83795516],
       [ 2.80896507],
       [ 2.12107248],
       [14.71026831],
       [ 2.61418439],
       [ 3.74017167],
       [ 3.73169131],
       [ 7.62765885],
       [22.7524283 ]])

加载函数

1
2
3
4
5
6
7
8
def load_data():
"""for ex5
d['X'] shape = (12, 1)
pandas has trouble taking this 2d ndarray to construct a dataframe, so I ravel
the results
"""
d = sio.loadmat('ex5data1.mat')
return map(np.ravel, [d['X'], d['y'], d['Xval'], d['yval'], d['Xtest'], d['ytest']])
1
X, y, Xval, yval, Xtest, ytest = load_data()

构造df数据帧(DataFrame)

1
2
df = pd.DataFrame({'water_level':X, 'flow':y})#字典构造 :数据帧(DataFrame)是二维数据结构,即数据以行和列的表格方式排列
df

water_level flow
0 -15.936758 2.134311
1 -29.152979 1.173257
2 36.189549 34.359109
3 37.492187 36.837955
4 -48.058829 2.808965
5 -8.941458 2.121072
6 15.307793 14.710268
7 -34.706266 2.614184
8 1.389154 3.740172
9 -44.383760 3.731691
10 7.013502 7.627659
11 22.762749 22.752428

seaborn.lmplot(回归图)

1
2
sns.lmplot('water_level', 'flow', data=df, fit_reg=False, size=5)
plt.show()

png

1
2
sns.lmplot('water_level', 'flow', data=df, fit_reg=True, size=5)
plt.show()

png

1
X, Xval, Xtest = [np.insert(x.reshape(x.shape[0], 1), 0, np.ones(x.shape[0]), axis=1) for x in (X, Xval, Xtest)]
1
X
array([[  1.        , -15.93675813],
       [  1.        , -29.15297922],
       [  1.        ,  36.18954863],
       [  1.        ,  37.49218733],
       [  1.        , -48.05882945],
       [  1.        ,  -8.94145794],
       [  1.        ,  15.30779289],
       [  1.        , -34.70626581],
       [  1.        ,   1.38915437],
       [  1.        , -44.38375985],
       [  1.        ,   7.01350208],
       [  1.        ,  22.76274892]])

代价函数

$$J(\theta)= \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def cost(theta, X, y):
"""
X: R(m*n), m records, n features
y: R(m)
theta : R(n), linear regression parameters
"""
m = X.shape[0]

inner = X @ theta - y # R(m*1)

# 1*m @ m*1 = 1*1 in matrix multiplication
# but you know numpy didn't do transpose in 1d array, so here is just a
# vector inner product to itselves
square_sum = inner.T @ inner
cost = square_sum / (2 * m)

return cost
1
2
3
4
5
6
7
8
def cost(theta, X, y):
m = X.shape[0]

inner = X@ theta - y

square_sum = inner.T@ inner
cost = square_sum / (2*m)
return cost
1
2
3
theta = np.ones(X.shape[1])

cost(theta, X, y)
303.9515255535976
  • @符号可以表示矩阵的乘法
1
2
3
a = np.array([[1,3],[2,4]])
b = np.array([[2,1],[3,3]])
a @ b
array([[11, 10],
       [16, 14]])

梯度

$$ \theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$$

1
2
3
4
5
6
def gradient(theta, X, y):
m = X.shape[0]

inner = X.T @ (X @ theta - y) # (m,n).T @ (m, 1) -> (n, 1)

return inner / m
1
gradient(theta, X, y)
array([-15.30301567, 598.16741084])

正则化梯度

$$\frac{\partial J(\theta)}{\partial \theta_0} = \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} ~ for ~j =0$$
$$\frac{\partial J(\theta)}{\partial \theta_j} = (\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)})+\frac{\lambda}{m}\theta_j ~ for ~j \geq 1$$

1
2
3
4
5
6
7
8
9
10
# L2-norm
def regularized_gradient(theta, X, y, l=1):
m = X.shape[0]

regularized_term = theta.copy() # same shape as theta 产生与theta相同的shape
regularized_term[0] = 0 # don't regularize intercept theta 截距不正则化

regularized_term = (l / m) * regularized_term

return gradient(theta, X, y) + regularized_term
1
regularized_gradient(theta, X, y)
array([-15.30301567, 598.25074417])

拟合数据

正则化项 $\lambda=0$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def linear_regression_np(X, y, l=1):
"""linear regression
args:
X: feature matrix, (m, n+1) # with incercept x0=1
y: target vector, (m, )
l: lambda constant for regularization

return: trained parameters
"""
# init theta
theta = np.ones(X.shape[1])

# train it
res = opt.minimize(fun=regularized_cost,
x0=theta,
args=(X, y, l),
method='TNC',
jac=regularized_gradient,
options={'disp': True})
return res
1
2
3
4
5
6
7

def regularized_cost(theta, X, y, l=1):
m = X.shape[0]

regularized_term = (l / (2 * m)) * np.power(theta[1:], 2).sum()

return cost(theta, X, y) + regularized_term
1
2
3
theta = np.ones(X.shape[0])

final_theta = linear_regression_np(X, y, l=0).get('x')
1
2
3
4
5
6
7
b = final_theta[0] # intercept
m = final_theta[1] # slope

plt.scatter(X[:,1], y, label="Training data")
plt.plot(X[:, 1], X[:, 1]*m + b, label="Prediction")
plt.legend(loc=2)
plt.show()

png

1
training_cost, cv_cost = [], []

1.使用训练集的子集来拟合应模型

2.在计算训练代价和交叉验证代价时,没有用正则化

3.记住使用相同的训练集子集来计算训练代价

1
2
3
4
5
6
7
8
9
10
11
m = X.shape[0]
for i in range(1, m+1):
# print('i={}'.format(i))
res = linear_regression_np(X[:i, :], y[:i], l=0)

tc = regularized_cost(res.x, X[:i, :], y[:i], l=0)
cv = regularized_cost(res.x, Xval, yval, l=0)
# print('tc={}, cv={}'.format(tc, cv))

training_cost.append(tc)
cv_cost.append(cv)
1
2
3
4
plt.plot(np.arange(1, m+1), training_cost, label='training cost')
plt.plot(np.arange(1, m+1), cv_cost, label='cv cost')
plt.legend(loc=1)
plt.show()

png

这个模型拟合不太好, 欠拟合了

创建多项式特征

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def prepare_poly_data(*args, power):
"""
args: keep feeding in X, Xval, or Xtest
will return in the same order
"""
def prepare(x):
# expand feature
df = poly_features(x, power=power)

# normalization
ndarr = normalize_feature(df).as_matrix()

# add intercept term
return np.insert(ndarr, 0, np.ones(ndarr.shape[0]), axis=1)

return [prepare(x) for x in args]
1
2
3
4
5
def poly_features(x, power, as_ndarray=False):
data = {'f{}'.format(i): np.power(x, i) for i in range(1, power + 1)}
df = pd.DataFrame(data)

return df.as_matrix() if as_ndarray else df
1
X, y, Xval, yval, Xtest, ytest = load_data()
  • 拆解分析
1
2
3
data = {'f{}'.format(i): np.power(X, i) for i in range(1, 3 + 1)}
df = pd.DataFrame(data)
df

f1 f2 f3
0 -15.936758 253.980260 -4047.621971
1 -29.152979 849.896197 -24777.006175
2 36.189549 1309.683430 47396.852168
3 37.492187 1405.664111 52701.422173
4 -48.058829 2309.651088 -110999.127750
5 -8.941458 79.949670 -714.866612
6 15.307793 234.328523 3587.052500
7 -34.706266 1204.524887 -41804.560890
8 1.389154 1.929750 2.680720
9 -44.383760 1969.918139 -87432.373590
10 7.013502 49.189211 344.988637
11 22.762749 518.142738 11794.353058
1
2
# 调用准备多项式中回归数据中的normalize_feature()函数 标准化
normalize_feature(df).as_matrix()
c:\users\fishmouse\appdata\local\programs\python\python37\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  """Entry point for launching an IPython kernel.





array([[-3.62140776e-01, -7.55086688e-01,  1.82225876e-01],
       [-8.03204845e-01,  1.25825266e-03, -2.47936991e-01],
       [ 1.37746700e+00,  5.84826715e-01,  1.24976856e+00],
       [ 1.42093988e+00,  7.06646754e-01,  1.35984559e+00],
       [-1.43414853e+00,  1.85399982e+00, -2.03716308e+00],
       [-1.28687086e-01, -9.75968776e-01,  2.51385075e-01],
       [ 6.80581552e-01, -7.80028951e-01,  3.40655738e-01],
       [-9.88534310e-01,  4.51358004e-01, -6.01281871e-01],
       [ 2.16075753e-01, -1.07499276e+00,  2.66275156e-01],
       [-1.31150068e+00,  1.42280595e+00, -1.54812094e+00],
       [ 4.03776736e-01, -1.01501039e+00,  2.73378511e-01],
       [ 9.29375305e-01, -4.19807932e-01,  5.10968368e-01]])
1
poly_features(X, power=3)

f1 f2 f3
0 -15.936758 253.980260 -4047.621971
1 -29.152979 849.896197 -24777.006175
2 36.189549 1309.683430 47396.852168
3 37.492187 1405.664111 52701.422173
4 -48.058829 2309.651088 -110999.127750
5 -8.941458 79.949670 -714.866612
6 15.307793 234.328523 3587.052500
7 -34.706266 1204.524887 -41804.560890
8 1.389154 1.929750 2.680720
9 -44.383760 1969.918139 -87432.373590
10 7.013502 49.189211 344.988637
11 22.762749 518.142738 11794.353058

准备多项式回归数据

  1. 扩展特征到 8阶,或者你需要的阶数
  2. 使用 归一化 来合并 $x^n$
  3. don’t forget intercept term别忘了截距项
1
2
3
def normalize_feature(df):
"""Applies function along input axis(default 0) of DataFrame."""
return df.apply(lambda column: (column - column.mean()) / column.std())# mean平均值,std标准差
1
2
X_poly, Xval_poly, Xtest_poly= prepare_poly_data(X, Xval, Xtest, power=8)
X_poly[:3, :]
array([[ 1.00000000e+00, -3.62140776e-01, -7.55086688e-01,
         1.82225876e-01, -7.06189908e-01,  3.06617917e-01,
        -5.90877673e-01,  3.44515797e-01, -5.08481165e-01],
       [ 1.00000000e+00, -8.03204845e-01,  1.25825266e-03,
        -2.47936991e-01, -3.27023420e-01,  9.33963187e-02,
        -4.35817606e-01,  2.55416116e-01, -4.48912493e-01],
       [ 1.00000000e+00,  1.37746700e+00,  5.84826715e-01,
         1.24976856e+00,  2.45311974e-01,  9.78359696e-01,
        -1.21556976e-02,  7.56568484e-01, -1.70352114e-01]])

画出学习曲线

首先,我们没有使用正则化,所以 $\lambda=0$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def plot_learning_curve(X, y, Xval, yval, l=0):
training_cost, cv_cost = [], []
m = X.shape[0]

for i in range(1, m + 1):
# regularization applies here for fitting parameters
res = linear_regression_np(X[:i, :], y[:i], l=l)

# remember, when you compute the cost here, you are computing
# non-regularized cost. Regularization is used to fit parameters only
tc = cost(res.x, X[:i, :], y[:i])
cv = cost(res.x, Xval, yval)

training_cost.append(tc)
cv_cost.append(cv)

plt.plot(np.arange(1, m + 1), training_cost, label='training cost')
plt.plot(np.arange(1, m + 1), cv_cost, label='cv cost')
plt.legend(loc=1)
1
2
plot_learning_curve(X_poly, y, Xval_poly, yval, l=0)
plt.show()

png

你可以看到训练的代价太低了,不真实. 这是 过拟合

try $\lambda=1$

1
2
plot_learning_curve(X_poly, y, Xval_poly, yval, l=1)
plt.show()

png

训练代价增加了些,不再是0了。
也就是说我们减轻过拟合

try $\lambda=100$

1
2
plot_learning_curve(X_poly, y, Xval_poly, yval, l=100)
plt.show()

png

太多正则化了.
变成 欠拟合状态

找到最佳的 $\lambda$

1
2
l_candidate = [0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10]
training_cost, cv_cost = [], []
1
2
3
4
5
6
7
8
for l in l_candidate:
res = linear_regression_np(X_poly, y, l)

tc = cost(res.x, X_poly, y)
cv = cost(res.x, Xval_poly, yval)

training_cost.append(tc)
cv_cost.append(cv)
1
2
3
4
5
6
7
8
plt.plot(l_candidate, training_cost, label='training')
plt.plot(l_candidate, cv_cost, label='cross validation')
plt.legend(loc=2)

plt.xlabel('lambda')

plt.ylabel('cost')
plt.show()

png

1
2
# best cv I got from all those candidates
l_candidate[np.argmin(cv_cost)]
1
1
2
3
4
# use test data to compute the cost
for l in l_candidate:
theta = linear_regression_np(X_poly, y, l).x
print('test cost(l={}) = {}'.format(l, cost(theta, Xtest_poly, ytest)))
test cost(l=0) = 10.055426362410126
test cost(l=0.001) = 11.001927632262907
test cost(l=0.003) = 11.26474655167747
test cost(l=0.01) = 10.880780731411715
test cost(l=0.03) = 10.022100517865269
test cost(l=0.1) = 8.63190793331871
test cost(l=0.3) = 7.3366077892272585
test cost(l=1) = 7.466283751156784
test cost(l=3) = 11.643941860536106
test cost(l=10) = 27.715080254176254

调参后, $\lambda = 0.3$ 是最优选择,这个时候测试代价最小

参考链接