線性回歸（Linear Regression）

Linear regression 是一個資料分析技術，且使用線性函數來預測未知的資料。雖然 linear regression 模型相對簡單，但它是一個成熟的統計技術。

完整程式碼可以在下載。

線性回歸
1. 線性回歸模型（Linear Regression Model）
2. 成本函數（Cost Function）
求解參數 w 和 b
多元線性迴歸（Multiple Linear Regression）
結語
參考

線性回歸

線性回歸模型（Linear Regression Model）

Linear regression 的 model function 如下。其中 w 和 b 是參數。當我們將變數 x 帶入後，函數 f_w,b 會回傳一個預測值 ŷ（y hat）。

$\hat{y}^{(i)}=f_{w,b}(x^{(i)})=wx^{(i)}+b \\\\ \text{paramters}: w,b \\\\ i: i \text{-th example}$

要注意的是，f_w,b 回傳的是一個預測值 ŷ（prediction value），而非真正的值 y（target value），如下圖所示。當 f_w,b 預測出來的 ŷ 很接近 y 時，則我們可以說 f_w,b 的準確率很高。我們透過調整 w 和 b 來提高 f_w,b 的準確率。

Linear regression's model function. — Linear regression’s model function.

以下的程式碼中，f_wb() 實作了 linear regression model。

def f_wb(x, w, b):
    """
    Linear regression model

    Parameters
    ----------
    x: (ndarray (m,)) or (scalar)
        m is the number of samples
    w: (scalar)
    b: (scalar)

    Returns
    -------
    (ndarray (m,)) or (scalar)
    """

    return w * x + b

成本函數（Cost Function）

在介紹如何求得 w 和 b 之前，我們要先了解什麼是 cost function。Cost function 是用來衡量 f_w,b 的 prediction values 與 target values 之間的殘差（residual），也就是它們之間的差值。有了 cost function，我們就可以衡量調整後的 w 和 b 是否比原來的好。

Linear regression 的 cost function 是 squared error，如下。

$J(w,b)=\frac{1}{2m} \displaystyle\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})^2=\frac{1}{2m} \displaystyle\sum_{i=1}^{m}(f_{w,b}(x^{(i)})-y^{(i)})^2 \\\\ \text{Find } w,b: \hat{y}^{(i)} \text{ is close to } y^{(i)} \text{ for all } (x^{(i)},y^{(i)}) \text{, where } i=0...m \\\\ m: \text{number of examples} \\\\ \text{Objective: } minimize_{w,b}J(w,b)$

我們最終想要找到一組 w 和 b，使得 J(w,b) 會是最小。

Linear regression's cost function: Square error. — Linear regression’s cost function: Square error.

以下的程式碼中，compute_cost() 實作了 cost function J(w,b)。

import numpy as np


def compute_cost(x, y, w, b):
    """
    Compute cost

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w: (scalar)
    b: (scalar)

    Returns
    -------
    (scalar)
    """

    m = x.shape[0]
    y_hat = f_wb(x, w, b)
    return 1 / (2 * m) * np.sum((y_hat - y) ** 2)

求解參數 `w` 和 `b`

Linear regression 的 model function 很簡單，然而問題是要如何求解 w 和 b，並使得 J(w,b) 會是最小呢？在 machine learning 中，我們使用的是梯度下降（gradient descent）來求解 w 和 b。不過我們會先介紹另外兩種方法，來幫助讀者了解 gradient descent 的優點。

在介紹各個方法時，我們會使用以下的範例來講解。

import matplotlib.pyplot as plt
import numpy as np

x_train = np.array([10, 20, 32, 44, 66])
y_train = np.array([10, 30, 40, 44, 54])

plt.scatter(x_train, y_train, 60, marker='o', c='c', label='y_train')
plt.xlabel('x_train')
plt.ylabel('y_train')
plt.legend()

聯立方程式（Simultaneous Equations）

我們都知道平面上的任意兩點可構成一條直線。我們挑選 (10, 10) 和 (66, 54) 來計算 w 和 b。

$\left\{\begin{matrix} 10=10w+b \hspace{1cm} (1) \\ 54=66w+b \hspace{1cm} (2) \end{matrix}\right. \\\\ \\\\ (2)-(1) \\\\ \Rightarrow 44=56w \\\\ \Rightarrow w=\frac{11}{14} \\\\ \\\\ \text{Substiture } w \text{ with } \frac{11}{14} \text{ in }(1)\\\\ \Rightarrow 10=10 \times \frac{11}{14}b \\\\ \Rightarrow b=10 - \frac{110}{14}=\frac{30}{14} \\\\ \therefore y=\frac{11}{14}x+\frac{30}{14}$

我們將計算出來的 w 和 b 代入 model function，並在座標圖上畫出用 simultaneous equations 算出的 model function 以及 prediction values。

# simultaneous equations
w_se = 11 / 14  # 0.7857
b_se = 30 / 14  # 2.1428
y_hat_se = f_wb(x_train, w_se, b_se)
plt.scatter(x_train, y_train, 60, marker='o', c='c', label='y_train')
plt.plot(x_train, y_hat_se, 'ro-', label='y_hat_se')
plt.xlabel('x_train')
plt.ylabel('y_train and y_hat_se')
plt.legend()

以下顯示該 model 的 cost。利用範例中的 5 個點，我們可以計算出 10 組 w 和 b，然後計算它們的 cost。最後，選擇 cost 最小的 w 和 b。

cost_se = compute_cost(x_train, y_train, w_se, b_se)
cost_se

# Output:
# np.float64(36.21836734693877)

使用聯立方程式來求解 w 和 b 看似簡單，然而當 training examples 多的時候，計算會趨於複雜。此外，如果最佳的 w 和 b 的線不會經過任何一個 training example，那我們將無法找到此 w 和 b。

最小平方法（Least Squares）

Least squares 是一個統計學的優化建模方法。它求解的 w 和 b 可使 residual 的平方和最小，也就是 cost 最小。Least squares 的概念是在 model function 中加入一個誤差值 ε（epsilon），來使得 prediction value 可以等於 target value。而且，盡量地使 ε 最小。

將所有的 training examples 代入 model function。可以看出，每一個 training example 的 ε 都可能不同。

$\left\{\begin{matrix} 10=10w+b+\epsilon_1 \\ 30=20w+b+\epsilon_2 \\ 40=32w+b+\epsilon_3 \\ 44=44w+b+\epsilon_4 \\ 54=66w+b+\epsilon_5 \\ \end{matrix}\right. \hspace{0.5cm} \Rightarrow \hspace{0.5cm} \left\{\begin{matrix} \epsilon_1=10-10w-b \\ \epsilon_2=30-20w-b \\ \epsilon_3=40-32w-b \\ \epsilon_4=44-44w-b \\ \epsilon_5=54-66w-b \\ \end{matrix}\right.$

接著，計算每個 ε 的平方。

$\epsilon_1^2 = (10-10w-b)^2 = 100+100w^2+b^2-200w+20wb-20b \\\\ \epsilon_2^2 = (30-20w-b)^2 = 900+400w^2+b^2-1200w+40wb-60b \\\\ \epsilon_3^2 = (40-32w-b)^2 = 1600+1024w^2+b^2-2560w+64wb-80b \\\\ \epsilon_4^2 = (44-44w-b)^2 = 1936+1936w^2+b^2-3872w+88wb-80b \\\\ \epsilon_5^2 = (54-66w-b)^2 = 2916+4356w^2+b^2-7128w+132wb-108b$

計算所有 ε 的平方的和。然後，我們希望求解出一組 w 和 b，使得所有 ε 的平方的和為最小，也就是越接近零越好。

$\epsilon_1^2+\epsilon_2^2+\epsilon_3^2+\epsilon_4^2+\epsilon_5^2 \\\\ = (100+100w^2+b^2-200w+20wb-20b) \\\\ + (900+400w^2+b^2-1200w+40wb-60b) \\\\ + (1600+1024w^2+b^2-2560w+64wb-80b) \\\\ + (1936+1936w^2+b^2-3872w+88wb-80b) \\\\ + (2916+4356w^2+b^2-7128w+132wb-108b) \\\\ = 7452+7816w^2+5b^2-14960w+344wb-348b$

利用配方法（completing the square）來化簡式子。

$7816w^2+344bw-14960w+5b^2-348b+7452 \\\\ =7816w^2+(344b-14960)w+5b^2-348b+7452 \\\\ =7816(w^2+\frac{344b-14960}{7816}w)+5b^2-348b+7452 \\\\ =7816(w^2+2 \times \frac{344b-14960}{2 \times 7816}w)+5b^2-348b+7452 \\\\ =7816(w+\frac{43b-1870}{1954})^2-7816(\frac{43b-1870}{1954})^2+5b^2-348b+7452$

我們希望誤差可以最小的話，w 的值應該要如下：

$w=-\frac{43b-1870}{1954}$

這樣子的話，前面的部分就會等於零。我們繼續用配方法化簡剩下的部分。

$-7816(\frac{43b-1870}{1954})^2+5b^2-348b+7452 \\\\ =-2 \times \frac{1849b^2+3498770-1608206}{977}+5b^2-348b+7452 \\\\ =-\frac{3689}{977}b^2-\frac{6997540}{977}+\frac{321640}{977}b+5b^2-348b+7452 \\\\ =(5 \times \frac{3689}{977})b^2+(\frac{321640}{977}-348)b+(7452-\frac{6997540}{977}) \\\\ =\frac{1196}{977}b^2-\frac{18356}{977}b+\frac{283064}{977} \\\\ = \frac{1}{977}(1196b^2-18356b+283064) \\\\ =\frac{1}{977}(1196(b^2-\frac{4589}{299}b)+283064) \\\\ =\frac{1}{977}(1196(b-\frac{4589}{598})^2-(\frac{4589}{299})^2+283064)$

要讓上面的式子最小的話，那 b 的值應該要如下，這樣前面的部分就會等於零。

$b=\frac{4589}{598}$

將 b 值代入到 w 的式子裡，並得出 w 的值。

$\text{Substitute } b \text{ in } w: \\\\ w=-\frac{43b-1870}{1954} \\\\ \hphantom{w}=-\frac{43 \times \frac{4589}{598}-1870}{1954} \\\\ \hphantom{w}=-\frac{43 \times 4589 - 1870 \times 598}{1954 \times 598} \\\\ \hphantom{w}=-\frac{197327-1118260}{1168492} \\\\ \hphantom{w}=\frac{920933}{1168492}$

將算出來的 w 和 b 代入 model function，則可以得出我們要的 model。

$w=\frac{920933}{1158492} \approx 0.788 \\\\ b=\frac{4589}{598} \approx 7.674 \\\\ \therefore y=0.788x+7.674$

以下的程式碼中，我們在座標圖上畫出用 least squares 算出的 model function 以及 prediction values。

# least squares
w_ls = 0.788
b_ls = 7.674

y_hat_ls = f_wb(x_train, w_ls, b_ls)
plt.scatter(x_train, y_train, 60, marker='o', c='c', label='y_train')
plt.plot(x_train, y_hat_se, 'm^-', label='y_hat_se')
plt.plot(x_train, y_hat_ls, 'gs-', label='y_hat_ls')
plt.xlabel('x_train')
plt.ylabel('y_train, y_hat_se and y_hat_ls')
plt.legend()

以下顯示該 model 的 cost。

cost_ls = compute_cost(x_train, y_train, w_ls, b_ls)
cost_ls

# Output
# np.float64(15.953221200000002)

相較於用聯立方程式來計算 w 和 b，最小平方法可以算出較好的 w 和 b。但是，當 training examples 很多的時候，整個計算過程會異常複雜。

梯度下降（Gradient Descent）

Gradient descent 是一個最佳化演算法，用來找到一個函數的局部值。其基本原理是，首先，隨便挑選一組 w 和 b，所以可能會在 cost function 的左邊或是右邊，然後下一組選擇谷底方向的 w 和 b。重複此動作直到 J(w,b) 無法再更小。

下一個問題是要如何知道 cost function 的谷底是在哪個方向呢？我們可以對 cost function J(w,b) 做微分，然後就可以得到目前的斜率（slope）。有了斜率，我們就知道要往哪邊移動。

以下為 gradient descent 的演算法。首先，先隨機選一組 w 和 b，或直接設為零。Cost function 的導數乘上一個 learning rate $\alpha$ 會是 w 和 b 要下降的梯度。此外，我們還會給定一個 iteration 次數，代表執行 gradient descent 的次數。因此，我們也許不會得到最佳的 w 和 b。

$\text{repeat until convergence } \{ \\\\ \phantom{xxxx} w=w-\alpha \frac{\partial J(w,b)}{\partial w} \\\\ \phantom{xxxx} b=b-\alpha \frac{\partial J(w,b)}{\partial b} \\\\ \}$

J(w,b) 對 w 的偏導數（partial derivative）為：

$\frac{\partial J(w,b)}{\partial w} \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}(f_{w,b}(x^{(i)})-y^{(i)})^2 \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}(wx^{(i)}+b-y^{(i)})^2 \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}((x^{(i)})^2w^2+b^2+(y^{(i)})^2+2bx^{(i)}w-2y^{(i)}x^{(i)}w-2by^{(i)}) \\\\ =\frac{1}{2m} \displaystyle\sum^{m}_{i=1}(2(x^{(i)})^2w+2bx^{(i)}-2y^{(i)}x^{(i)}) \\\\ =\frac{1}{2m} \displaystyle\sum^{m}_{i=1}2x^{(i)}(wx^{(i)}+b-y^{(i)}) \\\\ =\frac{1}{m} \displaystyle\sum^{m}_{i=1}(wx^{(i)}+b-y^{(i)})x^{(i)} \\\\ =\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)}) x^{(i)}$

J(w,b) 對 b 的偏導數為：

$\frac{\partial J(w,b)}{\partial w} \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}(f_{w,b}(x^{(i)})-y^{(i)})^2 \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}(wx^{(i)}+b-y^{(i)})^2 \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}((x^{(i)})^2w^2+b^2+(y^{(i)})^2+2bx^{(i)}w-2y^{(i)}x^{(i)}w-2by^{(i)}) \\\\ =\frac{1}{2m} \displaystyle\sum^{m}_{i=1}(2b+2x^{(i)}w-2y^{(i)}) \\\\ =\frac{1}{2m} \displaystyle\sum^{m}_{i=1}(x^{(i)}w+b-y^{(i)})2 \\\\ =\frac{1}{m} \displaystyle\sum^{m}_{i=1}(x^{(i)}w+b-y^{(i)}) \\\\ =\frac{1}{m} \displaystyle\sum^{m}_{i=1}(f_{w,b}(x^{(i)})-y^{(i)})$

以下的程式碼實作 J(w,b) 對 w 和 b 的偏導數計算。

def compute_gradient(x, y, w, b):
    """
    Compute the gradient for linear regression

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w: (scalar)
    b: (scalar)

    Returns
    -------
    dj_dw: (scalar)
    dj_db: (scalar)
    """

    m = x.shape[0]
    dj_dw = 1 / m * np.sum((f_wb(x, w, b) - y) * x)
    dj_db = 1 / m * np.sum(f_wb(x, w, b) - y)
    return dj_dw, dj_db

以下的程式碼實作 gradient descent。其中參數 alpha 是 learning rate，而 epochs 是 iteration 次數。

def perform_gradient_descent(x, y, w_init, b_init, alpha, epochs):
    """
    Perform gradient descent

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w_init: (scalar)
    b_init: (scalar)
    alpha: (float)
    epochs: (int)

    Returns
    -------
    w: (scalar)
    b: (scalar)
    """

    w = w_init
    b = b_init
    for i in range(epochs):
        dj_dw, dj_db = compute_gradient(x, y, w, b)
        w -= alpha * dj_dw
        b -= alpha * dj_db
    return w, b

以下的程式碼中，我們令 learning rate 為 0.001，以及 iteration 次數為 3000 次。然後，在座標圖上畫出用 gradient descent 算出的 model function 以及 prediction values。

w_dg, b_dg = perform_gradient_descent(x_train, y_train, 0, 0, 0.001, 3000)
y_hat_gd = f_wb(x_train, w_dg, b_dg)

plt.scatter(x_train, y_train, 60, marker='o', c='c', label='y_train')
plt.plot(x_train, y_hat_se, 'm^-', label='y_hat_se')
plt.plot(x_train, y_hat_ls, 'gs-', label='y_hat_ls')
plt.plot(x_train, y_hat_gd, 'y*-', label='y_hat_gd')
plt.xlabel('x_train')
plt.ylabel('y_train, y_hat_se, y_hat_ls, and y_hat_gd')
plt.legend()

以下顯示該 model 的 cost。

cost_gd = compute_cost(x_train, y_train, w_dg, b_dg)
cost_gd

# Output
# np.float64(18.014426543403744)

雖然在推導 cost function 的導數過程有些複雜，但是整體演算法不會太複雜。當 training examples 多的時候，程式碼也不會變複雜。

以下列出由以上三個方法算出的一些數值，供讀者比較和參考。

(w_se, b_se), (w_ls, b_ls), (w_dg, b_dg)

# Output
((0.7857142857142857, 2.142857142857143),
 (0.788, 7.674),
 (np.float64(0.8312661524754889), np.float64(5.714916639215474)))

(y_train, y_hat_se, y_hat_ls, y_hat_gd)

# Output
(array([10, 30, 40, 44, 54]),
 array([10.        , 17.85714286, 27.28571429, 36.71428571, 54.        ]),
 array([15.554, 23.434, 32.89 , 42.346, 59.682]),
 array([14.02757816, 22.34023969, 32.31543352, 42.29062735, 60.5784827 ]))

(cost_se, cost_ls, cost_gd)

# Output
(np.float64(36.21836734693877),
 np.float64(15.953221200000002),
 np.float64(18.014426543403744))

Learning Rate

在 gradient descent 中，有一個 learning rate 叫 $\alpha$ 。由 gradient descent 演算法可以看出，當 $\alpha$ 較小時，需要較多的次數才能接近谷底，因此效能會較差。反之，則會比較快速地到達底谷。但是，當 $\alpha$ 太大時，則有可能會無法到達谷底。

多元線性迴歸（Multiple Linear Regression）

至目前為止，我們介紹的是 simple linear regression，只有一個變數，實用性可能不大。接下來，我們將介紹 multiple linear regression。

多元線性迴歸模型（Multiple Linear Regression Model）

Multiple linear regression 有一個以上的變數，如下是 4 個變數的 model function。

$f_{w,b}(x)=w_1x_1+w_2x_2+w_3x_3+w_4x_4+b$

所以，multiple linear regression 的 model function 如下。

$f_{w,b}(x^{(i)}) =w_1x_1^{(i)}+w_2x_2^{(i)}+\cdots+w_nx_n^{(i)}+b$

向量化的 model function 則如下，其中 $\vec{x}, \vec{w}$ 是 vectors。

$f_{\vec{w},b}(\vec{x}^{(i)})=\displaystyle\sum_{j=1}^nw_jx_j+b=\vec{w}\cdot \vec{x}^{(i)}+b \\\\ \vec{w}=[w_1,w_2,\cdots,w_n] \\\\ \vec{x}^{(i)}=[x_1^{(i)},x_2^{(i)},\cdots,x_n^{(i)}]$

如果我們將所有的 training examples 放在一起，就會形成一個陣列（matrix）。一般我們會用大寫的 X 來代表 training examples 形成的 matrix。因此，model function 可寫成如下：

$f_{w,b}(X)=X\cdot w+b \\\\ X=\begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_{n}^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_{n}^{(2)} \\ \vdots \\ x_1^{(m)} & x_2^{(m)} & \cdots & x_{n}^{(m)} \\ \end{bmatrix}, w=\begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n} \end{bmatrix}, b \text{ is a scalar} \\\\ y=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{m} \end{bmatrix} \\\\ m: \text{number of examples} \\\\ n: \text{number of features in each example} \\\\ x_j^i: j \text{-th feature } \text{ in } i \text{-th example}$

向量化的 f_w,b 和 simple linear regression 的 model function 很像，它們的實作也差不多。差別在於，向量化的 f_w,b 使用內積（dot product）來相乘 X 和 w。

import numpy as np


def f_wb(X, w, b):
    """
    Linear regression model

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    (ndarray (m,))
    """

    return np.dot(X, w) + b

成本函數（Cost Function）

Multiple linear regression 的 cost function 如下。

$J(\vec{w},b)=\frac{1}{2m}\displaystyle\sum_{i=1}^{m}(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)})^2 \\\\ f_{\vec{w},b}(\vec{x}^{(i)})=\vec{w}\cdot \vec{x}^{(i)}+b$

以下程式碼中，compute_cost() 實作了向量化的 cost function。

import numpy as np


def compute_cost(X, y, w, b):
    """
    Compute cost

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    (scalar)
    """

    m = X.shape[0]
    y_hat = f_wb(X, w, b)
    return 1 / (2 * m) * np.sum((y_hat - y) ** 2)

梯度下降（Gradient Descent）

Multiple linear regression 的 gradient descent 如下。從式子中可以了解到，我們要計算每一個 feature 的參數 w_j。然後，計算 J(w,b) 對每一個 w_j 的偏導數。

$\text{repeat until convergence: \{} \\ \phantom{xxxx} w_j=w_j-\alpha\frac{\partial J(w,b)}{\partial w_j}, \text{ for } j=1\dots n \\ \phantom{xxxx} b=b-\alpha\frac{\partial J(w,b)}{\partial b} \\ \}$

J(w,b) 對每一個 w_j 和 b 的偏導數如下：

$\frac{\partial J(w,b)}{\partial w_j}=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}_j \\\\ \frac{\partial J(w,b)}{\partial b}=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b} (x^{(i)})-y^{(i)})$

以下的程式碼實作 J(w,b) 對 w_j 和 b 的偏導數計算。

import numpy as np


def compute_gradient(X, y, w, b):
    """
    Compute the gradient for linear regression

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    dj_dw: (ndarray (n,))
    dj_db: (scalar)
    """

    m = X.shape[0]
    y_hat = f_wb(X, w, b)
    dj_dw = 1 / m * np.dot(X.T, y_hat - y)
    dj_db = 1 / m * np.sum(y_hat - y)
    return dj_dw, dj_db

以下的程式碼實作 gradient descent。

import numpy as np


def perform_gradient_descent(X, y, w_init, b_init, alpha, epochs):
    """
    Perform gradient descent

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w_init: (ndarray (n,))
    b_init: (scalar)
    alpha: (scalar)
    epochs: (int)

    Returns
    -------
    w: (ndarray (n,))
    b: (scalar)
    """

    w = w_init
    b = b_init
    for i in range(epochs):
        dj_dw, dj_db = compute_gradient(X, y, w, b)
        w -= alpha * dj_dw
        b -= alpha * dj_db
        print(f'Epoch {i + 1}, Cost: {compute_cost(X, y, w, b)}')
    return w, b

以下的程式碼中，我們令 learning rate 為 0.0000001，以及 iteration 次數為 1000 次。然後，輸出 cost 在每個 iteration 的變化。可以看出，cost 在一開始時下降很快，之後越來越慢。

X_train = np.array([[2104, 5, 1, 45], [1416, 3, 2, 40], [852, 2, 1, 35]])
y_train = np.array([460, 232, 178])
w, b = perform_gradient_descent(X_train, y_train, np.zeros(X_train.shape[1]), 0., 0.0000001, 1000)
(w, b)

# Output
# Epoch 1, Cost: 28989.111501449268
# Epoch 2, Cost: 17092.48916583246
# Epoch 3, Cost: 10198.319962562276
# Epoch 4, Cost: 6203.104159134617
# Epoch 5, Cost: 3887.8502848154976
# Epoch 6, Cost: 2546.145030983679
# Epoch 7, Cost: 1768.6173933475704
# Epoch 8, Cost: 1318.0342718091906
# Epoch 9, Cost: 1056.9175763423789
# Epoch 10, Cost: 905.59787958039
# Epoch 11, Cost: 817.9062417617257
# Epoch 12, Cost: 767.0874687152398
# Epoch 13, Cost: 737.6367552876493
# Epoch 14, Cost: 720.5689683005712
# Epoch 15, Cost: 710.6771659424062
# ...
# Epoch 995, Cost: 694.9420356037724
# Epoch 996, Cost: 694.9399054268631
# Epoch 997, Cost: 694.9377752873567
# Epoch 998, Cost: 694.9356451852516
# Epoch 999, Cost: 694.9335151205503
# Epoch 1000, Cost: 694.9313850932496
# (array([ 0.20253263,  0.00112386, -0.00213202, -0.00933401]), np.float64(-0.0003572563919114557))

以下顯示利用算出來的 w 和 b 來預測 ŷ，並計算 ŷ 和 y 的 residual。

y_hat = f_wb(X_train, w, b)
(y_train, y_hat)

# Output
# (array([460, 232, 178]), array([425.71175692, 286.41159657, 172.23087045]))

cost = compute_cost(X_train, y_train, w, b)
cost

# Output
# np.float64(694.9313850932496)

結語

Linear regression 是一個相較簡單的模型。很適合用來講解模型背後的數學原理。不過，它還是很實用的。Simple linear regression 只有一個變數，實用性可能不大。實際上，我們會使用向量化的 multiple linear regression，因為通常變數會大於一個，而且可能會很多。

參考

Andrew Ng, Machine Learning Specialization, Coursera.
西内啓，機器學習的數學基礎 : AI、深度學習打底必讀，旗標。

線性回歸（Linear Regression）

Share

Table of Contents

線性回歸

線性回歸模型（Linear Regression Model）

成本函數（Cost Function）

求解參數 w 和 b

聯立方程式（Simultaneous Equations）

最小平方法（Least Squares）

梯度下降（Gradient Descent）

Learning Rate

多元線性迴歸（Multiple Linear Regression）

多元線性迴歸模型（Multiple Linear Regression Model）

成本函數（Cost Function）

梯度下降（Gradient Descent）

結語

參考

Related Tags

2 comments

發佈留言 取消回覆

You May Also Like

求解參數 `w` 和 `b`

發佈留言取消回覆