Linear Regression

Linear regression is a data analysis technique that uses linear functions to predict data. Although the linear regression model is relatively simple, it is a mature statistical technique.

The complete code for this chapter can be found in .

Linear Regression
1. Linear Regression Model
2. Cost Function
Solving Parameters w and b
Multiple Linear Regression
Conclusion
References

Linear Regression Model

The model function of linear regression is as follows, where w and b are parameters. When we substitute x with the value, the function f_w,b returns a predicted value ŷ (y hat).

$J(w,b)=\frac{1}{2m} \displaystyle\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})^2=\frac{1}{2m} \displaystyle\sum_{i=1}^{m}(f_{w,b}(x^{(i)})-y^{(i)})^2 \\\\ \text{Find } w,b: \hat{y}^{(i)} \text{ is close to } y^{(i)} \text{ for all } (x^{(i)},y^{(i)}) \text{, where } i=0...m \\\\ m: \text{number of examples} \\\\ \text{Objective: } minimize_{w,b}J(w,b)$

It should be noted that what f_w,b returns is a prediction value ŷ, not the real value y (target value), as shown in the figure below. When the predictions ŷ are very close y, we can say that the accuracy is very high. We improve the accuracy of f_w,b by adjusting w and b.

Linear regression's model function. — Linear regression’s model function.

In the following code, f_wb() implements the linear regression model.

def f_wb(x, w, b):
    """
    Linear regression model

    Parameters
    ----------
    x: (ndarray (m,)) or (scalar)
        m is the number of samples
    w: (scalar)
    b: (scalar)

    Returns
    -------
    (ndarray (m,)) or (scalar)
    """

    return w * x + b

Cost Function

Before introducing how to calculate w and b, we must first understand what cost function is. Cost function is used to measure the residual between the prediction values and target values, that is, the difference between them. With the cost function, we can measure whether the adjusted w and b are better than the original.

The cost function of linear regression is squared error, as follows.

We ultimately want to find a w and b such that J(w,b) is minimum.

Linear regression's cost function: Square error. — Linear regression’s cost function: Square error.

In the following code, compute_cost() implements cost function J(w,b).

import numpy as np


def compute_cost(x, y, w, b):
    """
    Compute cost

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w: (scalar)
    b: (scalar)

    Returns
    -------
    (scalar)
    """

    m = x.shape[0]
    y_hat = f_wb(x, w, b)
    return 1 / (2 * m) * np.sum((y_hat - y) ** 2)

Solving Parameters `w` and `b`

The model function of linear regression is very simple. However, the question is how to solve w and b and make J(w,b) minimum? In machine learning, we use gradient descent to solve w and b. However, we will first introduce the other two methods to help readers understand the advantages of gradient descent.

When introducing each method, we will use the following examples to explain.

import matplotlib.pyplot as plt
import numpy as np

x_train = np.array([10, 20, 32, 44, 66])
y_train = np.array([10, 30, 40, 44, 54])

plt.scatter(x_train, y_train, 60, marker='o', c='c', label='y_train')
plt.xlabel('x_train')
plt.ylabel('y_train')
plt.legend()

Simultaneous Equations

We all know that any two points on the plane can form a straight line. We pick (10, 10) and (66, 54) to calculate w and b.

$\left\{\begin{matrix} 10=10w+b \hspace{1cm} (1) \\ 54=66w+b \hspace{1cm} (2) \end{matrix}\right. \\\\ \\\\ (2)-(1) \\\\ \Rightarrow 44=56w \\\\ \Rightarrow w=\frac{11}{14} \\\\ \\\\ \text{Substiture } w \text{ with } \frac{11}{14} \text{ in }(1)\\\\ \Rightarrow 10=10 \times \frac{11}{14}b \\\\ \Rightarrow b=10 - \frac{110}{14}=\frac{30}{14} \\\\ \therefore y=\frac{11}{14}x+\frac{30}{14}$

We substitute the calculated w and b into the model function, and draw the model function and prediction values calculated using simultaneous equations on the coordinate graph.

# simultaneous equations
w_se = 11 / 14  # 0.7857
b_se = 30 / 14  # 2.1428
y_hat_se = f_wb(x_train, w_se, b_se)
plt.scatter(x_train, y_train, 60, marker='o', c='c', label='y_train')
plt.plot(x_train, y_hat_se, 'ro-', label='y_hat_se')
plt.xlabel('x_train')
plt.ylabel('y_train and y_hat_se')
plt.legend()

The following shows the cost of the model. Using the 5 points in the example, we can calculate 10 groups of w and b, and then calculate their costs. Finally, select w and b with the smallest cost.

cost_se = compute_cost(x_train, y_train, w_se, b_se)
cost_se

# Output:
# np.float64(36.21836734693877)

It seems simple to use simultaneous equations to solve w and b. However, when there are many training examples, the calculation will become complicated. In addition, if the optimal line of w and b does not pass through any training example, then we will not be able to find this w and b.

Least Squares

Least squares is a statistical optimization modeling method. The w and b it solves can minimize the sum of squares of residual, that is, minimize the cost. The concept of least squares is to add an error value ε (epsilon) to the model function so that the prediction value can be equal to the target value. Moreover, try to minimize ε.

Substitute all training examples into the model function. It can be seen that ε of each training example may be different.

$\left\{\begin{matrix} 10=10w+b+\epsilon_1 \\ 30=20w+b+\epsilon_2 \\ 40=32w+b+\epsilon_3 \\ 44=44w+b+\epsilon_4 \\ 54=66w+b+\epsilon_5 \\ \end{matrix}\right. \hspace{0.5cm} \Rightarrow \hspace{0.5cm} \left\{\begin{matrix} \epsilon_1=10-10w-b \\ \epsilon_2=30-20w-b \\ \epsilon_3=40-32w-b \\ \epsilon_4=44-44w-b \\ \epsilon_5=54-66w-b \\ \end{matrix}\right.$

Next, calculate the square of each ε.

$\epsilon_1^2 = (10-10w-b)^2 = 100+100w^2+b^2-200w+20wb-20b \\\\ \epsilon_2^2 = (30-20w-b)^2 = 900+400w^2+b^2-1200w+40wb-60b \\\\ \epsilon_3^2 = (40-32w-b)^2 = 1600+1024w^2+b^2-2560w+64wb-80b \\\\ \epsilon_4^2 = (44-44w-b)^2 = 1936+1936w^2+b^2-3872w+88wb-80b \\\\ \epsilon_5^2 = (54-66w-b)^2 = 2916+4356w^2+b^2-7128w+132wb-108b$

Calculate the sum of all squares of ε. Then, we hope to solve for a set of w and b such that the sum of the squares of all ε is minimum, that is, the closer to zero, the better.

$\epsilon_1^2+\epsilon_2^2+\epsilon_3^2+\epsilon_4^2+\epsilon_5^2 \\\\ = (100+100w^2+b^2-200w+20wb-20b) \\\\ + (900+400w^2+b^2-1200w+40wb-60b) \\\\ + (1600+1024w^2+b^2-2560w+64wb-80b) \\\\ + (1936+1936w^2+b^2-3872w+88wb-80b) \\\\ + (2916+4356w^2+b^2-7128w+132wb-108b) \\\\ = 7452+7816w^2+5b^2-14960w+344wb-348b$

Use completing the square to simplify the expression.

$7816w^2+344bw-14960w+5b^2-348b+7452 \\\\ =7816w^2+(344b-14960)w+5b^2-348b+7452 \\\\ =7816(w^2+\frac{344b-14960}{7816}w)+5b^2-348b+7452 \\\\ =7816(w^2+2 \times \frac{344b-14960}{2 \times 7816}w)+5b^2-348b+7452 \\\\ =7816(w+\frac{43b-1870}{1954})^2-7816(\frac{43b-1870}{1954})^2+5b^2-348b+7452$

If we hope that the error can be minimized, the value of w should be as follows:

$w=-\frac{43b-1870}{1954}$

In this case, the previous part will be equal to zero. We continue to use the matching method to simplify the remaining parts.

$-7816(\frac{43b-1870}{1954})^2+5b^2-348b+7452 \\\\ =-2 \times \frac{1849b^2+3498770-1608206}{977}+5b^2-348b+7452 \\\\ =-\frac{3689}{977}b^2-\frac{6997540}{977}+\frac{321640}{977}b+5b^2-348b+7452 \\\\ =(5 \times \frac{3689}{977})b^2+(\frac{321640}{977}-348)b+(7452-\frac{6997540}{977}) \\\\ =\frac{1196}{977}b^2-\frac{18356}{977}b+\frac{283064}{977} \\\\ = \frac{1}{977}(1196b^2-18356b+283064) \\\\ =\frac{1}{977}(1196(b^2-\frac{4589}{299}b)+283064) \\\\ =\frac{1}{977}(1196(b-\frac{4589}{598})^2-(\frac{4589}{299})^2+283064)$

To minimize the above expression, the value of b should be as follows, so that the previous part will be equal to zero.

$b=\frac{4589}{598}$

Substitute the value of b into the equation for w and find the value of w.

$\text{Substitute } b \text{ in } w: \\\\ w=-\frac{43b-1870}{1954} \\\\ \hphantom{w}=-\frac{43 \times \frac{4589}{598}-1870}{1954} \\\\ \hphantom{w}=-\frac{43 \times 4589 - 1870 \times 598}{1954 \times 598} \\\\ \hphantom{w}=-\frac{197327-1118260}{1168492} \\\\ \hphantom{w}=\frac{920933}{1168492}$

Substituting the calculated w and b into the model function, we can get the model we want.

$w=\frac{920933}{1158492} \approx 0.788 \\\\ b=\frac{4589}{598} \approx 7.674 \\\\ \therefore y=0.788x+7.674$

In the following code, we draw the model function and prediction values calculated using least squares on the coordinate graph.

# least squares
w_ls = 0.788
b_ls = 7.674

y_hat_ls = f_wb(x_train, w_ls, b_ls)
plt.scatter(x_train, y_train, 60, marker='o', c='c', label='y_train')
plt.plot(x_train, y_hat_se, 'm^-', label='y_hat_se')
plt.plot(x_train, y_hat_ls, 'gs-', label='y_hat_ls')
plt.xlabel('x_train')
plt.ylabel('y_train, y_hat_se and y_hat_ls')
plt.legend()

The following shows the cost of the model.

cost_ls = compute_cost(x_train, y_train, w_ls, b_ls)
cost_ls

# Output
# np.float64(15.953221200000002)

Compared with using simultaneous equations to calculate w and b, the least squares method can calculate better w and b. However, when there are many training examples, the entire calculation process will be extremely complicated.

Gradient Descent

Gradient descent is an optimization algorithm used to find the local value of a function. The basic principle is that first, randomly select a set of w and b, so it may be on the left or right side of the cost function, and then select the next set of w and b in the valley direction. Repeat this until J(w,b) can’t get any smaller.

The next question is how to know in which direction the bottom of the cost function is? We can differentiate the cost function J(w,b) and then get the current slope. With the slope, we know which way to move.

The following is the algorithm of gradient descent. First, randomly select a set of w and b, or set them to zero directly. The derivative of the cost function multiplied by a learning rate $\alpha$ will be the gradient for w and b to decrease. In addition, we will also give an iteration number, which represents the number of times gradient descent is performed. Therefore, we may not get the optimal w and b.

$\text{repeat until convergence } \{ \\\\ \phantom{xxxx} w=w-\alpha \frac{\partial J(w,b)}{\partial w} \\\\ \phantom{xxxx} b=b-\alpha \frac{\partial J(w,b)}{\partial b} \\\\ \}$

The partial derivative of J(w,b) with respect to w is:

$\frac{\partial J(w,b)}{\partial w} \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}(f_{w,b}(x^{(i)})-y^{(i)})^2 \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}(wx^{(i)}+b-y^{(i)})^2 \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}((x^{(i)})^2w^2+b^2+(y^{(i)})^2+2bx^{(i)}w-2y^{(i)}x^{(i)}w-2by^{(i)}) \\\\ =\frac{1}{2m} \displaystyle\sum^{m}_{i=1}(2(x^{(i)})^2w+2bx^{(i)}-2y^{(i)}x^{(i)}) \\\\ =\frac{1}{2m} \displaystyle\sum^{m}_{i=1}2x^{(i)}(wx^{(i)}+b-y^{(i)}) \\\\ =\frac{1}{m} \displaystyle\sum^{m}_{i=1}(wx^{(i)}+b-y^{(i)})x^{(i)} \\\\ =\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)}) x^{(i)}$

The partial derivative of J(w,b) with respect to b is:

$\frac{\partial J(w,b)}{\partial w} \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}(f_{w,b}(x^{(i)})-y^{(i)})^2 \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}(wx^{(i)}+b-y^{(i)})^2 \\\\ =\frac{\partial}{\partial w} \times \frac{1}{2m} \displaystyle\sum^{m}_{i=1}((x^{(i)})^2w^2+b^2+(y^{(i)})^2+2bx^{(i)}w-2y^{(i)}x^{(i)}w-2by^{(i)}) \\\\ =\frac{1}{2m} \displaystyle\sum^{m}_{i=1}(2b+2x^{(i)}w-2y^{(i)}) \\\\ =\frac{1}{2m} \displaystyle\sum^{m}_{i=1}(x^{(i)}w+b-y^{(i)})2 \\\\ =\frac{1}{m} \displaystyle\sum^{m}_{i=1}(x^{(i)}w+b-y^{(i)}) \\\\ =\frac{1}{m} \displaystyle\sum^{m}_{i=1}(f_{w,b}(x^{(i)})-y^{(i)})$

The following code implements the calculation of partial derivatives of J(w,b) with respect to w and b.

def compute_gradient(x, y, w, b):
    """
    Compute the gradient for linear regression

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w: (scalar)
    b: (scalar)

    Returns
    -------
    dj_dw: (scalar)
    dj_db: (scalar)
    """

    m = x.shape[0]
    dj_dw = 1 / m * np.sum((f_wb(x, w, b) - y) * x)
    dj_db = 1 / m * np.sum(f_wb(x, w, b) - y)
    return dj_dw, dj_db

The following code implements gradient descent. The parameter alpha is the learning rate, and epochs is the number of iterations.

def perform_gradient_descent(x, y, w_init, b_init, alpha, epochs):
    """
    Perform gradient descent

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w_init: (scalar)
    b_init: (scalar)
    alpha: (float)
    epochs: (int)

    Returns
    -------
    w: (scalar)
    b: (scalar)
    """

    w = w_init
    b = b_init
    for i in range(epochs):
        dj_dw, dj_db = compute_gradient(x, y, w, b)
        w -= alpha * dj_dw
        b -= alpha * dj_db
    return w, b

In the following code, we set the learning rate to 0.001 and the number of iterations to 3000. Then, draw the model function and prediction values calculated using gradient descent on the coordinate graph.

w_dg, b_dg = perform_gradient_descent(x_train, y_train, 0, 0, 0.001, 3000)
y_hat_gd = f_wb(x_train, w_dg, b_dg)

plt.scatter(x_train, y_train, 60, marker='o', c='c', label='y_train')
plt.plot(x_train, y_hat_se, 'm^-', label='y_hat_se')
plt.plot(x_train, y_hat_ls, 'gs-', label='y_hat_ls')
plt.plot(x_train, y_hat_gd, 'y*-', label='y_hat_gd')
plt.xlabel('x_train')
plt.ylabel('y_train, y_hat_se, y_hat_ls, and y_hat_gd')
plt.legend()

The following shows the cost of the model.

cost_gd = compute_cost(x_train, y_train, w_dg, b_dg)
cost_gd

# Output
# np.float64(18.014426543403744)

Although the process of deriving the derivative of the cost function is somewhat complicated, the overall algorithm is not too complicated. When there are many training examples, the program code will not become complicated.

Some values calculated by the above three methods are listed below for readers’ comparison and reference.

(w_se, b_se), (w_ls, b_ls), (w_dg, b_dg)

# Output
((0.7857142857142857, 2.142857142857143),
 (0.788, 7.674),
 (np.float64(0.8312661524754889), np.float64(5.714916639215474)))

(y_train, y_hat_se, y_hat_ls, y_hat_gd)

# Output
(array([10, 30, 40, 44, 54]),
 array([10.        , 17.85714286, 27.28571429, 36.71428571, 54.        ]),
 array([15.554, 23.434, 32.89 , 42.346, 59.682]),
 array([14.02757816, 22.34023969, 32.31543352, 42.29062735, 60.5784827 ]))

(cost_se, cost_ls, cost_gd)

# Output
(np.float64(36.21836734693877),
 np.float64(15.953221200000002),
 np.float64(18.014426543403744))

Learning Rate

In gradient descent, there is a learning rate called $alpha$ . It can be seen from the gradient descent algorithm that when $alpha$ is small, it takes more times to get close to the bottom, so the performance will be poor. On the contrary, it will reach the bottom more quickly. However, when $alpha$ is too large, it may not reach the bottom.

Multiple Linear Regression

So far, what we have introduced is simple linear regression, which has only one variable and may not be very practical. Next, we will introduce multiple linear regression.

Multiple Linear Regression Model

Multiple linear regression has more than one variable. The following is the model function of 4 variables.

$f_{w,b}(x)=w_1x_1+w_2x_2+w_3x_3+w_4x_4+b$

Therefore, the model function of multiple linear regression is as follows.

$f_{w,b}(x^{(i)}) =w_1x_1^{(i)}+w_2x_2^{(i)}+\cdots+w_nx_n^{(i)}+b$

The vectorized model function is as follows, where $vec{x}, vec{w}$ are vectors.

$f_{\vec{w},b}(\vec{x}^{(i)})=\displaystyle\sum_{j=1}^nw_jx_j+b=\vec{w}\cdot \vec{x}^{(i)}+b \\\\ \vec{w}=[w_1,w_2,\cdots,w_n] \\\\ \vec{x}^{(i)}=[x_1^{(i)},x_2^{(i)},\cdots,x_n^{(i)}]$

If we put all the training examples together, it will form a matrix. Generally, we use capital X to represent the matrix formed by training examples. Therefore, the model function can be written as follows:

$f_{w,b}(X)=X\cdot w+b \\\\ X=\begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_{n}^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_{n}^{(2)} \\ \vdots \\ x_1^{(m)} & x_2^{(m)} & \cdots & x_{n}^{(m)} \\ \end{bmatrix}, w=\begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n} \end{bmatrix}, b \text{ is a scalar} \\\\ y=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{m} \end{bmatrix} \\\\ m: \text{number of examples} \\\\ n: \text{number of features in each example} \\\\ x_j^i: j \text{-th feature } \text{ in } i \text{-th example}$

The vectorized f_w,b is very similar to the model function of simple linear regression, and their implementation is similar. The difference is that the vectorized f_w,b uses the dot product to multiply X and w.

import numpy as np


def f_wb(X, w, b):
    """
    Linear regression model

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    (ndarray (m,))
    """

    return np.dot(X, w) + b

Cost Function

The cost function of Multiple linear regression is as follows.

$J(\vec{w},b)=\frac{1}{2m}\displaystyle\sum_{i=1}^{m}(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)})^2 \\\\ f_{\vec{w},b}(\vec{x}^{(i)})=\vec{w}\cdot \vec{x}^{(i)}+b$

In the following code, compute_cost() implements the vectorized cost function.

import numpy as np


def compute_cost(X, y, w, b):
    """
    Compute cost

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    (scalar)
    """

    m = X.shape[0]
    y_hat = f_wb(X, w, b)
    return 1 / (2 * m) * np.sum((y_hat - y) ** 2)

Gradient Descent

The gradient descent of multiple linear regression is as follows. It can be understood from the formula that we need to calculate the parameter w_j of each feature. Then, calculate the partial derivative of J(w,b) for each w_j.

$\text{repeat until convergence: \{} \\ \phantom{xxxx} w_j=w_j-\alpha\frac{\partial J(w,b)}{\partial w_j}, \text{ for } j=1\dots n \\ \phantom{xxxx} b=b-\alpha\frac{\partial J(w,b)}{\partial b} \\ \}$

The partial derivatives of J(w,b) for each w_j and b are as follows:

$\frac{\partial J(w,b)}{\partial w_j}=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}_j \\\\ \frac{\partial J(w,b)}{\partial b}=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b} (x^{(i)})-y^{(i)})$

The following code implements the calculation of the partial derivatives of J(w,b) with respect to w_j and b.

import numpy as np


def compute_gradient(X, y, w, b):
    """
    Compute the gradient for linear regression

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    dj_dw: (ndarray (n,))
    dj_db: (scalar)
    """

    m = X.shape[0]
    y_hat = f_wb(X, w, b)
    dj_dw = 1 / m * np.dot(X.T, y_hat - y)
    dj_db = 1 / m * np.sum(y_hat - y)
    return dj_dw, dj_db

The following code implements gradient descent.

import numpy as np


def perform_gradient_descent(X, y, w_init, b_init, alpha, epochs):
    """
    Perform gradient descent

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w_init: (ndarray (n,))
    b_init: (scalar)
    alpha: (scalar)
    epochs: (int)

    Returns
    -------
    w: (ndarray (n,))
    b: (scalar)
    """

    w = w_init
    b = b_init
    for i in range(epochs):
        dj_dw, dj_db = compute_gradient(X, y, w, b)
        w -= alpha * dj_dw
        b -= alpha * dj_db
        print(f'Epoch {i + 1}, Cost: {compute_cost(X, y, w, b)}')
    return w, b

In the following code, we set the learning rate to 0.0000001 and the number of iterations to 1000 times. Then, output the change in cost at each iteration. It can be seen that the cost decreases quickly at the beginning and then becomes slower and slower.

X_train = np.array([[2104, 5, 1, 45], [1416, 3, 2, 40], [852, 2, 1, 35]])
y_train = np.array([460, 232, 178])
w, b = perform_gradient_descent(X_train, y_train, np.zeros(X_train.shape[1]), 0., 0.0000001, 1000)
(w, b)

# Output
# Epoch 1, Cost: 28989.111501449268
# Epoch 2, Cost: 17092.48916583246
# Epoch 3, Cost: 10198.319962562276
# Epoch 4, Cost: 6203.104159134617
# Epoch 5, Cost: 3887.8502848154976
# Epoch 6, Cost: 2546.145030983679
# Epoch 7, Cost: 1768.6173933475704
# Epoch 8, Cost: 1318.0342718091906
# Epoch 9, Cost: 1056.9175763423789
# Epoch 10, Cost: 905.59787958039
# Epoch 11, Cost: 817.9062417617257
# Epoch 12, Cost: 767.0874687152398
# Epoch 13, Cost: 737.6367552876493
# Epoch 14, Cost: 720.5689683005712
# Epoch 15, Cost: 710.6771659424062
# ...
# Epoch 995, Cost: 694.9420356037724
# Epoch 996, Cost: 694.9399054268631
# Epoch 997, Cost: 694.9377752873567
# Epoch 998, Cost: 694.9356451852516
# Epoch 999, Cost: 694.9335151205503
# Epoch 1000, Cost: 694.9313850932496
# (array([ 0.20253263,  0.00112386, -0.00213202, -0.00933401]), np.float64(-0.0003572563919114557))

The following shows using the calculated w and b to predict ŷ and calculate the residual of ŷ and y.

y_hat = f_wb(X_train, w, b)
(y_train, y_hat)

# Output
# (array([460, 232, 178]), array([425.71175692, 286.41159657, 172.23087045]))

cost = compute_cost(X_train, y_train, w, b)
cost

# Output
# np.float64(694.9313850932496)

Conclusion

Linear regression is a relatively simple model. Great for explaining the mathematics behind the model. However, it is still very practical. Simple linear regression has only one variable and may not be very practical. In practice, we use vectorized multiple linear regression because usually the variable will be larger than one, and there may be many.

References

Andrew Ng, Machine Learning Specialization, Coursera.
西内啓，統計学が最強の学問である［数学編］――データ分析と機械学習のための新しい教科書，ダイヤモンド社。

Linear Regression

Share

Table of Contents

Linear Regression

Linear Regression Model

Cost Function

Solving Parameters w and b

Simultaneous Equations

Least Squares

Gradient Descent

Learning Rate

Multiple Linear Regression

Multiple Linear Regression Model

Cost Function

Gradient Descent

Conclusion

References

Related Tags

Leave a Reply Cancel reply

You May Also Like

Solving Parameters `w` and `b`