Logistic Regression

Logistic regression is a data analysis technique used for classification. It is probably the simplest model among classification algorithms. Therefore, it is very suitable as the first classification algorithm for beginners.

The complete code for this chapter can be found in .

Logistic Regression
Cost Function
Multiple Logistic Regress
Conclusion
References

Sigmoid Function

Before introducing logistic regression, let’s first look at the sigmoid function. The sigmoid function is as follows, which converts z into a value between 0 and 1.

$g(z)=\frac{1}{1+e^{-z}}, 0<g(z)<1$

As shown in the figure below, the two ends of the sigmoid function will be very close to 0 and 1.

Sigmoid function, outputs between 0 and 1.

In the following code, sigmoid() implements the sigmoid function.

def sigmoid(z):
    """
    Sigmoid function

    Parameters
    ----------
    z: (ndarray (m,)) or (scalar)
        m is the number of samples

    Returns
    -------
    (ndarray (m,)) or (scalar)
    """

    return 1 / (1 + np.exp(-z))

Logistic Regression Model

Logistic regression is a regression used for classification, and when it comes to classification, it is naturally related to probability. The output value of the sigmoid function is between 0 and 1, meeting the basic requirement that the probability must be between 0% and 100%. The model of logistic regression is to first substitute x into the linear regression model to get
y. Then substituting y into the sigmoid function and get a value between 0 and 1. The model function of logistic regression is as follows.

$z=wx^{(i)}+b \\\\ g(z)=\frac{1}{1+e^{-z}} \\\\ f_{w,b}(x^{(i)})=g(wx^{(i)}+b)=\frac{1}{1+e^{-(wx^{(i)}+b)}} \\\\ \text{parameters}:w,b \\\\ i:i \text{-th example}$

Assume that x is the size of the tumor, when y is 0, it means a non-malignant tumor, and when y is 1, it means a malignant tumor. If, we pass x into a logistic regression model and obtain 0.7, this means there is a 70% chance that y is 1. Because the probability that y is 0 plus the probability that y is 1 sums to 1, so when there is a 70% probability that y is 1, that means there is a 30% probability that y is 0.

$P(y=0)+P(y=1)=1$

In the following code, f_wb() implements the logistic regression model.

def f_wb(x, w, b):
    """
    Logistic regression model

    Parameters
    ----------
    x: (ndarray (m,)) or (scalar)
        m is the number of samples
    w: (scalar)
    b: (scalar)

    Returns
    -------
    (ndarray (m,)) or (scalar)
    """

    return sigmoid(w * x + b)

Decision Boundary

As mentioned just now, the output value of logistic regression is between 0 and 1, which represents the probability of y being 1. Therefore, we need to set a threshold. When f_w,b is greater than or equal to threshold, ŷ is 1; conversely, when f_w,b is less than threshold, ŷ is 0.

Looking carefully, when z is greater than or equal to 0, f_w,b, that is, g(z) will be greater than or equal to threshold 0.5. So, the left side of z = 0 is ŷ = 0, and the right side of z = 0 is ŷ = 1. This boundary is called the design boundary.

In the example below, we can calculate the boundary decision by z = 0.

Cost Function

Likelihood Function and Maximum Likelihood Estimation

The Likelihood function is a statistical method used to describe the likelihood of observed data given given parameters. Suppose we toss a coin 4 times and the results are as follows, where the probability of heads for each toss is $\theta$ .

x_i	Result	Probability
x₁	Font	$\theta$
x₂	Font	$\theta$
x₃	Front	$\theta$
x₄	Back	1 – $\theta$

According to the above results, its likelihood function $L(\theta)$ is as follows. $p(x_i|\theta)$ is conditional probability, which refers to the probability $\theta$ of x_i under probability .

$L(\theta)=\displaystyle\prod_{i=1}^{4}p(x_i|\theta) \\\\ \hphantom{L(\theta)}=p(x_1|\theta) \cdot p(x_2|\theta) \cdot p(x_3|\theta) \cdot p(x_4|\theta) \\\\ \hphantom{L(\theta)}=\theta \cdot \theta \cdot \theta \cdot (1-\theta) \\\\ \hphantom{L(\theta)}=\theta^3 \cdot (1-\theta)$

Assuming that the probability of tossing fronts $\theta$ is 0.6, the likelihood function $L(\theta)$ is 0.0846.

$L(\theta)=0.6^3 \cdot 0.4=0.0846$

After obtaining the likelihood function, differentiate the likelihood function and set it equal to 0 to find the maximum value. Finally, calculate the $\theta$ probability of tossing fronts $\frac{3}{4}$ . This is maximum likelihood estimation.

$\frac{d}{d\theta}L(\theta) =0 \\\\ \Rightarrow \frac{d}{d\theta}(\theta^3 \cdot (1-\theta))=0 \\ \Rightarrow \frac{d}{d\theta}(\theta^3-\theta^4)=0 \\\\ \Rightarrow 3\theta^2-4\theta^3=0 \\\\ \Rightarrow \theta=\frac{3}{4}$

Cost Function and Loss Function

Cost function is used to measure the accuracy of f_w,b . With the cost function, we can measure whether the adjusted sum w is b better than the original.

Taking the previous coin toss example, the probability of each coin toss is as follows:

$p(x^{(i)}|\theta)=\begin{cases} \theta & \text{if } y^{(i)}=1 \\ 1-\theta & \text{if } y^{(i)}=0 \end{cases} \\\\ \text{0: back, 1: front}$

For convenience, we can combine the two equations of $p(x^{(i)}|\theta)$ into one equation, as follows:

$p(x^{(i)}|\theta)=\theta^{(y^{(i)})}+(1-\theta)^{(1-y^{(i)})}$

The likelihood function when tossing a coin n times is as follows. According to maximum likelihood estimation, we can use the maximum value of the likelihood function as the cost function.

$L(\theta)=\displaystyle\prod_{i=1}^{n}p(x_i|\theta)$

Since multiplication in the formula is unfavorable for calculation, we first simplify it by taking logarithms of both sides.

$L(\theta)=\displaystyle\prod_{i=1}^{n}p(x_i|\theta) \\\\ \Rightarrow L(\theta)=p(x_1|\theta) \cdot p(x_2|\theta) \cdots p(x_n|\theta) \\\\ \Rightarrow \ln L(\theta)=\ln p(x_1|\theta) + \ln p(x_2|\theta) +\cdots+ \ln p(x_n|\theta) \\\\ \Rightarrow \ln L(\theta)=\displaystyle\sum_{i=1}^{n} \ln p(x_i|\theta)$

The $\ln p(x_i|\theta)$ can be expanded as follows.

$\ln p(x_i|\theta)=\ln (\theta^{(y^{(i)})}+(1-\theta)^{(1-y^{(i)})}) \\\\ \hphantom{\ln p(x_i|\theta)}=y^{(i)}\ln\theta+(1-y^{(i)})\ln(1-\theta)$

Finally, our likelihood function becomes log-likelihood function is as follows.

$\ln L(\theta)=\displaystyle\sum_{i=1}^{n} \ln p(x_i|\theta) \\\\ \hphantom{\ln L(\theta)}=\displaystyle\sum_{i=1}^{n}y^{(i)}\ln\theta+(1-y^{(i)})\ln(1-\theta)$

According to maximum likelihood estimation, we must find the maximum value of the likelihood function. However, by convention, we are used to finding the minimum value of the cost function. Therefore, we multiply the likelihood function by -1 so that we can find the minimum value instead.

$-\ln L(\theta)=\displaystyle\sum_{i=1}^{n} -y^{(i)}\ln\theta -(1-y^{(i)})\ln(1-\theta)$

In logistic regression, the probability $\theta$ is calculated by f_w,b, so by substituting it into $p(x_i|\theta)$ , you can get the loss function of logistic regression.

$loss(f_{w,b}(x^{(i)}),y^{(i)})=-y^{(i)}\log(f_{w,b}(x^{(i)}))-(1-y^{(i)})\log(1-f_{w,b}(x^{(i)}))$

Finally, the cost function of logistic regression is as follows:

$J(w,b)=\frac{1}{m}\displaystyle\sum^{m}_{i=1}loss(f_{w,b}(x^{(i)},y^{(i)})) \\\\ \hphantom{J(w,b)}=\frac{1}{m}\displaystyle\sum^{m}_{i=1}-y^{(i)}\log(f_{w,b}(x^{(i)}))-(1-y^{(i)})\log(1-f_{w,b}(x^{(i)})) \\\\ m:\text{number of examples} \\\\ \text{Objective}:minimize_{w,b}J(w,b)$

In the following code, compute_cost() implements the cost function J(w,b).

def compute_cost(x, y, w, b):
    """
    Compute cost

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w: (scalar)
    b: (scalar)

    Returns
    -------
    (scalar)
    """

    m = x.shape[0]
    y_hat = f_wb(x, w, b)
    cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat))
    return cost

Gradient Descent

Gradient descent is an optimization algorithm used to find the local value of a function. Readers can refer to the following first. Linear regression to understand gradient descent.

Photo by Louis Hansel @shotsoflouis on Unsplash

- Machine Learning
- Supervised Learning

Linear Regression

ByWayne
08/12/2024

The following is the gradient descent algorithm of logistic regression. First, randomly select a set of w and b, or set them directly to zero. The derivative of the cost function multiplied by a learning rate $\alpha$ will be the gradient that w and b will descend. In addition, we will also give an iteration number, which represents the number of times gradient descent is performed. Therefore, we may not get the optimal w and b.

$\text{repeat until convergence } \{ \\\\ \phantom{xxxx} w=w-\alpha \frac{\partial J(w,b)}{\partial w} \\\\ \phantom{xxxx} b=b-\alpha \frac{\partial J(w,b)}{\partial b} \\\\ \}$

The partial derivative of J(w, b) with respect to w and b is calculated as follows:

$\frac{\partial}{\partial w}J(w,b)=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)} \\\\ \frac{\partial}{\partial b}J(w,b)=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)})$

The way to calculate the derivatives of the cost function of logistic regression and linear regression look the same, but note that their f_w,b are different.

$\text{Linear regression: } f_{w,b}(x^{(i)})=wx^{(i)}+b \\ \text{Logistic regression: } f_{w,b}(x^{(i)})=\frac{1}{1+e^{-(wx^{(i)}+b)}}$

In the following code, J(w,b) implements the calculation of partial derivatives of w and b.

def compute_gradient(x, y, w, b):
    """
    Compute the gradient for logistic regression

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w: (scalar)
    b: (scalar)

    Returns
    -------
    dj_dw: (scalar)
    dj_db: (scalar)
    """

    m = x.shape[0]
    dj_dw = 1 / m * np.sum((f_wb(x, w, b) - y) * x)
    dj_db = 1 / m * np.sum(f_wb(x, w, b) - y)
    return dj_dw, dj_db

The following code implements gradient descent. The parameter alpha is the learning rate and epochs is the number of iterations.

def perform_gradient_descent(x, y, w_init, b_init, alpha, epochs):
    """
    Perform gradient descent

    Parameters
    ----------
    x: (ndarray (m))
        m is the number of samples
    y: (ndarray (m,))
    w_init: (scalar)
    b_init: (scalar)
    alpha: (float)
    epochs: (int)

    Returns
    -------
    w: (scalar)
    b: (scalar)
    """

    w = w_init
    b = b_init
    for i in range(epochs):
        dj_dw, dj_db = compute_gradient(x, y, w, b)
        w -= alpha * dj_dw
        b -= alpha * dj_db
    return w, b

Example

Next, we will use an example to explain how to use logistic regression. In the following example, there are 20 values. Values less than 10 are given a label of 0, and values greater than or equal to 10 are given a label of 1.

x_train = np.array([1, 0, 17, 8, 13, 19, 15, 10, 8, 7, 3, 6, 17, 3, 4, 17, 11, 12, 16, 13])
y_train = np.int_(x_train[:] >= 10)

plt.scatter(x_train[y_train == 0], y_train[y_train == 0], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(x_train[y_train == 1], y_train[y_train == 1], 60, marker='o', c='m', label='y_train == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()

In the following code, we set the learning rate to 0.01 and the number of iterations to 10,000.

w, b = perform_gradient_descent(x_train, y_train, 0, 0, 0.01, 10000)
w, b

# Output
(np.float64(0.6199251592044446), np.float64(-5.42258186823073))

After that, use the calculated w and b to predict x_train. The resulting prediction is a series of probabilities. We set the decision boundary to 0.5, so if the probability is less than 0.5, it will be given the label 0, and if the probability is greater than or equal to 0.5, it will be given the label 1.

prediction = f_wb(x_train, w, b)
y_hat = np.int_(prediction >= 0.5)
(y_train, prediction, y_hat)

Plot y_hat on the coordinate chart and compare it with the coordinate chart of y_train above.

plt.scatter(x_train[y_hat == 0], y_train[y_hat == 0], 60, marker='^', c='c', label='y_hat == 0')
plt.scatter(x_train[y_hat == 1], y_train[y_hat == 1], 60, marker='o', c='m', label='y_hat == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()

y_hat with decision boundary 0.5, 0: < 0.5, 1: >= 0.5.

Multiple Logistic Regress

So far, we have introduced simple logistic regression, which has only one variable. Next, we will introduce multiple logistic regression.

Multiple Logistic Regression Model

The model function of multiple logistic regression is as follows.

$f_{w,b}(x^{(i)})=g(z^{(i)})=\frac{1}{1+e^{-z^{(i)}}} \\\\ z^{(i)}=w_1x_1^{(i)}+w_2x_2^{(i)}+\cdots+w_nx_n^{(i)}+b$

The vectorized model function is as follows, where $\vec{x},\vec{w}$ are vectors.

$f_{\vec{w},b}(\vec{x}^{(i)})=g(\vec{z}^{(i)})=\frac{1}{1+e^{-\vec{z}^{(i)}}} \\\\ \vec{z}^{(i)}=\displaystyle\sum_{j=1}^{n}w_jx_j^{(i)}=\vec{w}\cdot\vec{x}^{(i)}+b \\\\ \vec{w}=[w_1,w_2,\cdots,w_n] \\\\ \vec{x}^{(i)}=[x_1^{(i)},x_2^{(i)},\cdots,x_n^{(i)}]$

If we put all the training examples together, it will form a matrix. Generally, we use uppercase letters X to represent the matrix formed by training examples. Therefore, the model function can be written as follows:

$f_{w,b}(X)=g(Z)=g(X\cdot w+b)=\frac{1}{1+e^{-(X\cdot w+b)}} \\\\ X=\begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_{n}^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_{n}^{(2)} \\ \vdots \\ x_1^{(m)} & x_2^{(m)} & \cdots & x_{n}^{(m)} \\ \end{bmatrix}, w=\begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n} \end{bmatrix}, b \text{ is a scalar} \\\\ y=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{m} \end{bmatrix} \\\\ m: \text{number of examples} \\\\ n: \text{number of features in each example} \\\\ x_j^i: j \text{-th feature } \text{ in } i \text{-th example}$

The vectorized f_w,b is very similar to the model function of simple logistic regression, and their implementations are also similar. The difference is that the vectorized f_w,b uses the inner product (dot product) to multiply the X and w.

def sigmoid(z):
    """
    Sigmoid function

    Parameters
    ----------
    z: (ndarray (m,)) or (scalar)
        m is the number of samples

    Returns
    -------
    (ndarray (m,)) or (scalar)
    """

    return 1 / (1 + np.exp(-z))


def f_wb(X, w, b):
    """
    Logistic regression model

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    (ndarray (m,))
    """

    return sigmoid(np.dot(X, w) + b)

Cost Function

The cost function of multiple logistic regression is as follows.

$J(\vec{w},b)=\frac{1}{m}\displaystyle\sum^{m}_{i=1}-y^{(i)}\log(f_{\vec{w},b}(\vec{x}^{(i)}))-(1-y^{(i)})\log(1-f_{\vec{w},b}(\vec{x}^{(i)})) \\\\ f_{\vec{w},b}(\vec{x}^{(i)})=g(\vec{w}\cdot \vec{x}^{(i)}+b)=\frac{1}{1+e^{-(\vec{w}\cdot \vec{x}^{(i)}+b)}}$

In the following code, compute_cost() implements the vectorized cost function.

def compute_cost(X, y, w, b):
    """
    Compute cost

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    (scalar)
    """

    m = X.shape[0]
    y_hat = f_wb(X, w, b)
    cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat))
    return cost

Gradient Descent

The gradient descent of multiple logistic regression is as follows. As can be understood from the formula, we need to calculate the parameters w_j of each feature . Then, calculate the partial derivative of J(w,b) for each w_j.

$\text{repeat until convergence: \{} \\ \phantom{xxxx} w_j=w_j-\alpha\frac{\partial J(w,b)}{\partial w_j}, \text{ for } j=1\dots n \\ \phantom{xxxx} b=b-\alpha\frac{\partial J(w,b)}{\partial b} \\ \}$

The partial derivatives of J(w,b) for each w_j and b are as follows:

$\frac{\partial J(w,b)}{\partial w_j}=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}_j \\\\ \frac{\partial J(w,b)}{\partial b}=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b} (x^{(i)})-y^{(i)})$

The following code implements the calculation of the partial derivatives of J(w,b) with respect to w_j and b.

def compute_gradient(X, y, w, b):
    """
    Compute the gradient for logistic regression

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    dj_dw: (ndarray (n,))
    dj_db: (scalar)
    """

    m = X.shape[0]
    y_hat = f_wb(X, w, b)
    dj_dw = 1 / m * np.dot(X.T, y_hat - y)
    dj_db = 1 / m * np.sum(y_hat - y)
    return dj_dw, dj_db

The following code implements gradient descent.

def perform_gradient_descent(X, y, w_init, b_init, alpha, epochs):
    """
    Perform gradient descent

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w_init: (ndarray (n,))
    b_init: (scalar)
    alpha: (float)
    epochs: (int)

    Returns
    -------
    w: (ndarray (n,))
    b: (scalar)
    """

    w = w_init
    b = b_init
    for i in range(epochs):
        dj_dw, dj_db = compute_gradient(X, y, w, b)
        w -= alpha * dj_dw
        b -= alpha * dj_db
        print(f'Epoch {i + 1:4}, Cost: {float(compute_cost(X, y, w, b)):8.2f}')
    return w, b

Example

In the following example, there are 6 points on the coordinate plane and they are divided into two groups.

X_train = np.array([[10, 10], [10, 16], [20, 8], [10, 24], [20, 16], [30, 30]])
y_train = np.array([0, 0, 0, 1, 1, 1])

plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()

Use gradient descent to train the model to find w and b.

np.random.seed(1)
initial_w = 0.01 * (np.random.rand(2) - 0.5)
initial_b = -8
w, b = perform_gradient_descent(X_train, y_train, initial_w, initial_b, 0.001, 10000)
w, b

# Output
Epoch    1, Cost:     3.75
Epoch 1001, Cost:     0.17
Epoch 2001, Cost:     0.17
Epoch 3001, Cost:     0.17
Epoch 4001, Cost:     0.17
Epoch 5001, Cost:     0.17
Epoch 6001, Cost:     0.17
Epoch 7001, Cost:     0.17
Epoch 8001, Cost:     0.17
Epoch 9001, Cost:     0.17
(array([0.17254296, 0.36047453]), np.float64(-8.219025094555942))

After calculating w and b, we can draw the decision boundary.

plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])])
plot_y = (-1. / w[1]) * (w[0] * plot_x + b)
plt.plot(plot_x, plot_y, c='g')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()

A decision boundary separates points into 2 groups.

Finally, we add a new point (25, 25) and use the model to classify it.

plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])])
plot_y = (-1. / w[1]) * (w[0] * plot_x + b)
plt.plot(plot_x, plot_y, c='g')

x = np.array([25, 25])
y_hat = f_wb(x, w, b)
plt.scatter(x[0], x[1], 60, marker='*', c='m' if y_hat >= 0.5 else 'c', label=f'Prediction: {y_hat:.2f}')

plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()

Conclusion

Logistic regression models are very similar to linear regression. It uses the linear regression model function to calculate a value, and then uses the sigmoid function to cleverly convert the value into a probability. Although the concept of logistic regression is quite simple, it is often used in practice.

References

Andrew Ng, Machine Learning Specialization, Coursera.
西内啓，統計学が最強の学問である［数学編］――データ分析と機械学習のための新しい教科書，ダイヤモンド社。

Logistic Regression

Share

Table of Contents

Logistic Regression

Sigmoid Function

Logistic Regression Model

Decision Boundary

Cost Function

Likelihood Function and Maximum Likelihood Estimation

Cost Function and Loss Function

Gradient Descent

Example

Multiple Logistic Regress

Multiple Logistic Regression Model

Cost Function

Gradient Descent

Example

Conclusion

References

Related Tags

Leave a Reply Cancel reply

You May Also Like