Logistic regression is a data analysis technique used for classification. It is probably the simplest model among classification algorithms. Therefore, it is very suitable as the first classification algorithm for beginners.
The complete code for this chapter can be found in .
Logistic Regression
Sigmoid Function
Before introducing logistic regression, let’s first look at the sigmoid function. The sigmoid function is as follows, which converts z into a value between 0 and 1.
As shown in the figure below, the two ends of the sigmoid function will be very close to 0 and 1.
In the following code, sigmoid()
implements the sigmoid function.
def sigmoid(z): """ Sigmoid function Parameters ---------- z: (ndarray (m,)) or (scalar) m is the number of samples Returns ------- (ndarray (m,)) or (scalar) """ return 1 / (1 + np.exp(-z))
Logistic Regression Model
Logistic regression is a regression used for classification, and when it comes to classification, it is naturally related to probability. The output value of the sigmoid function is between 0 and 1, meeting the basic requirement that the probability must be between 0% and 100%. The model of logistic regression is to first substitute x
into the linear regression model to get y
. Then substituting y
into the sigmoid function and get a value between 0 and 1. The model function of logistic regression is as follows.
Assume that x
is the size of the tumor, when y
is 0, it means a non-malignant tumor, and when y
is 1, it means a malignant tumor. If, we pass x
into a logistic regression model and obtain 0.7, this means there is a 70% chance that y
is 1. Because the probability that y
is 0 plus the probability that y
is 1 sums to 1, so when there is a 70% probability that y
is 1, that means there is a 30% probability that y
is 0.
In the following code, f_wb()
implements the logistic regression model.
def f_wb(x, w, b): """ Logistic regression model Parameters ---------- x: (ndarray (m,)) or (scalar) m is the number of samples w: (scalar) b: (scalar) Returns ------- (ndarray (m,)) or (scalar) """ return sigmoid(w * x + b)
Decision Boundary
As mentioned just now, the output value of logistic regression is between 0 and 1, which represents the probability of y
being 1. Therefore, we need to set a threshold. When fw,b
is greater than or equal to threshold, ŷ
is 1; conversely, when fw,b
is less than threshold, ŷ
is 0.
Looking carefully, when z is greater than or equal to 0, fw,b
, that is, g(z)
will be greater than or equal to threshold 0.5. So, the left side of z = 0
is ŷ = 0
, and the right side of z = 0
is ŷ = 1
. This boundary is called the design boundary.
In the example below, we can calculate the boundary decision by z = 0
.
Cost Function
Likelihood Function and Maximum Likelihood Estimation
The Likelihood function is a statistical method used to describe the likelihood of observed data given given parameters. Suppose we toss a coin 4 times and the results are as follows, where the probability of heads for each toss is .
xi | Result | Probability |
---|---|---|
x1 | Font | |
x2 | Font | |
x3 | Front | |
x4 | Back | 1 – |
According to the above results, its likelihood function is as follows. is conditional probability, which refers to the probability of xi
under probability .
Assuming that the probability of tossing fronts is 0.6, the likelihood function is 0.0846.
After obtaining the likelihood function, differentiate the likelihood function and set it equal to 0 to find the maximum value. Finally, calculate the probability of tossing fronts . This is maximum likelihood estimation.
Cost Function and Loss Function
Cost function is used to measure the accuracy of fw,b
. With the cost function, we can measure whether the adjusted sum w
is b
better than the original.
Taking the previous coin toss example, the probability of each coin toss is as follows:
For convenience, we can combine the two equations of into one equation, as follows:
The likelihood function when tossing a coin n
times is as follows. According to maximum likelihood estimation, we can use the maximum value of the likelihood function as the cost function.
Since multiplication in the formula is unfavorable for calculation, we first simplify it by taking logarithms of both sides.
The can be expanded as follows.
Finally, our likelihood function becomes log-likelihood function is as follows.
According to maximum likelihood estimation, we must find the maximum value of the likelihood function. However, by convention, we are used to finding the minimum value of the cost function. Therefore, we multiply the likelihood function by -1 so that we can find the minimum value instead.
In logistic regression, the probability is calculated by fw,b
, so by substituting it into , you can get the loss function of logistic regression.
Finally, the cost function of logistic regression is as follows:
In the following code, compute_cost()
implements the cost function J(w,b)
.
def compute_cost(x, y, w, b): """ Compute cost Parameters ---------- x: (ndarray (m,)) m is the number of samples y: (ndarray (m,)) w: (scalar) b: (scalar) Returns ------- (scalar) """ m = x.shape[0] y_hat = f_wb(x, w, b) cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat)) return cost
Gradient Descent
Gradient descent is an optimization algorithm used to find the local value of a function. Readers can refer to the following first. Linear regression to understand gradient descent.
The following is the gradient descent algorithm of logistic regression. First, randomly select a set of w
and b
, or set them directly to zero. The derivative of the cost function multiplied by a learning rate will be the gradient that w
and b
will descend. In addition, we will also give an iteration number, which represents the number of times gradient descent is performed. Therefore, we may not get the optimal w
and b
.
The partial derivative of J(w, b)
with respect to w
and b
is calculated as follows:
The way to calculate the derivatives of the cost function of logistic regression and linear regression look the same, but note that their fw,b
are different.
In the following code, J(w,b)
implements the calculation of partial derivatives of w
and b
.
def compute_gradient(x, y, w, b): """ Compute the gradient for logistic regression Parameters ---------- x: (ndarray (m,)) m is the number of samples y: (ndarray (m,)) w: (scalar) b: (scalar) Returns ------- dj_dw: (scalar) dj_db: (scalar) """ m = x.shape[0] dj_dw = 1 / m * np.sum((f_wb(x, w, b) - y) * x) dj_db = 1 / m * np.sum(f_wb(x, w, b) - y) return dj_dw, dj_db
The following code implements gradient descent. The parameter alpha
is the learning rate and epochs
is the number of iterations.
def perform_gradient_descent(x, y, w_init, b_init, alpha, epochs): """ Perform gradient descent Parameters ---------- x: (ndarray (m)) m is the number of samples y: (ndarray (m,)) w_init: (scalar) b_init: (scalar) alpha: (float) epochs: (int) Returns ------- w: (scalar) b: (scalar) """ w = w_init b = b_init for i in range(epochs): dj_dw, dj_db = compute_gradient(x, y, w, b) w -= alpha * dj_dw b -= alpha * dj_db return w, b
Example
Next, we will use an example to explain how to use logistic regression. In the following example, there are 20 values. Values less than 10 are given a label of 0, and values greater than or equal to 10 are given a label of 1.
x_train = np.array([1, 0, 17, 8, 13, 19, 15, 10, 8, 7, 3, 6, 17, 3, 4, 17, 11, 12, 16, 13]) y_train = np.int_(x_train[:] >= 10) plt.scatter(x_train[y_train == 0], y_train[y_train == 0], 60, marker='^', c='c', label='y_train == 0') plt.scatter(x_train[y_train == 1], y_train[y_train == 1], 60, marker='o', c='m', label='y_train == 1') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
In the following code, we set the learning rate to 0.01 and the number of iterations to 10,000.
w, b = perform_gradient_descent(x_train, y_train, 0, 0, 0.01, 10000) w, b # Output (np.float64(0.6199251592044446), np.float64(-5.42258186823073))
After that, use the calculated w
and b
to predict x_train
. The resulting prediction is a series of probabilities. We set the decision boundary to 0.5, so if the probability is less than 0.5, it will be given the label 0, and if the probability is greater than or equal to 0.5, it will be given the label 1.
prediction = f_wb(x_train, w, b) y_hat = np.int_(prediction >= 0.5) (y_train, prediction, y_hat)
Plot y_hat
on the coordinate chart and compare it with the coordinate chart of y_train
above.
plt.scatter(x_train[y_hat == 0], y_train[y_hat == 0], 60, marker='^', c='c', label='y_hat == 0') plt.scatter(x_train[y_hat == 1], y_train[y_hat == 1], 60, marker='o', c='m', label='y_hat == 1') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
Multiple Logistic Regress
So far, we have introduced simple logistic regression, which has only one variable. Next, we will introduce multiple logistic regression.
Multiple Logistic Regression Model
The model function of multiple logistic regression is as follows.
The vectorized model function is as follows, where are vectors.
If we put all the training examples together, it will form a matrix. Generally, we use uppercase letters X
to represent the matrix formed by training examples. Therefore, the model function can be written as follows:
The vectorized fw,b
is very similar to the model function of simple logistic regression, and their implementations are also similar. The difference is that the vectorized fw,b
uses the inner product (dot product) to multiply the X
and w
.
def sigmoid(z): """ Sigmoid function Parameters ---------- z: (ndarray (m,)) or (scalar) m is the number of samples Returns ------- (ndarray (m,)) or (scalar) """ return 1 / (1 + np.exp(-z)) def f_wb(X, w, b): """ Logistic regression model Parameters ---------- X: (ndarray (m, n)) m is the number of samples, n is the number of features w: (ndarray (n,)) b: (scalar) Returns ------- (ndarray (m,)) """ return sigmoid(np.dot(X, w) + b)
Cost Function
The cost function of multiple logistic regression is as follows.
In the following code, compute_cost()
implements the vectorized cost function.
def compute_cost(X, y, w, b): """ Compute cost Parameters ---------- X: (ndarray (m, n)) m is the number of samples, n is the number of features y: (ndarray (m,)) w: (ndarray (n,)) b: (scalar) Returns ------- (scalar) """ m = X.shape[0] y_hat = f_wb(X, w, b) cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat)) return cost
Gradient Descent
The gradient descent of multiple logistic regression is as follows. As can be understood from the formula, we need to calculate the parameters wj
of each feature . Then, calculate the partial derivative of J(w,b)
for each wj
.
The partial derivatives of J(w,b)
for each wj
and b
are as follows:
The following code implements the calculation of the partial derivatives of J(w,b)
with respect to wj
and b
.
def compute_gradient(X, y, w, b): """ Compute the gradient for logistic regression Parameters ---------- X: (ndarray (m, n)) m is the number of samples, n is the number of features y: (ndarray (m,)) w: (ndarray (n,)) b: (scalar) Returns ------- dj_dw: (ndarray (n,)) dj_db: (scalar) """ m = X.shape[0] y_hat = f_wb(X, w, b) dj_dw = 1 / m * np.dot(X.T, y_hat - y) dj_db = 1 / m * np.sum(y_hat - y) return dj_dw, dj_db
The following code implements gradient descent.
def perform_gradient_descent(X, y, w_init, b_init, alpha, epochs): """ Perform gradient descent Parameters ---------- X: (ndarray (m, n)) m is the number of samples, n is the number of features y: (ndarray (m,)) w_init: (ndarray (n,)) b_init: (scalar) alpha: (float) epochs: (int) Returns ------- w: (ndarray (n,)) b: (scalar) """ w = w_init b = b_init for i in range(epochs): dj_dw, dj_db = compute_gradient(X, y, w, b) w -= alpha * dj_dw b -= alpha * dj_db print(f'Epoch {i + 1:4}, Cost: {float(compute_cost(X, y, w, b)):8.2f}') return w, b
Example
In the following example, there are 6 points on the coordinate plane and they are divided into two groups.
X_train = np.array([[10, 10], [10, 16], [20, 8], [10, 24], [20, 16], [30, 30]]) y_train = np.array([0, 0, 0, 1, 1, 1]) plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0') plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
Use gradient descent to train the model to find w
and b
.
np.random.seed(1) initial_w = 0.01 * (np.random.rand(2) - 0.5) initial_b = -8 w, b = perform_gradient_descent(X_train, y_train, initial_w, initial_b, 0.001, 10000) w, b # Output Epoch 1, Cost: 3.75 Epoch 1001, Cost: 0.17 Epoch 2001, Cost: 0.17 Epoch 3001, Cost: 0.17 Epoch 4001, Cost: 0.17 Epoch 5001, Cost: 0.17 Epoch 6001, Cost: 0.17 Epoch 7001, Cost: 0.17 Epoch 8001, Cost: 0.17 Epoch 9001, Cost: 0.17 (array([0.17254296, 0.36047453]), np.float64(-8.219025094555942))
After calculating w and b
, we can draw the decision boundary.
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0') plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1') plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])]) plot_y = (-1. / w[1]) * (w[0] * plot_x + b) plt.plot(plot_x, plot_y, c='g') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
Finally, we add a new point (25, 25) and use the model to classify it.
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0') plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1') plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])]) plot_y = (-1. / w[1]) * (w[0] * plot_x + b) plt.plot(plot_x, plot_y, c='g') x = np.array([25, 25]) y_hat = f_wb(x, w, b) plt.scatter(x[0], x[1], 60, marker='*', c='m' if y_hat >= 0.5 else 'c', label=f'Prediction: {y_hat:.2f}') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
Conclusion
Logistic regression models are very similar to linear regression. It uses the linear regression model function to calculate a value, and then uses the sigmoid function to cleverly convert the value into a probability. Although the concept of logistic regression is quite simple, it is often used in practice.
Reference
- Andrew Ng, Machine Learning Specialization, Coursera.
- 西内啓,統計学が最強の学問である[数学編]――データ分析と機械学習のための新しい教科書,ダイヤモンド社。