Logistic Regression

Photo by Food Photographer | Jennifer Pallian on Unsplash
Photo by Food Photographer | Jennifer Pallian on Unsplash
Logistic regression is a data analysis technique used for classification. It is probably the simplest model among classification algorithms.

Logistic regression is a data analysis technique used for classification. It is probably the simplest model among classification algorithms. Therefore, it is very suitable as the first classification algorithm for beginners.

The complete code for this chapter can be found in .

Logistic Regression

Sigmoid Function

Before introducing logistic regression, let’s first look at the sigmoid function. The sigmoid function is as follows, which converts z into a value between 0 and 1.

g(z)=\frac{1}{1+e^{-z}}, 0<g(z)<1

As shown in the figure below, the two ends of the sigmoid function will be very close to 0 and 1.

Sigmoid function, outputs between 0 and 1.
Sigmoid function, outputs between 0 and 1.

In the following code, sigmoid() implements the sigmoid function.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def sigmoid(z):
"""
Sigmoid function
Parameters
----------
z: (ndarray (m,)) or (scalar)
m is the number of samples
Returns
-------
(ndarray (m,)) or (scalar)
"""
return 1 / (1 + np.exp(-z))
def sigmoid(z): """ Sigmoid function Parameters ---------- z: (ndarray (m,)) or (scalar) m is the number of samples Returns ------- (ndarray (m,)) or (scalar) """ return 1 / (1 + np.exp(-z))
def sigmoid(z):
    """
    Sigmoid function

    Parameters
    ----------
    z: (ndarray (m,)) or (scalar)
        m is the number of samples

    Returns
    -------
    (ndarray (m,)) or (scalar)
    """

    return 1 / (1 + np.exp(-z))

Logistic Regression Model

Logistic regression is a regression used for classification, and when it comes to classification, it is naturally related to probability. The output value of the sigmoid function is between 0 and 1, meeting the basic requirement that the probability must be between 0% and 100%. The model of logistic regression is to first substitute x into the linear regression model to get 
y. Then substituting y into the sigmoid function and get a value between 0 and 1. The model function of logistic regression is as follows.

z=wx^{(i)}+b \\\\ g(z)=\frac{1}{1+e^{-z}} \\\\ f_{w,b}(x^{(i)})=g(wx^{(i)}+b)=\frac{1}{1+e^{-(wx^{(i)}+b)}} \\\\ \text{parameters}:w,b \\\\ i:i \text{-th example}

Assume that x is the size of the tumor, when y is 0, it means a non-malignant tumor, and when y is 1, it means a malignant tumor. If, we pass x into a logistic regression model and obtain 0.7, this means there is a 70% chance that y is 1. Because the probability that y is 0 plus the probability that y is 1 sums to 1, so when there is a 70% probability that y is 1, that means there is a 30% probability that y is 0.

P(y=0)+P(y=1)=1

In the following code, f_wb() implements the logistic regression model.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def f_wb(x, w, b):
"""
Logistic regression model
Parameters
----------
x: (ndarray (m,)) or (scalar)
m is the number of samples
w: (scalar)
b: (scalar)
Returns
-------
(ndarray (m,)) or (scalar)
"""
return sigmoid(w * x + b)
def f_wb(x, w, b): """ Logistic regression model Parameters ---------- x: (ndarray (m,)) or (scalar) m is the number of samples w: (scalar) b: (scalar) Returns ------- (ndarray (m,)) or (scalar) """ return sigmoid(w * x + b)
def f_wb(x, w, b):
    """
    Logistic regression model

    Parameters
    ----------
    x: (ndarray (m,)) or (scalar)
        m is the number of samples
    w: (scalar)
    b: (scalar)

    Returns
    -------
    (ndarray (m,)) or (scalar)
    """

    return sigmoid(w * x + b)

Decision Boundary

As mentioned just now, the output value of logistic regression is between 0 and 1, which represents the probability of y being 1. Therefore, we need to set a threshold. When fw,b is greater than or equal to threshold, ŷ is 1; conversely, when fw,b is less than threshold, ŷ is 0.

When the threshold is 0.5.
When the threshold is 0.5.

Looking carefully, when z is greater than or equal to 0, fw,b, that is, g(z) will be greater than or equal to threshold 0.5. So, the left side of z = 0 is ŷ = 0, and the right side of z = 0 is ŷ = 1. This boundary is called the design boundary.

Decision boundary: z=0.
Decision boundary: z=0.

In the example below, we can calculate the boundary decision by z = 0.

Example of decision boundary.
Example of decision boundary.

Cost Function

Likelihood Function and Maximum Likelihood Estimation

The Likelihood function is a statistical method used to describe the likelihood of observed data given given parameters. Suppose we toss a coin 4 times and the results are as follows, where the probability of heads for each toss is \theta .

xiResultProbability
x1Font\theta
x2Font\theta
x3Front\theta
x4Back1 – \theta

According to the above results, its likelihood function L(\theta) is as follows. p(x_i|\theta) is conditional probability, which refers to the probability \theta of xi under probability .

L(\theta)=\displaystyle\prod_{i=1}^{4}p(x_i|\theta) \\\\ \hphantom{L(\theta)}=p(x_1|\theta) \cdot p(x_2|\theta) \cdot p(x_3|\theta) \cdot p(x_4|\theta) \\\\ \hphantom{L(\theta)}=\theta \cdot \theta \cdot \theta \cdot (1-\theta) \\\\ \hphantom{L(\theta)}=\theta^3 \cdot (1-\theta)

Assuming that the probability of tossing fronts  \theta is 0.6, the likelihood function L(\theta) is 0.0846.

L(\theta)=0.6^3 \cdot 0.4=0.0846

After obtaining the likelihood function, differentiate the likelihood function and set it equal to 0 to find the maximum value. Finally, calculate the \theta probability of tossing fronts \frac{3}{4} . This is maximum likelihood estimation.

\frac{d}{d\theta}L(\theta) =0 \\\\ \Rightarrow \frac{d}{d\theta}(\theta^3 \cdot (1-\theta))=0 \\ \Rightarrow \frac{d}{d\theta}(\theta^3-\theta^4)=0 \\\\ \Rightarrow 3\theta^2-4\theta^3=0 \\\\ \Rightarrow \theta=\frac{3}{4}

Cost Function and Loss Function

Cost function is used to measure the accuracy of fw,b . With the cost function, we can measure whether the adjusted sum w is b better than the original.

Taking the previous coin toss example, the probability of each coin toss is as follows:

p(x^{(i)}|\theta)=\begin{cases} \theta & \text{if } y^{(i)}=1 \\ 1-\theta & \text{if } y^{(i)}=0 \end{cases} \\\\ \text{0: back, 1: front}

For convenience, we can combine the two equations of p(x^{(i)}|\theta) into one equation, as follows:

p(x^{(i)}|\theta)=\theta^{(y^{(i)})}+(1-\theta)^{(1-y^{(i)})}

The likelihood function when tossing a coin n times is as follows. According to maximum likelihood estimation, we can use the maximum value of the likelihood function as the cost function.

L(\theta)=\displaystyle\prod_{i=1}^{n}p(x_i|\theta)

Since multiplication in the formula is unfavorable for calculation, we first simplify it by taking logarithms of both sides.

L(\theta)=\displaystyle\prod_{i=1}^{n}p(x_i|\theta) \\\\ \Rightarrow L(\theta)=p(x_1|\theta) \cdot p(x_2|\theta) \cdots p(x_n|\theta) \\\\ \Rightarrow \ln L(\theta)=\ln p(x_1|\theta) + \ln p(x_2|\theta) +\cdots+ \ln p(x_n|\theta) \\\\ \Rightarrow \ln L(\theta)=\displaystyle\sum_{i=1}^{n} \ln p(x_i|\theta)

The \ln p(x_i|\theta) can be expanded as follows.

\ln p(x_i|\theta)=\ln (\theta^{(y^{(i)})}+(1-\theta)^{(1-y^{(i)})}) \\\\ \hphantom{\ln p(x_i|\theta)}=y^{(i)}\ln\theta+(1-y^{(i)})\ln(1-\theta)

Finally, our likelihood function becomes log-likelihood function is as follows.

\ln L(\theta)=\displaystyle\sum_{i=1}^{n} \ln p(x_i|\theta) \\\\ \hphantom{\ln L(\theta)}=\displaystyle\sum_{i=1}^{n}y^{(i)}\ln\theta+(1-y^{(i)})\ln(1-\theta)

According to maximum likelihood estimation, we must find the maximum value of the likelihood function. However, by convention, we are used to finding the minimum value of the cost function. Therefore, we multiply the likelihood function by -1 so that we can find the minimum value instead.

-\ln L(\theta)=\displaystyle\sum_{i=1}^{n} -y^{(i)}\ln\theta -(1-y^{(i)})\ln(1-\theta)

In logistic regression, the probability \theta is calculated by fw,b, so by substituting it into p(x_i|\theta) , you can get the loss function of logistic regression.

loss(f_{w,b}(x^{(i)}),y^{(i)})=-y^{(i)}\log(f_{w,b}(x^{(i)}))-(1-y^{(i)})\log(1-f_{w,b}(x^{(i)}))

Finally, the cost function of logistic regression is as follows:

J(w,b)=\frac{1}{m}\displaystyle\sum^{m}_{i=1}loss(f_{w,b}(x^{(i)},y^{(i)})) \\\\ \hphantom{J(w,b)}=\frac{1}{m}\displaystyle\sum^{m}_{i=1}-y^{(i)}\log(f_{w,b}(x^{(i)}))-(1-y^{(i)})\log(1-f_{w,b}(x^{(i)})) \\\\ m:\text{number of examples} \\\\ \text{Objective}:minimize_{w,b}J(w,b)

In the following code, compute_cost() implements the cost function J(w,b).

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def compute_cost(x, y, w, b):
"""
Compute cost
Parameters
----------
x: (ndarray (m,))
m is the number of samples
y: (ndarray (m,))
w: (scalar)
b: (scalar)
Returns
-------
(scalar)
"""
m = x.shape[0]
y_hat = f_wb(x, w, b)
cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat))
return cost
def compute_cost(x, y, w, b): """ Compute cost Parameters ---------- x: (ndarray (m,)) m is the number of samples y: (ndarray (m,)) w: (scalar) b: (scalar) Returns ------- (scalar) """ m = x.shape[0] y_hat = f_wb(x, w, b) cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat)) return cost
def compute_cost(x, y, w, b):
    """
    Compute cost

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w: (scalar)
    b: (scalar)

    Returns
    -------
    (scalar)
    """

    m = x.shape[0]
    y_hat = f_wb(x, w, b)
    cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat))
    return cost

Gradient Descent

Gradient descent  is an optimization algorithm used to find the local value of a function. Readers can refer to the following first. Linear regression to understand gradient descent.

The following is the gradient descent algorithm of logistic regression. First, randomly select a set of  w and  b, or set them directly to zero. The derivative of the cost function multiplied by a learning rate \alpha will be the gradient that  w and b will descend. In addition, we will also give an iteration number, which represents the number of times gradient descent is performed. Therefore, we may not get the optimal  w and  b.

\text{repeat until convergence } \{ \\\\ \phantom{xxxx} w=w-\alpha \frac{\partial J(w,b)}{\partial w} \\\\ \phantom{xxxx} b=b-\alpha \frac{\partial J(w,b)}{\partial b} \\\\ \}

The partial derivative of J(w, b) with respect to w and b is calculated as follows:

\frac{\partial}{\partial w}J(w,b)=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)} \\\\ \frac{\partial}{\partial b}J(w,b)=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)})

The way to calculate the derivatives of the cost function of logistic regression and linear regression look the same, but note that their fw,b are different.

\text{Linear regression: } f_{w,b}(x^{(i)})=wx^{(i)}+b \\ \text{Logistic regression: } f_{w,b}(x^{(i)})=\frac{1}{1+e^{-(wx^{(i)}+b)}}

In the following code, J(w,b) implements the calculation of partial derivatives of w and b.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def compute_gradient(x, y, w, b):
"""
Compute the gradient for logistic regression
Parameters
----------
x: (ndarray (m,))
m is the number of samples
y: (ndarray (m,))
w: (scalar)
b: (scalar)
Returns
-------
dj_dw: (scalar)
dj_db: (scalar)
"""
m = x.shape[0]
dj_dw = 1 / m * np.sum((f_wb(x, w, b) - y) * x)
dj_db = 1 / m * np.sum(f_wb(x, w, b) - y)
return dj_dw, dj_db
def compute_gradient(x, y, w, b): """ Compute the gradient for logistic regression Parameters ---------- x: (ndarray (m,)) m is the number of samples y: (ndarray (m,)) w: (scalar) b: (scalar) Returns ------- dj_dw: (scalar) dj_db: (scalar) """ m = x.shape[0] dj_dw = 1 / m * np.sum((f_wb(x, w, b) - y) * x) dj_db = 1 / m * np.sum(f_wb(x, w, b) - y) return dj_dw, dj_db
def compute_gradient(x, y, w, b):
    """
    Compute the gradient for logistic regression

    Parameters
    ----------
    x: (ndarray (m,))
        m is the number of samples
    y: (ndarray (m,))
    w: (scalar)
    b: (scalar)

    Returns
    -------
    dj_dw: (scalar)
    dj_db: (scalar)
    """

    m = x.shape[0]
    dj_dw = 1 / m * np.sum((f_wb(x, w, b) - y) * x)
    dj_db = 1 / m * np.sum(f_wb(x, w, b) - y)
    return dj_dw, dj_db

The following code implements gradient descent. The parameter alpha is the learning rate and epochs is the number of iterations.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def perform_gradient_descent(x, y, w_init, b_init, alpha, epochs):
"""
Perform gradient descent
Parameters
----------
x: (ndarray (m))
m is the number of samples
y: (ndarray (m,))
w_init: (scalar)
b_init: (scalar)
alpha: (float)
epochs: (int)
Returns
-------
w: (scalar)
b: (scalar)
"""
w = w_init
b = b_init
for i in range(epochs):
dj_dw, dj_db = compute_gradient(x, y, w, b)
w -= alpha * dj_dw
b -= alpha * dj_db
return w, b
def perform_gradient_descent(x, y, w_init, b_init, alpha, epochs): """ Perform gradient descent Parameters ---------- x: (ndarray (m)) m is the number of samples y: (ndarray (m,)) w_init: (scalar) b_init: (scalar) alpha: (float) epochs: (int) Returns ------- w: (scalar) b: (scalar) """ w = w_init b = b_init for i in range(epochs): dj_dw, dj_db = compute_gradient(x, y, w, b) w -= alpha * dj_dw b -= alpha * dj_db return w, b
def perform_gradient_descent(x, y, w_init, b_init, alpha, epochs):
    """
    Perform gradient descent

    Parameters
    ----------
    x: (ndarray (m))
        m is the number of samples
    y: (ndarray (m,))
    w_init: (scalar)
    b_init: (scalar)
    alpha: (float)
    epochs: (int)

    Returns
    -------
    w: (scalar)
    b: (scalar)
    """

    w = w_init
    b = b_init
    for i in range(epochs):
        dj_dw, dj_db = compute_gradient(x, y, w, b)
        w -= alpha * dj_dw
        b -= alpha * dj_db
    return w, b

Example

Next, we will use an example to explain how to use logistic regression. In the following example, there are 20 values. Values ​​less than 10 are given a label of 0, and values ​​greater than or equal to 10 are given a label of 1.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
x_train = np.array([1, 0, 17, 8, 13, 19, 15, 10, 8, 7, 3, 6, 17, 3, 4, 17, 11, 12, 16, 13])
y_train = np.int_(x_train[:] >= 10)
plt.scatter(x_train[y_train == 0], y_train[y_train == 0], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(x_train[y_train == 1], y_train[y_train == 1], 60, marker='o', c='m', label='y_train == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
x_train = np.array([1, 0, 17, 8, 13, 19, 15, 10, 8, 7, 3, 6, 17, 3, 4, 17, 11, 12, 16, 13]) y_train = np.int_(x_train[:] >= 10) plt.scatter(x_train[y_train == 0], y_train[y_train == 0], 60, marker='^', c='c', label='y_train == 0') plt.scatter(x_train[y_train == 1], y_train[y_train == 1], 60, marker='o', c='m', label='y_train == 1') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
x_train = np.array([1, 0, 17, 8, 13, 19, 15, 10, 8, 7, 3, 6, 17, 3, 4, 17, 11, 12, 16, 13])
y_train = np.int_(x_train[:] >= 10)

plt.scatter(x_train[y_train == 0], y_train[y_train == 0], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(x_train[y_train == 1], y_train[y_train == 1], 60, marker='o', c='m', label='y_train == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
Example: 0: < 10, 1 >= 10.
Example: 0: < 10, 1 >= 10.

In the following code, we set the learning rate to 0.01 and the number of iterations to 10,000.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
w, b = perform_gradient_descent(x_train, y_train, 0, 0, 0.01, 10000)
w, b
# Output
(np.float64(0.6199251592044446), np.float64(-5.42258186823073))
w, b = perform_gradient_descent(x_train, y_train, 0, 0, 0.01, 10000) w, b # Output (np.float64(0.6199251592044446), np.float64(-5.42258186823073))
w, b = perform_gradient_descent(x_train, y_train, 0, 0, 0.01, 10000)
w, b

# Output
(np.float64(0.6199251592044446), np.float64(-5.42258186823073))

After that, use the calculated w and b to predict x_train. The resulting prediction is a series of probabilities. We set the decision boundary to 0.5, so if the probability is less than 0.5, it will be given the label 0, and if the probability is greater than or equal to 0.5, it will be given the label 1.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
prediction = f_wb(x_train, w, b)
y_hat = np.int_(prediction >= 0.5)
(y_train, prediction, y_hat)
prediction = f_wb(x_train, w, b) y_hat = np.int_(prediction >= 0.5) (y_train, prediction, y_hat)
prediction = f_wb(x_train, w, b)
y_hat = np.int_(prediction >= 0.5)
(y_train, prediction, y_hat)

Plot y_hat on the coordinate chart and compare it with the coordinate chart of y_train above.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.scatter(x_train[y_hat == 0], y_train[y_hat == 0], 60, marker='^', c='c', label='y_hat == 0')
plt.scatter(x_train[y_hat == 1], y_train[y_hat == 1], 60, marker='o', c='m', label='y_hat == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
plt.scatter(x_train[y_hat == 0], y_train[y_hat == 0], 60, marker='^', c='c', label='y_hat == 0') plt.scatter(x_train[y_hat == 1], y_train[y_hat == 1], 60, marker='o', c='m', label='y_hat == 1') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
plt.scatter(x_train[y_hat == 0], y_train[y_hat == 0], 60, marker='^', c='c', label='y_hat == 0')
plt.scatter(x_train[y_hat == 1], y_train[y_hat == 1], 60, marker='o', c='m', label='y_hat == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
y_hat with decision boundary 0.5, 0: < 0.5, 1: >= 0.5.
y_hat with decision boundary 0.5, 0: < 0.5, 1: >= 0.5.

Multiple Logistic Regress

So far, we have introduced simple logistic regression, which has only one variable. Next, we will introduce multiple logistic regression.

Multiple Logistic Regression Model

The model function of multiple logistic regression is as follows.

f_{w,b}(x^{(i)})=g(z^{(i)})=\frac{1}{1+e^{-z^{(i)}}} \\\\ z^{(i)}=w_1x_1^{(i)}+w_2x_2^{(i)}+\cdots+w_nx_n^{(i)}+b

The vectorized model function is as follows, where \vec{x},\vec{w} are vectors.

f_{\vec{w},b}(\vec{x}^{(i)})=g(\vec{z}^{(i)})=\frac{1}{1+e^{-\vec{z}^{(i)}}} \\\\ \vec{z}^{(i)}=\displaystyle\sum_{j=1}^{n}w_jx_j^{(i)}=\vec{w}\cdot\vec{x}^{(i)}+b \\\\ \vec{w}=[w_1,w_2,\cdots,w_n] \\\\ \vec{x}^{(i)}=[x_1^{(i)},x_2^{(i)},\cdots,x_n^{(i)}]

If we put all the training examples together, it will form a matrix. Generally, we use uppercase letters X to represent the matrix formed by training examples. Therefore, the model function can be written as follows:

f_{w,b}(X)=g(Z)=g(X\cdot w+b)=\frac{1}{1+e^{-(X\cdot w+b)}} \\\\ X=\begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_{n}^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_{n}^{(2)} \\ \vdots \\ x_1^{(m)} & x_2^{(m)} & \cdots & x_{n}^{(m)} \\ \end{bmatrix}, w=\begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n} \end{bmatrix}, b \text{ is a scalar} \\\\ y=\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{m} \end{bmatrix} \\\\ m: \text{number of examples} \\\\ n: \text{number of features in each example} \\\\ x_j^i: j \text{-th feature } \text{ in } i \text{-th example}

The vectorized fw,b is very similar to the model function of simple logistic regression, and their implementations are also similar. The difference is that the vectorized fw,b uses the inner product (dot product) to multiply the X and w.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def sigmoid(z):
"""
Sigmoid function
Parameters
----------
z: (ndarray (m,)) or (scalar)
m is the number of samples
Returns
-------
(ndarray (m,)) or (scalar)
"""
return 1 / (1 + np.exp(-z))
def f_wb(X, w, b):
"""
Logistic regression model
Parameters
----------
X: (ndarray (m, n))
m is the number of samples, n is the number of features
w: (ndarray (n,))
b: (scalar)
Returns
-------
(ndarray (m,))
"""
return sigmoid(np.dot(X, w) + b)
def sigmoid(z): """ Sigmoid function Parameters ---------- z: (ndarray (m,)) or (scalar) m is the number of samples Returns ------- (ndarray (m,)) or (scalar) """ return 1 / (1 + np.exp(-z)) def f_wb(X, w, b): """ Logistic regression model Parameters ---------- X: (ndarray (m, n)) m is the number of samples, n is the number of features w: (ndarray (n,)) b: (scalar) Returns ------- (ndarray (m,)) """ return sigmoid(np.dot(X, w) + b)
def sigmoid(z):
    """
    Sigmoid function

    Parameters
    ----------
    z: (ndarray (m,)) or (scalar)
        m is the number of samples

    Returns
    -------
    (ndarray (m,)) or (scalar)
    """

    return 1 / (1 + np.exp(-z))


def f_wb(X, w, b):
    """
    Logistic regression model

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    (ndarray (m,))
    """

    return sigmoid(np.dot(X, w) + b)

Cost Function

The cost function of multiple logistic regression is as follows.

J(\vec{w},b)=\frac{1}{m}\displaystyle\sum^{m}_{i=1}-y^{(i)}\log(f_{\vec{w},b}(\vec{x}^{(i)}))-(1-y^{(i)})\log(1-f_{\vec{w},b}(\vec{x}^{(i)})) \\\\ f_{\vec{w},b}(\vec{x}^{(i)})=g(\vec{w}\cdot \vec{x}^{(i)}+b)=\frac{1}{1+e^{-(\vec{w}\cdot \vec{x}^{(i)}+b)}}

In the following code, compute_cost() implements the vectorized cost function.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def compute_cost(X, y, w, b):
"""
Compute cost
Parameters
----------
X: (ndarray (m, n))
m is the number of samples, n is the number of features
y: (ndarray (m,))
w: (ndarray (n,))
b: (scalar)
Returns
-------
(scalar)
"""
m = X.shape[0]
y_hat = f_wb(X, w, b)
cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat))
return cost
def compute_cost(X, y, w, b): """ Compute cost Parameters ---------- X: (ndarray (m, n)) m is the number of samples, n is the number of features y: (ndarray (m,)) w: (ndarray (n,)) b: (scalar) Returns ------- (scalar) """ m = X.shape[0] y_hat = f_wb(X, w, b) cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat)) return cost
def compute_cost(X, y, w, b):
    """
    Compute cost

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    (scalar)
    """

    m = X.shape[0]
    y_hat = f_wb(X, w, b)
    cost = 1 / m * np.sum(-y * np.log(y_hat) - (1 - y) * np.log(1 - y_hat))
    return cost

Gradient Descent

The gradient descent of multiple logistic regression is as follows. As can be understood from the formula, we need to calculate the parameters wj of each feature . Then, calculate the partial derivative of J(w,b) for each wj.

\text{repeat until convergence: \{} \\ \phantom{xxxx} w_j=w_j-\alpha\frac{\partial J(w,b)}{\partial w_j}, \text{ for } j=1\dots n \\ \phantom{xxxx} b=b-\alpha\frac{\partial J(w,b)}{\partial b} \\ \}

The partial derivatives of J(w,b) for each wj and b are as follows:

\frac{\partial J(w,b)}{\partial w_j}=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}_j \\\\ \frac{\partial J(w,b)}{\partial b}=\frac{1}{m} \displaystyle\sum^{m}_{i=1} (f_{w,b} (x^{(i)})-y^{(i)})

The following code implements the calculation of the partial derivatives of J(w,b) with respect to wj and b.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def compute_gradient(X, y, w, b):
"""
Compute the gradient for logistic regression
Parameters
----------
X: (ndarray (m, n))
m is the number of samples, n is the number of features
y: (ndarray (m,))
w: (ndarray (n,))
b: (scalar)
Returns
-------
dj_dw: (ndarray (n,))
dj_db: (scalar)
"""
m = X.shape[0]
y_hat = f_wb(X, w, b)
dj_dw = 1 / m * np.dot(X.T, y_hat - y)
dj_db = 1 / m * np.sum(y_hat - y)
return dj_dw, dj_db
def compute_gradient(X, y, w, b): """ Compute the gradient for logistic regression Parameters ---------- X: (ndarray (m, n)) m is the number of samples, n is the number of features y: (ndarray (m,)) w: (ndarray (n,)) b: (scalar) Returns ------- dj_dw: (ndarray (n,)) dj_db: (scalar) """ m = X.shape[0] y_hat = f_wb(X, w, b) dj_dw = 1 / m * np.dot(X.T, y_hat - y) dj_db = 1 / m * np.sum(y_hat - y) return dj_dw, dj_db
def compute_gradient(X, y, w, b):
    """
    Compute the gradient for logistic regression

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w: (ndarray (n,))
    b: (scalar)

    Returns
    -------
    dj_dw: (ndarray (n,))
    dj_db: (scalar)
    """

    m = X.shape[0]
    y_hat = f_wb(X, w, b)
    dj_dw = 1 / m * np.dot(X.T, y_hat - y)
    dj_db = 1 / m * np.sum(y_hat - y)
    return dj_dw, dj_db

The following code implements gradient descent.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def perform_gradient_descent(X, y, w_init, b_init, alpha, epochs):
"""
Perform gradient descent
Parameters
----------
X: (ndarray (m, n))
m is the number of samples, n is the number of features
y: (ndarray (m,))
w_init: (ndarray (n,))
b_init: (scalar)
alpha: (float)
epochs: (int)
Returns
-------
w: (ndarray (n,))
b: (scalar)
"""
w = w_init
b = b_init
for i in range(epochs):
dj_dw, dj_db = compute_gradient(X, y, w, b)
w -= alpha * dj_dw
b -= alpha * dj_db
print(f'Epoch {i + 1:4}, Cost: {float(compute_cost(X, y, w, b)):8.2f}')
return w, b
def perform_gradient_descent(X, y, w_init, b_init, alpha, epochs): """ Perform gradient descent Parameters ---------- X: (ndarray (m, n)) m is the number of samples, n is the number of features y: (ndarray (m,)) w_init: (ndarray (n,)) b_init: (scalar) alpha: (float) epochs: (int) Returns ------- w: (ndarray (n,)) b: (scalar) """ w = w_init b = b_init for i in range(epochs): dj_dw, dj_db = compute_gradient(X, y, w, b) w -= alpha * dj_dw b -= alpha * dj_db print(f'Epoch {i + 1:4}, Cost: {float(compute_cost(X, y, w, b)):8.2f}') return w, b
def perform_gradient_descent(X, y, w_init, b_init, alpha, epochs):
    """
    Perform gradient descent

    Parameters
    ----------
    X: (ndarray (m, n))
        m is the number of samples, n is the number of features
    y: (ndarray (m,))
    w_init: (ndarray (n,))
    b_init: (scalar)
    alpha: (float)
    epochs: (int)

    Returns
    -------
    w: (ndarray (n,))
    b: (scalar)
    """

    w = w_init
    b = b_init
    for i in range(epochs):
        dj_dw, dj_db = compute_gradient(X, y, w, b)
        w -= alpha * dj_dw
        b -= alpha * dj_db
        print(f'Epoch {i + 1:4}, Cost: {float(compute_cost(X, y, w, b)):8.2f}')
    return w, b

Example

In the following example, there are 6 points on the coordinate plane and they are divided into two groups.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
X_train = np.array([[10, 10], [10, 16], [20, 8], [10, 24], [20, 16], [30, 30]])
y_train = np.array([0, 0, 0, 1, 1, 1])
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
X_train = np.array([[10, 10], [10, 16], [20, 8], [10, 24], [20, 16], [30, 30]]) y_train = np.array([0, 0, 0, 1, 1, 1]) plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0') plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
X_train = np.array([[10, 10], [10, 16], [20, 8], [10, 24], [20, 16], [30, 30]])
y_train = np.array([0, 0, 0, 1, 1, 1])

plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
Example with 6 points.
Example with 6 points.

Use gradient descent to train the model to find w and b.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
np.random.seed(1)
initial_w = 0.01 * (np.random.rand(2) - 0.5)
initial_b = -8
w, b = perform_gradient_descent(X_train, y_train, initial_w, initial_b, 0.001, 10000)
w, b
# Output
Epoch 1, Cost: 3.75
Epoch 1001, Cost: 0.17
Epoch 2001, Cost: 0.17
Epoch 3001, Cost: 0.17
Epoch 4001, Cost: 0.17
Epoch 5001, Cost: 0.17
Epoch 6001, Cost: 0.17
Epoch 7001, Cost: 0.17
Epoch 8001, Cost: 0.17
Epoch 9001, Cost: 0.17
(array([0.17254296, 0.36047453]), np.float64(-8.219025094555942))
np.random.seed(1) initial_w = 0.01 * (np.random.rand(2) - 0.5) initial_b = -8 w, b = perform_gradient_descent(X_train, y_train, initial_w, initial_b, 0.001, 10000) w, b # Output Epoch 1, Cost: 3.75 Epoch 1001, Cost: 0.17 Epoch 2001, Cost: 0.17 Epoch 3001, Cost: 0.17 Epoch 4001, Cost: 0.17 Epoch 5001, Cost: 0.17 Epoch 6001, Cost: 0.17 Epoch 7001, Cost: 0.17 Epoch 8001, Cost: 0.17 Epoch 9001, Cost: 0.17 (array([0.17254296, 0.36047453]), np.float64(-8.219025094555942))
np.random.seed(1)
initial_w = 0.01 * (np.random.rand(2) - 0.5)
initial_b = -8
w, b = perform_gradient_descent(X_train, y_train, initial_w, initial_b, 0.001, 10000)
w, b

# Output
Epoch    1, Cost:     3.75
Epoch 1001, Cost:     0.17
Epoch 2001, Cost:     0.17
Epoch 3001, Cost:     0.17
Epoch 4001, Cost:     0.17
Epoch 5001, Cost:     0.17
Epoch 6001, Cost:     0.17
Epoch 7001, Cost:     0.17
Epoch 8001, Cost:     0.17
Epoch 9001, Cost:     0.17
(array([0.17254296, 0.36047453]), np.float64(-8.219025094555942))

After calculating w and b, we can draw the decision boundary.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])])
plot_y = (-1. / w[1]) * (w[0] * plot_x + b)
plt.plot(plot_x, plot_y, c='g')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0') plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1') plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])]) plot_y = (-1. / w[1]) * (w[0] * plot_x + b) plt.plot(plot_x, plot_y, c='g') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])])
plot_y = (-1. / w[1]) * (w[0] * plot_x + b)
plt.plot(plot_x, plot_y, c='g')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
A decision boundary separates points into 2 groups.
A decision boundary separates points into 2 groups.

Finally, we add a new point (25, 25) and use the model to classify it.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])])
plot_y = (-1. / w[1]) * (w[0] * plot_x + b)
plt.plot(plot_x, plot_y, c='g')
x = np.array([25, 25])
y_hat = f_wb(x, w, b)
plt.scatter(x[0], x[1], 60, marker='*', c='m' if y_hat >= 0.5 else 'c', label=f'Prediction: {y_hat:.2f}')
plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0') plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1') plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])]) plot_y = (-1. / w[1]) * (w[0] * plot_x + b) plt.plot(plot_x, plot_y, c='g') x = np.array([25, 25]) y_hat = f_wb(x, w, b) plt.scatter(x[0], x[1], 60, marker='*', c='m' if y_hat >= 0.5 else 'c', label=f'Prediction: {y_hat:.2f}') plt.xlabel('x_train[:,0]') plt.ylabel('x_train[:,1]') plt.legend()
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 60, marker='^', c='c', label='y_train == 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 60, marker='o', c='m', label='y_train == 1')
plot_x = np.array([min(X_train[:, 0]), max(X_train[:, 0])])
plot_y = (-1. / w[1]) * (w[0] * plot_x + b)
plt.plot(plot_x, plot_y, c='g')

x = np.array([25, 25])
y_hat = f_wb(x, w, b)
plt.scatter(x[0], x[1], 60, marker='*', c='m' if y_hat >= 0.5 else 'c', label=f'Prediction: {y_hat:.2f}')

plt.xlabel('x_train[:,0]')
plt.ylabel('x_train[:,1]')
plt.legend()

Conclusion

Logistic regression models are very similar to linear regression. It uses the linear regression model function to calculate a value, and then uses the sigmoid function to cleverly convert the value into a probability. Although the concept of logistic regression is quite simple, it is often used in practice.

Reference

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like