Regularization

Photo by Ashley Knedler on Unsplash
Photo by Ashley Knedler on Unsplash
When a model has poor performance, it cannot predict the data accurately. The main cause may be overfitting or underfitting. If it is a case of overfitting, we can use regularization to solve model overfitting.

When a model has poor performance, it cannot predict the data accurately. The main cause may be overfitting or underfitting. If it is a case of overfitting, we can use regularization to solve model overfitting.

Regularization

Regularization is a method used to prevent overfitting. Suppose there is an overfitted model as shown on the lower left. We can make w3 and w4 very small or tend to 0 to reduce the impact of x3 and x4 on the model, so that the model becomes simpilier, as shown on the lower right. This is the basic idea of ​​regularization.

Use regularization to reduce the size of parameters.
Use regularization to reduce the size of parameters.

Gradient descent will find the minimum value in the cost function. In the cost function, if we append 1000w_3^2 and 1000w_4^2, this will cause 1000w_3^2 and 1000w_4^2 to be small or tend to 0 in the minimum value found by gradient descent. Therefore, by modifying the cost function, we can reduce the impact of 1000w_3^2 and 1000w_4^2 to the model during training.

min_{\vec{w},b} \left[ \frac{1}{2m} \displaystyle\sum_{i=1}^m (f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)})^2) + 1000 w_3^2 + 1000 w_4^2 \right]

The following formula is the cost function with regularization added. The formula added at the end is called regularization term, and λ is called regularization parameter. If λ is set to a very large value, such as 1010, then all W will tend to 0. Therefore, we can reduce W by adjusting λ.

J(\vec{w},b)=\frac{1}{2m} \displaystyle\sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)})^2 + \frac{\lambda}{2m} \displaystyle\sum_{j=1}^n w_j^2

Regularized Linear Regression

Regularized linear regression is the cost function of linear regression with regularization term, as follows.

min_{\vec{w},b} J(\vec{w},b) = min_{\vec{w},b} \left[ \frac{1}{2m} \displaystyle\sum_{i-1}^m (f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)})^2 + \frac{\lambda}{2m} \displaystyle\sum_{j=1}^n w_j^2 \right]

The Gradient descent algorithm is as follows.

\text{repeat \{} \\\\ \phantom{xxxx} w_j=w_j-\alpha \frac{\partial}{\partial w_j}J(\vec{w},b) \\\\ \phantom{xxxx} b=b-\alpha \frac{\partial}{\partial b}J(\vec{w},b) \\\\ \text{\}}

After we expand the derivative part, it becomes the following formula.

\text{repeat \{} \\\\ \phantom{xxxx} w_j=w_j-\alpha \left[ \frac{1}{m} \displaystyle\sum_{i=1}^m \left[ (f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)})x_j^{(i)} \right] + \frac{\lambda}{m} w_j \right] \\\\ \phantom{xxxx} b=b-\alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}) \\\\ \text{\}}

The simplified expression of wj becomes as follows. It can be clearly seen that we can reduce wj by adjusting λ.

w_j=w_j(1-\alpha \frac{\lambda}{m})- \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}) x_j^{(i)}

Regularized Logistic Regression

The cost function of logistic regression plus the regularization term will become the following formula.

z=w_1x_1+w_2x_2+w_3x_1^2x_2+w_4x_1^2x_2^2+w_5x_1^2x_2^3+ \cdots +b \\\\ f_{\vec{w},b}(\vec{x})=\frac{1}{1+e^{-z}} \\\\ J(\vec{w},b)=-\frac{1}{m} \displaystyle\sum_{i=1}^m \left[ y^{(i)} log(f_{\vec{w},b}(\vec{x}^{(i)})) + (1-y^{(i)}) logt(1-f_{\vec{w},b}(\vec{x}^{(i)})) \right] \\\\ \phantom{xxxxxxxx} + \frac{\lambda}{2m} \displaystyle\sum_{i=1}^n w_j^2

The Gradient descent algorithm is as follows.

\text{repeat \{} \\\\ \phantom{xxxx} w_j=w_j-\alpha \frac{\partial}{\partial w_j}J(\vec{w},b) \\\\ \phantom{xxxx} b=b-\alpha \frac{\partial}{\partial b}J(\vec{w},b) \\\\ \text{\}}

After expanding the derivative part, it becomes the following formula. It looks exactly the same as regularized linear regression, but it should be noted that w, b in the formula are logistic regression.

\text{repeat \{} \\\\ \phantom{xxxx} w_j=w_j-\alpha \left[ \frac{1}{m} \displaystyle\sum_{i=1}^m \left[ (f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)})x_j^{(i)} \right] + \frac{\lambda}{m} w_j \right] \\\\ \phantom{xxxx} b=b-\alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}) \\\\ \text{\}}

Conclusion


Regularization can reduce the size of parameters to solve overfitting. When the parameters are larger, the penalty will be larger, that is, it will be reduced a lot at a time.

Reference

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like