L1 and L2 Reguralization

author: daodeiv (David Stankov) daodavid

Polynomial Regression, Bias and Variance

Polynomial Regression is a type of regression supervised algorithm in which the relationship between feature vector $X$ and target one $Y$ is an $n^{th}$ degree polynomial function.
We gonna use the data set which is prepared in advance using $3^{th}$ degree polynomial

$$y_i = f(x_i) = \theta_3x_{i}^3 + \theta_2x_i^2 +\theta_1x_i + \theta_0 + \varepsilon_i $$
$$ \; \; \; = x - 2x^2 + 0.2x^3 + \varepsilon_i $$
The $\varepsilon$ is Gaussian distributed with mean zero and some variance $\sigma$

$$𝜀 \sim N(0,\sigma^2) $$

Now we gonna use already prepared data and we will try to find the best fitting polynomial using some regression model.

As we can see above, using a straight line we've achieved approx 16% accuracy which is too bad. In that case, when the model is too simple to fit the data we say the model is an underfitting . The simple linear regression is not able to capture the points well, this called high bias . Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. We can boost the model by increasing the degree of the polynomial. We can add new features that are powers of the original one.

$$X = \begin{bmatrix} x_1 \\ \vdots \\ x_n \\ \end{bmatrix} \rightarrow X_{polyData} = T(X) = \begin{bmatrix} 1 &x_1 & x_1^2 & \dots & x_1^p \\ \vdots & \ddots & \ddots & \vdots & \vdots \\ 1& x_n & x_n^2 & \dots & x_n^p \\ \end{bmatrix} $$

Thus we can improve the model from $Y = \omega_0 + \omega_1x$ to $Y = \omega_0 + \omega_1 x^1 + ... + \omega_p x^p$ but the problem still remains linear because $Y$ is multi linear dependented to $x_i^j$.That means we can use again the LinearRegression(). We will tranform the data into $39^{th}$ degree polynomial.

We can see that the accuracy of $39^{th}$ polynomial is 99.9%. This model cannot be right because the data set has uncertainty. When the model fits the uncertainty(error or bias) that is called overfitting or the model has high variance . On new unseen data the model will perform bad result.

There are quite a number of techniques that help us to prevent overfitting. Regularization is one such technique. Regularization basically aims at proper feature selection to avoid over-fitting.

Lasso Regression (L1 Regularization)

This is a regularization technique used in feature selection using a Shrinkage method also referred to as the penalized regression method. Lasso is short for Least Absolute Shrinkage and Selection Operator, which is used both for regularization and model selection. If a model uses the L1 regularization technique, then it is called lasso regression.Lasso regression achieves regularization by completely diminishing the importance given to some features (making the weight zero).

The lost function of the Lasso is the ordinary least square with constraint optimization.

$$J(\varTheta) = \frac{1}{2m}\sum_{i}^n( y_i - \sum_j^mx_{ij}\theta_j)^2 $$ $$subject \; to \hspace{1cm} \sum_j^m|\theta_j| < \lambda $$ Using langrange multiplyer the solutions of above equtions is :

$$LassoLost = J(\varTheta) = \frac{1}{2m}\sum_{i}^n( y_i - \sum_j^m x_{ij}\theta_j )^2 + \lambda \sum_j^m|\theta_j|$$

$\lambda \sum_j^m|\theta_j|$ represens the penalty
A tuning parameter, $λ$ controls the strength of the L1 penalty. $λ$ is basically the amount of shrinkage:

Lets investigate what the lasso will perform,we gonna use </mark> Lasso </mark> from sklearn

In above we've used the $12^{th}$ degree polynimial and Lasso with alpha=0. We will train lasso with different tuning parameters $\lambda= [10^{-4},10^{-3},10^{-2},10^{-1},1]$ in order to to get the their impact on the model.

From the above graphic, we can notice that when $ \lambda $ is increased then the weights (predictors) $ w_i $ goes to zero, also the important thing that we have to point out is the rate of decreasing of $ w_i $ is related to its size $\frac {\partial w_i} {\partial \lambda} = C.w$.

Let's see with bigger $\lambda= [1,10,20,30,100]$

From above graph we see that with enough big $\lambda$ the weights get a zero. Because of that fact we can use Lasso for feature selection.

The lasso performs shrinkage so that there are "corners'' in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

graph

Lasso as feature selection

We've seen that the regularisation process is controlled by the alpha parameter in the Lasso model. With higher alpha, some features get zero. As we've found that alpha with which the model works well and if some of the weights are zero we could drop the features related to these weights. Thus we make a feature selection.

Let's try to find the best alpha

For $\alpha$ higher than $0.1$ both train and test accuracy are getting to decrease. Therefore from the graph, we can point $0.1$ as the best $\alpha$ for which the model achieved $0.94%$ on test data.

For the predictors of model (weights) we have :

We can make feature selection by selecting only the columns (features) with non-zero weights and creating new data set.

Let's try to train a new model using only the selected features by Lasso into Linear Regression.

We've achieved the same result although dimensionally of the data is reduced from (30,13) to (30,13).

Ridge regression (L2 regularization)

Ridge regression as the Lasso is form of regularized regression.The both methods seek to alleviate the consequences of multi-collinearity, poorly conditioned equations, and overfitting.

Ridge regression as Lasso is motivated by a constrained minimization problem, which can be formulated as follows:

The lost function of the Lasso is the ordinary least square with constraint optimization.

$$J(\varTheta) = \frac{1}{2m}\sum_{i}^n( y_i - \sum_j^mx_{ij}\theta_j)^2 $$ $$subject \; to \hspace{1cm} \sum_j^m\theta_j^2 < \lambda $$

Using a Lagrange multiplier we can rewrite the problem as: $$RidgeLost = J(\varTheta) = \frac{1}{2m}\sum_{i}^n( y_i - \sum_j^m x_{ij}\theta_j )^2 + \lambda \sum_j^m\theta_j^2$$

Ridge regression decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it. Hence, this model is not good for feature reduction.

Considering the geometry of both the lasso (left) and ridge (right) models, the elliptical contours (red circles) are the cost functions for each. Relaxing the constraints introduced by the penalty factor leads to an increase in the constrained region (diamond, circle). Doing this continually, we will hit the center of the ellipse, where the results of both lasso and ridge models are similar to a linear regression model.

However, both methods determine coefficients by finding the first point where the elliptical contours hit the region of constraints. Since lasso regression takes a diamond shape in the plot for the constrained region, each time the elliptical regions intersect with these corners, at least one of the coefficients becomes zero. This is impossible in the ridge regression model as it forms a circular shape and therefore values can be shrunk close to zero, but never equal to zero.

K-fold cross validation

Cross-validation is a resampling method that uses diferent portions of the data to test and train amodel on diferent iterations.The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting.

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[16] but in general k remains an unfixed parameter.

Let's implement the k-fold cross-validation

from graphic we can see that the in $\alpha=0.1$ the model achieved the best accuracy

References

[1] Lasso and Ridge Regression Tutorial
[3] Regularization Tutorial: Ridge, Lasso and Elastic Net

[4] The Lasso

[5]Cross-validation (statistics)