author: daodeiv (David Stankov) daodavid

Logistic Regression

Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.The best way to think about logistic regression is that it is a linear regression but for classification problems. Logistic regression essentially uses a logistic function defined below to model a binary output variable (Tolles & Meurer, 2016).The primary difference between linear regression and logistic regression is that logistic regression's range is bounded between 0 and 1. In addition, as opposed to linear regression, logistic regression does not require a linear relationship between inputs and output variables

Log-odds or Loggit function

The odds ratio is the probability to occur some event $A$ divided to the probability of not occur.

$$1) \; \; \; odds(p) = \frac{p(A)}{1-p(A)} $$ where $p(A)\in [0,1]$ Let's see the graphic of the function

We can transform the Odds function into another more appropriate for the interpretation, without losing the underlying information. It is called Log-odds or Logit function.


$$2)\; \; \; loggit(P) = \log(Odds)=\log{ \frac{P(A)}{1- P(A)} } $$

plt.plot(p,np.log(odds)) plt.xlabel("p",size='20') plt.ylabel("log-odds",size='20')

The properties of the function log-odds(p) that we should point out :

The math origin of the Sigmoid function

The logistic regression is based on the assumption that log-odds is a linear dependent on features values of the dataset.

$$ 3) \; \; log(\frac{p^i}{1-p^i})=h(x^i)$$


where
$$h(x^i)= \vartheta_0 + \vartheta_1 x^i_{1} + \vartheta_2 x_{2}^i+ ...\vartheta_p x_{p} $$

and $i$ is number of obeservation, $x^i_m $ feature values, $\vartheta_0$ intercept, $\vartheta_{m}$ (weight) slope coeff. for each explanority var.

Let's do some math

If we raise up both sides of eq 3) on a base $e$ :

$$e^{log(\frac{p^i}{1-p^i})} = e^{h(x^i)} \Leftrightarrow $$

$$ \frac{p^i}{1-p^i} = e^{h(x^i)} \Leftrightarrow $$

$$ p^i = e^{h(x^i)}-p.e^{h(x^i)} \Leftrightarrow $$

$$ p^i(1+e^{h(x^i)} ) = e^{h(x^i)} \Leftrightarrow $$

$$p^i= \frac {e^{h(x^i)}} {1 +e^{h(x^i)}} \Leftrightarrow $$

$$ p^i= \frac {e^{h(x^i)}e^{-h(x^i)}} {(1 +e^{h(x^i)})e^{h(x^i)}} \Leftrightarrow $$



$$p^i= \frac{1}{1 +e^{-h(x^i)} } $$

usualy $p^i$ is written as $\sigma^i$ or $\sigma_i$ and it is called sigmoid

$$\sigma^i= \frac{1}{1 +e^{-h(x^i)} } $$

Properties and Identities Of Sigmoid Function

The sigmoid function has very interesting properties. Let's check out for example : $$\sigma =\frac{1}{1 +e^{-(2.x + 4)} }$$

The graph of sigmoid function is an S-shaped curve as above line in the graph.The main proporties :


By sigmoid function we can model binary output, $\sigma(X) < 0.5 $ then the result is $0$ or 'False' otherwise $1$ or 'True'.
Let's see the sigmoid with different weights $\vartheta_{0}$ and $ \vartheta_{1} $

The purpose of Logistic regression is to find that curve that fits as well as possible given dataset.

Maximum Likelihood of Logistic regression, Cross-entropy loss

The logistic regression model is literally a model for the $p$ parameter of a binomial distribution.

$$P(y_i|X,\theta) = \frac{n!}{y_i!(n-1)!} \sigma^{y_i}(1 - \sigma_i )^{n-y_i} $$

when $n=1$

$$P(y_i|x_i,\theta) =\sigma^y_i(1 - \sigma_i )^{1-y_i}$$

Maximum Likelihood Estimation, or MLE for short, is a probabilistic framework for estimating the parameters of a model. In Maximum Likelihood Estimation, we wish to maximize the conditional probability according to observing data $(X)$ given a specific probability distribution and its parameters $(\theta)$, stated formally as:$P(y_i|X,\theta)$ because $y_i$ are independent,we can do joint probability :

$$ argmax_{\Theta} P(y_i|X,\theta) = argmax_{\Theta} P(y_i|x_0,\theta)P(y_i|x_1,\theta)...P(y_i|x_n,\theta)= argmax_{\Theta} \prod_{i=0}^{n}\sigma^y_i(1 - \sigma_i )^{1-y_i}$$

where $\sigma_i$ is sigmoid which our model specific probability distribution function sigmoid.

As same as linear regression we can maximize log Likelihood, because log is monotonic function and if we get minus of log : $$3) \; \; log \big(\prod_{i=0}^{n}\sigma^y_i(1 - \sigma_i )^{1-y_i} \big) = \sum_i^m \big (y^i.\log\sigma^i + (y^i-1)\log{(1 -\sigma^i)}\big)$$

If we get the 3) with minus sign we can instead to maximize we can minimize that function :

$$\Theta = argmax_{\Theta}log \big(\prod_{i=0}^{n}\sigma^y_i(1 - \sigma_i )^{1-y_i} \big) = argmin_{\Theta}-log\big(\prod_{i=0}^{n}\sigma^y_i(1 - \sigma_i )^{1-y_i} \big) $$

Our pupose is to minimize $$4) \; \; \; \mathcal{L} = -\sum_i^m \big (y^i.\log\sigma^i + (y^i-1)\log{(1 -\sigma^i)}\big)$$

that eq is called $ cross-entropy \; loss$ .The next step is to optimized it in order to find its minima.

Mathematical derivation of cross-entopy loss.Gradient Descent

gradient descent $$5) \; \; \theta_m = \theta_m - \nabla_{\Theta_m} \mathcal{L}(\Theta)*h$$

Let's to find the partial derivate $\frac{\partial \mathcal{L} (\Theta)}{\partial \theta_k}$

$\; \;\frac{\partial \mathcal{L} (\Theta)}{\partial \theta_k} = -\frac{1}{m}\sum_i^m\big(y^i\frac{\partial log(\sigma^i)}{\partial \theta_k}+ (1-y^i)\frac{\partial log(\sigma^i-1)}{\partial \theta_k}\big) $

$= -\frac{1}{m}\sum_i^m\big(y^i \frac{dlog(\sigma^i)}{d\sigma^i(\Theta,x^i_m)}\frac{\partial \sigma^i}{\partial \theta_k}+ (y^i-1) \frac{dlog(1- \sigma^i)}{d\sigma^i(\Theta,x^i_m)}\frac{\partial \sigma^i}{\partial \theta_k}\big) $

$= -\frac{1}{m}\sum_i^m\big(y^i \frac{1}{\sigma^i}\frac{\partial \sigma^i}{\partial \theta_k}+ (y^i-1) \frac{1 }{1- \sigma^i}\frac{-\partial \sigma^i}{\partial \theta_k}\big) $

let's calculate $\frac{\partial \sigma^i}{\partial \theta_k} = \frac{d(\frac{1}{1 +e^{-h(x^i)}})}{dh(x^i)}.\frac{\partial h(x^i)}{\partial \theta_k} =\frac{e^{-h(x^i)}}{(1+e^{-h(x^i)})^2}\frac{\partial h(x^i)}{\partial \theta_k} =\frac{1}{1+e^{- h(x^i)}}(1 -\frac{1}{1+e^{-h(x^i)}} )\frac{\partial h(x^i}{\partial \theta_k} = \sigma^i (1 - \sigma^i\frac{\partial h(x^i)}{\partial \theta_k}) $ and applying it above we achieve

$\; \; \frac{\partial \mathcal{L} (\Theta)}{\partial \theta_k} = -\frac{1}{m}\sum_i^m\big(y^i \frac{1}{\sigma^i}\sigma^i (1 - \sigma^i \frac{\partial h(x^i)}{\partial \theta_k} )+ (y^i-1) \frac{1 }{1- \sigma^i}.-\sigma^i(1 - \sigma^i\frac{\partial h(x^i)}{\partial \theta_k}) $

$= -\frac{1}{m}\sum_i^m\big(y^i\frac{\partial h(x^i)}{\partial \theta_k} -y^i \sigma^i\frac{\partial h(x^i)}{\partial \theta_k} +y^i\sigma^i\frac{\partial h(x^i)}{\partial \theta_k}- \sigma^i\frac{\partial h(x^i)}{\partial \theta_k}\big)$

$= -\frac{1}{m}\sum_i^m\big(y^i\frac{\partial h(x^i)}{\partial \theta_k}- \sigma^i\frac{\partial h(x^i)}{\partial \theta_k}\big)= -\frac{1}{m}\sum_i^m\big(y^i - \sigma^i\big) \frac{\partial h(x^i)}{\partial \theta_k}$

for $\frac{\partial h(x^i)}{\partial \theta_k} = \frac{\partial \theta_mx^i_m}{\partial \theta_k} = \delta_{mk}x^i_m = x^i_m$ applying this result above we will achieve (every k must be replace with m)

$$\frac{\partial \mathcal{L}(\Theta)}{\partial \theta_m} = - \frac{1}{m}\sum_i^m\big(y^i - \sigma^i\big)x^i_m$$

For every partial derivate we can write above in matrix form suitable for numy calculation

$$ \nabla_{\Theta_m} \mathcal{L} = \frac{1}{m} X^T(Y - \sigma)$$

Implementation of BinaryLogisticRegression using numpy

We will test our implementation on data baknote dataset

References

[1]
What is logistic regression?
[2] Maximum Likelihood and Logistic Regression
[3] Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation- Scott A. Czepiel