import warnings
import datetime
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from IPython.display import display, Markdown , Math
sns.set()
warnings.filterwarnings('ignore')
def printmd(string): display(Markdown(string))
def latex(out): printmd(f'{out}')
def pr(string): printmd('***{}***'.format(string))
The Softmax function maybe is one of the most popular Machine learning algorithm basically, it turns arbitrary real values into probabilities, by using the exponential function. It could be considered as generalization of the sigmoid function. We could use softmax for multi-class classification furthermore it is met in many various fields of science as Statistical physics (Gibbs distributions), Quantum statistic, Information theory, and Neural networks. Softmax is much attractive in classification problems since it has a simple implementation and in many cases gives satisfying results and enough good performance.
We gonna use the Iris dataset because it is comparatively simple and very convenient in the studying of Machine Learning.
iris = pd.read_csv("../../../resources/data/IRIS.csv")
iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
pr('labels value : ' +str(iris['species'].unique()))
labels value : ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
sns.FacetGrid(iris,hue='species',height=6).map(plt.scatter,'sepal_length','sepal_width').add_legend()
<seaborn.axisgrid.FacetGrid at 0x65a094df70>
We are going to express the label values as one-hot encoding variables or so-called dummies variables.
x_train = iris.drop('species', axis=1)
y_train = pd.get_dummies(iris['species'])
pr('One hode encoding representation')
y_train.head()
One hode encoding representation
Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 1 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 0 |
x_train, y_train = np.array(x_train), np.array(y_train)
pr('shape X :'+ str(x_train.shape))
pr('shape X :'+ str(y_train.shape))
shape X :(150, 4)
shape X :(150, 3)
X_train, X_test, y_train, y_test = train_test_split(x_train,y_train, test_size=0.33, random_state=42) #separats into test and train samples
The softmax function $\sigma: \; \Re^k \; \rightarrow \; \Re^k $ could be defined by formula :
$\; \; z_{ij} = \sum_p x_{ip} w_{jp} + b_j $
$\; \; \vec{z_i} = z^i= [z_{i1}, z_{i2}, ...z_{ik} ] $
$$\sigma_{softmax (W,b,X^{i})_{ij}}=\frac{e^{ \sum_p x_{ip} w_{pj} + b_j}}{\sum_j^k e^{ \sum_p x_{ip} w_{pj} + b_j}} $$
</div>
The softmax function takes as an input a vector $z^i$ with $K$ number of component $z_{i1}, z_{i2}, ...,z_{in}$ and normalized it into a probability distribution $p^{i}$ consisting of also $K$ number of probabilities $p_{i1},...,p_{ik}$ proportional to exponentials of input values $z^{i}$. That is, prior to applying softmax function some vector components of $z^{i}$ could be negative or greater than 1 and might not sum up to 1. Furthermore more the larger input components correspond to larger probabilities and $\sum_j p_{ij}=1$. The $w_{jp}$ are the wights or estimators $w_{pj}\in W^{K\times N}$ In matrix $W^{K\times N}$ $K$ coresponds to number of class labels and $N$ coresponds to number of atrribute(feature) of training data, and $ b = [b_1, ...b_k]$ is slopes term with component coresponds to class labels.
This property of softmax function that it outputs a probability distribution makes it suitable for probabilistic interpretation in classification tasks.
According to our dataset, we can write the folowing expressions.
$W= \begin{bmatrix}
weight^1\rightarrow class \; 1(Iris-setosa)
\\ weight^2\rightarrow class\; 2(Iris-versicolor) \;
\\ weight^{3}\rightarrow class \;3(Iris-virginica) \;
\end{bmatrix} =
\begin{bmatrix}
\vec W^1 \\ \vec W^2\ \\ \vec W^3 \end{bmatrix} =
\begin{bmatrix}
w_{11} & w_{12} & w_{13} & w_{14}
\\ w_{21} & w_{22} & w_{23} & w_{24}
\\ w_{31} & w_{32} & w_{33} & w_{34}
\end{bmatrix} $
$ B= \begin{bmatrix} b_1 \\ b_2\ \\ b_3 \end{bmatrix}\;\;\;$
vector $\vec W^i=[w_{i1},..w_{in}]$ is estimator vector for taget class (label) $i$ , $n$ coresponds to the feature (pridictor) of $X$
in matrix form the $Z$ is expressed as : $Z = XW^T$
$Z =
\begin{bmatrix} x_{11} & x_{12} & x_{13} & x_{14} \\ x_{21} & x_{22} & x_{23} & x_{24}\\ ... & ... & ... & ...\\ x_{m1} & x_{m2} & x_{m3} & x_{m4} \end{bmatrix} \times \begin{bmatrix}
w_{11} & w_{21} & w_{31}
\\ w_{12} & w_{22} & w_{32}
\\ w_{13} & w_{23} & w_{33}
\\ w_{14} & w_{24} & w_{34}
\end{bmatrix} + \begin{bmatrix} b_1 \\ b_2\ \\ b_3 \end{bmatrix} = \begin{bmatrix} z_{11} & z_{12} & z_{13} \\ z_{21} & z_{32} & z_{33} \\ ... & ... & ... \\ z_{m1} & z_{m2} & z_{m3} \end{bmatrix} $
The softmax function computes the probability that a training example $X^{(i)}$ belongs to class $y^{j}$ given
the weight matrix $W$ and bias (slope) $\vec b$ .
So we compute the probability :
for all $P_{ij}$ and given target components $Y = [y^1 -Iris-setosa,y^2 -Iris-versicolor,y^3- Iris-virginica]$ we can write :
$$ P = \begin{bmatrix} p(y_{1} |z^{1})_{11} & p(y_{2} |z_{2} )_{12} & p(y^{3} |z^{3} )_{13} \\ ... & ... & ... \\ P(y_{1} |z^{m} )_{m1} & P(y_{m} |z^{m} )_{m2} & P(y_{3} |z^{m} )_{m3} \end{bmatrix} = \begin{bmatrix} \frac{e^{z_{11}}}{\sum_{1j}e^{z_{1j}}} & \frac{e^{z_{12}}}{\sum_{1j}e^{z_{1j}}} & \frac{e^{z_{13}}}{\sum_{1j}e^{z_{1j}}} \\ \\ ... & ... & ... \\ \\ \frac{e^{k_{n1}}}{\sum_{j}^3e^{z_{nj}}} & \frac{e^{z_{n2}}}{\sum_{j}^3e^{z_{nj}}} & \frac{e^{z_{n3}}}{\sum_{j}^3e^{z_{nj}}}\end{bmatrix} $$
for example, the probability of record $x^1$ to belongs to target label $y_2$ (Iris-versicolor) is calculated as:
always the $p_{ij} \in [0,1]$ and $\sum_j p_{ij}= 1$
Let's see how the softmax function can be applied concretely in our training dataset.
First, let to define a weight matrix $W$ and bias $\vec b$
$W =\begin{bmatrix}
w_{11} & w_{12} & w_{13} & w_{14}
\\ w_{21} & w_{22} & w_{23} & w_{24}
\\ w_{31} & w_{32} & w_{33} & w_{34}
\end{bmatrix}
= \begin{bmatrix} 1.38618464 & 1.9151765 & -0.28863154 & 0.40849489
\\1.31642223 & 0.76753677 & 1.1482473 & 0.74274245
\\0.29739313 & 0.31728673 & 2.14038423 & 1.84876265\end{bmatrix}$
$B = \begin{bmatrix} 1.18749764 \\ 1.16215506 \\0.6503473 \end{bmatrix} $
I've prepared weight vector $W$ and bias $B$ in advance, how? We will see later.
W = np.array([[ 1.38618464, 1.9151765 , -0.28863154, 0.40849489],
[ 1.31642223, 0.76753677, 1.1482473 , 0.74274245],
[ 0.29739313, 0.31728673, 2.14038423, 1.84876265]]) # define a weight matrix
b = np.array([1.18749764, 1.16215506, 0.6503473 ]) #bias vector (intercept)
The implementation of softmax function:
def softmax(X, weight, b):
'''
perform softmax function
Parameters :
X : ndarray
train data
weight : ndarray
weght matrix
b : ndarray
bias vector
Returns
ndarray
'''
#dot product between X_data matrix and tranposed Weight_ matrix added to Bias gives matrix each z_ij
Z = X.dot(weight.T)
#return matrix cosist of exponentials Z input net
exp_z = np.exp(Z)
#array contains sum of every row (e^z_{ik})
sums=np.sum(exp_z, axis=1)
#return softmax(Z)_{ij}
return (exp_z.T/sums).T
def accuracy(Y, P):
'''
evualate accuracy of dummies variable
Parameters :
Y_target : ndarray
actual real values
P : ndarray
predicted values (probability)
Return
float
'''
C = np.argmax(Y, axis=1)==np.argmax(P, axis=1)
D = np.where(C==True)
return len(D[0])/len(C)
pr( "original data label " +r'$ Y_{{M\times N}} :$'.format(X_test.shape))
actual_label = pd.DataFrame(y_train,columns = ['Iris-setosa', 'Iris-versicolor','Iris-virginica'])
actual_label.head()
original data label $ Y_{M\times N} :$
Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 0 | 1 |
2 | 0 | 1 | 0 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 1 |
pr("predicted data label " +r'$ P_{{M\times N}} :$'.format(X_test.shape))
predict = softmax(X_train,W,b)
predict = pd.DataFrame(predict,columns = ['Iris-setosa', 'Iris-versicolor','Iris-virginica'])
predict.head()
predicted data label $ P_{M\times N} :$
Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|
0 | 0.050053 | 0.777979 | 0.171968 |
1 | 0.001113 | 0.555302 | 0.443584 |
2 | 0.030317 | 0.696057 | 0.273627 |
3 | 0.907907 | 0.091569 | 0.000524 |
4 | 0.000807 | 0.561198 | 0.437995 |
a =accuracy(np.array(y_train),np.array(predict))
pr('accuracy : '+str(a))
accuracy : 0.75
From $P_{{M\times N}}$ output let to consider the row with index $1$ :
$P'_{1,1} = 0.004 \rightarrow $ has $0$% chance that record $X^{1}$ belongs to class 1 'iris-setosa'
$P'_{1,2} = 0.167 \rightarrow $ has $17$% chance that record $X^{1}$ belongs to class 2 'Iris-versicolor'
$P'_{1,3} = 0.83 \rightarrow $ has $83$% chance belongs to class 3 'Iris-virginica'
From the above result we can make a conclusion:
We cannot be sure which is the class label that record $X_{1}$ belongs,but its third column has $80$% chance which is the biggest one therefore, we would assume that record belongs to class 'Iris-virginica'in fact, that is coorect assumtion comparing with actual data $Y$. By applying this evaluation process for all results we can validate that the weight matrix $W_{M \times N}$ and $b$ have given the $95$% accuracy.
How have I found the weight matrix $W$ and bias $B$ ?
Just I've used the LogisticRegression from scikit-learn and took the coefficients, let us try to find out the way of finding the weight $W$ and bias $b$ and is it posible to imporove estimators of $W$ and bias $b$.
For objective loss function we will use Cross-Entropy wich is used binary logistic reg.Sofmax Loss is defined as :
$$ (2)\;\; \mathcal{L}(Y,Z)=-\sum_i^m\sum_j^k y_{ij} \log (p(Z)_{ij})$$
Where $m$ is a count of records , $k$ is count of classes $y_{ij}$ is label values , $p_{ij}=\phi_{softmax}(Z)_{ij}$ are Y' predicited class values and $Z$ is net input which is function of $W$ is weight matrix and $b$ bias and $X$
Our goal is to minimize the eq.(2) in order to find the best estimators $w_{ij}\in W$ and $b$ given the iris data $X$ and a label data $Y$.
We are going to use Gradient descent for optimization process.
Note that, the eq.(2) is the function of all weights $w_{ij}$, bias $b_j$ all training data X and label data Y.
Gradient descent is defined as :
$$ \; \; \; \; \; \; \;\begin{matrix} w_{ij} = w_{ij} - \lambda \nabla w_{ij}L(W,b,X,Y) \\ \\ b_{j} = b_j - \lambda\nabla b_{j}L(W,b,X,Y)
\end{matrix} $$
where $\lambda$ is learning rate or step size
Plug eq.(2) in gradient descent we achieve the general formula :
$$(3) \; \; \; \; \; \;
\begin{matrix} \nabla w_{ij}L(Y,P) &= \nabla w_{ij}L(Y,P) -\frac{\partial}{\partial w_{ij}}\Big(\sum_k^m\sum_n^n y_{mn} \log {p_{mn}}\Big) \\ \\
\nabla b_{j}L(Y,P) & = \nabla b_{j}L(Y,P) - \frac{\partial}{\partial b_{j}}\Big(\sum_k^m\sum_n^n y_{mn} \log{p_{mn}}\Big)
\end{matrix} $$
Before we take up with $\nabla w_{ij}L(W,b)$. We gonna introduce some math technics which will make our work easier.
For simplicity in the summation process of indices, we will introduce a Kronecker symbol .
</font>
$$\delta_{ij} =
\begin{equation}
\begin{Bmatrix}
1 & if \; i=j \\
0 & if \; i\ne j
\end{Bmatrix}
\end{equation}$$
$$ \delta_{ij} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$$
In many places in the coming sum operations over indexes we will miss the $\sum$ symbol, just it will be avoided(hidden) according to the .Einstein summation convention .
For example, the equation.
$$z_{ij} = x_{ip} w_{jp} + b_j$$
The sign $\sum_p^3$ is miss.The sumation over p is implied(by default) because p is repeated twice.Every time when there are repeatable indices that is the indicator for exist of $\sum$ which is just missing(The sum sign is not written).
In order to minimize the entropy of data, we must to take up with minimizatin of Cross entropy loss respect to $w_{ij}$.The cross-etropy Loss is a function of all feature vectors $X_{M\times N}$ ,all labels $X_{M\times K}$ , weight $W_{K\times N}$ and bias $B_K$
$L = L(X,T,W,B)$
$\frac{\partial L}{\partial w_{ij}}=-\frac{\partial}{\partial w_{ij}}\Big(\sum_k\sum_n y_{mn} \log {p_{mn}}\Big)$ $=\sum_k\sum_n y_{mn}\frac{\partial \log {p_{mn}}}{\partial w_{ij}}$
$=-\sum_m\sum_n\frac{y_{mn}}{p_{mn}}\frac{\partial p_{mn}}{\partial w_{ij}} $ $=\sum_m\sum_n \frac{y_{mn}}{p_{mn}}\frac{\partial p_{mn}}{\partial z_{vp}}\frac{\partial z_{vp}}{\partial w_{ij}} $
$ =-\sum_m\sum_n \frac{y_{mn}}{p_{mn}}\delta_{mv}\frac{\partial p_{mn}}{\partial z_{vp}} \delta_{pi}\frac{\partial z_{vp}}{\partial w_{ij}} $
We've successfully reduced the count of sum operations, using Einstein's convention and Kronecker symbol and achieved
$\frac{\partial L}{\partial w_{ij}}=-\sum_m\sum_n \frac{y_{mn}}{p_{mn}}\frac{\partial p_{mn}}{\partial z_{mi}}\frac{\partial z_{mi}}{\partial w_{ij}}$
Let to focus on terms $\frac{\partial p_{mn}}{\partial z_{mi}}$ and $\frac{\partial z_{mi}}{\partial w_{ij}}$
$\frac{\partial p_{mn}} {\partial z_{mi}}=\frac{\partial\frac { e^{z_{mn}} }{ \sum_ke^{z_{mk}}} }{\partial z_{mi}}$ $=\frac{1}{(\sum_ke^{z_{mk}})^2}\times \Big(\frac{\partial e^{z_{mn}} }{\partial z_{mi}}\times(\sum_ke^{z_{mk}}) - e^{z_{mn}}\times\frac{\partial (\sum_ke^{z_{mk}})}{\partial z_{mi}} \Big)$
$=\frac{e^{z_{mn}}\times\frac{\partial z_{mn}}{\partial z_{mi}}}{\sum_ke^{z_{mk}}} - \frac{e^{z_{mn}}}{\sum_ke^{z_{mk}}}\times\frac{ \sum_k e^{z_{mk}} \frac{ \partial z_{mk}}{\partial z_{mi}}} {\sum_ke^{z_{mk}}}$
$=\frac{e^{z_{mn}}\times \delta_{ni} }{\sum_ke^{z_{mk}}} - \frac{e^{z_{mn}}}{\sum_ke^{z_{mk}}}\times\frac{ \sum_k e^{z_{mk}}\delta_{ki}} {\sum_ke^{z_{mk}}}$ $=\frac{e^{z_{mn}}\times \delta_{ni} }{\sum_k e^{z_{mk}}} - \frac{e^{z_{mn}}}{\sum_ke^{z_{mk}}}\times\frac{ \sum_k e^{z_{mk}}\delta_{ki}} {\sum_ke^{z_{mk}}}$
$=\frac{e^{z_{mn}}\times \delta_{ni} }{\sum_ke^{z_{mk}}} - \frac{e^{z_{mn}}}{\sum_ke^{z_{mk}}}\times\frac{ \sum_k e^{z_{mi}}\delta_{ii}} {\sum_ke^{z_{mk}}}$ $=\frac{e^{z_{mn}}\times \delta_{ni} }{\sum_ke^{z_{mk}}} - \frac{e^{z_{mn}}}{\sum_ke^{z_{mk}}}\times \frac{ e^{z_{mi}}}{\sum_ke^{z_{mk}}}$
$=p_{mn}\times \delta_{ni} - p_{mn}\times p_{mn}p_{mi}$
$=p_{mn}(\delta_{ni} - p_{mi})$
For term $\frac{\partial p_{mn}} {\partial z_{mi}}$ we achieve :
$$4) \; \; \; \; \frac{\partial p_{mn}} {\partial z_{mi}}=p_{mn}(\delta_{ni} - p_{mi})$$
$\frac{\partial z_{mi}}{\partial w_{ij}} =\frac{\partial ( \sum_k x_{mk}w_{ki})}{\partial w_{ij}}= \frac{ \sum_k z_{mi} x_{mk}\partial w_{ki}}{\partial w_{ij}}$
$= \sum_k x_{mk} \delta_{kj} =\sum_k x_{mj}\delta_{jj}=x_{mj}$
$\frac{\partial L}{\partial w_{ij}}=-\sum_m\sum_n \frac{y_{mn}}{p_{mn}}p_{mn}(\delta_{ni} - p_{mi})x_{mj}=--\sum_m\sum_n y_{mn}(\delta_{ni} - p_{mi})x_{mj} $
$ =-\sum_m\sum_n y_{mn}\delta_{ni} + \sum_m\sum_n y_{mn} p_{mi}x_{mj} $
$ =-\sum_m\sum_n y_{mi}\delta_{ii}x_{mj} + \sum_m\sum_n y_{mn} p_{mi}x_{mj} $
$ =-\sum_m y_{mi}x_{mj} + \sum_m\sum_n y_{mn} p_{mi}x_{mj} $
$\sum_n y_{in}=1$
$ =-\sum_m y_{mi}x_{mj} + \sum_m 1. p_{mi}x_{mj} = \sum_m p_{mi}x_{mj}-\sum_m y_{mi}x_{mj}$
$ = \sum_k^m( p_{mi}-y_{mi})x_{mj}$
We've achieved the most important result .The optimization of Cross-entropy respect to $w_{ij}$ :
$$6) \;\;\;\;\nabla w_{ij}L(W,b)= \sum_m( p_{mi}-y_{mi})x_{mj}$$
If we apply the same step for $\nabla b_{i}L(W,b)$ (It is an easier one) we will achieve the minimization formula for bias , which looks like this
$$7)\;\;\;\;\nabla b_{i}L(W,b)= \sum_m( p_{mi}-y_{mi})$$
Although the eq.(6) seems so simple and elegant it is written in a tensor form, not in matrix one. Therefore its implementation becomes more difficult, especially when we want to use our lovely library NumPy.But we can write the equation in matrix form seeming like that :
$\nabla_W L = \begin{bmatrix} \nabla_{w_{11}} L & \nabla_{w_{12}}L &... &\nabla_{ w_{1j}}L \\ \nabla_{ w_{21}}L & \nabla_{ w_{22}}L &... &\nabla_{ w_{2j}}L \\ ... & ... & ... & ... \\ \nabla_{ w_{i1}}L & \nabla_{ w_{i2}}L & ...& \nabla_{ w_{ij}}L \end{bmatrix} $ $ =\begin{bmatrix} p_{11} -y_{11} & p_{21}-y_{21} & ... & p_{m1}-y_{m1}\\ p_{12} -y_{12} & p_{22}-y_{22} & ... & p_{m2}-y_{m2}\\ \;\;...\;\;\; & \;\;...\;\;\; &\;\;...\;\;\; &\;\;...\;\;\; \\ p_{1i} -y_{1i} & p_{2i}-y_{2i} & ... & p_{mi}-y_{mi} \end{bmatrix}$ $\begin{bmatrix} x_{11} & x_{12} &...& x_{1j} \\ x_{21} & x_{22} &...& x_{2j} \\ ... & ... & ... &... \\ x_{m1} & x_{m2} &...& x_{mj}\end{bmatrix}$
$$\;\;\;\;\nabla_{b_i} L = \sum_m (P-Y)_{mi}\;\;\; or $$ or
$$9)\;\;\;\;\nabla_{b} L =\Big[\sum_m (P-Y)_{m1}, \sum_m (P-Y)_{m2},..., \sum_m (P-Y)_{mi}\Big]$$
We've written eq(6) and(7) in matrix form, and result is surprisingly simple and so easy for implementation.If we plug the eq(8) and (9) in gradient descent eqaution
$$ \; \; \; \; \; \; \;\begin{matrix} w_{ij} = w_{ij} - \lambda \nabla w_{ij}L(W,b) \\ \\ b_{j} = b_j - \lambda\nabla b_{j}L(W,b)
\end{matrix} $$
We've achieve our minimization alogithm for Cross-entopy Loss for find the best esitmitators $W_{ij}$. The minimization alogorithm using gradient descent is deifined as :
$$(10) \; \; \; \; \; \; \; W= W -\lambda(P-Y)^T.X $$
$$(11) \; \; \; \; \; \; \; b_i = b_i -\lambda \sum_m(P-Y)_{mi} $$
where W is a weight matrix , $\lambda$ is an leraning rate or step size, P is the the prediction values pruduced from sofmax $Y$ is the target values $X$ is the training data (features vectors).
The implentation using numpy
could be defined as :
:
$$ W = W - gamma*(( softmax(W,b,X,Y) -Y )^T).dot(X)$$
$$ b_i = b_i - gamma*sum_m((softmax(W,b,X,Y)_i -Y_i)) $$
where the arguments are considered: the W matrix our estimator coeficients , $gamma=$ $\lambda* (1/m)$ step size,
$\lambda$ learning rate $Y$ is the target values $X$ is the training data (features vectors).
The implementation of gradient descent
def gradient_descent(X, y, W, b, step_size):
''''
one iteration(Epoch) perform gradient descent
Parameters :
X : ndarray
train data
y : ndarray
target data
W : ndarray
weight matrix
b : ndarray
bias
step_size : float
gradient descent setting
'''
P_y = softmax(X,W,b)-y
W = W - step_size*(P_y.T).dot(X)
b = b - step_size*np.sum(P_y, axis=0)
return W,b
def train(X , y, max_iter=100,learning_rate=0.1,innitial_value =1, debug_W=None):
'''
Train by softmax regression
Parameters :
X : ndarray
train data
y : ndarray
target data
max_iter : int
number of epoch (iterations)
debug_W : tuple
index of weight parameter for debugging
Returns
W, b : ndarray
weight and bias
in debug mode
W, b, k : ndarray
weight and bias and debugind parameter
'''
if type(X) != np.ndarray or type(y) != np.ndarray:
raise ValueError('X and y must be ndarray')
#init weight and bias
b = np.full((y.shape[1],),innitial_value)
W = np.full((y.shape[1], X.shape[1]), innitial_value)
m = X.shape[0]
step_size = (1/m)*learning_rate
if debug_W is not None:
debug_mode=True
debug = W[(debug_W)]
else :
debug_mode=False
for i in range(max_iter):
W,b = gradient_descent(X , y, W, b,step_size)
if debug_mode: debug = np.append(debug, W[debug_W])
if debug_mode: return W,b,debug
return W,b
Let to test our implementation and to train the iris data
Trainig data
W,b = train(np.array(X_train), np.array(y_train),max_iter=100)
print('weight vector :'+ str(W))
print('')
print('bias :'+ str(b))
weight vector :[[ 1.35411091 1.84190926 -0.19105989 0.45297526] [ 1.27780858 0.77747064 1.17029213 0.79586893] [ 0.36808051 0.3806201 2.02076776 1.75115581]] bias :[1.17282574 1.13488628 0.69228799]
predict = softmax(np.array(X_test), W,b)
#print(v)
pr('accuracy: ' +str(accuracy(np.array(y_test), predict))+'%')
accuracy: 0.7%
We've achieved 70% acuracy , learning rata = 0,1 max iteration 100
Testing on data X_test and y_test
We will try to see the behavior of Gradient respect to different learning rates and max iterations.
Debugging tool implemetation
def debug_gradient(max_iter, l_rates=(100, 50, 10, 5, 1, 0.1), call_function = train, debug_weight=(1,1), y_size = [-20,20]):
'''
Parameters
max_iter : number
maxumim count of itertation (epoch)
l_rates : typle
different learning rates
deebug_weight : type
index of weight parameter wich will be debug
'''
i=-1
plt.figure(figsize=(15,5))
axes =plt.gca()
axes.set_ylim(y_size)
axes.set_xlabel(r' $log(number\; of\; iteration)$'.format(max_iter))
axes.set_ylabel(r'weight parameter: $W_{}$'.format(debug_weight))
for learning_rate in l_rates:
i+=1
start = datetime.datetime.now()
W,b,w_11 = train(X_test,y_test,max_iter=max_iter,learning_rate=learning_rate,debug_W=debug_weight)
accr = accuracy(y_test, softmax(X_test,W,b))*100
time = int((datetime.datetime.now()-start).total_seconds()*1000 )
x = np.linspace(1, len(w_11) , len(w_11))
plt.plot(np.log(x),w_11,label=r'$\lambda$ = {} , accr:{} %'.format(learning_rate, accr),color=cmap[i])
plt.title("Curve of weight estimator $W${} ,{} iterations, time for excution {}ms".format(debug_weight, max_iter, time))
plt.legend()
plt.show()
cmap = ['#701b1b','#a7ba1a','#1aba25','#1a75ba','#9a1aba','#7d1aba','#ba1a45']
debug_gradient(10, debug_weight=(2, 2), y_size=[-30,150])
debug_gradient(10,y_size=[-20,30])
In above both graphic we see the evolution of weight components $w_{22}$ and $w_{11}$ respect to log(epoch) or len of iteration. From graphics, we may could conclude that all values are an arbitrary wrong moreover the larger $\lambda$ corresponds to larger fluctuations leading to a larger interval in which the values can vary, the reason for that is a small iterations.
Incresing of max iteration to 150
debug_gradient(150, debug_weight=(2, 2), y_size=[-20,100])
debug_gradient(150)
The $\lambda \in [10,5,2,1 ,0.1]$ has a limit $$\lim_{epoch \to 150} w_{ij}(epoch) = L$$ where L is neither arbitrary nor infinity. This means that Gradient descent becomes stable with a tendency to give plausible results and somehow $w_{ij}$ could be considered as approximately correct. For $\lambda \in [10,5,2,0.1]$ the accuracy wich has been achieved is $100%$ that is a sign that the algorithm has been achieved overfitting maybe the reason is that it has learned the noise data that means the accuracy in new coming data would be decreased. For $\lambda \in [100,50]$ the are still Underfitting because it is too large and gradient descent is made a big jumps.
In fact, we are not able to judge which are the best $\lambda $ and $M$, since the dataset is small although we could assume that the smaller $\lambda $ and the bigger $M$ often leads to good results at the expense of execution time and a big risk to achieve overfitting. In summary, every dataset has a unique set of $\lambda$, $M$ wich will give satisfying accuracy and in order to achieve the best ones we have to take up with the experiments over them.
We have walked step by step from math to implementation of multi-class classification using softmax regression. That could be used as a guide for implementation of logistic regression and some notes related to optimization could be useful for neural networks