Skip to content
Snippets Groups Projects
Commit 8917f662 authored by Lateefah Bello's avatar Lateefah Bello
Browse files

Lesson 4 review

@shwars there are some comments in the jupyter notebooks that aren't in english.
parent a4adc3ef
No related branches found
No related tags found
No related merge requests found
...@@ -10,7 +10,7 @@ Using the code we have developed in this lesson for binary classification of MNI ...@@ -10,7 +10,7 @@ Using the code we have developed in this lesson for binary classification of MNI
1. For each digit, create a dataset for binary classifier of "this digit vs. all other digits" 1. For each digit, create a dataset for binary classifier of "this digit vs. all other digits"
1. Train 10 different perceptrons for binary classification (one for each digit) 1. Train 10 different perceptrons for binary classification (one for each digit)
1. Define a function that will classify an input digit 1. Define a function that will classify an input digit
> **Hint**: If we combine weights of all 10 perceptrons into one matrix, we should be able to apply all 10 perceptrons to the input digits by one matrix multiplication. Most probable digit can then be found just by applying `argmax` operation on the output. > **Hint**: If we combine weights of all 10 perceptrons into one matrix, we should be able to apply all 10 perceptrons to the input digits by one matrix multiplication. Most probable digit can then be found just by applying `argmax` operation on the output.
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Multi-Layered Perceptrons ## Multi-Layered Perceptrons
## Building our own Neural Framework ## Building our own Neural Framework
> This notebook is a part of [AI for Beginners Curricula](http://github.com/microsoft/ai-for-beginners). Visit the repository for complete set of learning materials. > This notebook is a part of [AI for Beginners Curricula](http://github.com/microsoft/ai-for-beginners). Visit the repository for complete set of learning materials.
In this notebook, we will gradually build our own neural framework capable of solving multi-class classification tasks as well as regression with multi-layered preceptrons. In this notebook, we will gradually build our own neural framework capable of solving multi-class classification tasks as well as regression with multi-layered preceptrons.
First, let's import some required libraries. First, let's import some required libraries.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
%matplotlib nbagg %matplotlib nbagg
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
from matplotlib import gridspec from matplotlib import gridspec
from sklearn.datasets import make_classification from sklearn.datasets import make_classification
import numpy as np import numpy as np
# pick the seed for reproducibility - change it to explore the effects of random variations # pick the seed for reproducibility - change it to explore the effects of random variations
np.random.seed(0) np.random.seed(0)
import random import random
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Sample Dataset ## Sample Dataset
As before, we will start with a simple sample dataset with two parameters. As before, we will start with a simple sample dataset with two parameters.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
n = 100 n = 100
X, Y = make_classification(n_samples = n, n_features=2, X, Y = make_classification(n_samples = n, n_features=2,
n_redundant=0, n_informative=2, flip_y=0.2) n_redundant=0, n_informative=2, flip_y=0.2)
X = X.astype(np.float32) X = X.astype(np.float32)
Y = Y.astype(np.int32) Y = Y.astype(np.int32)
# Разбиваем на обучающую и тестовые выборки # Разбиваем на обучающую и тестовые выборки
train_x, test_x = np.split(X, [n*8//10]) train_x, test_x = np.split(X, [n*8//10])
train_labels, test_labels = np.split(Y, [n*8//10]) train_labels, test_labels = np.split(Y, [n*8//10])
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
def plot_dataset(suptitle, features, labels): def plot_dataset(suptitle, features, labels):
# prepare the plot # prepare the plot
fig, ax = plt.subplots(1, 1) fig, ax = plt.subplots(1, 1)
#pylab.subplots_adjust(bottom=0.2, wspace=0.4) #pylab.subplots_adjust(bottom=0.2, wspace=0.4)
fig.suptitle(suptitle, fontsize = 16) fig.suptitle(suptitle, fontsize = 16)
ax.set_xlabel('$x_i[0]$ -- (feature 1)') ax.set_xlabel('$x_i[0]$ -- (feature 1)')
ax.set_ylabel('$x_i[1]$ -- (feature 2)') ax.set_ylabel('$x_i[1]$ -- (feature 2)')
colors = ['r' if l else 'b' for l in labels] colors = ['r' if l else 'b' for l in labels]
ax.scatter(features[:, 0], features[:, 1], marker='o', c=colors, s=100, alpha = 0.5) ax.scatter(features[:, 0], features[:, 1], marker='o', c=colors, s=100, alpha = 0.5)
fig.show() fig.show()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
plot_dataset('Scatterplot of the training data', train_x, train_labels) plot_dataset('Scatterplot of the training data', train_x, train_labels)
plt.show() plt.show()
``` ```
%% Output %% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
print(train_x[:5]) print(train_x[:5])
print(train_labels[:5]) print(train_labels[:5])
``` ```
%% Output %% Output
[[ 1.3382818 -0.98613256] [[ 1.3382818 -0.98613256]
[ 0.5128146 0.43299454] [ 0.5128146 0.43299454]
[-0.4473693 -0.2680512 ] [-0.4473693 -0.2680512 ]
[-0.9865851 -0.28692 ] [-0.9865851 -0.28692 ]
[-1.0693829 0.41718036]] [-1.0693829 0.41718036]]
[1 1 0 0 0] [1 1 0 0 0]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Machine Learning Problem ## Machine Learning Problem
Suppose we have input dataset $\langle X,Y\rangle$, where $X$ is a set of features, and $Y$ - corresponding labels. For regression problem, $y_i\in\mathbb{R}$, and for classification it is represented by a class number $y_i\in\{0,\dots,n\}$. Suppose we have input dataset $\langle X,Y\rangle$, where $X$ is a set of features, and $Y$ - corresponding labels. For regression problem, $y_i\in\mathbb{R}$, and for classification it is represented by a class number $y_i\in\{0,\dots,n\}$.
Any machine learning model can be represented by function $f_\theta(x)$, where $\theta$ is a set of **parameters**. Our goal is to find such parameters $\theta$ that our model fits the dataset in the best way. The criteria is defined by **loss function** $\mathcal{L}$, and we need to find optimal value Any machine learning model can be represented by function $f_\theta(x)$, where $\theta$ is a set of **parameters**. Our goal is to find such parameters $\theta$ that our model fits the dataset in the best way. The criteria is defined by **loss function** $\mathcal{L}$, and we need to find optimal value
$$ $$
\theta = \mathrm{argmin}_\theta \mathcal{L}(f_\theta(X),Y) \theta = \mathrm{argmin}_\theta \mathcal{L}(f_\theta(X),Y)
$$ $$
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Loss function depends on the problem being solved. Loss function depends on the problem being solved.
### Loss functions for regression ### Loss functions for regression
For regression, we often use **abosolute error** $\mathcal{L}_{abs}(\theta) = \sum_{i=1}^n |y_i - f_{\theta}(x_i)|$, or **mean squared error**: $\mathcal{L}_{sq}(\theta) = \sum_{i=1}^n (y_i - f_{\theta}(x_i))^2$ For regression, we often use **abosolute error** $\mathcal{L}_{abs}(\theta) = \sum_{i=1}^n |y_i - f_{\theta}(x_i)|$, or **mean squared error**: $\mathcal{L}_{sq}(\theta) = \sum_{i=1}^n (y_i - f_{\theta}(x_i))^2$
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
# helper function for plotting various loss functions # helper function for plotting various loss functions
def plot_loss_functions(suptitle, functions, ylabels, xlabel): def plot_loss_functions(suptitle, functions, ylabels, xlabel):
fig, ax = plt.subplots(1,len(functions), figsize=(9, 3)) fig, ax = plt.subplots(1,len(functions), figsize=(9, 3))
plt.subplots_adjust(bottom=0.2, wspace=0.4) plt.subplots_adjust(bottom=0.2, wspace=0.4)
fig.suptitle(suptitle) fig.suptitle(suptitle)
for i, fun in enumerate(functions): for i, fun in enumerate(functions):
ax[i].set_xlabel(xlabel) ax[i].set_xlabel(xlabel)
if len(ylabels) > i: if len(ylabels) > i:
ax[i].set_ylabel(ylabels[i]) ax[i].set_ylabel(ylabels[i])
ax[i].plot(x, fun) ax[i].plot(x, fun)
plt.show() plt.show()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
x = np.linspace(-2, 2, 101) x = np.linspace(-2, 2, 101)
plot_loss_functions( plot_loss_functions(
suptitle = 'Common loss functions for regression', suptitle = 'Common loss functions for regression',
functions = [np.abs(x), np.power(x, 2)], functions = [np.abs(x), np.power(x, 2)],
ylabels = ['$\mathcal{L}_{abs}}$ (absolute loss)', ylabels = ['$\mathcal{L}_{abs}}$ (absolute loss)',
'$\mathcal{L}_{sq}$ (squared loss)'], '$\mathcal{L}_{sq}$ (squared loss)'],
xlabel = '$y - f(x_i)$') xlabel = '$y - f(x_i)$')
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Loss functions for classification ### Loss functions for classification
Let's consider binary classification for a moment. In this case we have two classes, numbered 0 and 1. The output of the network $f_\theta(x_i)\in [0,1]$ essentially defines the probability of choosing the class 1. Let's consider binary classification for a moment. In this case we have two classes, numbered 0 and 1. The output of the network $f_\theta(x_i)\in [0,1]$ essentially defines the probability of choosing the class 1.
**0-1 loss** **0-1 loss**
0-1 loss is the same as calculating accuracy of the model - we compute the number of correct classifications: 0-1 loss is the same as calculating accuracy of the model - we compute the number of correct classifications:
$$\mathcal{L}_{0-1} = \sum_{i=1}^n l_i \quad l_i = \begin{cases} $$\mathcal{L}_{0-1} = \sum_{i=1}^n l_i \quad l_i = \begin{cases}
0 & (f(x_i)<0.5 \land y_i=0) \lor (f(x_i)<0.5 \land y_i=1) \\ 0 & (f(x_i)<0.5 \land y_i=0) \lor (f(x_i)<0.5 \land y_i=1) \\
1 & \mathrm{ otherwise} 1 & \mathrm{ otherwise}
\end{cases} \\ \end{cases} \\
$$ $$
However, accuracy itself does not show how far are we from the right classification. It could be that we missed the correct class just by a little bit, and that is in a way "better" (in a sense that we need to correct weights much less) than missing significantly. Thus, more often logistic loss is used, which takes this into account. However, accuracy itself does not show how far are we from the right classification. It could be that we missed the correct class just by a little bit, and that is in a way "better" (in a sense that we need to correct weights much less) than missing significantly. Thus, more often logistic loss is used, which takes this into account.
**Logistic Loss** **Logistic Loss**
$$\mathcal{L}_{log} = \sum_{i=1}^n -y\log(f_{\theta}(x_i)) - (1-y)\log(1-f_\theta(x_i))$$ $$\mathcal{L}_{log} = \sum_{i=1}^n -y\log(f_{\theta}(x_i)) - (1-y)\log(1-f_\theta(x_i))$$
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
x = np.linspace(0,1,100) x = np.linspace(0,1,100)
def zero_one(d): def zero_one(d):
if d < 0.5: if d < 0.5:
return 0 return 0
return 1 return 1
zero_one_v = np.vectorize(zero_one) zero_one_v = np.vectorize(zero_one)
def logistic_loss(fx): def logistic_loss(fx):
# assumes y == 1 # assumes y == 1
return -np.log(fx) return -np.log(fx)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
plot_loss_functions(suptitle = 'Common loss functions for classification (class=1)', plot_loss_functions(suptitle = 'Common loss functions for classification (class=1)',
functions = [zero_one_v(x), logistic_loss(x)], functions = [zero_one_v(x), logistic_loss(x)],
ylabels = ['$\mathcal{L}_{0-1}}$ (0-1 loss)', ylabels = ['$\mathcal{L}_{0-1}}$ (0-1 loss)',
'$\mathcal{L}_{log}$ (logistic loss)'], '$\mathcal{L}_{log}$ (logistic loss)'],
xlabel = '$p$') xlabel = '$p$')
``` ```
%% Output %% Output
C:\Users\dmitryso\AppData\Local\Temp/ipykernel_55820/331859503.py:10: RuntimeWarning: divide by zero encountered in log C:\Users\dmitryso\AppData\Local\Temp/ipykernel_55820/331859503.py:10: RuntimeWarning: divide by zero encountered in log
return -np.log(fx) return -np.log(fx)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To understand logistic loss, consider two cases of the expected output: To understand logistic loss, consider two cases of the expected output:
* If we expect output to be 1 ($y=1$), then the loss is $-log f_\theta(x_i)$. The loss is 0 is the network predicts 1 with probability 1, and grows larger when probability of 1 gets smaller. * If we expect output to be 1 ($y=1$), then the loss is $-log f_\theta(x_i)$. The loss is 0 is the network predicts 1 with probability 1, and grows larger when probability of 1 gets smaller.
* If we expect output to be 0 ($y=0$), the loss is $-log(1-f_\theta(x_i))$. Here, $1-f_\theta(x_i)$ is the probability of 0 which is predicted by the network, and the meaning of log-loss is the same as described in the previous case * If we expect output to be 0 ($y=0$), the loss is $-log(1-f_\theta(x_i))$. Here, $1-f_\theta(x_i)$ is the probability of 0 which is predicted by the network, and the meaning of log-loss is the same as described in the previous case
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Neural Network Architecture ## Neural Network Architecture
We have generated a dataset for binary classification problem. However, let's consider it as multi-class classification right from the start, so that we can then easily switch our code to multi-class classification. In this case, our one-layer perceptron will have the following architecture: We have generated a dataset for binary classification problem. However, let's consider it as multi-class classification right from the start, so that we can then easily switch our code to multi-class classification. In this case, our one-layer perceptron will have the following architecture:
<img src="images/NeuroArch.png" width="50%"/> <img src="images/NeuroArch.png" width="50%"/>
Two outputs of the network correspond to two classes, and the class with highest value among two outputs corresponds to the right solution. Two outputs of the network correspond to two classes, and the class with highest value among two outputs corresponds to the right solution.
The model is defined as The model is defined as
$$ $$
f_\theta(x) = W\times x + b f_\theta(x) = W\times x + b
$$ $$
where $$\theta = \langle W,b\rangle$$ are parameters. where $$\theta = \langle W,b\rangle$$ are parameters.
We will define this linear layer as a Python class with a `forward` function that performs the calculation. It receives input value $x$, and produces the output of the layer. Parameters `W` and `b` are stored within the layer class, and are initialized upon creation with random values and zeroes respectively. We will define this linear layer as a Python class with a `forward` function that performs the calculation. It receives input value $x$, and produces the output of the layer. Parameters `W` and `b` are stored within the layer class, and are initialized upon creation with random values and zeroes respectively.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
class Linear: class Linear:
def __init__(self,nin,nout): def __init__(self,nin,nout):
self.W = np.random.normal(0, 1.0/np.sqrt(nin), (nout, nin)) self.W = np.random.normal(0, 1.0/np.sqrt(nin), (nout, nin))
self.b = np.zeros((1,nout)) self.b = np.zeros((1,nout))
def forward(self, x): def forward(self, x):
return np.dot(x, self.W.T) + self.b return np.dot(x, self.W.T) + self.b
net = Linear(2,2) net = Linear(2,2)
net.forward(train_x[0:5]) net.forward(train_x[0:5])
``` ```
%% Output %% Output
array([[ 1.77202116, -0.25384488], array([[ 1.77202116, -0.25384488],
[ 0.28370828, -0.39610552], [ 0.28370828, -0.39610552],
[-0.30097433, 0.30513182], [-0.30097433, 0.30513182],
[-0.8120485 , 0.56079421], [-0.8120485 , 0.56079421],
[-1.23519653, 0.3394973 ]]) [-1.23519653, 0.3394973 ]])
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In many cases, it is more efficient to operate not on the one input value, but on the vector of input values. Because we use Numpy operations, we can pass a vector of input values to our network, and it will give us the vector of output values. In many cases, it is more efficient to operate not on the one input value, but on the vector of input values. Because we use Numpy operations, we can pass a vector of input values to our network, and it will give us the vector of output values.
## Softmax: Turning Outputs into Probabilities ## Softmax: Turning Outputs into Probabilities
As you can see, our outputs are not probabilities - they can take any values. In order to convert them into probabilities, we need to normalize the values across all classes. This is done using **softmax** function: $$\sigma(\mathbf{z}_c) = \frac{e^{z_c}}{\sum_{j} e^{z_j}}, \quad\mathrm{for}\quad c\in 1 .. |C|$$ As you can see, our outputs are not probabilities - they can take any values. In order to convert them into probabilities, we need to normalize the values across all classes. This is done using **softmax** function: $$\sigma(\mathbf{z}_c) = \frac{e^{z_c}}{\sum_{j} e^{z_j}}, \quad\mathrm{for}\quad c\in 1 .. |C|$$
<img src="https://raw.githubusercontent.com/shwars/NeuroWorkshop/master/images/NeuroArch-softmax.PNG" width="50%"> <img src="https://raw.githubusercontent.com/shwars/NeuroWorkshop/master/images/NeuroArch-softmax.PNG" width="50%">
> Output of the network $\sigma(\mathbf{z})$ can be interpreted as probability distribution on the set of classes $C$: $q = \sigma(\mathbf{z}_c) = \hat{p}(c | x)$ > Output of the network $\sigma(\mathbf{z})$ can be interpreted as probability distribution on the set of classes $C$: $q = \sigma(\mathbf{z}_c) = \hat{p}(c | x)$
We will define the `Softmax` layer in the same manner, as a class with `forward` function: We will define the `Softmax` layer in the same manner, as a class with `forward` function:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
class Softmax: class Softmax:
def forward(self,z): def forward(self,z):
zmax = z.max(axis=1,keepdims=True) zmax = z.max(axis=1,keepdims=True)
expz = np.exp(z-zmax) expz = np.exp(z-zmax)
Z = expz.sum(axis=1,keepdims=True) Z = expz.sum(axis=1,keepdims=True)
return expz / Z return expz / Z
softmax = Softmax() softmax = Softmax()
softmax.forward(net.forward(train_x[0:10])) softmax.forward(net.forward(train_x[0:10]))
``` ```
%% Output %% Output
array([[0.88348621, 0.11651379], array([[0.88348621, 0.11651379],
[0.66369714, 0.33630286], [0.66369714, 0.33630286],
[0.35294795, 0.64705205], [0.35294795, 0.64705205],
[0.20216095, 0.79783905], [0.20216095, 0.79783905],
[0.17154828, 0.82845172], [0.17154828, 0.82845172],
[0.24279153, 0.75720847], [0.24279153, 0.75720847],
[0.18915732, 0.81084268], [0.18915732, 0.81084268],
[0.17282951, 0.82717049], [0.17282951, 0.82717049],
[0.13897531, 0.86102469], [0.13897531, 0.86102469],
[0.72746882, 0.27253118]]) [0.72746882, 0.27253118]])
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can see that we are now getting probabilities as outputs, i.e. the sum of each output vector is exactly 1. You can see that we are now getting probabilities as outputs, i.e. the sum of each output vector is exactly 1.
In case we have more than 2 classes, softmax will normalize probabilities across all of them. Here is a diagram of network architecture that does MNIST digit classification: In case we have more than 2 classes, softmax will normalize probabilities across all of them. Here is a diagram of network architecture that does MNIST digit classification:
![MNIST Classifier](images/Cross-Entropy-Loss.PNG) ![MNIST Classifier](images/Cross-Entropy-Loss.PNG)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Cross-Entropy Loss ## Cross-Entropy Loss
A loss function in classification is typically a logistic function, which can be generalized as **cross-entropy loss**. Cross-entropy loss is a function that can calculate similarity between two arbitrary probability distributions. You can find more detailed discussion about it [on Wikipedia](https://en.wikipedia.org/wiki/Cross_entropy). A loss function in classification is typically a logistic function, which can be generalized as **cross-entropy loss**. Cross-entropy loss is a function that can calculate similarity between two arbitrary probability distributions. You can find more detailed discussion about it [on Wikipedia](https://en.wikipedia.org/wiki/Cross_entropy).
In our case, first distribution is the probabilistic output of our network, and the second one is so-called **one-hot** distribution, which specifies that a given class $c$ has corresponding probability 1 (all the rest being 0). In such a case cross-entropy loss can be calculated as $-\log p_c$, where $c$ is the expected class, and $p_c$ is the corresponding probability of this class given by our neural network. In our case, first distribution is the probabilistic output of our network, and the second one is so-called **one-hot** distribution, which specifies that a given class $c$ has corresponding probability 1 (all the rest being 0). In such a case cross-entropy loss can be calculated as $-\log p_c$, where $c$ is the expected class, and $p_c$ is the corresponding probability of this class given by our neural network.
> If the network return probability 1 for the expected class, cross-entropy loss would be 0. The closer the probability of the actual class is to 0, the higher is cross-entropy loss (and it can go up to infinity!). > If the network return probability 1 for the expected class, cross-entropy loss would be 0. The closer the probability of the actual class is to 0, the higher is cross-entropy loss (and it can go up to infinity!).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
def plot_cross_ent(): def plot_cross_ent():
p = np.linspace(0.01, 0.99, 101) # estimated probability p(y|x) p = np.linspace(0.01, 0.99, 101) # estimated probability p(y|x)
cross_ent_v = np.vectorize(cross_ent) cross_ent_v = np.vectorize(cross_ent)
f3, ax = plt.subplots(1,1, figsize=(8, 3)) f3, ax = plt.subplots(1,1, figsize=(8, 3))
l1, = plt.plot(p, cross_ent_v(p, 1), 'r--') l1, = plt.plot(p, cross_ent_v(p, 1), 'r--')
l2, = plt.plot(p, cross_ent_v(p, 0), 'r-') l2, = plt.plot(p, cross_ent_v(p, 0), 'r-')
plt.legend([l1, l2], ['$y = 1$', '$y = 0$'], loc = 'upper center', ncol = 2) plt.legend([l1, l2], ['$y = 1$', '$y = 0$'], loc = 'upper center', ncol = 2)
plt.xlabel('$\hat{p}(y|x)$', size=18) plt.xlabel('$\hat{p}(y|x)$', size=18)
plt.ylabel('$\mathcal{L}_{CE}$', size=18) plt.ylabel('$\mathcal{L}_{CE}$', size=18)
plt.show() plt.show()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
def cross_ent(prediction, ground_truth): def cross_ent(prediction, ground_truth):
t = 1 if ground_truth > 0.5 else 0 t = 1 if ground_truth > 0.5 else 0
return -t * np.log(prediction) - (1 - t) * np.log(1 - prediction) return -t * np.log(prediction) - (1 - t) * np.log(1 - prediction)
plot_cross_ent() plot_cross_ent()
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Cross-entropy loss will be defined again as a separate layer, but `forward` function will have two input values: output of the previous layers of the network `p`, and the expected class `y`: Cross-entropy loss will be defined again as a separate layer, but `forward` function will have two input values: output of the previous layers of the network `p`, and the expected class `y`:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
class CrossEntropyLoss: class CrossEntropyLoss:
def forward(self,p,y): def forward(self,p,y):
self.p = p self.p = p
self.y = y self.y = y
p_of_y = p[np.arange(len(y)), y] p_of_y = p[np.arange(len(y)), y]
log_prob = np.log(p_of_y) log_prob = np.log(p_of_y)
return -log_prob.mean() # average over all input samples return -log_prob.mean() # average over all input samples
cross_ent_loss = CrossEntropyLoss() cross_ent_loss = CrossEntropyLoss()
p = softmax.forward(net.forward(train_x[0:10])) p = softmax.forward(net.forward(train_x[0:10]))
cross_ent_loss.forward(p,train_labels[0:10]) cross_ent_loss.forward(p,train_labels[0:10])
``` ```
%% Output %% Output
1.429664938969559 1.429664938969559
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
> **IMPORTANT**: Loss function returns a number that shows how good (or bad) our network performs. It should return us one number for the whole dataset, or for the part of the dataset (minibatch). Thus after calculating cross-entropy loss for each individual component of the input vector, we need to average (or add) all components together - which is done by the call to `.mean()`. > **IMPORTANT**: Loss function returns a number that shows how good (or bad) our network performs. It should return us one number for the whole dataset, or for the part of the dataset (minibatch). Thus after calculating cross-entropy loss for each individual component of the input vector, we need to average (or add) all components together - which is done by the call to `.mean()`.
## Computational Graph ## Computational Graph
<img src="images/ComputeGraph.png" width="600px"/> <img src="images/ComputeGraph.png" width="600px"/>
Up to this moment, we have defined different classes for different layers of the network. Composition of those layers can be represented as **computational graph**. Now we can compute the loss for a given training dataset (or part of it) in the following manner: Up to this moment, we have defined different classes for different layers of the network. Composition of those layers can be represented as **computational graph**. Now we can compute the loss for a given training dataset (or part of it) in the following manner:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
z = net.forward(train_x[0:10]) z = net.forward(train_x[0:10])
p = softmax.forward(z) p = softmax.forward(z)
loss = cross_ent_loss.forward(p,train_labels[0:10]) loss = cross_ent_loss.forward(p,train_labels[0:10])
print(loss) print(loss)
``` ```
%% Output %% Output
1.429664938969559 1.429664938969559
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Loss Minimization Problem and Network Training ## Loss Minimization Problem and Network Training
Once we have defined out network as $f_\theta$, and given the loss function $\mathcal{L}(Y,f_\theta(X))$, we can consider $\mathcal{L}$ as a function of $\theta$ under our fixed training dataset: $\mathcal{L}(\theta) = \mathcal{L}(Y,f_\theta(X))$ Once we have defined out network as $f_\theta$, and given the loss function $\mathcal{L}(Y,f_\theta(X))$, we can consider $\mathcal{L}$ as a function of $\theta$ under our fixed training dataset: $\mathcal{L}(\theta) = \mathcal{L}(Y,f_\theta(X))$
In this case, the network training would be a minimization problem of $\mathcal{L}$ under argument $\theta$: In this case, the network training would be a minimization problem of $\mathcal{L}$ under argument $\theta$:
$$ $$
\theta = \mathrm{argmin}_{\theta} \mathcal{L}(Y,f_\theta(X)) \theta = \mathrm{argmin}_{\theta} \mathcal{L}(Y,f_\theta(X))
$$ $$
There is a well-known method of function optimization called **gradient descent**. The idea is that we can compute a derivative (in multi-dimensional case call **gradient**) of loss function with respect to parameters, and vary parameters in such a way that the error would decrease. There is a well-known method of function optimization called **gradient descent**. The idea is that we can compute a derivative (in multi-dimensional case call **gradient**) of loss function with respect to parameters, and vary parameters in such a way that the error would decrease.
Gradient descent works as follows: Gradient descent works as follows:
* Initialize parameters by some random values $w^{(0)}$, $b^{(0)}$ * Initialize parameters by some random values $w^{(0)}$, $b^{(0)}$
* Repeat the following step many times: * Repeat the following step many times:
$$\begin{align} $$\begin{align}
W^{(i+1)}&=W^{(i)}-\eta\frac{\partial\mathcal{L}}{\partial W}\\ W^{(i+1)}&=W^{(i)}-\eta\frac{\partial\mathcal{L}}{\partial W}\\
b^{(i+1)}&=b^{(i)}-\eta\frac{\partial\mathcal{L}}{\partial b} b^{(i+1)}&=b^{(i)}-\eta\frac{\partial\mathcal{L}}{\partial b}
\end{align} \end{align}
$$ $$
During training, the optimization steps are supposed to be calculated considering the whole dataset (remember that loss is calculated as a sum/average through all training samples). However, in real life we take small portions of the dataset called **minibatches**, and calculate gradients based on a subset of data. Because subset is taken randomly each time, such method is called **stochastic gradient descent** (SGD). During training, the optimization steps are supposed to be calculated considering the whole dataset (remember that loss is calculated as a sum/average through all training samples). However, in real life we take small portions of the dataset called **minibatches**, and calculate gradients based on a subset of data. Because subset is taken randomly each time, such method is called **stochastic gradient descent** (SGD).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Backward Propagation ## Backward Propagation
<img src="images/ComputeGraph.png" width="300px" align="left"/> <img src="images/ComputeGraph.png" width="300px" align="left"/>
$$\def\L{\mathcal{L}}\def\zz#1#2{\frac{\partial#1}{\partial#2}} $$\def\L{\mathcal{L}}\def\zz#1#2{\frac{\partial#1}{\partial#2}}
\begin{align} \begin{align}
\zz{\L}{W} =& \zz{\L}{p}\zz{p}{z}\zz{z}{W}\cr \zz{\L}{W} =& \zz{\L}{p}\zz{p}{z}\zz{z}{W}\cr
\zz{\L}{b} =& \zz{\L}{p}\zz{p}{z}\zz{z}{b} \zz{\L}{b} =& \zz{\L}{p}\zz{p}{z}\zz{z}{b}
\end{align} \end{align}
$$ $$
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To compute $\partial\mathcal{L}/\partial W$ we can use the **chaining rule** for computing derivatives of a composite function, as you can see in the formulae above. It corresponds to the following idea: To compute $\partial\mathcal{L}/\partial W$ we can use the **chaining rule** for computing derivatives of a composite function, as you can see in the formulae above. It corresponds to the following idea:
* Suppose under given input we have obtanes loss $\Delta\mathcal{L}$ * Suppose under given input we have obtained loss $\Delta\mathcal{L}$
* To minimize it, we would have to adjust softmax output $p$ by value $\Delta p = (\partial\mathcal{L}/\partial p)\Delta\mathcal{L}$ * To minimize it, we would have to adjust softmax output $p$ by value $\Delta p = (\partial\mathcal{L}/\partial p)\Delta\mathcal{L}$
* This corresponds to the changes to node $z$ by $\Delta z = (\partial\mathcal{p}/\partial z)\Delta p$ * This corresponds to the changes to node $z$ by $\Delta z = (\partial\mathcal{p}/\partial z)\Delta p$
* To minimize this error, we need to adjust parameters accordingly: $\Delta W = (\partial\mathcal{z}/\partial W)\Delta z$ (and the same for $b$) * To minimize this error, we need to adjust parameters accordingly: $\Delta W = (\partial\mathcal{z}/\partial W)\Delta z$ (and the same for $b$)
<img src="images/ComputeGraphGrad.PNG" width="400px" align="right"/> <img src="images/ComputeGraphGrad.PNG" width="400px" align="right"/>
This process starts distributing the loss error from the output of the network back to its parameters. Thus the process is called **back propagation**. This process starts distributing the loss error from the output of the network back to its parameters. Thus the process is called **back propagation**.
One pass of the network training consists of two parts: One pass of the network training consists of two parts:
* **Forward pass**, when we calculate the value of loss function for a given input minibatch * **Forward pass**, when we calculate the value of loss function for a given input minibatch
* **Backward pass**, when we try to minimize this error by distributing it back to the model parameters through the computational graph. * **Backward pass**, when we try to minimize this error by distributing it back to the model parameters through the computational graph.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Implementation of Back Propagation ### Implementation of Back Propagation
* Let's add `backward` function to each of our nodes that will compute the derivative and propagate the error during the backward pass. * Let's add `backward` function to each of our nodes that will compute the derivative and propagate the error during the backward pass.
* We also need to implement parameter updates according to the procedure described above * We also need to implement parameter updates according to the procedure described above
We need to compute derivatives for each layer manually, for example for linear layer $z = x\times W+b$: We need to compute derivatives for each layer manually, for example for linear layer $z = x\times W+b$:
$$\begin{align} $$\begin{align}
\frac{\partial z}{\partial W} &= x \\ \frac{\partial z}{\partial W} &= x \\
\frac{\partial z}{\partial b} &= 1 \\ \frac{\partial z}{\partial b} &= 1 \\
\end{align}$$ \end{align}$$
If we need to compensate for the error $\Delta z$ at the output of the layer, we need to update the weights accordingly: If we need to compensate for the error $\Delta z$ at the output of the layer, we need to update the weights accordingly:
$$\begin{align} $$\begin{align}
\Delta x &= \Delta z \times W \\ \Delta x &= \Delta z \times W \\
\Delta W &= \frac{\partial z}{\partial W} \Delta z = \Delta z \times x \\ \Delta W &= \frac{\partial z}{\partial W} \Delta z = \Delta z \times x \\
\Delta b &= \frac{\partial z}{\partial b} \Delta z = \Delta z \\ \Delta b &= \frac{\partial z}{\partial b} \Delta z = \Delta z \\
\end{align}$$ \end{align}$$
**IMPORTANT:** Calculations are done not for each training sample independently, but rather for a whole **minibatch**. Required parameter updates $\Delta W$ and $\Delta b$ are computed across the whole minibatch, and the respective vectors have dimensions: $x\in\mathbb{R}^{\mathrm{minibatch}\, \times\, \mathrm{nclass}}$ **IMPORTANT:** Calculations are done not for each training sample independently, but rather for a whole **minibatch**. Required parameter updates $\Delta W$ and $\Delta b$ are computed across the whole minibatch, and the respective vectors have dimensions: $x\in\mathbb{R}^{\mathrm{minibatch}\, \times\, \mathrm{nclass}}$
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
class Linear: class Linear:
def __init__(self,nin,nout): def __init__(self,nin,nout):
self.W = np.random.normal(0, 1.0/np.sqrt(nin), (nout, nin)) self.W = np.random.normal(0, 1.0/np.sqrt(nin), (nout, nin))
self.b = np.zeros((1,nout)) self.b = np.zeros((1,nout))
self.dW = np.zeros_like(self.W) self.dW = np.zeros_like(self.W)
self.db = np.zeros_like(self.b) self.db = np.zeros_like(self.b)
def forward(self, x): def forward(self, x):
self.x=x self.x=x
return np.dot(x, self.W.T) + self.b return np.dot(x, self.W.T) + self.b
def backward(self, dz): def backward(self, dz):
dx = np.dot(dz, self.W) dx = np.dot(dz, self.W)
dW = np.dot(dz.T, self.x) dW = np.dot(dz.T, self.x)
db = dz.sum(axis=0) db = dz.sum(axis=0)
self.dW = dW self.dW = dW
self.db = db self.db = db
return dx return dx
def update(self,lr): def update(self,lr):
self.W -= lr*self.dW self.W -= lr*self.dW
self.b -= lr*self.db self.b -= lr*self.db
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In the same manner we can define `backward` function for the rest of our layers: In the same manner we can define `backward` function for the rest of our layers:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
class Softmax: class Softmax:
def forward(self,z): def forward(self,z):
self.z = z self.z = z
zmax = z.max(axis=1,keepdims=True) zmax = z.max(axis=1,keepdims=True)
expz = np.exp(z-zmax) expz = np.exp(z-zmax)
Z = expz.sum(axis=1,keepdims=True) Z = expz.sum(axis=1,keepdims=True)
return expz / Z return expz / Z
def backward(self,dp): def backward(self,dp):
p = self.forward(self.z) p = self.forward(self.z)
pdp = p * dp pdp = p * dp
return pdp - p * pdp.sum(axis=1, keepdims=True) return pdp - p * pdp.sum(axis=1, keepdims=True)
class CrossEntropyLoss: class CrossEntropyLoss:
def forward(self,p,y): def forward(self,p,y):
self.p = p self.p = p
self.y = y self.y = y
p_of_y = p[np.arange(len(y)), y] p_of_y = p[np.arange(len(y)), y]
log_prob = np.log(p_of_y) log_prob = np.log(p_of_y)
return -log_prob.mean() return -log_prob.mean()
def backward(self,loss): def backward(self,loss):
dlog_softmax = np.zeros_like(self.p) dlog_softmax = np.zeros_like(self.p)
dlog_softmax[np.arange(len(self.y)), self.y] -= 1.0/len(self.y) dlog_softmax[np.arange(len(self.y)), self.y] -= 1.0/len(self.y)
return dlog_softmax / self.p return dlog_softmax / self.p
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Training the Model ## Training the Model
Now we are ready to write the **training loop**, which will go through our dataset, and perform the optimization minibatch by minibatch.One complete pass through the dataset is often called **an epoch**: Now we are ready to write the **training loop**, which will go through our dataset, and perform the optimization minibatch by minibatch. One complete pass through the dataset is often called **an epoch**:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
lin = Linear(2,2) lin = Linear(2,2)
softmax = Softmax() softmax = Softmax()
cross_ent_loss = CrossEntropyLoss() cross_ent_loss = CrossEntropyLoss()
learning_rate = 0.1 learning_rate = 0.1
pred = np.argmax(lin.forward(train_x),axis=1) pred = np.argmax(lin.forward(train_x),axis=1)
acc = (pred==train_labels).mean() acc = (pred==train_labels).mean()
print("Initial accuracy: ",acc) print("Initial accuracy: ",acc)
batch_size=4 batch_size=4
for i in range(0,len(train_x),batch_size): for i in range(0,len(train_x),batch_size):
xb = train_x[i:i+batch_size] xb = train_x[i:i+batch_size]
yb = train_labels[i:i+batch_size] yb = train_labels[i:i+batch_size]
# forward pass # forward pass
z = lin.forward(xb) z = lin.forward(xb)
p = softmax.forward(z) p = softmax.forward(z)
loss = cross_ent_loss.forward(p,yb) loss = cross_ent_loss.forward(p,yb)
# backward pass # backward pass
dp = cross_ent_loss.backward(loss) dp = cross_ent_loss.backward(loss)
dz = softmax.backward(dp) dz = softmax.backward(dp)
dx = lin.backward(dz) dx = lin.backward(dz)
lin.update(learning_rate) lin.update(learning_rate)
pred = np.argmax(lin.forward(train_x),axis=1) pred = np.argmax(lin.forward(train_x),axis=1)
acc = (pred==train_labels).mean() acc = (pred==train_labels).mean()
print("Final accuracy: ",acc) print("Final accuracy: ",acc)
``` ```
%% Output %% Output
Initial accuracy: 0.725 Initial accuracy: 0.725
Final accuracy: 0.825 Final accuracy: 0.825
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Nice to see how we can increase accuracy of the model from about 50% to around 80% in one epoch. Nice to see how we can increase accuracy of the model from about 50% to around 80% in one epoch.
## Network Class ## Network Class
Since in many cases neural network is just a composition of layers, we can build a class that will allow us to stack layers together and make forward and backward passes through them without explicitly programming that logic. We will store the list of layers inside the `Net` class, and use `add()` function to add new layers: Since in many cases neural network is just a composition of layers, we can build a class that will allow us to stack layers together and make forward and backward passes through them without explicitly programming that logic. We will store the list of layers inside the `Net` class, and use `add()` function to add new layers:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
class Net: class Net:
def __init__(self): def __init__(self):
self.layers = [] self.layers = []
def add(self,l): def add(self,l):
self.layers.append(l) self.layers.append(l)
def forward(self,x): def forward(self,x):
for l in self.layers: for l in self.layers:
x = l.forward(x) x = l.forward(x)
return x return x
def backward(self,z): def backward(self,z):
for l in self.layers[::-1]: for l in self.layers[::-1]:
z = l.backward(z) z = l.backward(z)
return z return z
def update(self,lr): def update(self,lr):
for l in self.layers: for l in self.layers:
if 'update' in l.__dir__(): if 'update' in l.__dir__():
l.update(lr) l.update(lr)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
With this `Net` class our model definition and training becomes more neat: With this `Net` class our model definition and training becomes more neat:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
net = Net() net = Net()
net.add(Linear(2,2)) net.add(Linear(2,2))
net.add(Softmax()) net.add(Softmax())
loss = CrossEntropyLoss() loss = CrossEntropyLoss()
def get_loss_acc(x,y,loss=CrossEntropyLoss()): def get_loss_acc(x,y,loss=CrossEntropyLoss()):
p = net.forward(x) p = net.forward(x)
l = loss.forward(p,y) l = loss.forward(p,y)
pred = np.argmax(p,axis=1) pred = np.argmax(p,axis=1)
acc = (pred==y).mean() acc = (pred==y).mean()
return l,acc return l,acc
print("Initial loss={}, accuracy={}: ".format(*get_loss_acc(train_x,train_labels))) print("Initial loss={}, accuracy={}: ".format(*get_loss_acc(train_x,train_labels)))
def train_epoch(net, train_x, train_labels, loss=CrossEntropyLoss(), batch_size=4, lr=0.1): def train_epoch(net, train_x, train_labels, loss=CrossEntropyLoss(), batch_size=4, lr=0.1):
for i in range(0,len(train_x),batch_size): for i in range(0,len(train_x),batch_size):
xb = train_x[i:i+batch_size] xb = train_x[i:i+batch_size]
yb = train_labels[i:i+batch_size] yb = train_labels[i:i+batch_size]
p = net.forward(xb) p = net.forward(xb)
l = loss.forward(p,yb) l = loss.forward(p,yb)
dp = loss.backward(l) dp = loss.backward(l)
dx = net.backward(dp) dx = net.backward(dp)
net.update(lr) net.update(lr)
train_epoch(net,train_x,train_labels) train_epoch(net,train_x,train_labels)
print("Final loss={}, accuracy={}: ".format(*get_loss_acc(train_x,train_labels))) print("Final loss={}, accuracy={}: ".format(*get_loss_acc(train_x,train_labels)))
print("Test loss={}, accuracy={}: ".format(*get_loss_acc(test_x,test_labels))) print("Test loss={}, accuracy={}: ".format(*get_loss_acc(test_x,test_labels)))
``` ```
%% Output %% Output
Initial loss=0.6212072429381601, accuracy=0.6875: Initial loss=0.6212072429381601, accuracy=0.6875:
Final loss=0.44369925927417986, accuracy=0.8: Final loss=0.44369925927417986, accuracy=0.8:
Test loss=0.4767711377257787, accuracy=0.85: Test loss=0.4767711377257787, accuracy=0.85:
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Plotting the Training Process ## Plotting the Training Process
It would be nice to see visually how the network is being trained! We will define a `train_and_plot` function for that. To visualize the state of the network we will use level map, i.e. we will represent different values of the network output using different colors. It would be nice to see visually how the network is being trained! We will define a `train_and_plot` function for that. To visualize the state of the network we will use level map, i.e. we will represent different values of the network output using different colors.
> Do not worry if you do not understand some of the plotting code below - it is more important to understand the underlying neural network concepts. > Do not worry if you do not understand some of the plotting code below - it is more important to understand the underlying neural network concepts.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
def train_and_plot(n_epoch, net, loss=CrossEntropyLoss(), batch_size=4, lr=0.1): def train_and_plot(n_epoch, net, loss=CrossEntropyLoss(), batch_size=4, lr=0.1):
fig, ax = plt.subplots(2, 1) fig, ax = plt.subplots(2, 1)
ax[0].set_xlim(0, n_epoch + 1) ax[0].set_xlim(0, n_epoch + 1)
ax[0].set_ylim(0,1) ax[0].set_ylim(0,1)
train_acc = np.empty((n_epoch, 3)) train_acc = np.empty((n_epoch, 3))
train_acc[:] = np.NAN train_acc[:] = np.NAN
valid_acc = np.empty((n_epoch, 3)) valid_acc = np.empty((n_epoch, 3))
valid_acc[:] = np.NAN valid_acc[:] = np.NAN
for epoch in range(1, n_epoch + 1): for epoch in range(1, n_epoch + 1):
train_epoch(net,train_x,train_labels,loss,batch_size,lr) train_epoch(net,train_x,train_labels,loss,batch_size,lr)
tloss, taccuracy = get_loss_acc(train_x,train_labels,loss) tloss, taccuracy = get_loss_acc(train_x,train_labels,loss)
train_acc[epoch-1, :] = [epoch, tloss, taccuracy] train_acc[epoch-1, :] = [epoch, tloss, taccuracy]
vloss, vaccuracy = get_loss_acc(test_x,test_labels,loss) vloss, vaccuracy = get_loss_acc(test_x,test_labels,loss)
valid_acc[epoch-1, :] = [epoch, vloss, vaccuracy] valid_acc[epoch-1, :] = [epoch, vloss, vaccuracy]
ax[0].set_ylim(0, max(max(train_acc[:, 2]), max(valid_acc[:, 2])) * 1.1) ax[0].set_ylim(0, max(max(train_acc[:, 2]), max(valid_acc[:, 2])) * 1.1)
plot_training_progress(train_acc[:, 0], (train_acc[:, 2], plot_training_progress(train_acc[:, 0], (train_acc[:, 2],
valid_acc[:, 2]), fig, ax[0]) valid_acc[:, 2]), fig, ax[0])
plot_decision_boundary(net, fig, ax[1]) plot_decision_boundary(net, fig, ax[1])
fig.canvas.draw() fig.canvas.draw()
fig.canvas.flush_events() fig.canvas.flush_events()
return train_acc, valid_acc return train_acc, valid_acc
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
import matplotlib.cm as cm import matplotlib.cm as cm
def plot_decision_boundary(net, fig, ax): def plot_decision_boundary(net, fig, ax):
draw_colorbar = True draw_colorbar = True
# remove previous plot # remove previous plot
while ax.collections: while ax.collections:
ax.collections.pop() ax.collections.pop()
draw_colorbar = False draw_colorbar = False
# generate countour grid # generate countour grid
x_min, x_max = train_x[:, 0].min() - 1, train_x[:, 0].max() + 1 x_min, x_max = train_x[:, 0].min() - 1, train_x[:, 0].max() + 1
y_min, y_max = train_x[:, 1].min() - 1, train_x[:, 1].max() + 1 y_min, y_max = train_x[:, 1].min() - 1, train_x[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1)) np.arange(y_min, y_max, 0.1))
grid_points = np.c_[xx.ravel().astype('float32'), yy.ravel().astype('float32')] grid_points = np.c_[xx.ravel().astype('float32'), yy.ravel().astype('float32')]
n_classes = max(train_labels)+1 n_classes = max(train_labels)+1
while train_x.shape[1] > grid_points.shape[1]: while train_x.shape[1] > grid_points.shape[1]:
# pad dimensions (plot only the first two) # pad dimensions (plot only the first two)
grid_points = np.c_[grid_points, grid_points = np.c_[grid_points,
np.empty(len(xx.ravel())).astype('float32')] np.empty(len(xx.ravel())).astype('float32')]
grid_points[:, -1].fill(train_x[:, grid_points.shape[1]-1].mean()) grid_points[:, -1].fill(train_x[:, grid_points.shape[1]-1].mean())
# evaluate predictions # evaluate predictions
prediction = np.array(net.forward(grid_points)) prediction = np.array(net.forward(grid_points))
# for two classes: prediction difference # for two classes: prediction difference
if (n_classes == 2): if (n_classes == 2):
Z = np.array([0.5+(p[0]-p[1])/2.0 for p in prediction]).reshape(xx.shape) Z = np.array([0.5+(p[0]-p[1])/2.0 for p in prediction]).reshape(xx.shape)
else: else:
Z = np.array([p.argsort()[-1]/float(n_classes-1) for p in prediction]).reshape(xx.shape) Z = np.array([p.argsort()[-1]/float(n_classes-1) for p in prediction]).reshape(xx.shape)
# draw contour # draw contour
levels = np.linspace(0, 1, 40) levels = np.linspace(0, 1, 40)
cs = ax.contourf(xx, yy, Z, alpha=0.4, levels = levels) cs = ax.contourf(xx, yy, Z, alpha=0.4, levels = levels)
if draw_colorbar: if draw_colorbar:
fig.colorbar(cs, ax=ax, ticks = [0, 0.5, 1]) fig.colorbar(cs, ax=ax, ticks = [0, 0.5, 1])
c_map = [cm.jet(x) for x in np.linspace(0.0, 1.0, n_classes) ] c_map = [cm.jet(x) for x in np.linspace(0.0, 1.0, n_classes) ]
colors = [c_map[l] for l in train_labels] colors = [c_map[l] for l in train_labels]
ax.scatter(train_x[:, 0], train_x[:, 1], marker='o', c=colors, s=60, alpha = 0.5) ax.scatter(train_x[:, 0], train_x[:, 1], marker='o', c=colors, s=60, alpha = 0.5)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
def plot_training_progress(x, y_data, fig, ax): def plot_training_progress(x, y_data, fig, ax):
styles = ['k--', 'g-'] styles = ['k--', 'g-']
# remove previous plot # remove previous plot
while ax.lines: while ax.lines:
ax.lines.pop() ax.lines.pop()
# draw updated lines # draw updated lines
for i in range(len(y_data)): for i in range(len(y_data)):
ax.plot(x, y_data[i], styles[i]) ax.plot(x, y_data[i], styles[i])
ax.legend(ax.lines, ['training accuracy', 'validation accuracy'], ax.legend(ax.lines, ['training accuracy', 'validation accuracy'],
loc='upper center', ncol = 2) loc='upper center', ncol = 2)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
%matplotlib nbagg %matplotlib nbagg
net = Net() net = Net()
net.add(Linear(2,2)) net.add(Linear(2,2))
net.add(Softmax()) net.add(Softmax())
res = train_and_plot(30,net,lr=0.005) res = train_and_plot(30,net,lr=0.005)
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
After running the cell above you should be able to see interactively how the boundary between classes change during training. Note that we have chosen very small learning rate so that we can see how the process happens. After running the cell above you should be able to see interactively how the boundary between classes change during training. Note that we have chosen very small learning rate so that we can see how the process happens.
## Multi-Layered Models ## Multi-Layered Models
The network above has been constructed from several layers, but we still had only one `Linear` layer, which does the actual classification. What happens if we decide to add several such layers? The network above has been constructed from several layers, but we still had only one `Linear` layer, which does the actual classification. What happens if we decide to add several such layers?
Surprisingly, our code will work! Very important thing to note, however, is that in between linear layers we need to have a non-linear **activation function**, such as `tanh`. Without such non-linearity, several linear layers would have the same expressive power as just one layers - because composition of linear functions is also linear! Surprisingly, our code will work! Very important thing to note, however, is that in between linear layers we need to have a non-linear **activation function**, such as `tanh`. Without such non-linearity, several linear layers would have the same expressive power as just one layers - because composition of linear functions is also linear!
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
class Tanh: class Tanh:
def forward(self,x): def forward(self,x):
y = np.tanh(x) y = np.tanh(x)
self.y = y self.y = y
return y return y
def backward(self,dy): def backward(self,dy):
return (1.0-self.y**2)*dy return (1.0-self.y**2)*dy
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Adding several layers make sense, because unlike one-layer network, multi-layered model will be able to accuratley classify sets that are not linearly separable. I.e., a model with several layers will be **reacher**. Adding several layers make sense, because unlike one-layer network, multi-layered model will be able to accurately classify sets that are not linearly separable. I.e., a model with several layers will be **reacher**.
> It can be demonstrated that with sufficient number of neurons a two-layered model is capable to classifying any convex set of data points, and three-layered network can classify virtually any set. > It can be demonstrated that with sufficient number of neurons a two-layered model is capable to classifying any convex set of data points, and three-layered network can classify virtually any set.
Mathematically, multi-layered perceptron would be represented by a more complex function $f_\theta$ that can be computed in several steps: Mathematically, multi-layered perceptron would be represented by a more complex function $f_\theta$ that can be computed in several steps:
* $z_1 = W_1\times x+b_1$ * $z_1 = W_1\times x+b_1$
* $z_2 = W_2\times\alpha(z_1)+b_2$ * $z_2 = W_2\times\alpha(z_1)+b_2$
* $f = \sigma(z_2)$ * $f = \sigma(z_2)$
Here, $\alpha$ is a **non-linear activation function**, $\sigma$ is a softmax function, and $\theta=\langle W_1,b_1,W_2,b_2\rangle$ are parameters. Here, $\alpha$ is a **non-linear activation function**, $\sigma$ is a softmax function, and $\theta=\langle W_1,b_1,W_2,b_2\rangle$ are parameters.
The gradient descent algorithm would remain the same, but it would be more difficult to calculate gradients. Given the The gradient descent algorithm would remain the same, but it would be more difficult to calculate gradients. Given the
chain differentiation rule, we can calculate derivatives as: chain differentiation rule, we can calculate derivatives as:
$$\begin{align} $$\begin{align}
\frac{\partial\mathcal{L}}{\partial W_2} &= \color{red}{\frac{\partial\mathcal{L}}{\partial\sigma}\frac{\partial\sigma}{\partial z_2}}\color{black}{\frac{\partial z_2}{\partial W_2}} \\ \frac{\partial\mathcal{L}}{\partial W_2} &= \color{red}{\frac{\partial\mathcal{L}}{\partial\sigma}\frac{\partial\sigma}{\partial z_2}}\color{black}{\frac{\partial z_2}{\partial W_2}} \\
\frac{\partial\mathcal{L}}{\partial W_1} &= \color{red}{\frac{\partial\mathcal{L}}{\partial\sigma}\frac{\partial\sigma}{\partial z_2}}\color{black}{\frac{\partial z_2}{\partial\alpha}\frac{\partial\alpha}{\partial z_1}\frac{\partial z_1}{\partial W_1}} \frac{\partial\mathcal{L}}{\partial W_1} &= \color{red}{\frac{\partial\mathcal{L}}{\partial\sigma}\frac{\partial\sigma}{\partial z_2}}\color{black}{\frac{\partial z_2}{\partial\alpha}\frac{\partial\alpha}{\partial z_1}\frac{\partial z_1}{\partial W_1}}
\end{align} \end{align}
$$ $$
Note that the beginning of all those expressions is still the same, and thus we can continue back propagation beyond one linear layers to adjust further weights up the computational graph. Note that the beginning of all those expressions is still the same, and thus we can continue back propagation beyond one linear layers to adjust further weights up the computational graph.
Let's now experiment with two-layered network: Let's now experiment with two-layered network:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
net = Net() net = Net()
net.add(Linear(2,10)) net.add(Linear(2,10))
net.add(Tanh()) net.add(Tanh())
net.add(Linear(10,2)) net.add(Linear(10,2))
net.add(Softmax()) net.add(Softmax())
loss = CrossEntropyLoss() loss = CrossEntropyLoss()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
res = train_and_plot(30,net,lr=0.01) res = train_and_plot(30,net,lr=0.01)
``` ```
%% Output %% Output
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Why Not Always Use Multi-Layered Model? ## Why Not Always Use Multi-Layered Model?
We have seen that multi-layered model is more *powerful* and *expressive*, than one-layered one. You may be wondering why don't we always use many-layered model. The answer to this question is **overfitting**. We have seen that multi-layered model is more *powerful* and *expressive*, than one-layered one. You may be wondering why don't we always use many-layered model. The answer to this question is **overfitting**.
We will deal with this term more in a later sections, but the idea is the following: **the more powerful the model is, the better it can approximate training data, and the more data it needs to properly generalize** for the new data it has not seen before. We will deal with this term more in a later sections, but the idea is the following: **the more powerful the model is, the better it can approximate training data, and the more data it needs to properly generalize** for the new data it has not seen before.
**A linear model:** **A linear model:**
* We are likely to get high training loss - so-called **underfitting**, when the model does not have enough power to correctly separate all data. * We are likely to get high training loss - so-called **underfitting**, when the model does not have enough power to correctly separate all data.
* Valiadation loss and training loss are more or less the same. The model is likely to generalize well to test data. * Valiadation loss and training loss are more or less the same. The model is likely to generalize well to test data.
**Complex multi-layered model** **Complex multi-layered model**
* Low training loss - the model can approximate training data well, because it has enough expressive power. * Low training loss - the model can approximate training data well, because it has enough expressive power.
* Validation loss can be much higher than training loss and can start to increase during training - this is because the model "memorizes" training points, and loses the "overall picture" * Validation loss can be much higher than training loss and can start to increase during training - this is because the model "memorizes" training points, and loses the "overall picture"
![Overfitting](images/overfit.png) ![Overfitting](images/overfit.png)
> On this picture, `x` stands for training data, `o` - validation data. Left - linear model (one-layer), it approximates the nature of the data pretty well. Right - overfitted model, the model perfectly well approximates training data, but stops making sense with any other data (validation error is very high) > On this picture, `x` stands for training data, `o` - validation data. Left - linear model (one-layer), it approximates the nature of the data pretty well. Right - overfitted model, the model perfectly well approximates training data, but stops making sense with any other data (validation error is very high)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Takeaways ## Takeaways
* Simple models (fewer layers, fewer neurons) with low number of parameters ("low capacity") are less likely to overfit * Simple models (fewer layers, fewer neurons) with low number of parameters ("low capacity") are less likely to overfit
* More complex models (more layers, more neurons on each layer, high capacity) are likely to overfit. We need to monitor validation error to make sure it does not start to rise with further training * More complex models (more layers, more neurons on each layer, high capacity) are likely to overfit. We need to monitor validation error to make sure it does not start to rise with further training
* More complex models need more data to train on. * More complex models need more data to train on.
* You can solve overfitting problem by either: * You can solve overfitting problem by either:
- simplifying your model - simplifying your model
- increasing the amount of training data - increasing the amount of training data
* **Bias-variance trade-off** is a term that shows that you need to get the compromise * **Bias-variance trade-off** is a term that shows that you need to get the compromise
- between power of the model and amount of data, - between power of the model and amount of data,
- between overfittig and underfitting - between overfittig and underfitting
* There is not single recipe on how many layers of parameters you need - the best way is to experiment * There is not single recipe on how many layers of parameters you need - the best way is to experiment
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Credits ## Credits
This notebook is a part of [AI for Beginners Curricula](http://github.com/microsoft/ai-for-beginners), and has been prepared by [Dmitry Soshnikov](http://soshnikov.com). It is inspired by Neural Network Workshop at Microsoft Research Cambridge. Some code and illustrative materials are taken from presentations by [Katja Hoffmann](https://www.microsoft.com/en-us/research/people/kahofman/), [Matthew Johnson](https://www.microsoft.com/en-us/research/people/matjoh/) and [Ryoto Tomioka](https://www.microsoft.com/en-us/research/people/ryoto/), and from [NeuroWorkshop](http://github.com/shwars/NeuroWorkshop) repository. This notebook is a part of [AI for Beginners Curricula](http://github.com/microsoft/ai-for-beginners), and has been prepared by [Dmitry Soshnikov](http://soshnikov.com). It is inspired by Neural Network Workshop at Microsoft Research Cambridge. Some code and illustrative materials are taken from presentations by [Katja Hoffmann](https://www.microsoft.com/en-us/research/people/kahofman/), [Matthew Johnson](https://www.microsoft.com/en-us/research/people/matjoh/) and [Ryoto Tomioka](https://www.microsoft.com/en-us/research/people/ryoto/), and from [NeuroWorkshop](http://github.com/shwars/NeuroWorkshop) repository.
......
...@@ -17,7 +17,7 @@ Let's start with formalizing the Machine Learning problem. Suppose we have a tra ...@@ -17,7 +17,7 @@ Let's start with formalizing the Machine Learning problem. Suppose we have a tra
* For regression problem, when we need to predict a number, we can use **absolute error** &sum;<sub>i</sub>|f(x<sup>(i)</sup>)-y<sup>(i)</sup>|, or **squared error** &sum;<sub>i</sub>(f(x<sup>(i)</sup>)-y<sup>(i)</sup>)<sup>2</sup> * For regression problem, when we need to predict a number, we can use **absolute error** &sum;<sub>i</sub>|f(x<sup>(i)</sup>)-y<sup>(i)</sup>|, or **squared error** &sum;<sub>i</sub>(f(x<sup>(i)</sup>)-y<sup>(i)</sup>)<sup>2</sup>
* For classification, we use **0-1 loss** (which is essentially the same as **accuracy** of the model), or **logistic loss**. * For classification, we use **0-1 loss** (which is essentially the same as **accuracy** of the model), or **logistic loss**.
For one-level perceptron, function *f* was defined as a linear function *f(x)=wx+b* (here *w* is the weight matrix, *x* is the vector if input features, and *b* is bias vector). For different neural network architectures, this function can take more complex form. For one-level perceptron, function *f* was defined as a linear function *f(x)=wx+b* (here *w* is the weight matrix, *x* is the vector of input features, and *b* is bias vector). For different neural network architectures, this function can take more complex form.
> In the case of classification, it is often desirable to get probabilities of corresponding classes as network output. To convert arbitrary numbers to probabilities (eg. to normalize the output), we often use **softmax** function &sigma;, and the function *f* becomes *f(x)=&sigma;(wx+b)* > In the case of classification, it is often desirable to get probabilities of corresponding classes as network output. To convert arbitrary numbers to probabilities (eg. to normalize the output), we often use **softmax** function &sigma;, and the function *f* becomes *f(x)=&sigma;(wx+b)*
...@@ -56,6 +56,7 @@ Note that the left-most part of all those expressions is the same, and thus we c ...@@ -56,6 +56,7 @@ Note that the left-most part of all those expressions is the same, and thus we c
<img src="images/ComputeGraphGrad.PNG" width="400px" align="right"/> <img src="images/ComputeGraphGrad.PNG" width="400px" align="right"/>
We will cover back prop in much more detail in our notebook example. We will cover back prop in much more detail in our notebook example.
## [Proceed to Notebook](OwnFramework.ipynb) ## [Proceed to Notebook](OwnFramework.ipynb)
In the accompanying notebook, we will implement our own framework for building and training multi-layered perceptrons. You will be able to see in detail how modern neural networks operate. Proceed to [OwnFramework](OwnFramework.ipynb) notebook. In the accompanying notebook, we will implement our own framework for building and training multi-layered perceptrons. You will be able to see in detail how modern neural networks operate. Proceed to [OwnFramework](OwnFramework.ipynb) notebook.
......
...@@ -6,7 +6,6 @@ Lab Assignment from [AI for Beginners Curriculum](https://github.com/microsoft/a ...@@ -6,7 +6,6 @@ Lab Assignment from [AI for Beginners Curriculum](https://github.com/microsoft/a
Solve the MNIST handwritten digit classification problem using 1-, 2- and 3-layered perceptron. Use the neural network framework we have developed in the lesson. Solve the MNIST handwritten digit classification problem using 1-, 2- and 3-layered perceptron. Use the neural network framework we have developed in the lesson.
## Stating Notebook ## Stating Notebook
Start the lab by opening [MyFW_MNIST.ipynb](MyFW_MNIST.ipynb) Start the lab by opening [MyFW_MNIST.ipynb](MyFW_MNIST.ipynb)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment