Skip to content
Snippets Groups Projects
Commit de6f7f0c authored by Dmitri Soshnikov's avatar Dmitri Soshnikov
Browse files

Add OwnFramework

parent dd99aeca
No related branches found
No related tags found
No related merge requests found
This diff is collapsed.
# Introduction to Neural Networks. Multi-Layered Perceptron
In the previous section, we have learnt about simplest neural network model - one-layered perceptron. It was a liner two-class classification model.
In this section we will extend this model into more flexible framework, allowing us to:
* perform **multi-class classification** in addition to two-class
* solve **regression problems** in addition to classification
* separate classes that are not linearly separable
We will also develop our own modular framework in Python that will allows us to construct different neural network architectures.
## Formalization of Machine Learning
Let's start with formalizing the Machine Learning problem. Suppose we have a training dataset **X** with labels **Y**, and we need to build a model *f* that will make most accurate predictions. The quality of predictions is measured by **Loss function** ℒ. The following loss functions are often used:
* For regression problem, when we need to predict a number, we can use **absolute error** &sum;<sub>i</sub>|f(x<sup>(i)</sup>)-y<sup>(i)</sub>|, or **squared error** &sum;<sub>i</sub>(f(x<sup>(i)</sup>)-y<sup>(i)</sub>)<sup>2</sup>
* For classification, we use **0-1 loss** (which is essentially the same as **accuracy** of the model), or **logistic loss**.
For one-level perceptron, function *f* was defined as a linear function *f(x)=wx+b* (here *w* is the weight matrix, *x* is the vector if input features, and *b* is bias vector). For different neural network architectures, this function can take more complex form.
> In the case of classification, it is often desirable to get probabilities of corresponding classes as network output. To convert arbitrary numbers to probabilities (eg. to normalize the output), we often use **softmax** function &sigma;, for the function *f* becomes *f=&sigma;(wx+b)*
In the definition of *f* above, *w* and *b* are called **parameters** &theta;=*w,b*. Given the dataset <**X**,**Y**>, we can compute an overall error on the whole dataset as a function of parameters &theta;.
**The goal of neural network training is to minimize the error by varying parameters &theta;**
## Gradient Descent Optimization
There is a well-known method of function optimization called **gradient descent**. The idea is that we can compute a derivative (in multi-dimensional case call **gradient**) of loss function with respect to parameters, and vary parameters in such a way that the error would decrease. This can be formalized as follows:
* Initialize parameters by some random values w<sup>(0)</sup>, b<sup>(0)</sup>
* Repeat the following step many times:
- w<sup>(i+1)</sup> = w<sup>(i)</sup>-&eta;&part;&lagran;/&part;w
- b<sup>(i+1)</sup> = b<sup>(i)</sup>-&eta;&part;&lagran;/&part;b
During training, the optimization steps are supposed to be calculated considering the whole dataset (remember that loss is calculated as a sum through all training samples). However, in real life we take small portions of the dataset called **minibatches**, and calculate gradients based on a subset of data. Because subset is taken randomly each time, such method is called **stochastic gradient descent** (SGD).
## Multi-Layered Perceptrons and Back Propagation
One-layer network, as we have seen above, is capable of classifying linearly separable classes. To build reacher model, we can combine several layers of the network. Mathematically it would just mean that the function *f* would have more complex form, such as *f(x) = &sigma;(w<sub>1</sub>&alpha;(w<sub>2</sub>x+b<sub>2</sub>)+b<sub>1</sub>)*, where &alpha; is a **non-linear activation function**, and &theta;=<*w<sub>1</sub>,b<sub>1</sub>,w<sub>2</sub>,b<sub>2</sub>*> are parameters.
The gradient descent algorithm would remain the same, but it would be more difficult to calculate gradients. Given the chain differentiation rule, we can calculate derivatives as
* &part;&lagran;/&part;w<sub>1</sub> = (&part;&lagran;/&part;&sigma;)(&part;&sigma;/&part;w<sub>1</sub>)
* &part;&lagran;/&part;w<sub>2</sub> = (&part;&lagran;/&part;&sigma;)(&part;&sigma;/&part;&alpha;)(&part;&alpha;/&part;z)
## [Proceed to Notebook](OwnFramework.ipynb)
To see how we can use perceptron to solve some toy as well as real-life problems, and to continue learning - go to [OwnFramework](OwnFramework.ipynb) notebook.
......@@ -40,7 +40,7 @@ For a gentle introduction to *AI in the Cloud* topic you may consider taking [Ge
<tr><td>3</td><td>Perceptron</td>
<td><a href="3-NeuralNetworks/03-Perceptron/README.md">Text</a>
<td colspan="2"><a href="3-NeuralNetworks/03-Perceptron/Perceptron.ipynb">Notebook</a></td><td></td></tr>
<tr><td>4 </td><td>Multi-Layered Perceptron and Creating our own Framework</td><td>Text</td><td colspan="2"><a href="3-NeuralNetworks/04-OwnFramework/OwnFramework.ipynb">Notebook</a><td></td></tr>
<tr><td>4 </td><td>Multi-Layered Perceptron and Creating our own Framework</td><td><a href="3-NeuralNetworks/04-OwnFramework/README.md">Text</a></td><td colspan="2"><a href="3-NeuralNetworks/04-OwnFramework/OwnFramework.ipynb">Notebook</a><td></td></tr>
<tr><td>5</td>
<td>Intro to Frameworks (PyTorch/Tensorflow)</td>
<td><a href="3-NeuralNetworks/05-Frameworks/README.md">Text</a></td>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment