Add OwnFramework

de6f7f0c · Dmitri Soshnikov · dd99aeca · de6f7f0c · de6f7f0c · de6f7f0c
Commit de6f7f0c authored 3 years ago by Dmitri Soshnikov
--- a/3-NeuralNetworks/04-OwnFramework/OwnFramework.ipynb
+++ b/3-NeuralNetworks/04-OwnFramework/OwnFramework.ipynb
--- a/3-NeuralNetworks/04-OwnFramework/README.md
+++ b/3-NeuralNetworks/04-OwnFramework/README.md
+# Introduction to Neural Networks. Multi-Layered Perceptron
+
+In the previous section, we have learnt about simplest neural network model - one-layered perceptron. It was a liner two-class classification model.
+
+In this section we will extend this model into more flexible framework, allowing us to:
+
+* perform **multi-class classification** in addition to two-class
+* solve **regression problems** in addition to classification
+* separate classes that are not linearly separable
+
+We will also develop our own modular framework in Python that will allows us to construct different neural network architectures.
+
+## Formalization of Machine Learning
+
+Let's start with formalizing the Machine Learning problem. Suppose we have a training dataset **X** with labels **Y**, and we need to build a model *f* that will make most accurate predictions. The quality of predictions is measured by **Loss function** &lagran;. The following loss functions are often used:
+
+* For regression problem, when we need to predict a number, we can use **absolute error** &sum;<sub>i</sub>|f(x<sup>(i)</sup>)-y<sup>(i)</sub>|, or **squared error** &sum;<sub>i</sub>(f(x<sup>(i)</sup>)-y<sup>(i)</sub>)<sup>2</sup>
+* For classification, we use **0-1 loss** (which is essentially the same as **accuracy** of the model), or **logistic loss**.
+
+For one-level perceptron, function *f* was defined as a linear function *f(x)=wx+b* (here *w* is the weight matrix, *x* is the vector if input features, and *b* is bias vector). For different neural network architectures, this function can take more complex form. 
+
+> In the case of classification, it is often desirable to get probabilities of corresponding classes as network output. To convert arbitrary numbers to probabilities (eg. to normalize the output), we often use **softmax** function &sigma;, for the function *f* becomes *f=&sigma;(wx+b)*
+
+In the definition of *f* above, *w* and *b* are called **parameters** &theta;=*w,b*. Given the dataset <**X**,**Y**>, we can compute an overall error on the whole dataset as a function of parameters &theta;.
+
+**The goal of neural network training is to minimize the error by varying parameters &theta;**
+
+## Gradient Descent Optimization
+
+There is a well-known method of function optimization called **gradient descent**. The idea is that we can compute a derivative (in multi-dimensional case call **gradient**) of loss function with respect to parameters, and vary parameters in such a way that the error would decrease. This can be formalized as follows:
+
+* Initialize parameters by some random values w<sup>(0)</sup>, b<sup>(0)</sup>
+* Repeat the following step many times:
+    - w<sup>(i+1)</sup> = w<sup>(i)</sup>-&eta;&part;&lagran;/&part;w
+    - b<sup>(i+1)</sup> = b<sup>(i)</sup>-&eta;&part;&lagran;/&part;b
+
+During training, the optimization steps are supposed to be calculated considering the whole dataset (remember that loss is calculated as a sum through all training samples). However, in real life we take small portions of the dataset called **minibatches**, and calculate gradients based on a subset of data. Because subset is taken randomly each time, such method is called **stochastic gradient descent** (SGD).
+
+## Multi-Layered Perceptrons and Back Propagation
+
+One-layer network, as we have seen above, is capable of classifying linearly separable classes. To build reacher model, we can combine several layers of the network. Mathematically it would just mean that the function *f* would have more complex form, such as *f(x) = &sigma;(w<sub>1</sub>&alpha;(w<sub>2</sub>x+b<sub>2</sub>)+b<sub>1</sub>)*, where &alpha; is a **non-linear activation function**, and &theta;=<*w<sub>1</sub>,b<sub>1</sub>,w<sub>2</sub>,b<sub>2</sub>*> are parameters.
+
+The gradient descent algorithm would remain the same, but it would be more difficult to calculate gradients. Given the chain differentiation rule, we can calculate derivatives as 
+
+* &part;&lagran;/&part;w<sub>1</sub> = (&part;&lagran;/&part;&sigma;)(&part;&sigma;/&part;w<sub>1</sub>)
+* &part;&lagran;/&part;w<sub>2</sub> = (&part;&lagran;/&part;&sigma;)(&part;&sigma;/&part;&alpha;)(&part;&alpha;/&part;z)
+
+
+
+## [Proceed to Notebook](OwnFramework.ipynb)
+
+To see how we can use perceptron to solve some toy as well as real-life problems, and to continue learning - go to [OwnFramework](OwnFramework.ipynb) notebook.
--- a/README.md
+++ b/README.md
@@ -40,7 +40,7 @@ For a gentle introduction to *AI in the Cloud* topic you may consider taking [Ge
 <tr><td>3</td><td>Perceptron</td>
   <td><a href="3-NeuralNetworks/03-Perceptron/README.md">Text</a>
   <td colspan="2"><a href="3-NeuralNetworks/03-Perceptron/Perceptron.ipynb">Notebook</a></td><td></td></tr>
-<tr><td>4 </td><td>Multi-Layered Perceptron and Creating our own Framework</td><td>Text</td><td colspan="2"><a href="3-NeuralNetworks/04-OwnFramework/OwnFramework.ipynb">Notebook</a><td></td></tr> 
+<tr><td>4 </td><td>Multi-Layered Perceptron and Creating our own Framework</td><td><a href="3-NeuralNetworks/04-OwnFramework/README.md">Text</a></td><td colspan="2"><a href="3-NeuralNetworks/04-OwnFramework/OwnFramework.ipynb">Notebook</a><td></td></tr> 
 <tr><td>5</td>
   <td>Intro to Frameworks (PyTorch/Tensorflow)</td>
   <td><a href="3-NeuralNetworks/05-Frameworks/README.md">Text</a></td>