diff --git a/.gitignore b/.gitignore
index 08286b3179176d1f7a06a4ab093054fa49bb2283..55d493e77fc6a036d84d9208426def3193be20d9 100644
--- a/.gitignore
+++ b/.gitignore
@@ -14,6 +14,7 @@
 *.userprefs
 
 .ipynb_checkpoints/
+data/
 
 # Mono auto generated files
 mono_crash.*
diff --git a/3-NeuralNetworks/05-Frameworks/Overfitting.md b/3-NeuralNetworks/05-Frameworks/Overfitting.md
index 58eafcb7cc42e5958c71d11b32f472c25f9f1cee..515036f9c6df05743877fea48bd46554a8ab557e 100644
--- a/3-NeuralNetworks/05-Frameworks/Overfitting.md
+++ b/3-NeuralNetworks/05-Frameworks/Overfitting.md
@@ -25,7 +25,7 @@ Thus it is very important to strike a correct balance between richness of the mo
 
 As you can see from the graph above, overfitting can be detected by very low training error, and high validation error. Normally during training we will see both training and validation errors starting to decrease, and then at some point validaton error might stop decreasing and start rising. This will be a sign of overfitting, and the indicator that we should probably stop training at this point (or at least make a snapshot of the model).
 
-<img src="../Overfitting.png" width="90%"/>
+<img src="../images/Overfitting.png" width="90%"/>
 
 ## How to prevent overfitting
 
diff --git a/3-NeuralNetworks/05-Frameworks/README.md b/3-NeuralNetworks/05-Frameworks/README.md
index 0fb50286137515fee0183f4e649001dc81b46838..df3e6a940370b23b9bc7122bbc0d897e1bb9f740 100644
--- a/3-NeuralNetworks/05-Frameworks/README.md
+++ b/3-NeuralNetworks/05-Frameworks/README.md
@@ -35,3 +35,4 @@ Low-Level API | [TensorFlow+Keras Notebook](IntroKerasTF.ipynb) | [PyTorch](Intr
 --------------|-------------------------------------|--------------------------------
 High-level API| [Keras](IntroKeras.ipynb) | *PyTorch Lightning*
 
+After mastering the frameworks, let's recap the notion of [overfitting](Overfiting.md).
\ No newline at end of file
diff --git a/3-NeuralNetworks/README.md b/3-NeuralNetworks/README.md
index 75fb6552febc9d051dda301ab8f9a726e92fcdc3..db82190bdd60e2cd3cf1ec6beea5898d44c79210 100644
--- a/3-NeuralNetworks/README.md
+++ b/3-NeuralNetworks/README.md
@@ -39,6 +39,6 @@ where f is some non-linear **activation function**.
 
 In this section we will learn about:
 * [Perceptron](03-Perceptron/README.md), one of the earliest neural network models for two-class classification
-* [Modern multi-layered networks](04-OwnFramework/README.md) and [how to build our own framework](04-OwnFramework/OwnFramework.ipynb)
+* [Multi-layered networks](04-OwnFramework/README.md) and [how to build our own framework](04-OwnFramework/OwnFramework.ipynb)
 * [Neural Network Frameworks](05-Frameworks/README.md), such as [PyTorch](05-Frameworks/IntroPyTorch.ipynb) and [Keras/Tensorflow](05-Frameworks/IntroKerasTF.ipynb)
-
+* [Overfitting](05-Frameworks/Overfitting.md)
diff --git a/5-NLP/13-TextRep/README.md b/5-NLP/13-TextRep/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..13ab39bc4d95828d9404ced9eed827014d03da69
--- /dev/null
+++ b/5-NLP/13-TextRep/README.md
@@ -0,0 +1,48 @@
+# Representing Text as Tensors
+
+## Text Classification
+
+Throughout the first part of this course, we will focus on **text classification** task. We will use [AG News](https://www.kaggle.com/amananandrai/ag-news-classification-dataset) Dataset, which contains news articles like the following:
+
+* Category: Sci/Tech
+* Title: Ky. Company Wins Grant to Study Peptides (AP)
+* Body: AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop...
+
+Our goal would be to classify the news item into one of the categories based on text.
+
+## Representing text
+
+If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such as ASCII or UTF-8.
+
+![Image showing diagram mapping a character to an ASCII and binary representation](images/ascii-character-map.png)
+
+We understand what each letter **represents**, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and neural network has to learn the meaning during training.
+
+Therefore, we can use different approaches when representing text:
+* **Character-level representation**, when we represent text by treating each character as a number. Given that we have *C* different characters in our text corpus, the word *Hello* would be represented by 5x*C* tensor. Each letter would correspond to a tensor column in one-hot encoding.
+* **Word-level representation**, in which we create a **vocabulary** of all words in our text, and then represent words using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given large dictionary size, we need to deal with high-dimensional sparse tensors.
+
+Regardless of the representation, we first need to convert text into a sequence of **tokens**, one token being either a character, a word, or sometimes even part of a word. Then, we convert token into a number, typically using **vocabulary**, and this number can be fed into a neural network using one-hot encoding.
+
+## N-Grams
+
+In natural language, precise meaning of words can only be determined in context. For example, meanings of *neural network* and *fishing network* are completely different. One of the ways to take this into account is to build our model on pairs of words, and considering word pairs as separate vocabulary tokens. In this way, the sentence *I like to go fishing* will be represented by the following sequence of tokens: *I like*, *like to*, *to go*, *go fishing*. The problem with this approach is that the dictionary size grows significantly, and combinations like *go fishing* and *go shopping* are presented by different tokens, which do not share any semantic similarity despite the same verb.  
+
+In some cases, we may consider using tri-grams -- combinations of three words -- as well. Thus the approach is such is often called **n-grams**. Also, it makes sense to use n-grams with character-level representation, in which case n-grams will roughly correspond to different syllabi.
+
+## Bag-of-Words and TF/IDF
+
+When solving tasks like text classification, we need to be able to represent text by one fixed-size vector, which we will use as an input to final dense classifier. One of the simplest ways to do that is to combine all individual word representations, eg. by adding them. If we add one-hot encodings of each word, we will end up with a vector of frequencies, showing how many times each word appears inside the text. Such representation of text is called **bag of words** (BOW).
+
+<img src="images/bow.png" width="30%"/>
+
+BOW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as *president* and *country*, while scientific publication would have something like *collider*, *discovered*, etc. Thus, word frequencies can in many cases be a good indicator of text content.
+
+The problem with BOW is that certain common words, such as *and*, *is*, etc. appear in most of the texts, and they have highest frequencies, masking out the words that are really important. We may lower the importance of those words by taking into account the frequency at which words occur in the whole document collection. This is the main idea behind TF/IDF approach, which is covered in more detail in the notebooks below.
+
+However, none of those approaches can fully take into account the semantics of text. We need more powerful neural networks models, which we will discuss later in this course.
+
+## Continue to Notebooks
+
+* [Text Representation with PyTorch](TextRepresentationPyTorch.ipynb)
+* [Text Representation with Tensorflow](TextRepresentationTF.ipynb)
diff --git a/5-NLP/13-TextRep/TextRepresentationPyTorch.ipynb b/5-NLP/13-TextRep/TextRepresentationPyTorch.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..463039a89676aeb3bffaa1042b0bb8371c5115a9
--- /dev/null
+++ b/5-NLP/13-TextRep/TextRepresentationPyTorch.ipynb
@@ -0,0 +1,571 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Text classification task\n",
+    "\n",
+    "As we have mentioned, we will focus on simple text classification task based on **AG_NEWS** dataset, which is to classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech.\n",
+    "\n",
+    "### The Dataset\n",
+    "\n",
+    "This dataset is built into [`torchtext`](https://github.com/pytorch/text) module, so we can easily access it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "d:\\WORK\\ai-for-beginners\\5-NLP\\13-TextRep\\data\\train.csv: 29.5MB [00:00, 48.1MB/s]                            \n",
+      "d:\\WORK\\ai-for-beginners\\5-NLP\\13-TextRep\\data\\test.csv: 1.86MB [00:00, 9.53MB/s]                          \n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "import torchtext\n",
+    "import os\n",
+    "import collections\n",
+    "os.makedirs('./data',exist_ok=True)\n",
+    "train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')\n",
+    "classes = ['World', 'Sports', 'Business', 'Sci/Tech']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, `train_dataset` and `test_dataset` contain iterators that return pairs of label (number of class) and text respectively, for example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(3,\n",
+       " \"Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\\\band of ultra-cynics, are seeing green again.\")"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "next(train_dataset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So, let's print out the first 10 new headlines from our dataset: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "**Sci/Tech** -> Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.\n",
+      "**Sci/Tech** -> Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.\n",
+      "**Sci/Tech** -> Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.\n",
+      "**Sci/Tech** -> Oil prices soar to all-time record, posing new menace to US economy (AFP) AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.\n",
+      "**Sci/Tech** -> Stocks End Up, But Near Year Lows (Reuters) Reuters - Stocks ended slightly higher on Friday\\but stayed near lows for the year as oil prices surged past  #36;46\\a barrel, offsetting a positive outlook from computer maker\\Dell Inc. (DELL.O)\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i,x in zip(range(5),train_dataset):\n",
+    "    print(f\"**{classes[x[0]]}** -> {x[1]}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Because datasets are iterators, if we want to use the data multiple times we need to convert it to list:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')\n",
+    "train_dataset = list(train_dataset)\n",
+    "test_dataset = list(test_dataset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Tokenization\n",
+    "\n",
+    "Now we need to convert text into **numbers** that can be represented as tensors. If we want word-level representation, we need to do two things:\n",
+    "* use **tokenizer** to split text into **tokens**\n",
+    "* build a **vocabulary** of those tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['he', 'said', 'hello']"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer = torchtext.data.utils.get_tokenizer('basic_english')\n",
+    "tokenizer('He said: hello')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "counter = collections.Counter()\n",
+    "for (label, line) in train_dataset:\n",
+    "    counter.update(tokenizer(line))\n",
+    "vocab = torchtext.vocab.Vocab(counter, min_freq=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using vocabulary, we can easily encode out tokenized string into a set of numbers:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Vocab size if 95812\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[283, 2321, 5, 337, 19, 1301, 2357]"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab_size = len(vocab)\n",
+    "print(f\"Vocab size if {vocab_size}\")\n",
+    "\n",
+    "def encode(x):\n",
+    "    return [vocab.stoi[s] for s in tokenizer(x)]\n",
+    "\n",
+    "encode('I love to play with my words')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Bag of Words text representation\n",
+    "\n",
+    "Because words represent meaning, sometimes we can figure out the meaning of a text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like *weather*, *snow* are likely to indicate *weather forecast*, while words like *stocks*, *dollar* would count towards *financial news*.\n",
+    "\n",
+    "**Bag of Words** (BoW) vector representation is the most commonly used traditional vector representation. Each word is linked to a vector index, vector element contains the number of occurrences of a word in a given document.\n",
+    "\n",
+    "![Image showing how a bag of words vector representation is represented in memory.](images/bag-of-words-example.png) \n",
+    "\n",
+    "> **Note**: You can also think of BoW as a sum of all one-hot-encoded vectors for individual words in the text.\n",
+    "\n",
+    "Below is an example of how to generate a bag of word representation using the Scikit Learn python library:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[1, 1, 0, 2, 0, 0, 0, 0, 0]], dtype=int64)"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from sklearn.feature_extraction.text import CountVectorizer\n",
+    "vectorizer = CountVectorizer()\n",
+    "corpus = [\n",
+    "        'I like hot dogs.',\n",
+    "        'The dog ran fast.',\n",
+    "        'Its hot outside.',\n",
+    "    ]\n",
+    "vectorizer.fit_transform(corpus)\n",
+    "vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To compute bag-of-words vector from the vector representation of our AG_NEWS dataset, we can use the following function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([0., 0., 2.,  ..., 0., 0., 0.])\n"
+     ]
+    }
+   ],
+   "source": [
+    "vocab_size = len(vocab)\n",
+    "\n",
+    "def to_bow(text,bow_vocab_size=vocab_size):\n",
+    "    res = torch.zeros(bow_vocab_size,dtype=torch.float32)\n",
+    "    for i in encode(text):\n",
+    "        if i<bow_vocab_size:\n",
+    "            res[i] += 1\n",
+    "    return res\n",
+    "\n",
+    "print(to_bow(train_dataset[0][1]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> **Note:** Here we are using global `vocab_size` variable to specify default size of the vocabulary. Since often vocabulary size is pretty big, we can limit the size of the vocabulary to most frequent words. Try lowering `vocab_size` value and running the code below, and see how it affects the accuracy. You should expect some accuracy drop, but not dramatic, in lieu of higher performance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Training BoW classifier\n",
+    "\n",
+    "Now that we have learned how to build Bag-of-Words representation of our text, let's train a classifier on top of it. First, we need to convert our dataset for training in such a way, that all positional vector representations are converted to bag-of-words representation. This can be achieved by passing `bowify` function as `collate_fn` parameter to standard torch `DataLoader`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import DataLoader\n",
+    "import numpy as np \n",
+    "\n",
+    "# this collate function gets list of batch_size tuples, and needs to \n",
+    "# return a pair of label-feature tensors for the whole minibatch\n",
+    "def bowify(b):\n",
+    "    return (\n",
+    "            torch.LongTensor([t[0]-1 for t in b]),\n",
+    "            torch.stack([to_bow(t[1]) for t in b])\n",
+    "    )\n",
+    "\n",
+    "train_loader = DataLoader(train_dataset, batch_size=16, collate_fn=bowify, shuffle=True)\n",
+    "test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=bowify, shuffle=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's define a simple classifier neural network that contains one linear layer. The size of the input vector equals to `vocab_size`, and output size corresponds to the number of classes (4). Because we are solving classification task, the final activation function is `LogSoftmax()`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "net = torch.nn.Sequential(torch.nn.Linear(vocab_size,4),torch.nn.LogSoftmax(dim=1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we will define standard PyTorch training loop. Because our dataset is quite large, for our teaching purpose we will train only for one epoch, and sometimes even for less than an epoch (specifying the `epoch_size` parameter allows us to limit training). We would also report accumulated training accuracy during training; the frequency of reporting is specified using `report_freq` parameter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_epoch(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.NLLLoss(),epoch_size=None, report_freq=200):\n",
+    "    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)\n",
+    "    net.train()\n",
+    "    total_loss,acc,count,i = 0,0,0,0\n",
+    "    for labels,features in dataloader:\n",
+    "        optimizer.zero_grad()\n",
+    "        out = net(features)\n",
+    "        loss = loss_fn(out,labels) #cross_entropy(out,labels)\n",
+    "        loss.backward()\n",
+    "        optimizer.step()\n",
+    "        total_loss+=loss\n",
+    "        _,predicted = torch.max(out,1)\n",
+    "        acc+=(predicted==labels).sum()\n",
+    "        count+=len(labels)\n",
+    "        i+=1\n",
+    "        if i%report_freq==0:\n",
+    "            print(f\"{count}: acc={acc.item()/count}\")\n",
+    "        if epoch_size and count>epoch_size:\n",
+    "            break\n",
+    "    return total_loss.item()/count, acc.item()/count"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3200: acc=0.80625\n",
+      "6400: acc=0.841875\n",
+      "9600: acc=0.8564583333333333\n",
+      "12800: acc=0.86640625\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(0.025393278614036056, 0.8710021321961621)"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_epoch(net,train_loader,epoch_size=15000)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### BiGrams, TriGrams and N-Grams\n",
+    "\n",
+    "One limitation of a bag of words approach is that some words are part of multi word expressions, for example, the word 'hot dog' has a completely different meaning than the words 'hot' and 'dog' in other contexts. If we represent words 'hot` and 'dog' always by the same vectors, it can confuse our model.\n",
+    "\n",
+    "To address this, **N-gram representations** are often used in methods of document classification, where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. In bigram representation, for example, we will add all word pairs to the vocabulary, in addition to original words. \n",
+    "\n",
+    "Below is an example of how to generate a bigram bag of word representation using the Scikit Learn:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Vocabulary:\n",
+      " {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],\n",
+       "      dtype=int64)"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\\b\\w+\\b', min_df=1)\n",
+    "corpus = [\n",
+    "        'I like hot dogs.',\n",
+    "        'The dog ran fast.',\n",
+    "        'Its hot outside.',\n",
+    "    ]\n",
+    "bigram_vectorizer.fit_transform(corpus)\n",
+    "print(\"Vocabulary:\\n\",bigram_vectorizer.vocabulary_)\n",
+    "bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The main drawback of N-gram approach is that vocabulary size starts to grow extremely fast. In practice, we need to combine N-gram representation with some dimensionality reduction techniques, such as *embeddings*, which we will discuss in the next unit.\n",
+    "\n",
+    "To use N-gram representation in our **AG News** dataset, we need to build special ngram vocabulary:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Bigram vocabulary length =  1308844\n"
+     ]
+    }
+   ],
+   "source": [
+    "counter = collections.Counter()\n",
+    "for (label, line) in train_dataset:\n",
+    "    l = tokenizer(line)\n",
+    "    counter.update(torchtext.data.utils.ngrams_iterator(l,ngrams=2))\n",
+    "    \n",
+    "bi_vocab = torchtext.vocab.Vocab(counter, min_freq=1)\n",
+    "\n",
+    "print(\"Bigram vocabulary length = \",len(bi_vocab))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We could then use the same code as above to train the classifier, however, it would be very memory-inefficient. In the next unit, we will train bigram classifier using embeddings.\n",
+    "\n",
+    "> **Note:** You can only leave those ngrams that occur in the text more than specified number of times. This will make sure that infrequent bigrams will be omitted, and will decrease the dimensionality significantly. To do this, set `min_freq` parameter to a higher value, and observe the length of vocabulary change."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Term Frequency Inverse Document Frequency TF-IDF\n",
+    "\n",
+    "In BoW representation, word occurrences are evenly weighted, regardless of the word itself. However, it is clear that frequent words, such as *a*, *in*, etc. are much less important for the classification, than specialized terms. In fact, in most NLP tasks some words are more relevant than others.\n",
+    "\n",
+    "**TF-IDF** stands for **term frequency鈥搃nverse document frequency**. It is a variation of bag of words, where instead of a binary 0/1 value indicating the appearance of a word in a document, a floating-point value is used, which is related to the frequency of word occurrence in the corpus.\n",
+    "\n",
+    "More formally, the weight $w_{ij}$ of a word $i$ in the document $j$ is defined as:\n",
+    "$$\n",
+    "w_{ij} = tf_{ij}\\times\\log({N\\over df_i})\n",
+    "$$\n",
+    "where\n",
+    "* $tf_{ij}$ is the number of occurrences of $i$ in $j$, i.e. the BoW value we have seen before\n",
+    "* $N$ is the number of documents in the collection\n",
+    "* $df_i$ is the number of documents containing the word $i$ in the whole collection\n",
+    "\n",
+    "TF-IDF value $w_{ij}$ increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appears in *every* document in the collection, $df_i=N$, and $w_{ij}=0$, and those terms would be completely disregarded.\n",
+    "\n",
+    "You can easily create TF-IDF vectorization of text using Scikit Learn:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[0.43381609, 0.        , 0.43381609, 0.        , 0.65985664,\n",
+       "        0.43381609, 0.        , 0.        , 0.        , 0.        ,\n",
+       "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
+       "        0.        ]])"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
+    "vectorizer = TfidfVectorizer(ngram_range=(1,2))\n",
+    "vectorizer.fit_transform(corpus)\n",
+    "vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Conclusion \n",
+    "\n",
+    "However even though TF-IDF representations provide frequency weight to different words they are unable to represent meaning or order. As the famous linguist J. R. Firth said in 1935, 鈥淭he complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.鈥�. We will learn later in the course how to capture contextual information from text using language modeling.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "0cb620c6d4b9f7a635928804c26cf22403d89d98d79684e4529119355ee6d5a5"
+  },
+  "kernelspec": {
+   "display_name": "py37_pytorch",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/5-NLP/13-TextRep/TextRepresentationTF.ipynb b/5-NLP/13-TextRep/TextRepresentationTF.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..2675380e11dbc78561b876e59d4757603acf5b41
--- /dev/null
+++ b/5-NLP/13-TextRep/TextRepresentationTF.ipynb
@@ -0,0 +1,675 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Text classification task\n",
+        "\n",
+        "In this module, we will start with a simple text classification task based on the **[AG_NEWS](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)** dataset: we'll classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech. \n",
+        "\n",
+        "### The Dataset\n",
+        "\n",
+        "To load the dataset, we will use the **[TensorFlow Datasets](https://www.tensorflow.org/datasets)** API."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\\Users\\dmitryso\\tensorflow_datasets\\ag_news_subset\\1.0.0...\u001b[0m\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Dl Completed...: 0 url [00:00, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:00<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:32<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:32<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:32<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:32<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:32<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:32<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:32<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:32<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:33<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:33<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:33<?, ? url/s]\n",
+            "Dl Completed...:   0%|          | 0/1 [00:33<?, ? url/s]\n",
+            "Dl Completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [00:33<00:00, 33.33s/ url]\n",
+            "Dl Completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [00:33<00:00, 33.33s/ url]\n",
+            "\u001b[A\n",
+            "Dl Completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [00:34<00:00, 33.33s/ url]\n",
+            "Extraction completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [00:34<00:00, 34.43s/ file]\n",
+            "Dl Size...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 11/11 [00:34<00:00,  3.13s/ MiB]\n",
+            "Dl Completed...: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [00:34<00:00, 34.45s/ url]\n",
+            "                                                                         \r"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[1mDataset ag_news_subset downloaded and prepared to C:\\Users\\dmitryso\\tensorflow_datasets\\ag_news_subset\\1.0.0. Subsequent calls will reuse this data.\u001b[0m\n"
+          ]
+        }
+      ],
+      "source": [
+        "import tensorflow as tf\n",
+        "from tensorflow import keras\n",
+        "import tensorflow_datasets as tfds\n",
+        "\n",
+        "# In this tutorial, we will be training a lot of models. In order to use GPU memory cautiously,\n",
+        "# we will set tensorflow option to grow GPU memory allocation when required.\n",
+        "physical_devices = tf.config.list_physical_devices('GPU') \n",
+        "if len(physical_devices)>0:\n",
+        "    tf.config.experimental.set_memory_growth(physical_devices[0], True)\n",
+        "\n",
+        "dataset = tfds.load('ag_news_subset')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can now access the training and test portions of the dataset by using `dataset['train']` and `dataset['test']` respectively:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "120000\n",
+            "7600\n"
+          ]
+        }
+      ],
+      "source": [
+        "ds_train = dataset['train']\n",
+        "ds_test = dataset['test']\n",
+        "\n",
+        "print(f\"Length of train dataset = {len(ds_train)}\")\n",
+        "print(f\"Length of test dataset = {len(ds_test)}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Let's print out the first 10 new headlines from our dataset: "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "3 (Sci/Tech) -> b'AMD Debuts Dual-Core Opteron Processor' b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'\n",
+            "1 (Sports) -> b\"Wood's Suspension Upheld (Reuters)\" b'Reuters - Major League Baseball\\\\Monday announced a decision on the appeal filed by Chicago Cubs\\\\pitcher Kerry Wood regarding a suspension stemming from an\\\\incident earlier this season.'\n",
+            "2 (Business) -> b'Bush reform may have blue states seeing red' b'President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.'\n",
+            "3 (Sci/Tech) -> b\"'Halt science decline in schools'\" b'Britain will run out of leading scientists unless science education is improved, says Professor Colin Pillinger.'\n",
+            "1 (Sports) -> b'Gerrard leaves practice' b'London, England (Sports Network) - England midfielder Steven Gerrard injured his groin late in Thursday #39;s training session, but is hopeful he will be ready for Saturday #39;s World Cup qualifier against Austria.'\n"
+          ]
+        }
+      ],
+      "source": [
+        "classes = ['World', 'Sports', 'Business', 'Sci/Tech']\n",
+        "\n",
+        "for i,x in zip(range(5),ds_train):\n",
+        "    print(f\"{x['label']} ({classes[x['label']]}) -> {x['title']} {x['description']}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Text vectorization\n",
+        "\n",
+        "Now we need to convert text into **numbers** that can be represented as tensors. If we want word-level representation, we need to do two things:\n",
+        "\n",
+        "* Use a **tokenizer** to split text into **tokens**.\n",
+        "* Build a **vocabulary** of those tokens.\n",
+        "\n",
+        "### Limiting vocabulary size\n",
+        "\n",
+        "In the AG News dataset example, the vocabulary size is rather big, more than 100k words. Generally speaking, we don't need words that are rarely present in the text &mdash; only a few sentences will have them, and the model will not learn from them. Thus, it makes sense to limit the vocabulary size to a smaller number by passing an argument to the vectorizer constructor:\n",
+        "\n",
+        "Both of those steps can be handled using the **TextVectorization** layer. Let's instantiate the vectorizer object, and then call the `adapt` method to go through all text and build a vocabulary:\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 16,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "vocab_size = 50000\n",
+        "vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size)\n",
+        "vectorizer.adapt(ds_train.take(500).map(lambda x: x['title']+' '+x['description']))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "> **Note** that we are using only subset of the whole dataset to build a vocabulary. We do it to speed up the execution time and not keep you waiting. However, we are taking the risk that some of the words from the whole dateset would not be included into the vocabulary, and will be ignored during training. Thus, using the whole vocabulary size and running through all dataset during `adapt` should increase the final accuracy, but not significantly.\n",
+        "\n",
+        "Now we can access the actual vocabulary:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 17,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "['', '[UNK]', 'the', 'to', 'a', 'of', 'in', 'and', 'on', 'for']\n",
+            "Length of vocabulary: 50000\n"
+          ]
+        }
+      ],
+      "source": [
+        "vocab = vectorizer.get_vocabulary()\n",
+        "vocab_size = len(vocab)\n",
+        "print(vocab[:10])\n",
+        "print(f\"Length of vocabulary: {vocab_size}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Using the tokenizer, we can easily encode any text into a set of numbers:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 18,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "<tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 372, 2297,    3,  312,   12, 1293, 2314])>"
+            ]
+          },
+          "execution_count": 18,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "vectorizer('I love to play with my words')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Bag-of-words text representation\n",
+        "\n",
+        "Because words represent meaning, sometimes we can figure out the meaning of a piece of text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like *weather* and *snow* are likely to indicate *weather forecast*, while words like *stocks* and *dollar* would count towards *financial news*.\n",
+        "\n",
+        "**Bag-of-words** (BoW) vector representation is the most simple to understand traditional vector representation. Each word is linked to a vector index, and a vector element contains the number of occurrences of each word in a given document.\n",
+        "\n",
+        "![Image showing how a bag of words vector representation is represented in memory.](images/bag-of-words-example.png) \n",
+        "\n",
+        "> **Note**: You can also think of BoW as a sum of all one-hot-encoded vectors for individual words in the text.\n",
+        "\n",
+        "Below is an example of how to generate a bag-of-words representation using the Scikit Learn python library:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 19,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "array([[1, 1, 0, 2, 0, 0, 0, 0, 0]])"
+            ]
+          },
+          "execution_count": 19,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "from sklearn.feature_extraction.text import CountVectorizer\n",
+        "sc_vectorizer = CountVectorizer()\n",
+        "corpus = [\n",
+        "        'I like hot dogs.',\n",
+        "        'The dog ran fast.',\n",
+        "        'Its hot outside.',\n",
+        "    ]\n",
+        "sc_vectorizer.fit_transform(corpus)\n",
+        "sc_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can also use the Keras vectorizer that we defined above, converting each word number into a one-hot encoding and adding all those vectors up:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 20,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)"
+            ]
+          },
+          "execution_count": 20,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "def to_bow(text):\n",
+        "    return tf.reduce_sum(tf.one_hot(vectorizer(text),vocab_size),axis=0)\n",
+        "\n",
+        "to_bow('My dog likes hot dogs on a hot day.').numpy()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "> **Note**: You may be surprised that the result differs from the previous example. The reason is that in the Keras example the length of the vector corresponds to the vocabulary size, which was built from the whole AG News dataset, while in the Scikit Learn example we built the vocabulary from the sample text on the fly. \n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Training the BoW classifier\n",
+        "\n",
+        "Now that we have learned how to build the bag-of-words representation of our text, let's train a classifier that uses it. First, we need to convert our dataset to a bag-of-words representation. This can be achieved by using `map` function in the following way:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 21,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "batch_size = 128\n",
+        "\n",
+        "ds_train_bow = ds_train.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)\n",
+        "ds_test_bow = ds_test.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Now let's define a simple classifier neural network that contains one linear layer. The input size is `vocab_size`, and the output size corresponds to the number of classes (4). Because we're solving a classification task, the final activation function is **softmax**:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "938/938 [==============================] - 88s 94ms/step - loss: 0.5466 - acc: 0.8759 - val_loss: 0.3682 - val_acc: 0.8950\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "<tensorflow.python.keras.callbacks.History at 0x7fb00217a810>"
+            ]
+          },
+          "execution_count": 22,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "model = keras.models.Sequential([\n",
+        "    keras.layers.Dense(4,activation='softmax',input_shape=(vocab_size,))\n",
+        "])\n",
+        "model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])\n",
+        "model.fit(ds_train_bow,validation_data=ds_test_bow)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Since we have 4 classes, an accuracy of above 80% is a good result.\n",
+        "\n",
+        "## Training a classifier as one network\n",
+        "\n",
+        "Because the vectorizer is also a Keras layer, we can define a network that includes it, and train it end-to-end. This way we don't need to vectorize the dataset using `map`, we can just pass the original dataset to the input of the network.\n",
+        "\n",
+        "> **Note**: We would still have to apply maps to our dataset to convert fields from dictionaries (such as `title`, `description` and `label`) to tuples. However, when loading data from disk, we can build a dataset with the required structure in the first place."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 23,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Model: \"functional_1\"\n",
+            "_________________________________________________________________\n",
+            "Layer (type)                 Output Shape              Param #   \n",
+            "=================================================================\n",
+            "input_1 (InputLayer)         [(None, 1)]               0         \n",
+            "_________________________________________________________________\n",
+            "text_vectorization_6 (TextVe (None, None)              0         \n",
+            "_________________________________________________________________\n",
+            "tf_op_layer_OneHot (TensorFl [(None, None, 50000)]     0         \n",
+            "_________________________________________________________________\n",
+            "tf_op_layer_Sum (TensorFlowO [(None, 50000)]           0         \n",
+            "_________________________________________________________________\n",
+            "dense_1 (Dense)              (None, 4)                 200004    \n",
+            "=================================================================\n",
+            "Total params: 200,004\n",
+            "Trainable params: 200,004\n",
+            "Non-trainable params: 0\n",
+            "_________________________________________________________________\n",
+            "938/938 [==============================] - 79s 84ms/step - loss: 0.5221 - acc: 0.8804 - val_loss: 0.3447 - val_acc: 0.9024\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "<tensorflow.python.keras.callbacks.History at 0x7fb003184d10>"
+            ]
+          },
+          "execution_count": 23,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "def extract_text(x):\n",
+        "    return x['title']+' '+x['description']\n",
+        "\n",
+        "def tupelize(x):\n",
+        "    return (extract_text(x),x['label'])\n",
+        "\n",
+        "inp = keras.Input(shape=(1,),dtype=tf.string)\n",
+        "x = vectorizer(inp)\n",
+        "x = tf.reduce_sum(tf.one_hot(x,vocab_size),axis=1)\n",
+        "out = keras.layers.Dense(4,activation='softmax')(x)\n",
+        "model = keras.models.Model(inp,out)\n",
+        "model.summary()\n",
+        "\n",
+        "model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])\n",
+        "model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Bigrams, trigrams and n-grams\n",
+        "\n",
+        "One limitation of the bag-of-words approach is that some words are part of multi-word expressions, for example, the word 'hot dog' has a completely different meaning from the words 'hot' and 'dog' in other contexts. If we represent the words 'hot' and 'dog' always using the same vectors, it can confuse our model.\n",
+        "\n",
+        "To address this, **n-gram representations** are often used in methods of document classification, where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. In bigram representations, for example, we will add all word pairs to the vocabulary, in addition to original words.\n",
+        "\n",
+        "Below is an example of how to generate a bigram bag of word representation using Scikit Learn:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 24,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Vocabulary:\n",
+            " {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])"
+            ]
+          },
+          "execution_count": 24,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\\b\\w+\\b', min_df=1)\n",
+        "corpus = [\n",
+        "        'I like hot dogs.',\n",
+        "        'The dog ran fast.',\n",
+        "        'Its hot outside.',\n",
+        "    ]\n",
+        "bigram_vectorizer.fit_transform(corpus)\n",
+        "print(\"Vocabulary:\\n\",bigram_vectorizer.vocabulary_)\n",
+        "bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The main drawback of the n-gram approach is that the vocabulary size starts to grow extremely fast. In practice, we need to combine the n-gram representation with a dimensionality reduction technique, such as *embeddings*, which we will discuss in the next unit.\n",
+        "\n",
+        "To use an n-gram representation in our **AG News** dataset, we need to pass the `ngrams` parameter to our `TextVectorization` constructor. The length of a bigram vocaculary is **significantly larger**, in our case it is more than 1.3 million tokens! Thus it makes sense to limit bigram tokens as well by some reasonable number.\n",
+        "\n",
+        "We could use the same code as above to train the classifier, however, it would be very memory-inefficient. In the next unit, we will train the bigram classifier using embeddings. In the meantime, you can experiment with bigram classifier training in this notebook and see if you can get higher accuracy."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Automatically calculating BoW Vectors\n",
+        "\n",
+        "In the example above we calculated BoW vectors by hand by summing the one-hot encodings of individual words. However, the latest version of TensorFlow allows us to calculate BoW vectors automatically by passing the `output_mode='count` parameter to the vectorizer constructor. This makes defining and training our model significanly easier:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 25,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Training vectorizer\n",
+            "938/938 [==============================] - 10s 11ms/step - loss: 0.5207 - acc: 0.8826 - val_loss: 0.3430 - val_acc: 0.9051\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "<tensorflow.python.keras.callbacks.History at 0x7fb002c0b290>"
+            ]
+          },
+          "execution_count": 25,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "model = keras.models.Sequential([\n",
+        "    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_mode='count'),\n",
+        "    keras.layers.Dense(4,input_shape=(vocab_size,), activation='softmax')\n",
+        "])\n",
+        "print(\"Training vectorizer\")\n",
+        "model.layers[0].adapt(ds_train.take(500).map(extract_text))\n",
+        "model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])\n",
+        "model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Term frequency - inverse document frequency (TF-IDF)\n",
+        "\n",
+        "In BoW representation, word occurrences are weighted using the same technique regardless of the word itself. However, it's clear that frequent words such as *a* and *in* are much less important for classification than specialized terms. In most NLP tasks some words are more relevant than others.\n",
+        "\n",
+        "**TF-IDF** stands for **term frequency - inverse document frequency**. It's a variation of bag-of-words, where instead of a binary 0/1 value indicating the appearance of a word in a document, a floating-point value is used, which is related to the frequency of the word occurrence in the corpus.\n",
+        "\n",
+        "More formally, the weight $w_{ij}$ of a word $i$ in the document $j$ is defined as:\n",
+        "$$\n",
+        "w_{ij} = tf_{ij}\\times\\log({N\\over df_i})\n",
+        "$$\n",
+        "where\n",
+        "* $tf_{ij}$ is the number of occurrences of $i$ in $j$, i.e. the BoW value we have seen before\n",
+        "* $N$ is the number of documents in the collection\n",
+        "* $df_i$ is the number of documents containing the word $i$ in the whole collection\n",
+        "\n",
+        "The TF-IDF value $w_{ij}$ increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appears in *every* document in the collection, $df_i=N$, and $w_{ij}=0$, and those terms would be completely disregarded.\n",
+        "\n",
+        "You can easily create TF-IDF vectorization of text using Scikit Learn:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 20,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "array([[0.43381609, 0.        , 0.43381609, 0.        , 0.65985664,\n",
+              "        0.43381609, 0.        , 0.        , 0.        , 0.        ,\n",
+              "        0.        , 0.        , 0.        , 0.        , 0.        ,\n",
+              "        0.        ]])"
+            ]
+          },
+          "execution_count": 20,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
+        "vectorizer = TfidfVectorizer(ngram_range=(1,2))\n",
+        "vectorizer.fit_transform(corpus)\n",
+        "vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "In Keras, the `TextVectorization` layer can automatically compute TF-IDF frequencies by passing the `output_mode='tf-idf'` parameter. Let's repeat the code we used above to see if using TF-IDF increases accuracy: "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 21,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Training vectorizer\n",
+            "938/938 [==============================] - 94s 101ms/step - loss: 0.3203 - acc: 0.9039 - val_loss: 0.2542 - val_acc: 0.9186\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "<tensorflow.python.keras.callbacks.History at 0x7f78f402e5d0>"
+            ]
+          },
+          "execution_count": 21,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "model = keras.models.Sequential([\n",
+        "    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_mode='tf-idf'),\n",
+        "    keras.layers.Dense(4,input_shape=(vocab_size,), activation='softmax')\n",
+        "])\n",
+        "print(\"Training vectorizer\")\n",
+        "model.layers[0].adapt(ds_train.take(500).map(extract_text))\n",
+        "model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])\n",
+        "model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Conclusion \n",
+        "\n",
+        "Even though TF-IDF representations provide frequency weights to different words, they are unable to represent meaning or order. As the famous linguist J. R. Firth said in 1935, \"The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.\" We will learn how to capture contextual information from text using language modeling in a later unit."
+      ]
+    }
+  ],
+  "metadata": {
+    "interpreter": {
+      "hash": "0cb620c6d4b9f7a635928804c26cf22403d89d98d79684e4529119355ee6d5a5"
+    },
+    "kernel_info": {
+      "name": "conda-env-py37_tensorflow-py"
+    },
+    "kernelspec": {
+      "display_name": "py37_tensorflow",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.12"
+    },
+    "nteract": {
+      "version": "nteract-front-end@1.0.0"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}
diff --git a/5-NLP/13-TextRep/images/ascii-character-map.png b/5-NLP/13-TextRep/images/ascii-character-map.png
new file mode 100644
index 0000000000000000000000000000000000000000..4373dbdb8a752640801562a568fc0bed481681cb
Binary files /dev/null and b/5-NLP/13-TextRep/images/ascii-character-map.png differ
diff --git a/5-NLP/13-TextRep/images/bag-of-words-example.png b/5-NLP/13-TextRep/images/bag-of-words-example.png
new file mode 100644
index 0000000000000000000000000000000000000000..796c669dbd78f35134d2503253c4abe088d52509
Binary files /dev/null and b/5-NLP/13-TextRep/images/bag-of-words-example.png differ
diff --git a/5-NLP/13-TextRep/images/bow.png b/5-NLP/13-TextRep/images/bow.png
new file mode 100644
index 0000000000000000000000000000000000000000..a88e8e948128f92280914e3f13536a5fd6a5596b
Binary files /dev/null and b/5-NLP/13-TextRep/images/bow.png differ
diff --git a/5-NLP/14-Embeddings/EmbeddingsPyTorch.ipynb b/5-NLP/14-Embeddings/EmbeddingsPyTorch.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..beb6320e5bd9fe59cca6269316e1ffbf5b2c30ff
--- /dev/null
+++ b/5-NLP/14-Embeddings/EmbeddingsPyTorch.ipynb
@@ -0,0 +1,708 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Embeddings\n",
+    "\n",
+    "In our previous example, we operated on high-dimensional bag-of-words vectors with length `vocab_size`, and we were explicitly converting from low-dimensional positional representation vectors into sparse one-hot representation. This one-hot representation is not memory-efficient, in addition, each word is treated independently from each other, i.e. one-hot encoded vectors do not express any semantic similarity between words.\n",
+    "\n",
+    "In this unit, we will continue exploring **News AG** dataset. To begin, let's load the data and get some definitions from the previous unit.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "jupyter": {
+        "outputs_hidden": true
+      }
+   },
+   "outputs": [],
+   "source": [
+    "!wget -q https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/torchnlp.py"
+   ]
+  },  
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading dataset...\n",
+      "Building vocab...\n",
+      "Vocab size =  95812\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "import torchtext\n",
+    "import numpy as np\n",
+    "from torchnlp import *\n",
+    "train_dataset, test_dataset, classes, vocab = load_dataset()\n",
+    "vocab_size = len(vocab)\n",
+    "print(\"Vocab size = \",vocab_size)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "### What is embedding?\n",
+    "\n",
+    "The idea of **embedding** is to represent words by lower-dimensional dense vectors, which somehow reflect semantic meaning of a word. We will later discuss how to build meaningful word embeddings, but for now let's just think of embeddings as a way to lower dimensionality of a word vector. \n",
+    "\n",
+    "So, embedding layer would take a word as an input, and produce an output vector of specified `embedding_size`. In a sense, it is very similar to `Linear` layer, but instead of taking one-hot encoded vector, it will be able to take a word number as an input.\n",
+    "\n",
+    "By using embedding layer as a first layer in our network, we can switch from bag-or-words to **embedding bag** model, where we first convert each word in our text into corresponding embedding, and then compute some aggregate function over all those embeddings, such as `sum`, `average` or `max`.  \n",
+    "\n",
+    "![Image showing an embedding classifier for five sequence words.](./images/embedding-classifier-example.png)\n",
+    "\n",
+    "Our classifier neural network will start with embedding layer, then aggregation layer, and linear classifier on top of it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class EmbedClassifier(torch.nn.Module):\n",
+    "    def __init__(self, vocab_size, embed_dim, num_class):\n",
+    "        super().__init__()\n",
+    "        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)\n",
+    "        self.fc = torch.nn.Linear(embed_dim, num_class)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        x = self.embedding(x)\n",
+    "        x = torch.mean(x,dim=1)\n",
+    "        return self.fc(x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Dealing with variable sequence size\n",
+    "\n",
+    "As a result of this architecture, minibatches to our network would need to be created in a certain way. In the previous unit, when using bag-of-words, all BoW tensors in a minibatch had equal size `vocab_size`, regardless of the actual length of our text sequence. Once we move to word embeddings, we would end up with variable number of words in each text sample, and when combining those samples into minibatches we would have to apply some padding.\n",
+    "\n",
+    "This can be done using the same technique of providing `collate_fn` function to the datasource:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def padify(b):\n",
+    "    # b is the list of tuples of length batch_size\n",
+    "    #   - first element of a tuple = label, \n",
+    "    #   - second = feature (text sequence)\n",
+    "    # build vectorized sequence\n",
+    "    v = [encode(x[1]) for x in b]\n",
+    "    # first, compute max length of a sequence in this minibatch\n",
+    "    l = max(map(len,v))\n",
+    "    return ( # tuple of two tensors - labels and features\n",
+    "        torch.LongTensor([t[0]-1 for t in b]),\n",
+    "        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])\n",
+    "    )\n",
+    "\n",
+    "train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Training embedding classifier\n",
+    "\n",
+    "Now that we have defined proper dataloader, we can train the model using the training function we have defined in the previous unit:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3200: acc=0.6428125\n",
+      "6400: acc=0.68453125\n",
+      "9600: acc=0.7123958333333333\n",
+      "12800: acc=0.725703125\n",
+      "16000: acc=0.7365625\n",
+      "19200: acc=0.7464583333333333\n",
+      "22400: acc=0.7548214285714285\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(0.9526769402541186, 0.7595969289827256)"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "net = EmbedClassifier(vocab_size,32,len(classes)).to(device)\n",
+    "train_epoch(net,train_loader, lr=1, epoch_size=25000)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> **Note**: We are only training for 25k records here (less than one full epoch) for the sake of time, but you can continue training, write a function to train for several epochs, and experiment with learning rate parameter to achieve higher accuracy. You should be able to go to the accuracy of about 90%."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### EmbeddingBag Layer and Variable-Length Sequence Representation\n",
+    "\n",
+    "In the previous architecture, we needed to pad all sequences to the same length in order to fit them into a minibatch. This is not the most efficient way to represent variable length sequences - another apporach would be to use **offset** vector, which would hold offsets of all sequences stored in one large vector.\n",
+    "\n",
+    "![Image showing an offset sequence representation](./images/offset-sequence-representation.png)\n",
+    "\n",
+    "> **Note**: On the picture above, we show a sequence of characters, but in our example we are working with sequences of words. However, the general principle of representing sequences with offset vector remains the same.\n",
+    "\n",
+    "To work with offset representation, we use [`EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html) layer. It is similar to `Embedding`, but it takes content vector and offset vector as input, and it also includes averaging layer, which can be `mean`, `sum` or `max`.\n",
+    "\n",
+    "Here is modified network that uses `EmbeddingBag`:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class EmbedClassifier(torch.nn.Module):\n",
+    "    def __init__(self, vocab_size, embed_dim, num_class):\n",
+    "        super().__init__()\n",
+    "        self.embedding = torch.nn.EmbeddingBag(vocab_size, embed_dim)\n",
+    "        self.fc = torch.nn.Linear(embed_dim, num_class)\n",
+    "\n",
+    "    def forward(self, text, off):\n",
+    "        x = self.embedding(text, off)\n",
+    "        return self.fc(x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To prepare the dataset for training, we need to provide a conversion function that will prepare the offset vector:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def offsetify(b):\n",
+    "    # first, compute data tensor from all sequences\n",
+    "    x = [torch.tensor(encode(t[1])) for t in b]\n",
+    "    # now, compute the offsets by accumulating the tensor of sequence lengths\n",
+    "    o = [0] + [len(t) for t in x]\n",
+    "    o = torch.tensor(o[:-1]).cumsum(dim=0)\n",
+    "    return ( \n",
+    "        torch.LongTensor([t[0]-1 for t in b]), # labels\n",
+    "        torch.cat(x), # text \n",
+    "        o\n",
+    "    )\n",
+    "\n",
+    "train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=offsetify, shuffle=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note, that unlike in all previous examples, our network now accepts two parameters: data vector and offset vector, which are of different sizes. Sililarly, our data loader also provides us with 3 values instead of 2: both text and offset vectors are provided as features. Therefore, we need to slightly adjust our training function to take care of that:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3200: acc=0.6334375\n",
+      "6400: acc=0.68234375\n",
+      "9600: acc=0.7072916666666667\n",
+      "12800: acc=0.72375\n",
+      "16000: acc=0.73575\n",
+      "19200: acc=0.743125\n",
+      "22400: acc=0.7497767857142857\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(23.37446267194498, 0.754118682021753)"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "net = EmbedClassifier(vocab_size,32,len(classes)).to(device)\n",
+    "\n",
+    "def train_epoch_emb(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.CrossEntropyLoss(),epoch_size=None, report_freq=200):\n",
+    "    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)\n",
+    "    loss_fn = loss_fn.to(device)\n",
+    "    net.train()\n",
+    "    total_loss,acc,count,i = 0,0,0,0\n",
+    "    for labels,text,off in dataloader:\n",
+    "        optimizer.zero_grad()\n",
+    "        labels,text,off = labels.to(device), text.to(device), off.to(device)\n",
+    "        out = net(text, off)\n",
+    "        loss = loss_fn(out,labels) #cross_entropy(out,labels)\n",
+    "        loss.backward()\n",
+    "        optimizer.step()\n",
+    "        total_loss+=loss\n",
+    "        _,predicted = torch.max(out,1)\n",
+    "        acc+=(predicted==labels).sum()\n",
+    "        count+=len(labels)\n",
+    "        i+=1\n",
+    "        if i%report_freq==0:\n",
+    "            print(f\"{count}: acc={acc.item()/count}\")\n",
+    "        if epoch_size and count>epoch_size:\n",
+    "            break\n",
+    "    return total_loss.item()/count, acc.item()/count\n",
+    "\n",
+    "\n",
+    "train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Semantic Embeddings: Word2Vec\n",
+    "\n",
+    "In our previous example, the model embedding layer learnt to map words to vector representation, however, this representation did not have much semantical meaning. It would be nice to learn such vector representation, that similar words or symonims would correspond to vectors that are close to each other in terms of some vector distance (eg. euclidian distance).\n",
+    "\n",
+    "To do that, we need to pre-train our embedding model on a large collection of text in a specific way. One of the first ways to train semantic embeddings is called [Word2Vec](https://en.wikipedia.org/wiki/Word2vec). It is based on two main architectures that are used to produce a distributed representation of words:\n",
+    "\n",
+    " - **Continuous bag-of-words** (CBoW) 鈥� in this architecture, we train the model to predict a word from surrounding context. Given the ngram $(W_{-2},W_{-1},W_0,W_1,W_2)$, the goal of the model is to predict $W_0$ from $(W_{-2},W_{-1},W_1,W_2)$.\n",
+    " - **Continuous skip-gram** is opposite to CBoW. The model uses surrounding window of context words to predict the current word.\n",
+    "\n",
+    "CBoW is faster, while skip-gram is slower, but does a better job of representing infrequent words.\n",
+    "\n",
+    "![Image showing both CBoW and Skip-Gram algorithms to convert words to vectors.](./images/example-algorithms-for-converting-words-to-vectors.png)\n",
+    "\n",
+    "To experiment with word2vec embedding pre-trained on Google News dataset, we can use **gensim** library. Below we find the words most similar to 'neural'\n",
+    "\n",
+    "> **Note:** When you first create word vectors, downloading them can take some time!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gensim.downloader as api\n",
+    "w2v = api.load('word2vec-google-news-300')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "neuronal -> 0.780479907989502\n",
+      "neurons -> 0.7326500415802002\n",
+      "neural_circuits -> 0.7252851128578186\n",
+      "neuron -> 0.7174385190010071\n",
+      "cortical -> 0.6941086053848267\n",
+      "brain_circuitry -> 0.6923245787620544\n",
+      "synaptic -> 0.6699119210243225\n",
+      "neural_circuitry -> 0.6638563275337219\n",
+      "neurochemical -> 0.6555314064025879\n",
+      "neuronal_activity -> 0.6531826257705688\n"
+     ]
+    }
+   ],
+   "source": [
+    "for w,p in w2v.most_similar('neural'):\n",
+    "    print(f\"{w} -> {p}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can also extract vector embeddings from the word, to be used in training classification model (we only show first 20 components of the vector for clarity):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([ 0.01226807,  0.06225586,  0.10693359,  0.05810547,  0.23828125,\n",
+       "        0.03686523,  0.05151367, -0.20703125,  0.01989746,  0.10058594,\n",
+       "       -0.03759766, -0.1015625 , -0.15820312, -0.08105469, -0.0390625 ,\n",
+       "       -0.05053711,  0.16015625,  0.2578125 ,  0.10058594, -0.25976562],\n",
+       "      dtype=float32)"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "w2v.word_vec('play')[:20]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Great thing about semantical embeddings is that you can manipulate vector encoding to change the semantics. For example, we can ask to find a word, whose vector representation would be as close as possible to words *king* and *woman*, and as far away from the word *man*:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('queen', 0.7118192911148071)"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "w2v.most_similar(positive=['king','woman'],negative=['man'])[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Both CBOW and Skip-Grams are 鈥減redictive鈥� embeddings, in that they only take local contexts into account. Word2Vec does not take advantage of global context. \n",
+    "\n",
+    "**FastText**, builds on Word2Vec by learning vector representations for each word and the charachter n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pre-training it enables word embeddings to encode sub-word information. \n",
+    "\n",
+    "Another method, **GloVe**, leverages the idea of co-occurence matrix, uses neural methods to decompose co-occurrence matrix into more expressive and non linear word vectors.\n",
+    "\n",
+    "You can play with the example by changing embeddings to FastText and GloVe, since gensim supports "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Using Pre-Trained Embeddings in PyTorch\n",
+    "\n",
+    "We can modify the example above to pre-populate the matrix in our embedding layer with semantical embeddings, such as Word2Vec. We need to take into account that vocabularies of pre-trained embedding and our text corpus will likely not match, so we will initialize weights for the missing words with random values:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Embedding size: 300\n",
+      "Populating matrix, this will take some time...Done, found 41080 words, 54732 words missing\n"
+     ]
+    }
+   ],
+   "source": [
+    "embed_size = len(w2v.get_vector('hello'))\n",
+    "print(f'Embedding size: {embed_size}')\n",
+    "\n",
+    "net = EmbedClassifier(vocab_size,embed_size,len(classes))\n",
+    "\n",
+    "print('Populating matrix, this will take some time...',end='')\n",
+    "found, not_found = 0,0\n",
+    "for i,w in enumerate(vocab.itos):\n",
+    "    try:\n",
+    "        net.embedding.weight[i].data = torch.tensor(w2v.get_vector(w))\n",
+    "        found+=1\n",
+    "    except:\n",
+    "        net.embedding.weight[i].data = torch.normal(0.0,1.0,(embed_size,))\n",
+    "        not_found+=1\n",
+    "\n",
+    "print(f\"Done, found {found} words, {not_found} words missing\")\n",
+    "net = net.to(device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's train our model. Note that the time it takes to train the model is significantly larger than in the previous example, due to larger embedding layer size, and thus much higher number of parameters. Also, because of this, we may need to train our model on more examples if we want to avoid overfitting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3200: acc=0.63875\n",
+      "6400: acc=0.693125\n",
+      "9600: acc=0.7176041666666667\n",
+      "12800: acc=0.7321875\n",
+      "16000: acc=0.7454375\n",
+      "19200: acc=0.7559375\n",
+      "22400: acc=0.7631696428571428\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(218.64081493921944, 0.7667146513115803)"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In our case we do not see huge increase in accuracy, which is likely to quite different vocalularies. \n",
+    "To overcome the problem of different vocabularies, we can use one of the following solutions:\n",
+    "* Re-train word2vec model on our vocabulary\n",
+    "* Load our dataset with the vocabulary from the pre-trained word2vec model. Vocabulary used to load the dataset can be specified during loading.\n",
+    "\n",
+    "The latter approach seems easiter, especially because PyTorch `torchtext` framework contains built-in support for embeddings. We can, for example, instantiate GloVe-based vocabulary in the following manner:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vocab = torchtext.vocab.GloVe(name='6B', dim=50)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Loaded vocabulary has the following basic operations:\n",
+    "* `vocab.stoi` dictionary allows us to convert word into its dictionary index\n",
+    "* `vocab.itos` does the opposite - converts number into word\n",
+    "* `vocab.vectors` is the array of embedding vectors, so to get the embedding of a word `s` we need to use `vocab.vectors[vocab.stoi[s]]`\n",
+    "\n",
+    "Here is the example of manipulating embeddings to demonstrate the equation **kind-man+woman = queen** (I had to tweak the coefficient a bit to make it work):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'queen'"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# get the vector corresponding to kind-man+woman\n",
+    "qvec = vocab.vectors[vocab.stoi['king']]-vocab.vectors[vocab.stoi['man']]+1.3*vocab.vectors[vocab.stoi['woman']]\n",
+    "# find the index of the closest embedding vector \n",
+    "d = torch.sum((vocab.vectors-qvec)**2,dim=1)\n",
+    "min_idx = torch.argmin(d)\n",
+    "# find the corresponding word\n",
+    "vocab.itos[min_idx]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To train the classifier using those embeddings, we first need to encode our dataset using GloVe vocabulary:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def offsetify(b):\n",
+    "    # first, compute data tensor from all sequences\n",
+    "    x = [torch.tensor(encode(t[1],voc=vocab)) for t in b] # pass the instance of vocab to encode function!\n",
+    "    # now, compute the offsets by accumulating the tensor of sequence lengths\n",
+    "    o = [0] + [len(t) for t in x]\n",
+    "    o = torch.tensor(o[:-1]).cumsum(dim=0)\n",
+    "    return ( \n",
+    "        torch.LongTensor([t[0]-1 for t in b]), # labels\n",
+    "        torch.cat(x), # text \n",
+    "        o\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we have seen above, all vector embeddings are stored in `vocab.vectors` matrix. It makes it super-easy to load those weights into weights of embedding layer using simple copying:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "net = EmbedClassifier(len(vocab),len(vocab.vectors[0]),len(classes))\n",
+    "net.embedding.weight.data = vocab.vectors\n",
+    "net = net.to(device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's train our model and see if we get better results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3200: acc=0.6271875\n",
+      "6400: acc=0.68078125\n",
+      "9600: acc=0.7030208333333333\n",
+      "12800: acc=0.71984375\n",
+      "16000: acc=0.7346875\n",
+      "19200: acc=0.7455729166666667\n",
+      "22400: acc=0.7529464285714286\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(35.53972978646833, 0.7575175943698017)"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=offsetify, shuffle=True)\n",
+    "train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "One of the reasons we are not seeing significant increase in accuracy is due to the fact that some words from our dataset are missing in the pre-trained GloVe vocabulary, and thus they are essentially ignored. To overcome this fact, we can train our own embeddings on our dataset. \n",
+    "\n",
+    "\n",
+    "## Training your own embeddings\n",
+    "\n",
+    "In our examples, we have been using pre-trained semantic embeddings, but it is interesting to see how those embeddings can be trained using either CBoW, or Skip-gram architectures. This exercise goes beyond this module, but those interested might want to check out this [official PyTorch tutorial on Language Modeling](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html). Also, **gensim** framework can be used to train most commonly used embeddings in a few lines of code, as described [in this documentation](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Contextual Embeddings\n",
+    "\n",
+    "One key limitation of tradition pretrained embedding representaitons such as Word2Vec is the problem of word sense disambigioution. While pretrained embeddings can capture some of the meaning of words in context, every possible meaning of a word is encoded into the same embedding. This can cause problems in downstream models, since many words such as the word 'play' have different meanings depending on the context they are used in.\n",
+    "\n",
+    "For example word 'play' in those two different sentences have quite different meaning:\n",
+    "- I went to a **play** at the theature.\n",
+    "- John wants to **play** with his friends.\n",
+    "\n",
+    "The pretrained embeddings above represent both of these meanings of the word 'play' in the same embedding. To overcome this limitation, we need to build embeddings based on the **language model**, which is trained on a large corpus of text, and *knows* how words can be put together in different contexts. Discussing contextual embeddings is out of scope for this tutorial, but we will come back to them when talking about language models in the next unit.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "py37_pytorch",
+   "language": "python",
+   "name": "conda-env-py37_pytorch-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/5-NLP/14-Embeddings/EmbeddingsTF.ipynb b/5-NLP/14-Embeddings/EmbeddingsTF.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..ec520127506b521822665a561d789df415aab602
--- /dev/null
+++ b/5-NLP/14-Embeddings/EmbeddingsTF.ipynb
@@ -0,0 +1,1100 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Embeddings\n",
+        "\n",
+        "In our previous example, we operated on high-dimensional bag-of-words vectors with length `vocab_size`, and we explicitly converted low-dimensional positional representation vectors into sparse one-hot representation. This one-hot representation is not memory-efficient. In addition, each word is treated independently from each other, so one-hot encoded vectors don't express semantic similarities between words.\n",
+        "\n",
+        "In this unit, we will continue exploring the **News AG** dataset. To begin, let's load the data and get some definitions from the previous unit."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import sys\n",
+        "!{sys.executable} -m pip install --quiet tensorflow_datasets==4.4.0\n",
+        "!cd ~ && wget -q -O - https://mslearntensorflowlp.blob.core.windows.net/data/tfds-ag-news.tgz | tar xz"
+      ],
+      "outputs": [],
+      "execution_count": 2,
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import tensorflow as tf\n",
+        "from tensorflow import keras\n",
+        "import tensorflow_datasets as tfds\n",
+        "import numpy as np\n",
+        "\n",
+        "# In this tutorial, we will be training a lot of models. In order to use GPU memory cautiously,\n",
+        "# we will set tensorflow option to grow GPU memory allocation when required.\n",
+        "physical_devices = tf.config.list_physical_devices('GPU') \n",
+        "if len(physical_devices)>0:\n",
+        "    tf.config.experimental.set_memory_growth(physical_devices[0], True)\n",
+        "\n",
+        "ds_train, ds_test = tfds.load('ag_news_subset').values()"
+      ],
+      "outputs": [],
+      "execution_count": 3,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "### What's an embedding?\n",
+        "\n",
+        "The idea of **embedding** is to represent words using lower-dimensional dense vectors that reflect the semantic meaning of the word. We will later discuss how to build meaningful word embeddings, but for now let's just think of embeddings as a way to reduce the dimensionality of a word vector. \n",
+        "\n",
+        "So, an embedding layer takes a word as input, and produces an output vector of specified `embedding_size`. In a sense, it is very similar to a `Dense` layer, but instead of taking a one-hot encoded vector as input, it's able to take a word number.\n",
+        "\n",
+        "By using an embedding layer as the first layer in our network, we can switch from bag-or-words to an **embedding bag** model, where we first convert each word in our text into the corresponding embedding, and then compute some aggregate function over all those embeddings, such as `sum`, `average` or `max`.  \n",
+        "\n",
+        "![Image showing an embedding classifier for five sequence words.](images/embedding-classifier-example.png)\n",
+        "\n",
+        "Our classifier neural network consists of the following layers:\n",
+        "\n",
+        "* `TextVectorization` layer, which takes a string as input, and produces a tensor of token numbers. We will specify some reasonable vocabulary size `vocab_size`, and ignore less-frequently used words. The input shape will be 1, and the output shape will be $n$, since we'll get $n$ tokens as a result, each of them containing numbers from 0 to `vocab_size`.\n",
+        "* `Embedding` layer, which takes $n$ numbers, and reduces each number to a dense vector of a given length (100 in our example). Thus, the input tensor of shape $n$ will be transformed into an $n\\times 100$ tensor. \n",
+        "* Aggregation layer, which takes the average of this tensor along the first axis, i.e. it will compute the average of all $n$ input tensors corresponding to different words. To implement this layer, we will use a `Lambda` layer, and pass into it the function to compute the average. The output will have shape of 100, and it will be the numeric representation of the whole input sequence.\n",
+        "* Final `Dense` linear classifier."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "vocab_size = 30000\n",
+        "batch_size = 128\n",
+        "\n",
+        "vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,input_shape=(1,))\n",
+        "\n",
+        "model = keras.models.Sequential([\n",
+        "    vectorizer,    \n",
+        "    keras.layers.Embedding(vocab_size,100),\n",
+        "    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),\n",
+        "    keras.layers.Dense(4, activation='softmax')\n",
+        "])\n",
+        "model.summary()"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Model: \"sequential_1\"\n",
+            "_________________________________________________________________\n",
+            "Layer (type)                 Output Shape              Param #   \n",
+            "=================================================================\n",
+            "text_vectorization_1 (TextVe (None, None)              0         \n",
+            "_________________________________________________________________\n",
+            "embedding_1 (Embedding)      (None, None, 100)         3000000   \n",
+            "_________________________________________________________________\n",
+            "lambda_1 (Lambda)            (None, 100)               0         \n",
+            "_________________________________________________________________\n",
+            "dense_1 (Dense)              (None, 4)                 404       \n",
+            "=================================================================\n",
+            "Total params: 3,000,404\n",
+            "Trainable params: 3,000,404\n",
+            "Non-trainable params: 0\n",
+            "_________________________________________________________________\n"
+          ]
+        }
+      ],
+      "execution_count": 6,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "In the `summary` printout, in the **output shape** column, the first tensor dimension `None` corresponds to the minibatch size, and the second corresponds to the length of the token sequence. All token sequences in the minibatch have different lengths. We'll discuss how to deal with it in the next section.\n",
+        "\n",
+        "Now let's train the network:"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def extract_text(x):\n",
+        "    return x['title']+' '+x['description']\n",
+        "\n",
+        "def tupelize(x):\n",
+        "    return (extract_text(x),x['label'])\n",
+        "\n",
+        "print(\"Training vectorizer\")\n",
+        "vectorizer.adapt(ds_train.take(500).map(extract_text))\n",
+        "\n",
+        "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])\n",
+        "model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Training vectorizer\n",
+            "938/938 [==============================] - 12s 13ms/step - loss: 0.7953 - acc: 0.8113 - val_loss: 0.4496 - val_acc: 0.8657\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
+          ]
+        },
+        {
+          "output_type": "execute_result",
+          "execution_count": 7,
+          "data": {
+            "text/plain": "<tensorflow.python.keras.callbacks.History at 0x7f0f647e5490>"
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 7,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> **Note** that we are building vectorizer based on a subset of the data. This is done in order to speed up the process, and it might result in a situation when not all tokens from our text is present in the vocabulary. In this case, those tokens would be ignored, which may result in slightly lower accuracy. However, in real life a subset of text often gives a good vocabulary estimation."
+      ],
+      "metadata": {
+        "nteract": {
+          "transient": {
+            "deleting": false
+          }
+        }
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Dealing with variable sequence sizes\n",
+        "\n",
+        "Let's understand how training happens in minibatches. In the example above, the input tensor has dimension 1, and we use 128-long minibatches, so that actual size of the tensor is $128 \\times 1$. However, the number of tokens in each sentence is different. If we apply the `TextVectorization` layer to a single input, the number of tokens returned is different, depending on how the text is tokenized:"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(vectorizer('Hello, world!'))\n",
+        "print(vectorizer('I am glad to meet you!'))"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "tf.Tensor([ 1 45], shape=(2,), dtype=int64)\n",
+            "tf.Tensor([ 112 1271    1    3 1747  158], shape=(6,), dtype=int64)\n"
+          ]
+        }
+      ],
+      "execution_count": 8,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "However, when we apply the vectorizer to several sequences, it has to produce a tensor of rectangular shape, so it fills unused elements with the PAD token (which in our case is zero):"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "vectorizer(['Hello, world!','I am glad to meet you!'])"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 9,
+          "data": {
+            "text/plain": "<tf.Tensor: shape=(2, 6), dtype=int64, numpy=\narray([[   1,   45,    0,    0,    0,    0],\n       [ 112, 1271,    1,    3, 1747,  158]])>"
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 9,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Here we can see the embeddings:"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "model.layers[1](vectorizer(['Hello, world!','I am glad to meet you!'])).numpy()"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 10,
+          "data": {
+            "text/plain": "array([[[-0.02485236, -0.00416857, -0.06599288, ..., -0.02404598,\n          0.03529833, -0.02100844],\n        [ 0.22493948,  0.01383338,  0.12420551, ...,  0.19531338,\n          0.13524376,  0.04216914],\n        [ 0.04510409,  0.00708018, -0.0310419 , ..., -0.0188726 ,\n         -0.0179676 , -0.04813331],\n        [ 0.04510409,  0.00708018, -0.0310419 , ..., -0.0188726 ,\n         -0.0179676 , -0.04813331],\n        [ 0.04510409,  0.00708018, -0.0310419 , ..., -0.0188726 ,\n         -0.0179676 , -0.04813331],\n        [ 0.04510409,  0.00708018, -0.0310419 , ..., -0.0188726 ,\n         -0.0179676 , -0.04813331]],\n\n       [[-0.00226152, -0.0972852 , -0.00063103, ...,  0.00504377,\n          0.22460397,  0.1497297 ],\n        [-0.15621698, -0.13758421, -0.02889572, ..., -0.02577994,\n          0.03472563,  0.08767739],\n        [-0.02485236, -0.00416857, -0.06599288, ..., -0.02404598,\n          0.03529833, -0.02100844],\n        [-0.06490357, -0.08200071, -0.06175491, ..., -0.02477042,\n         -0.06802022, -0.01040947],\n        [ 0.03279151,  0.12563369,  0.06062867, ..., -0.04349922,\n         -0.12154414, -0.12533969],\n        [-0.14435016, -0.304014  , -0.00378676, ...,  0.05609043,\n          0.20370889,  0.28518862]]], dtype=float32)"
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 10,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> **Note**: To minimize the amount of padding, in some cases it makes sense to sort all sequences in the dataset in the order of increasing length (or, more precisely, number of tokens). This will ensure that each minibatch contains sequences of similar length."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Semantic embeddings: Word2Vec\n",
+        "\n",
+        "In our previous example, the embedding layer learned to map words to vector representations, however, these representations did not have semantic meaning. It would be nice to learn a vector representation such that similar words or synonyms correspond to vectors that are close to each other in terms of some vector distance (for example euclidian distance).\n",
+        "\n",
+        "To do that, we need to pretrain our embedding model on a large collection of text using a technique such as [Word2Vec](https://en.wikipedia.org/wiki/Word2vec). It's based on two main architectures that are used to produce a distributed representation of words:\n",
+        "\n",
+        " - **Continuous bag-of-words** (CBoW), where we train the model to predict a word from the surrounding context. Given the ngram $(W_{-2},W_{-1},W_0,W_1,W_2)$, the goal of the model is to predict $W_0$ from $(W_{-2},W_{-1},W_1,W_2)$.\n",
+        " - **Continuous skip-gram** is opposite to CBoW. The model uses the surrounding window of context words to predict the current word.\n",
+        "\n",
+        "CBoW is faster, and while skip-gram is slower, it does a better job of representing infrequent words.\n",
+        "\n",
+        "![Image showing both CBoW and Skip-Gram algorithms to convert words to vectors.](images/example-algorithms-for-converting-words-to-vectors.png)\n",
+        "\n",
+        "To experiment with the Word2Vec embedding pretrained on Google News dataset, we can use the **gensim** library. Below we find the words most similar to 'neural'.\n",
+        "\n",
+        "> **Note:** When you first create word vectors, downloading them can take some time!"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import gensim.downloader as api\n",
+        "w2v = api.load('word2vec-google-news-300')"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "[==================================================] 100.0% 1662.8/1662.8MB downloaded\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n",
+            "IOPub message rate exceeded.\n",
+            "The notebook server will temporarily stop sending output\n",
+            "to the client in order to avoid crashing it.\n",
+            "To change this limit, set the config variable\n",
+            "`--NotebookApp.iopub_msg_rate_limit`.\n",
+            "\n",
+            "Current values:\n",
+            "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+            "NotebookApp.rate_limit_window=3.0 (secs)\n",
+            "\n"
+          ]
+        }
+      ],
+      "execution_count": 11,
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "for w,p in w2v.most_similar('neural'):\n",
+        "    print(f\"{w} -> {p}\")"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "neuronal -> 0.7804799675941467\n",
+            "neurons -> 0.7326500415802002\n",
+            "neural_circuits -> 0.7252851724624634\n",
+            "neuron -> 0.7174385190010071\n",
+            "cortical -> 0.6941086649894714\n",
+            "brain_circuitry -> 0.6923246383666992\n",
+            "synaptic -> 0.6699118614196777\n",
+            "neural_circuitry -> 0.6638563275337219\n",
+            "neurochemical -> 0.6555314064025879\n",
+            "neuronal_activity -> 0.6531826257705688\n"
+          ]
+        }
+      ],
+      "execution_count": 12,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can also extract the vector embedding from the word, to be used in training the classification model. The embedding has 300 components, but here we only show the first 20 components of the vector for clarity:"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "w2v['play'][:20]"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 13,
+          "data": {
+            "text/plain": "array([ 0.01226807,  0.06225586,  0.10693359,  0.05810547,  0.23828125,\n        0.03686523,  0.05151367, -0.20703125,  0.01989746,  0.10058594,\n       -0.03759766, -0.1015625 , -0.15820312, -0.08105469, -0.0390625 ,\n       -0.05053711,  0.16015625,  0.2578125 ,  0.10058594, -0.25976562],\n      dtype=float32)"
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 13,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The great thing about semantic embeddings is that you can manipulate the vector encoding based on semantics. For example, we can ask to find a word whose vector representation is as close as possible to the words *king* and *woman*, and as far as possible from the word *man*:"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "w2v.most_similar(positive=['king','woman'],negative=['man'])[0]"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 14,
+          "data": {
+            "text/plain": "('queen', 0.7118192911148071)"
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 14,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "An example above uses some internal GenSym magic, but the underlying logic is actually quite simple. An interesting thing about embeddings is that you can perform normal vector operations on embedding vectors, and that would reflect operations on word **meanings**. The example above can be expressed in terms of vector operations: we calculate the vector corresponding to **KING-MAN+WOMAN** (operations `+` and `-` are performed on vector representations of corresponding words), and then find the closest word in the dictionary to that vector:"
+      ],
+      "metadata": {
+        "tags": []
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# get the vector corresponding to kind-man+woman\n",
+        "qvec = w2v['king']-1.7*w2v['man']+1.7*w2v['woman']\n",
+        "# find the index of the closest embedding vector \n",
+        "d = np.sum((w2v.vectors-qvec)**2,axis=1)\n",
+        "min_idx = np.argmin(d)\n",
+        "# find the corresponding word\n",
+        "w2v.index2word[min_idx]"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 15,
+          "data": {
+            "text/plain": "'queen'"
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 15,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> **NOTE**: We had to add a small coefficients to *man* and *woman* vectors - try removing them to see what happens.\n",
+        "\n",
+        "To find the closest vector, we use TensorFlow machinery to compute a vector of distances between our vector and all vectors in the vocabulary, and then find the index of minimal word using `argmin`."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "While Word2Vec seems like a great way to express word semantics, it has many disadvantages, including the following:\n",
+        "\n",
+        "* Both CBoW and skip-gram models are **predictive embeddings**, and they only take local context into account. Word2Vec does not take advantage of global context.\n",
+        "* Word2Vec does not take into account word **morphology**, i.e. the fact that the meaning of the word can depend on different parts of the word, such as the root.  \n",
+        "\n",
+        "**FastText** tries to overcome the second limitation, and builds on Word2Vec by learning vector representations for each word and the charachter n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pretraining, it enables word embeddings to encode sub-word information.\n",
+        "\n",
+        "Another method, **GloVe**, uses a different approach to word embeddings, based on the factorization of the word-context matrix. First, it builds a large matrix that counts the number of word occurences in different contexts, and then it tries to represent this matrix in lower dimensions in a way that minimizes reconstruction loss.\n",
+        "\n",
+        "The gensim library supports those word embeddings, and you can experiment with them by changing the model loading code above."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Using pretrained embeddings in Keras\n",
+        "\n",
+        "We can modify the example above to prepopulate the matrix in our embedding layer with semantic embeddings, such as Word2Vec. The vocabularies of the pretrained embedding and the text corpus will likely not match, so we need to choose one. Here we explore the two possible options: using the tokenizer vocabulary, and using the vocabulary from Word2Vec embeddings.\n",
+        "\n",
+        "### Using tokenizer vocabulary\n",
+        "\n",
+        "When using the tokenizer vocabulary, some of the words from the vocabulary will have corresponding Word2Vec embeddings, and some will be missing. Given that our vocabulary size is `vocab_size`, and the Word2Vec embedding vector length is `embed_size`, the embedding layer will be repesented by a weight matrix of shape `vocab_size`$\\times$`embed_size`. We will populate this matrix by going through the vocabulary:"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "embed_size = len(w2v.get_vector('hello'))\n",
+        "print(f'Embedding size: {embed_size}')\n",
+        "\n",
+        "vocab = vectorizer.get_vocabulary()\n",
+        "W = np.zeros((vocab_size,embed_size))\n",
+        "print('Populating matrix, this will take some time...',end='')\n",
+        "found, not_found = 0,0\n",
+        "for i,w in enumerate(vocab):\n",
+        "    try:\n",
+        "        W[i] = w2v.get_vector(w)\n",
+        "        found+=1\n",
+        "    except:\n",
+        "        # W[i] = np.random.normal(0.0,0.3,size=(embed_size,))\n",
+        "        not_found+=1\n",
+        "\n",
+        "print(f\"Done, found {found} words, {not_found} words missing\")"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Embedding size: 300\n",
+            "Populating matrix, this will take some time...Done, found 4551 words, 784 words missing\n"
+          ]
+        }
+      ],
+      "execution_count": 16,
+      "metadata": {
+        "tags": []
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "For words that are not present in the Word2Vec vocabulary, we can either leave them as zeroes, or generate a random vector.\n",
+        "\n",
+        "Now we can define an embedding layer with pretrained weights:"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "emb = keras.layers.Embedding(vocab_size,embed_size,weights=[W],trainable=False)\n",
+        "model = keras.models.Sequential([\n",
+        "    vectorizer, emb,\n",
+        "    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),\n",
+        "    keras.layers.Dense(4, activation='softmax')\n",
+        "])"
+      ],
+      "outputs": [],
+      "execution_count": 17,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Now let's train our model. "
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])\n",
+        "model.fit(ds_train.map(tupelize).batch(batch_size),\n",
+        "          validation_data=ds_test.map(tupelize).batch(batch_size))"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "938/938 [==============================] - 6s 7ms/step - loss: 1.1098 - acc: 0.7849 - val_loss: 0.9145 - val_acc: 0.8159\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
+          ]
+        },
+        {
+          "output_type": "execute_result",
+          "execution_count": 18,
+          "data": {
+            "text/plain": "<tensorflow.python.keras.callbacks.History at 0x7f0da2028b90>"
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 18,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> **Note**: Notice that we set `trainable=False` when creating the `Embedding`, which means that we're not retraining the Embedding layer. This may cause accuracy to be slightly lower, but it speeds up the training.\n",
+        "\n",
+        "### Using embedding vocabulary\n",
+        "\n",
+        "One issue with the previous approach is that the vocabularies used in the TextVectorization and Embedding are different. To overcome this problem, we can use one of the following solutions:\n",
+        "* Re-train the Word2Vec model on our vocabulary.\n",
+        "* Load our dataset with the vocabulary from the pretrained Word2Vec model. Vocabularies used to load the dataset can be specified during loading.\n",
+        "\n",
+        "The latter approach seems easier, so let's implement it. First of all, we will create a `TextVectorization` layer with the specified vocabulary, taken from the Word2Vec embeddings:"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "vocab = list(w2v.vocab.keys())\n",
+        "vectorizer = keras.layers.experimental.preprocessing.TextVectorization(input_shape=(1,))\n",
+        "vectorizer.set_vocabulary(vocab)"
+      ],
+      "outputs": [],
+      "execution_count": 19,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The gensim word embeddings library contains a convenient function, `get_keras_embeddings`, which will automatically create the corresponding Keras embeddings layer for you."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "model = keras.models.Sequential([\n",
+        "    vectorizer, \n",
+        "    w2v.get_keras_embedding(train_embeddings=False),\n",
+        "    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),\n",
+        "    keras.layers.Dense(4, activation='softmax')\n",
+        "])\n",
+        "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])\n",
+        "model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128),epochs=5)"
+      ],
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Epoch 1/5\n",
+            "938/938 [==============================] - 7s 7ms/step - loss: 1.3381 - acc: 0.4961 - val_loss: 1.2996 - val_acc: 0.5682\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
+            "Epoch 2/5\n",
+            "938/938 [==============================] - 7s 7ms/step - loss: 1.2591 - acc: 0.5714 - val_loss: 1.2340 - val_acc: 0.5839\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
+            "Epoch 3/5\n",
+            "938/938 [==============================] - 7s 7ms/step - loss: 1.1983 - acc: 0.5883 - val_loss: 1.1827 - val_acc: 0.5951\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
+            "Epoch 4/5\n",
+            "938/938 [==============================] - 7s 7ms/step - loss: 1.1505 - acc: 0.6001 - val_loss: 1.1417 - val_acc: 0.6021\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
+            "Epoch 5/5\n",
+            "938/938 [==============================] - 7s 7ms/step - loss: 1.1122 - acc: 0.6093 - val_loss: 1.1084 - val_acc: 0.6103\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
+          ]
+        },
+        {
+          "output_type": "execute_result",
+          "execution_count": 20,
+          "data": {
+            "text/plain": "<tensorflow.python.keras.callbacks.History at 0x7f0da1d67110>"
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 20,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "One of the reasons we're not seeing higher accuracy is because some words from our dataset are missing in the pretrained GloVe vocabulary, and thus they are essentially ignored. To overcome this, we can train our own embeddings based on our dataset. \n",
+        "\n",
+        "\n",
+        "## Training your own embeddings\n",
+        "\n",
+        "In our examples, we have been using pretrained semantic embeddings, but it is interesting to see how those embeddings can be trained using either CBoW, or skip-gram architectures. This exercise goes beyond this module, but those interested might want to check out this [official TensorFlow tutorial on training Word2Vec model](https://www.tensorflow.org/tutorials/text/word2vec). Also, the **gensim** framework can be used to train the most commonly used embeddings in a few lines of code, as described [in the official documentation](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#training-your-own-model)."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Contextual embeddings\n",
+        "\n",
+        "One key limitation of traditional pretrained embedding representations such as Word2Vec is the fact that, even though they can capture some meaning of a word, they can't differentiate between different meanings. This can cause problems in downstream models.\n",
+        "\n",
+        "For example the word 'play' has different meaning in these two different sentences:\n",
+        "- I went to a **play** at the theater.\n",
+        "- John wants to **play** with his friends.\n",
+        "\n",
+        "The pretrained embeddings we talked about represent both meanings of the word 'play' in the same embedding. To overcome this limitation, we need to build embeddings based on the **language model**, which is trained on a large corpus of text, and *knows* how words can be put together in different contexts. Discussing contextual embeddings is out of scope for this tutorial, but we will come back to them when talking about language models in the next unit.\n"
+      ],
+      "metadata": {}
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "name": "conda-env-py37_tensorflow-py",
+      "language": "python",
+      "display_name": "py37_tensorflow"
+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.7.9",
+      "mimetype": "text/x-python",
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "pygments_lexer": "ipython3",
+      "nbconvert_exporter": "python",
+      "file_extension": ".py"
+    },
+    "kernel_info": {
+      "name": "conda-env-py37_tensorflow-py"
+    },
+    "nteract": {
+      "version": "nteract-front-end@1.0.0"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/5-NLP/14-Embeddings/torchnlp.py b/5-NLP/14-Embeddings/torchnlp.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6ca5e0c19c08862edc19d7720ae9d66d364b26a
--- /dev/null
+++ b/5-NLP/14-Embeddings/torchnlp.py
@@ -0,0 +1,104 @@
+import builtins
+import torch
+import torchtext
+import collections
+import os
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+vocab = None
+tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
+
+def load_dataset(ngrams=1,min_freq=1):
+    global vocab, tokenizer
+    print("Loading dataset...")
+    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
+    train_dataset = list(train_dataset)
+    test_dataset = list(test_dataset)
+    classes = ['World', 'Sports', 'Business', 'Sci/Tech']
+    print('Building vocab...')
+    counter = collections.Counter()
+    for (label, line) in train_dataset:
+        counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line),ngrams=ngrams))
+    vocab = torchtext.vocab.Vocab(counter, min_freq=min_freq)
+    return train_dataset,test_dataset,classes,vocab
+
+def encode(x,voc=None,unk=0,tokenizer=tokenizer):
+    v = vocab if voc is None else voc
+    return [v.stoi.get(s,unk) for s in tokenizer(x)]
+
+def train_epoch(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.CrossEntropyLoss(),epoch_size=None, report_freq=200):
+    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
+    loss_fn = loss_fn.to(device)
+    net.train()
+    total_loss,acc,count,i = 0,0,0,0
+    for labels,features in dataloader:
+        optimizer.zero_grad()
+        features, labels = features.to(device), labels.to(device)
+        out = net(features)
+        loss = loss_fn(out,labels) #cross_entropy(out,labels)
+        loss.backward()
+        optimizer.step()
+        total_loss+=loss
+        _,predicted = torch.max(out,1)
+        acc+=(predicted==labels).sum()
+        count+=len(labels)
+        i+=1
+        if i%report_freq==0:
+            print(f"{count}: acc={acc.item()/count}")
+        if epoch_size and count>epoch_size:
+            break
+    return total_loss.item()/count, acc.item()/count
+
+def padify(b,voc=None,tokenizer=tokenizer):
+    # b is the list of tuples of length batch_size
+    #   - first element of a tuple = label, 
+    #   - second = feature (text sequence)
+    # build vectorized sequence
+    v = [encode(x[1],voc=voc,tokenizer=tokenizer) for x in b]
+    # compute max length of a sequence in this minibatch
+    l = max(map(len,v))
+    return ( # tuple of two tensors - labels and features
+        torch.LongTensor([t[0]-1 for t in b]),
+        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])
+    )
+
+def offsetify(b,voc=None):
+    # first, compute data tensor from all sequences
+    x = [torch.tensor(encode(t[1],voc=voc)) for t in b]
+    # now, compute the offsets by accumulating the tensor of sequence lengths
+    o = [0] + [len(t) for t in x]
+    o = torch.tensor(o[:-1]).cumsum(dim=0)
+    return ( 
+        torch.LongTensor([t[0]-1 for t in b]), # labels
+        torch.cat(x), # text 
+        o
+    )
+
+def train_epoch_emb(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.CrossEntropyLoss(),epoch_size=None, report_freq=200,use_pack_sequence=False):
+    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
+    loss_fn = loss_fn.to(device)
+    net.train()
+    total_loss,acc,count,i = 0,0,0,0
+    for labels,text,off in dataloader:
+        optimizer.zero_grad()
+        labels,text = labels.to(device), text.to(device)
+        if use_pack_sequence:
+            off = off.to('cpu')
+        else:
+            off = off.to(device)
+        out = net(text, off)
+        loss = loss_fn(out,labels) #cross_entropy(out,labels)
+        loss.backward()
+        optimizer.step()
+        total_loss+=loss
+        _,predicted = torch.max(out,1)
+        acc+=(predicted==labels).sum()
+        count+=len(labels)
+        i+=1
+        if i%report_freq==0:
+            print(f"{count}: acc={acc.item()/count}")
+        if epoch_size and count>epoch_size:
+            break
+    return total_loss.item()/count, acc.item()/count
+
diff --git a/5-NLP/README.md b/5-NLP/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5ea06aa81f00e72be138498066edf96c4a4bf3a8
--- /dev/null
+++ b/5-NLP/README.md
@@ -0,0 +1,39 @@
+# Natural Language Processing
+
+In this section, we will focus on using Neural Networks to handle tasks related to natural language processing (NLP). There are many NLP problems that we want computers to be able to solve:
+
+* **Text classification** is typical classification problem on text sequences. Examples include classifying e-mail messages on spam vs. no-spam, or attributing news article into one of the pre-defined categories (sport, business, politics, etc.). Also, when developing chat bots, we often need to understand what a used wanted to say -- in this case we are dealing with **intent classificaton**. Often, in intent classification we need to deal with many categories.
+* **Sentiment analysis** is typical regression problem, where we need to attribute a number -- sentiment -- corresponding to how positive/negative the meaning of a sentence is. More advanced version of sentiment analysis is **aspect-based sentiment analysis** (ABSA), where we attribute sentiment not the the whole sentence, but to different parts of it (aspects), eg. *In this restaurant, I liked the cuisine, but the atmosphere was awful*.
+* **Named Entity Recognition** (NER) refers to the problem of extracting certain entities from text. For example, we might need to understand that in the phrase *I need to fly to Paris tomorrow* the word *tomorrow* refers to DATE, and *Paris* is a LOCATION.  
+* **Keyword extraction** is similar to NER, but we need to extract words important to the meaning of the sentence automatically, without pre-training for specific entity types.
+* **Text clustering** can be useful when we want to group together similar sentences, for example, similar requests in technical support conversations.
+* **Question answering** refers to the ability of a model to answer a specific question. The model receives a text passage and a question as inputs, and it needs to provide a place in the text where the answer to the question is contained (or, sometimes, to generate the answer text).
+* **Text Generation** is the ability of a model to generate new text. It can be considered as classification task that predicts next letter/word based on some *text prompt*. Advanced text generation models, such as GPT-3, are able to solve other NLP tasks such as classification using a technique called [prompt programming](https://towardsdatascience.com/software-3-0-how-prompting-will-change-the-rules-of-the-game-a982fbfe1e0) or [prompt engineering](https://medium.com/swlh/openai-gpt-3-and-prompt-engineering-dcdc2c5fcd29)
+* **Text summarization** is a technique when we want a computer to "read" long text, and summarize it in a few sentences.
+* **Machine translation** can be viewed as a combination of text understanding in one language, and text generation in another one.
+
+Initially, most of NLP tasks were solved using traditional methods such as grammars. For example, in machine translation parsers were used to transform initial sentence into a syntax tree, then higher level semantic structures were extracted to represent the meaning of the sentence, and based on this meaning and grammar of the target language the result was generated. Nowadays, many NLP tasks are more effectively solved using neural networks.
+
+Many classical NLP methods are implemented in [Natural Language Processing Toolkit (NLTK)](https://www.nltk.org) Python library. There is a great [NLTK Book](https://www.nltk.org/book/) available online that covers how different NLP tasks can be solved using NLTK.
+
+In our course, we will mostly focus on using Neural Networks for NLP, and we will use NLTK where needed.
+
+We have already learnt about using neural networks for dealing with tabular data and with images. The main difference between those types of data and text is that text is a sequence of variable length, while the input size in case of images is known in advance. While convolutional networks can extract patterns from input data, patterns in text are more complex. Eg., we can have negation being separated from the subject be arbitrary many words (eg. *I do not like organges*, vs. *I do not like those big colorful tasty oranges*), and that should still be interpreted as one pattern. Thus, to handle language we need to introduce new neural network types, such as *recurrent networks* and *transformers*. 
+
+## Install Libraries
+
+If you are using local Python installation to run this course, you may need to install all required libraries for NLP using the following commands:
+
+**For PyTorch**
+```bash
+pip install -r requirements-torch.txt
+```
+**For Tensorflow**
+```bash
+pip install -r requirements-tf.txt
+```
+
+## Contents
+
+* [Representing text as tensors](13-TextRep/README.md)
+* [Word Embeddings](14-Emdeddings/README.md)
diff --git a/5-NLP/requirements-pytorch.txt b/5-NLP/requirements-pytorch.txt
new file mode 100644
index 0000000000000000000000000000000000000000..3545f5598247f13be142e8f3b66e5429170b7179
--- /dev/null
+++ b/5-NLP/requirements-pytorch.txt
@@ -0,0 +1,15 @@
+gensim==3.8.3
+huggingface==0.0.1
+matplotlib
+nltk==3.5
+numpy==1.18.5
+opencv-python==4.5.1.48
+Pillow==7.1.2
+scikit-learn
+scipy
+torch==1.8.1
+torchaudio==0.8.1
+torchinfo==0.0.8
+torchtext==0.9.1
+torchvision==0.9.1
+transformers==4.3.3
\ No newline at end of file
diff --git a/5-NLP/requirements-tf.txt b/5-NLP/requirements-tf.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8b7689c02b2843f576b295645e0eaacfae220436
--- /dev/null
+++ b/5-NLP/requirements-tf.txt
@@ -0,0 +1,12 @@
+gensim==3.8.3
+huggingface==0.0.1
+matplotlib
+nltk==3.5
+numpy==1.18.5
+opencv-python==4.5.1.48
+Pillow==7.1.2
+scikit-learn
+scipy
+tensorflow
+tensorflow_datasets
+transformers==4.3.3
\ No newline at end of file
diff --git a/README.md b/README.md
index 4706c59f3bc70b3e5da97c04277df4294a2401dd..0080b20a767bebe056cfa9742d48ca1f5a980c84 100644
--- a/README.md
+++ b/README.md
@@ -46,7 +46,7 @@ For a gentle introduction to *AI in the Cloud* topic you may consider taking [Ge
    <td><a href="3-NeuralNetworks/05-Frameworks/README.md">Text</a><br/><a href="3-NeuralNetworks/05-Frameworks/Overfitting.md">Text</a></td>
    <td><a href="3-NeuralNetworks/05-Frameworks/IntroPyTorch.ipynb">PyTorch</td>
    <td><a href="3-NeuralNetworks/05-Frameworks/IntroKerasTF.md">Keras/Tensorflow</td><td></td></tr>
-<tr><td>IV</td><td colspan="2"><b>Computer Vision</b></td>
+<tr><td>IV</td><td colspan="2"><b><a href="4-ComputerVision/README.md">Computer Vision</a></b></td>
   <td><a href="https://docs.microsoft.com/learn/modules/intro-computer-vision-pytorch/?WT.mc_id=academic-33554-dmitryso">MS Learn</a></td>
   <td><a href="https://docs.microsoft.com/learn/modules/intro-computer-vision-tensorflow/?WT.mc_id=academic-33554-dmitryso">MS Learn</a></td>
   <td>PAT</td></tr>
@@ -57,17 +57,17 @@ For a gentle introduction to *AI in the Cloud* topic you may consider taking [Ge
 <tr><td>10</td><td> Generative Adversarial Networks</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
 <tr><td>11</td><td>Object Detection</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
 <tr><td>12</td><td>Instance Segmentation. U-Net</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
-<tr><td>V</td><td colspan="2"><b>Natural Language Processing</b></td>
+<tr><td>V</td><td colspan="2"><b><a href="5-NLP/README.md">Natural Language Processing</a></b></td>
    <td><a href="https://docs.microsoft.com/learn/modules/intro-natural-language-processing-pytorch/?WT.mc_id=academic-33554-dmitryso">MS Learn</a></td>
    <td><a href="https://docs.microsoft.com/learn/modules/intro-natural-language-processing-tensorflow/?WT.mc_id=academic-33554-dmitryso">MS Learn</a></td>
    <td>PAT</td></tr>
-<tr><td>13</td><td>Text Representation. Bow/TF-IDF</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
-<tr><td>14</td><td>Semantic Word Embeddings</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
+<tr><td>13</td><td>Text Representation. Bow/TF-IDF</td><td><a href="5-NLP/13-TextRep/README.md">Text</a></td><td><a href="5-NLP/13-TextRep/TextRepresentationPyTorch.ipynb">PyTorch</a></td><td><a href="5-NLP/13-TextRep/TextRepresentationTF.ipynb">Tensorflow</td><td></td></tr>
+<tr><td>14</td><td>Semantic word embeddings</td><td><a href="5-NLP/14-Embeddings/README.md">Text</td><td><a href="5-NLP/14-Embeddings/EmbeddingsPyTorch.ipynb">PyTorch</a></td><td><a href="5-NLP/14-Embeddings/EmbeddingsTF.ipynb">Tensorflow</a></td><td></td></tr>
 <tr><td>15</td><td>Training your own embeddings</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
 <tr><td>16</td><td>Recurrent Neural Networks</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
 <tr><td>17</td><td>Generative Recurrent Networks</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
-<tr><td>18</td><td>Language Modelling. BERT. Transformers.</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
-<tr><td>19</td><td>Named Entity Recognition.</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
+<tr><td>18</td><td>Language Modelling. Transformers. BERT.</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
+<tr><td>19</td><td>Named Entity Recognition</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
 <tr><td>20</td><td>Text Generation using GPT</td><td>Text</td><td>PyTorch</td><td>Tensorflow</td><td></td></tr>
 <tr><td>VI</td><td colspan="4"><b>Other AI Techniques</b></td><td>PAT</td></tr>
 <tr><td>21</td><td>Genetic Algorithms</td><td>Text<td colspan="2">Notebook</td><td></td></tr>