Skip to content
Snippets Groups Projects
EmbeddingsTF.ipynb 29.7 KiB
Newer Older
{
  "cells": [
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "## Embeddings\n",
        "\n",
        "In our previous example, we operated on high-dimensional bag-of-words vectors with length `vocab_size`, and we explicitly converted low-dimensional positional representation vectors into sparse one-hot representation. This one-hot representation is not memory-efficient. In addition, each word is treated independently from each other, so one-hot encoded vectors don't express semantic similarities between words.\n",
        "\n",
        "In this unit, we will continue exploring the **News AG** dataset. To begin, let's load the data and get some definitions from the previous unit."
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "from tensorflow import keras\n",
        "import tensorflow_datasets as tfds\n",
        "import numpy as np\n",
        "\n",
        "ds_train, ds_test = tfds.load('ag_news_subset').values()"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "\n",
        "### What's an embedding?\n",
        "\n",
        "The idea of **embedding** is to represent words using lower-dimensional dense vectors that reflect the semantic meaning of the word. We will later discuss how to build meaningful word embeddings, but for now let's just think of embeddings as a way to reduce the dimensionality of a word vector. \n",
        "\n",
        "So, an embedding layer takes a word as input, and produces an output vector of specified `embedding_size`. In a sense, it is very similar to a `Dense` layer, but instead of taking a one-hot encoded vector as input, it's able to take a word number.\n",
        "\n",
Lateefah Bello's avatar
Lateefah Bello committed
        "By using an embedding layer as the first layer in our network, we can switch from bag-of-words to an **embedding bag** model, where we first convert each word in our text into the corresponding embedding, and then compute some aggregate function over all those embeddings, such as `sum`, `average` or `max`.  \n",
        "\n",
        "![Image showing an embedding classifier for five sequence words.](images/embedding-classifier-example.png)\n",
        "\n",
        "Our classifier neural network consists of the following layers:\n",
        "\n",
        "* `TextVectorization` layer, which takes a string as input, and produces a tensor of token numbers. We will specify some reasonable vocabulary size `vocab_size`, and ignore less-frequently used words. The input shape will be 1, and the output shape will be $n$, since we'll get $n$ tokens as a result, each of them containing numbers from 0 to `vocab_size`.\n",
        "* `Embedding` layer, which takes $n$ numbers, and reduces each number to a dense vector of a given length (100 in our example). Thus, the input tensor of shape $n$ will be transformed into an $n\\times 100$ tensor. \n",
        "* Aggregation layer, which takes the average of this tensor along the first axis, i.e. it will compute the average of all $n$ input tensors corresponding to different words. To implement this layer, we will use a `Lambda` layer, and pass into it the function to compute the average. The output will have shape of 100, and it will be the numeric representation of the whole input sequence.\n",
        "* Final `Dense` linear classifier."
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 3,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "output_type": "stream",
          "text": [
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "Model: \"sequential\"\n",
            "_________________________________________________________________\n",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            " Layer (type)                Output Shape              Param #   \n",
            "=================================================================\n",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            " text_vectorization (TextVec  (None, None)             0         \n",
            " torization)                                                     \n",
            "                                                                 \n",
            " embedding (Embedding)       (None, None, 100)         3000000   \n",
            "                                                                 \n",
            " lambda (Lambda)             (None, 100)               0         \n",
            "                                                                 \n",
            " dense (Dense)               (None, 4)                 404       \n",
            "                                                                 \n",
            "=================================================================\n",
            "Total params: 3,000,404\n",
            "Trainable params: 3,000,404\n",
            "Non-trainable params: 0\n",
            "_________________________________________________________________\n"
          ]
        }
      ],
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "vocab_size = 30000\n",
        "batch_size = 128\n",
        "\n",
        "vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,input_shape=(1,))\n",
        "\n",
        "model = keras.models.Sequential([\n",
        "    vectorizer,    \n",
        "    keras.layers.Embedding(vocab_size,100),\n",
        "    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),\n",
        "    keras.layers.Dense(4, activation='softmax')\n",
        "])\n",
        "model.summary()"
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "In the `summary` printout, in the **output shape** column, the first tensor dimension `None` corresponds to the minibatch size, and the second corresponds to the length of the token sequence. All token sequences in the minibatch have different lengths. We'll discuss how to deal with it in the next section.\n",
        "\n",
        "Now let's train the network:"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 4,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "output_type": "stream",
          "text": [
            "Training vectorizer\n",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "938/938 [==============================] - 20s 20ms/step - loss: 0.7891 - acc: 0.8155 - val_loss: 0.4470 - val_acc: 0.8642\n"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "text/plain": [
              "<keras.callbacks.History at 0x22255515100>"
            ]
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "def extract_text(x):\n",
        "    return x['title']+' '+x['description']\n",
        "\n",
        "def tupelize(x):\n",
        "    return (extract_text(x),x['label'])\n",
        "\n",
        "print(\"Training vectorizer\")\n",
        "vectorizer.adapt(ds_train.take(500).map(extract_text))\n",
        "\n",
        "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])\n",
        "model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      },
      "source": [
        "> **Note** that we are building vectorizer based on a subset of the data. This is done in order to speed up the process, and it might result in a situation when not all tokens from our text is present in the vocabulary. In this case, those tokens would be ignored, which may result in slightly lower accuracy. However, in real life a subset of text often gives a good vocabulary estimation."
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "### Dealing with variable sequence sizes\n",
        "\n",
        "Let's understand how training happens in minibatches. In the example above, the input tensor has dimension 1, and we use 128-long minibatches, so that actual size of the tensor is $128 \\times 1$. However, the number of tokens in each sentence is different. If we apply the `TextVectorization` layer to a single input, the number of tokens returned is different, depending on how the text is tokenized:"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 5,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "output_type": "stream",
          "text": [
            "tf.Tensor([ 1 45], shape=(2,), dtype=int64)\n",
            "tf.Tensor([ 112 1271    1    3 1747  158], shape=(6,), dtype=int64)\n"
          ]
        }
      ],
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "print(vectorizer('Hello, world!'))\n",
        "print(vectorizer('I am glad to meet you!'))"
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "However, when we apply the vectorizer to several sequences, it has to produce a tensor of rectangular shape, so it fills unused elements with the PAD token (which in our case is zero):"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 6,
      "metadata": {},
      "outputs": [
        {
          "data": {
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "text/plain": [
              "<tf.Tensor: shape=(2, 6), dtype=int64, numpy=\n",
              "array([[   1,   45,    0,    0,    0,    0],\n",
              "       [ 112, 1271,    1,    3, 1747,  158]], dtype=int64)>"
            ]
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "vectorizer(['Hello, world!','I am glad to meet you!'])"
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "Here we can see the embeddings:"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 7,
      "metadata": {},
      "outputs": [
        {
          "data": {
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "text/plain": [
              "array([[[ 1.53059261e-02,  6.80514947e-02,  3.14026810e-02, ...,\n",
              "         -8.92002955e-02,  1.52911525e-04, -5.65562584e-02],\n",
              "        [ 2.57456154e-01,  2.79364467e-01, -2.03605562e-01, ...,\n",
              "         -2.07474351e-01,  8.31158683e-02, -2.03911960e-01],\n",
              "        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,\n",
              "         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],\n",
              "        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,\n",
              "         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],\n",
              "        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,\n",
              "         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],\n",
              "        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,\n",
              "         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02]],\n",
              "\n",
              "       [[ 1.89674050e-01,  2.61548996e-01, -3.67433839e-02, ...,\n",
              "         -2.07366899e-01, -1.05442435e-01, -2.36952081e-01],\n",
              "        [ 6.16133213e-02,  1.80511594e-01,  9.77298319e-02, ...,\n",
              "         -5.46628237e-02, -1.07340455e-01, -1.06589928e-01],\n",
              "        [ 1.53059261e-02,  6.80514947e-02,  3.14026810e-02, ...,\n",
              "         -8.92002955e-02,  1.52911525e-04, -5.65562584e-02],\n",
              "        [-4.84890305e-02, -8.41715634e-02,  1.51529670e-01, ...,\n",
              "          1.28192469e-01, -7.77286515e-02,  1.26041949e-01],\n",
              "        [-4.17212099e-02, -5.60694858e-02,  4.08860669e-02, ...,\n",
              "          8.70475471e-02,  8.92383084e-02,  1.67974353e-01],\n",
              "        [ 2.85779923e-01,  4.57767487e-01,  4.52292450e-02, ...,\n",
              "         -1.97419018e-01, -2.04659685e-01, -2.79758364e-01]]],\n",
              "      dtype=float32)"
            ]
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "model.layers[1](vectorizer(['Hello, world!','I am glad to meet you!'])).numpy()"
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "> **Note**: To minimize the amount of padding, in some cases it makes sense to sort all sequences in the dataset in the order of increasing length (or, more precisely, number of tokens). This will ensure that each minibatch contains sequences of similar length."
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "\n",
        "## Semantic embeddings: Word2Vec\n",
        "\n",
        "In our previous example, the embedding layer learned to map words to vector representations, however, these representations did not have semantic meaning. It would be nice to learn a vector representation such that similar words or synonyms correspond to vectors that are close to each other in terms of some vector distance (for example euclidian distance).\n",
        "\n",
        "To do that, we need to pretrain our embedding model on a large collection of text using a technique such as [Word2Vec](https://en.wikipedia.org/wiki/Word2vec). It's based on two main architectures that are used to produce a distributed representation of words:\n",
        "\n",
        " - **Continuous bag-of-words** (CBoW), where we train the model to predict a word from the surrounding context. Given the ngram $(W_{-2},W_{-1},W_0,W_1,W_2)$, the goal of the model is to predict $W_0$ from $(W_{-2},W_{-1},W_1,W_2)$.\n",
        " - **Continuous skip-gram** is opposite to CBoW. The model uses the surrounding window of context words to predict the current word.\n",
        "\n",
        "CBoW is faster, and while skip-gram is slower, it does a better job of representing infrequent words.\n",
        "\n",
        "![Image showing both CBoW and Skip-Gram algorithms to convert words to vectors.](images/example-algorithms-for-converting-words-to-vectors.png)\n",
        "\n",
        "To experiment with the Word2Vec embedding pretrained on Google News dataset, we can use the **gensim** library. Below we find the words most similar to 'neural'.\n",
        "\n",
        "> **Note:** When you first create word vectors, downloading them can take some time!"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 12,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "output_type": "stream",
          "text": [
            "neuronal -> 0.7804799675941467\n",
            "neurons -> 0.7326500415802002\n",
            "neural_circuits -> 0.7252851724624634\n",
            "neuron -> 0.7174385190010071\n",
            "cortical -> 0.6941086649894714\n",
            "brain_circuitry -> 0.6923246383666992\n",
            "synaptic -> 0.6699118614196777\n",
            "neural_circuitry -> 0.6638563275337219\n",
            "neurochemical -> 0.6555314064025879\n",
            "neuronal_activity -> 0.6531826257705688\n"
          ]
        }
      ],
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "for w,p in w2v.most_similar('neural'):\n",
        "    print(f\"{w} -> {p}\")"
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "We can also extract the vector embedding from the word, to be used in training the classification model. The embedding has 300 components, but here we only show the first 20 components of the vector for clarity:"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 13,
      "metadata": {},
      "outputs": [
        {
          "data": {
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "text/plain": [
              "array([ 0.01226807,  0.06225586,  0.10693359,  0.05810547,  0.23828125,\n",
              "        0.03686523,  0.05151367, -0.20703125,  0.01989746,  0.10058594,\n",
              "       -0.03759766, -0.1015625 , -0.15820312, -0.08105469, -0.0390625 ,\n",
              "       -0.05053711,  0.16015625,  0.2578125 ,  0.10058594, -0.25976562],\n",
              "      dtype=float32)"
            ]
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "w2v['play'][:20]"
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "The great thing about semantic embeddings is that you can manipulate the vector encoding based on semantics. For example, we can ask to find a word whose vector representation is as close as possible to the words *king* and *woman*, and as far as possible from the word *man*:"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 14,
      "metadata": {},
      "outputs": [
        {
          "data": {
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "text/plain": [
              "('queen', 0.7118192911148071)"
            ]
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "w2v.most_similar(positive=['king','woman'],negative=['man'])[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "tags": []
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      },
      "source": [
        "An example above uses some internal GenSym magic, but the underlying logic is actually quite simple. An interesting thing about embeddings is that you can perform normal vector operations on embedding vectors, and that would reflect operations on word **meanings**. The example above can be expressed in terms of vector operations: we calculate the vector corresponding to **KING-MAN+WOMAN** (operations `+` and `-` are performed on vector representations of corresponding words), and then find the closest word in the dictionary to that vector:"
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 15,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'queen'"
            ]
          },
          "execution_count": 15,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# get the vector corresponding to kind-man+woman\n",
        "qvec = w2v['king']-1.7*w2v['man']+1.7*w2v['woman']\n",
        "# find the index of the closest embedding vector \n",
        "d = np.sum((w2v.vectors-qvec)**2,axis=1)\n",
        "min_idx = np.argmin(d)\n",
        "# find the corresponding word\n",
        "w2v.index2word[min_idx]"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "> **NOTE**: We had to add a small coefficients to *man* and *woman* vectors - try removing them to see what happens.\n",
        "\n",
        "To find the closest vector, we use TensorFlow machinery to compute a vector of distances between our vector and all vectors in the vocabulary, and then find the index of minimal word using `argmin`."
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "While Word2Vec seems like a great way to express word semantics, it has many disadvantages, including the following:\n",
        "\n",
        "* Both CBoW and skip-gram models are **predictive embeddings**, and they only take local context into account. Word2Vec does not take advantage of global context.\n",
        "* Word2Vec does not take into account word **morphology**, i.e. the fact that the meaning of the word can depend on different parts of the word, such as the root.  \n",
        "\n",
        "**FastText** tries to overcome the second limitation, and builds on Word2Vec by learning vector representations for each word and the charachter n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pretraining, it enables word embeddings to encode sub-word information.\n",
        "\n",
        "Another method, **GloVe**, uses a different approach to word embeddings, based on the factorization of the word-context matrix. First, it builds a large matrix that counts the number of word occurences in different contexts, and then it tries to represent this matrix in lower dimensions in a way that minimizes reconstruction loss.\n",
        "\n",
        "The gensim library supports those word embeddings, and you can experiment with them by changing the model loading code above."
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "## Using pretrained embeddings in Keras\n",
        "\n",
        "We can modify the example above to prepopulate the matrix in our embedding layer with semantic embeddings, such as Word2Vec. The vocabularies of the pretrained embedding and the text corpus will likely not match, so we need to choose one. Here we explore the two possible options: using the tokenizer vocabulary, and using the vocabulary from Word2Vec embeddings.\n",
        "\n",
        "### Using tokenizer vocabulary\n",
        "\n",
        "When using the tokenizer vocabulary, some of the words from the vocabulary will have corresponding Word2Vec embeddings, and some will be missing. Given that our vocabulary size is `vocab_size`, and the Word2Vec embedding vector length is `embed_size`, the embedding layer will be repesented by a weight matrix of shape `vocab_size`$\\times$`embed_size`. We will populate this matrix by going through the vocabulary:"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 9,
      "metadata": {
        "tags": []
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Embedding size: 300\n",
            "Populating matrix, this will take some time...Done, found 4551 words, 784 words missing\n"
          ]
        }
      ],
      "source": [
        "embed_size = len(w2v.get_vector('hello'))\n",
        "print(f'Embedding size: {embed_size}')\n",
        "\n",
        "vocab = vectorizer.get_vocabulary()\n",
        "W = np.zeros((vocab_size,embed_size))\n",
        "print('Populating matrix, this will take some time...',end='')\n",
        "found, not_found = 0,0\n",
        "for i,w in enumerate(vocab):\n",
        "    try:\n",
        "        W[i] = w2v.get_vector(w)\n",
        "        found+=1\n",
        "    except:\n",
        "        # W[i] = np.random.normal(0.0,0.3,size=(embed_size,))\n",
        "        not_found+=1\n",
        "\n",
        "print(f\"Done, found {found} words, {not_found} words missing\")"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "For words that are not present in the Word2Vec vocabulary, we can either leave them as zeroes, or generate a random vector.\n",
        "\n",
        "Now we can define an embedding layer with pretrained weights:"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 10,
      "metadata": {},
      "outputs": [],
      "source": [
        "emb = keras.layers.Embedding(vocab_size,embed_size,weights=[W],trainable=False)\n",
        "model = keras.models.Sequential([\n",
        "    vectorizer, emb,\n",
        "    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),\n",
        "    keras.layers.Dense(4, activation='softmax')\n",
        "])"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "Now let's train our model. "
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 11,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "output_type": "stream",
          "text": [
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "938/938 [==============================] - 10s 10ms/step - loss: 1.1075 - acc: 0.7822 - val_loss: 0.9134 - val_acc: 0.8175\n"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "text/plain": [
              "<keras.callbacks.History at 0x2220226ef10>"
            ]
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])\n",
        "model.fit(ds_train.map(tupelize).batch(batch_size),\n",
        "          validation_data=ds_test.map(tupelize).batch(batch_size))"
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "> **Note**: Notice that we set `trainable=False` when creating the `Embedding`, which means that we're not retraining the Embedding layer. This may cause accuracy to be slightly lower, but it speeds up the training.\n",
        "\n",
        "### Using embedding vocabulary\n",
        "\n",
        "One issue with the previous approach is that the vocabularies used in the TextVectorization and Embedding are different. To overcome this problem, we can use one of the following solutions:\n",
        "* Re-train the Word2Vec model on our vocabulary.\n",
        "* Load our dataset with the vocabulary from the pretrained Word2Vec model. Vocabularies used to load the dataset can be specified during loading.\n",
        "\n",
        "The latter approach seems easier, so let's implement it. First of all, we will create a `TextVectorization` layer with the specified vocabulary, taken from the Word2Vec embeddings:"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 12,
      "metadata": {},
      "outputs": [],
      "source": [
        "vocab = list(w2v.vocab.keys())\n",
        "vectorizer = keras.layers.experimental.preprocessing.TextVectorization(input_shape=(1,))\n",
        "vectorizer.set_vocabulary(vocab)"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "The gensim word embeddings library contains a convenient function, `get_keras_embeddings`, which will automatically create the corresponding Keras embeddings layer for you."
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    },
    {
      "cell_type": "code",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "execution_count": 13,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "output_type": "stream",
          "text": [
            "Epoch 1/5\n",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "938/938 [==============================] - 20s 14ms/step - loss: 1.3377 - acc: 0.4978 - val_loss: 1.2995 - val_acc: 0.5647\n",
            "Epoch 2/5\n",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "938/938 [==============================] - 10s 10ms/step - loss: 1.2587 - acc: 0.5722 - val_loss: 1.2339 - val_acc: 0.5842\n",
            "Epoch 3/5\n",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "938/938 [==============================] - 10s 10ms/step - loss: 1.1980 - acc: 0.5884 - val_loss: 1.1826 - val_acc: 0.5954\n",
            "Epoch 4/5\n",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "938/938 [==============================] - 12s 13ms/step - loss: 1.1503 - acc: 0.6002 - val_loss: 1.1417 - val_acc: 0.6018\n",
            "Epoch 5/5\n",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "938/938 [==============================] - 11s 12ms/step - loss: 1.1120 - acc: 0.6097 - val_loss: 1.1083 - val_acc: 0.6104\n"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
            "text/plain": [
              "<keras.callbacks.History at 0x2220ccb81c0>"
            ]
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "source": [
        "model = keras.models.Sequential([\n",
        "    vectorizer, \n",
        "    w2v.get_keras_embedding(train_embeddings=False),\n",
        "    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),\n",
        "    keras.layers.Dense(4, activation='softmax')\n",
        "])\n",
        "model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])\n",
        "model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128),epochs=5)"
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
        "One of the reasons we're not seeing higher accuracy is because some words from our dataset are missing in the pretrained GloVe vocabulary, and thus they are essentially ignored. To overcome this, we can train our own embeddings based on our dataset. "
      ]
    },
    {
      "cell_type": "markdown",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "metadata": {},
      "source": [
        "## Contextual embeddings\n",
        "\n",
        "One key limitation of traditional pretrained embedding representations such as Word2Vec is the fact that, even though they can capture some meaning of a word, they can't differentiate between different meanings. This can cause problems in downstream models.\n",
        "\n",
        "For example the word 'play' has different meaning in these two different sentences:\n",
        "- I went to a **play** at the theater.\n",
        "- John wants to **play** with his friends.\n",
        "\n",
        "The pretrained embeddings we talked about represent both meanings of the word 'play' in the same embedding. To overcome this limitation, we need to build embeddings based on the **language model**, which is trained on a large corpus of text, and *knows* how words can be put together in different contexts. Discussing contextual embeddings is out of scope for this tutorial, but we will come back to them when talking about language models in the next unit.\n"
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      ]
    }
  ],
  "metadata": {
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
    "interpreter": {
      "hash": "0cb620c6d4b9f7a635928804c26cf22403d89d98d79684e4529119355ee6d5a5"
    },
    "kernel_info": {
      "name": "conda-env-py37_tensorflow-py"
    },
    "kernelspec": {
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "display_name": "py37_tensorflow",
      "language": "python",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
      "pygments_lexer": "ipython3",
      "version": "3.8.12"
    },
    "nteract": {
      "version": "nteract-front-end@1.0.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
}