Skip to content
Snippets Groups Projects
README.md 2.63 KiB
Newer Older
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
# Language Modeling

Lateefah Bello's avatar
Lateefah Bello committed
Semantic embeddings, such as Word2Vec and GloVe, are in fact a first step towards **language modeling** - creating models that somehow *understand* (or *represent*) the nature of the language.
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed

Julia Muiruri's avatar
Julia Muiruri committed
## [Pre-lecture quiz](https://green-forest-02da2d60f.1.azurestaticapps.net/quiz/115)
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed

Jen Looper's avatar
Jen Looper committed
The main idea behind language modeling is training them on unlabeled datasets in an unsupervised manner. This is important because we have huge amounts of unlabeled text available, while the amount of labeled text would always be limited by the amount of effort we can spend on labeling. Most often, we can build language models that can **predict missing words** in the text, because it is easy to mask out a random word in text and use it as a training sample.
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed

Jen Looper's avatar
Jen Looper committed
## Training Embeddings
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed

Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
In our previous examples, we used pre-trained semantic embeddings, but it is interesting to see how those embeddings can be trained. There are several possible ideas the can be used:

* **N-Gram** language modeling, when we predict a token by looking at N previous tokens (N-gram)
* **Continuous Bag-of-Words** (CBoW), when we predict the middle token $W_0$ in a token sequence $W_{-N}$, ..., $W_N$.
* **Skip-gram**, where we predict a set of neighboring tokens {$W_{-N},\dots, W_{-1}, W_1,\dots, W_N$} from the middle token $W_0$.
Jen Looper's avatar
Jen Looper committed

![image from paper on converting words to vectors](../14-Embeddings/images/example-algorithms-for-converting-words-to-vectors.png)
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed

> Image from [this paper](https://arxiv.org/pdf/1301.3781.pdf)

Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
## ✍️ Example Notebooks: Training CBoW model
Jen Looper's avatar
Jen Looper committed

Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
Continue your learning in the following notebooks:
Jen Looper's avatar
Jen Looper committed

Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
* [Training CBoW Word2Vec with TensorFlow](CBoW-TF.ipynb)
Jen Looper's avatar
Jen Looper committed

Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
## Conclusion
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed

Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
In the previous lesson we have seen that words embeddings work like magic! Now we know that training word embeddings is not a very complex task, and we should be able to train our own word embeddings for domain specific text if needed. 
Jen Looper's avatar
Jen Looper committed

Julia Muiruri's avatar
Julia Muiruri committed
## [Post-lecture quiz](https://green-forest-02da2d60f.1.azurestaticapps.net/quiz/215)
Jen Looper's avatar
Jen Looper committed

## Review & Self Study
Dmitri Soshnikov's avatar
Dmitri Soshnikov committed

Lateefah Bello's avatar
Lateefah Bello committed
* [Official PyTorch tutorial on Language Modeling](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).
* [Official TensorFlow tutorial on training Word2Vec model](https://www.TensorFlow.org/tutorials/text/word2vec).
Jen Looper's avatar
Jen Looper committed
* Using the **gensim** framework to train most commonly used embeddings in a few lines of code is described [in this documentation](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).

Dmitri Soshnikov's avatar
Dmitri Soshnikov committed
## 🚀 [Assignment: Train Skip-Gram Model](lab/README.md)

Lateefah Bello's avatar
Lateefah Bello committed
In the lab, we challenge you to modify the code from this lesson to train skip-gram model instead of CBoW. [Read the details](lab/README.md)