Skip to content
Snippets Groups Projects

Language Modeling

Semantic embeddings, such as Word2Vec and GloVe, are in fact a first step towards language modeling - creating models that somehow understand (or represent) the nature of the language.

Pre-lecture quiz

The main idea behind language modeling is training them on unlabeled datasets in an unsupervised manner. This is important because we have huge amounts of unlabeled text available, while the amount of labeled text would always be limited by the amount of effort we can spend on labeling. Most often, we can build language models that can predict missing words in the text, because it is easy to mask out a random word in text and use it as a training sample.

Training Embeddings

In our previous examples, we used pre-trained semantic embeddings, but it is interesting to see how those embeddings can be trained using either CBoW, or Skip-gram architectures.

image from paper on converting words to vectors

Image from this paper

The idea underpinning CBoW involves how to predict a missing word, but to do this we take a small sliding window of text tokens. We can denote them from W-2 to W2, and train a model to predict the central word W0 from a few surrounding words.

Conclusion

TBD

:rocket: Challenge

TBD

Post-lecture quiz

Review & Self Study

Assignment: Notebooks - TBD