From c190a5fdccf611384607d47a55ea212544313f25 Mon Sep 17 00:00:00 2001
From: Jen Looper <jen.looper@gmail.com>
Date: Tue, 10 May 2022 13:50:28 -0400
Subject: [PATCH] ch 13

---
 lessons/5-NLP/13-TextRep/README.md     | 34 ++++++++++++++++++--------
 lessons/5-NLP/13-TextRep/assignment.md |  3 +++
 lessons/5-NLP/README.md                |  4 +++
 3 files changed, 31 insertions(+), 10 deletions(-)
 create mode 100644 lessons/5-NLP/13-TextRep/assignment.md

diff --git a/lessons/5-NLP/13-TextRep/README.md b/lessons/5-NLP/13-TextRep/README.md
index 670e1d3..d1983ac 100644
--- a/lessons/5-NLP/13-TextRep/README.md
+++ b/lessons/5-NLP/13-TextRep/README.md
@@ -4,13 +4,13 @@
 
 ## Text Classification
 
-Throughout the first part of this course, we will focus on **text classification** task. We will use [AG News](https://www.kaggle.com/amananandrai/ag-news-classification-dataset) Dataset, which contains news articles like the following:
+Throughout the first part of this section, we will focus on **text classification** task. We will use the [AG News](https://www.kaggle.com/amananandrai/ag-news-classification-dataset) Dataset, which contains news articles like the following:
 
 * Category: Sci/Tech
 * Title: Ky. Company Wins Grant to Study Peptides (AP)
 * Body: AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop...
 
-Our goal would be to classify the news item into one of the categories based on text.
+Our goal will be to classify the news item into one of the categories based on text.
 
 ## Representing text
 
@@ -20,14 +20,14 @@ If we want to solve Natural Language Processing (NLP) tasks with neural networks
 
 > [Image source](https://www.seobility.net/en/wiki/ASCII)
 
-We understand what each letter **represents**, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and neural network has to learn the meaning during training.
+As humans, we understand what each letter **represents**, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and neural network has to learn the meaning during training.
 
 Therefore, we can use different approaches when representing text:
 
 * **Character-level representation**, when we represent text by treating each character as a number. Given that we have *C* different characters in our text corpus, the word *Hello* would be represented by 5x*C* tensor. Each letter would correspond to a tensor column in one-hot encoding.
-* **Word-level representation**, in which we create a **vocabulary** of all words in our text, and then represent words using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given large dictionary size, we need to deal with high-dimensional sparse tensors.
+* **Word-level representation**, in which we create a **vocabulary** of all words in our text, and then represent words using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given the large dictionary size, we need to deal with high-dimensional sparse tensors.
 
-Regardless of the representation, we first need to convert text into a sequence of **tokens**, one token being either a character, a word, or sometimes even part of a word. Then, we convert token into a number, typically using **vocabulary**, and this number can be fed into a neural network using one-hot encoding.
+Regardless of the representation, we first need to convert the text into a sequence of **tokens**, one token being either a character, a word, or sometimes even part of a word. Then, we convert the token into a number, typically using **vocabulary**, and this number can be fed into a neural network using one-hot encoding.
 
 ## N-Grams
 
@@ -43,17 +43,31 @@ When solving tasks like text classification, we need to be able to represent tex
 
 > Image by author
 
-BOW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as *president* and *country*, while scientific publication would have something like *collider*, *discovered*, etc. Thus, word frequencies can in many cases be a good indicator of text content.
+A BOW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as *president* and *country*, while scientific publication would have something like *collider*, *discovered*, etc. Thus, word frequencies can in many cases be a good indicator of text content.
 
-The problem with BOW is that certain common words, such as *and*, *is*, etc. appear in most of the texts, and they have highest frequencies, masking out the words that are really important. We may lower the importance of those words by taking into account the frequency at which words occur in the whole document collection. This is the main idea behind TF/IDF approach, which is covered in more detail in the notebooks below.
+The problem with BOW is that certain common words, such as *and*, *is*, etc. appear in most of the texts, and they have highest frequencies, masking out the words that are really important. We may lower the importance of those words by taking into account the frequency at which words occur in the whole document collection. This is the main idea behind TF/IDF approach, which is covered in more detail in the notebooks attached to this lesson.
 
-However, none of those approaches can fully take into account the semantics of text. We need more powerful neural networks models, which we will discuss later in this course.
+However, none of those approaches can fully take into account the **semantics** of text. We need more powerful neural networks models to do this, which we will discuss later in this section.
 
-## Continue to Notebooks
+## âœï¸ Exercises: Text Representation
+
+Continue your learning in the following notebooks:
 
 * [Text Representation with PyTorch](TextRepresentationPyTorch.ipynb)
 * [Text Representation with TensorFlow](TextRepresentationTF.ipynb)
 
+## Conclusion
+
+So far, we have studied techniques that can add frequency weight to different words. They are, however, unable to represent meaning or order. As the famous linguist J. R. Firth said in 1935, "The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously." We will learn later in the course how to capture contextual information from text using language modeling.
+
+## ðŸš€ Challenge
+
+Try some other exercises using bag-of-words and different data models. You might be inspired by this [competition on Kaggle](https://www.kaggle.com/competitions/word2vec-nlp-tutorial/overview/part-1-for-beginners-bag-of-words)
+
 ## [Post-lecture quiz](https://black-ground-0cc93280f.1.azurestaticapps.net/quiz/213)
 
-> âœ… Todo: conclusion, assignment, challenge, review.
+## Review & Self Study
+
+Practice your skills with text embeddings and bag-of-words techniques on [Microsoft Learn](https://docs.microsoft.com/learn/modules/intro-natural-language-processing-pytorch/?WT.mc_id=academic-57639-dmitryso)
+
+## [Assignment: Notebooks](assignment.md)
diff --git a/lessons/5-NLP/13-TextRep/assignment.md b/lessons/5-NLP/13-TextRep/assignment.md
new file mode 100644
index 0000000..a02b537
--- /dev/null
+++ b/lessons/5-NLP/13-TextRep/assignment.md
@@ -0,0 +1,3 @@
+# Assignment: Notebooks
+
+Using the notebooks associated to this lesson (either the PyTorch or the TensorFlow version), rerun them using your own dataset, perhaps one from Kaggle, used with attribution. Rewrite the notebook to underline your own findings. Try some innovative datasets that might prove surprising, such as [this one about UFO sightings](https://www.kaggle.com/datasets/NUFORC/ufo-sightings) from NUFORC.
\ No newline at end of file
diff --git a/lessons/5-NLP/README.md b/lessons/5-NLP/README.md
index d8a1649..725ce85 100644
--- a/lessons/5-NLP/README.md
+++ b/lessons/5-NLP/README.md
@@ -35,6 +35,8 @@ pip install -r requirements-torch.txt
 pip install -r requirements-tf.txt
 ```
 
+> You can try NLP with TensorFlow on [Microsoft Learn](https://docs.microsoft.com/learn/modules/intro-natural-language-processing-tensorflow/?WT.mc_id=academic-57639-dmitryso)
+
 ## GPU Warning
 
 In this section, in some of the examples we will be training quite large models. It is advisable to run notebooks on GPU-enabled computer to minimize waiting time.
@@ -49,6 +51,8 @@ if len(physical_devices)>0:
     tf.config.experimental.set_memory_growth(physical_devices[0], True) 
 ```
 
+If you're interested in learning about NLP from a classic ML perspective, visit [this suite of lessons](https://github.com/microsoft/ML-For-Beginners/tree/main/6-NLP)
+
 ## In this Section
 In this section we will learn about:
 
-- 
GitLab