Skip to content
Snippets Groups Projects
Commit d28b35b4 authored by Jen Looper's avatar Jen Looper
Browse files

just a few tweaks for X-content

parent 78fa22e7
No related branches found
No related tags found
No related merge requests found
# Multi-Modal Networks
After the success of transformer models for solving NLP tasks, there were many attempts to apply the same or similar architectures to computer vision tasks. Also, there is a growing interest in building models that would *combine* vision and natural language capabilities. One of such attempts was done by OpenAI, which is called CLIP.
After the success of transformer models for solving NLP tasks, there were many attempts to apply the same or similar architectures to computer vision tasks. Also, there is a growing interest in building models that would *combine* vision and natural language capabilities. One of such attempts was done by OpenAI, and it is called CLIP.
## Contrastive Image Pre-Training (CLIP)
The main idea of CLIP is to be able to compare text prompt with an image and say how well the image corresponds to the prompt.
The main idea of CLIP is to be able to compare text prompts with an image and determine how well the image corresponds to the prompt.
![CLIP Architecture](images/clip-arch.png)
> *Picture from [this blog post](https://openai.com/blog/clip/)*
The model is trained on images obtained from the Internet and their captions. At each batch, we take N pairs of (image, text), and convert them to some vector representations I<sub>1</sub>,..., I<sub>N</sub> / T<sub>1</sub>, ..., T<sub>N</sub>. Those representations are then matched together. The loss function is defined to maximize the cosine similarity between vectors corresponding to one pair (eg. I<sub>i</sub> and T<sub>i</sub>), and minimize cosine similarity between all other pairs. That is the reason this approach is called **constrastive**.
The model is trained on images obtained from the Internet and their captions. For each batch, we take N pairs of (image, text), and convert them to some vector representations I<sub>1</sub>,..., I<sub>N</sub> / T<sub>1</sub>, ..., T<sub>N</sub>. Those representations are then matched together. The loss function is defined to maximize the cosine similarity between vectors corresponding to one pair (eg. I<sub>i</sub> and T<sub>i</sub>), and minimize cosine similarity between all other pairs. That is the reason this approach is called **contrastive**.
CLIP model/library is available from [OpenAI GitHub](https://github.com/openai/CLIP). The approach is described in [this blog post](https://openai.com/blog/clip/), and in more detail in [this paper](https://arxiv.org/pdf/2103.00020.pdf).
......@@ -30,23 +30,23 @@ We can also do the opposite. If we have a collection of images, we can pass this
## ✍️ Example: [Using CLIP for Image Classification and Image Search](Clip.ipynb)
Open [Clip.ipynb](Clip.ipynb) notebook to see CLIP in action.
Open the [Clip.ipynb](Clip.ipynb) notebook to see CLIP in action.
## Image Generation with VQGAN + CLIP
CLIP can also be used for **image generation** from text prompt. In order to do this, we need a **generator model** that will be able to generate images based on some vector input. One of such models is called [VQGAN](https://compvis.github.io/taming-transformers/) (Vector-Quantized GAN).
CLIP can also be used for **image generation** from a text prompt. In order to do this, we need a **generator model** that will be able to generate images based on some vector input. One of such models is called [VQGAN](https://compvis.github.io/taming-transformers/) (Vector-Quantized GAN).
The main ideas of VQGAN that differentiate it from ordinary [GAN](../../4-ComputerVision/10-GANs/README.md) are the following:
* Using autoregressive transformer architecture to generate a sequence of context-rich visual parts that compose the image. Those visual parts are in turn learned by [CNN](../../4-ComputerVision/07-ConvNets/README.md)
* Use sub-image discriminator that detects whether parts of the image are "real" of "fake" (unlike "all-or-nothing" approach in traditional GAN).
* Use sub-image discriminator that detects whether parts of the image are "real" of "fake" (unlike the "all-or-nothing" approach in traditional GAN).
Learn more about VQGAN at [Taming Transformers](https://compvis.github.io/taming-transformers/) web site.
Learn more about VQGAN at the [Taming Transformers](https://compvis.github.io/taming-transformers/) web site.
One of the important differences between VQGAN and traditional GAN is that the latter can produce decent image from any input vector, while VQGAN is likely to produce an image that would not be coherent. Thus, we need to further guide the image creation process, and that can be done using CLIP.
One of the important differences between VQGAN and traditional GAN is that the latter can produce a decent image from any input vector, while VQGAN is likely to produce an image that would not be coherent. Thus, we need to further guide the image creation process, and that can be done using CLIP.
![VQGAN+CLIP Architecture](images/vqgan.png)
To generate an image corresponding to a text prompt, we start with some random encoding vector that is passed through VQGAN to produce an image. Then CLIP is used to produce loss function that shows how well the image corresponds to the text prompt. The goal then is to minimize this loss, using back propagation to adjust the input vector parameters.
To generate an image corresponding to a text prompt, we start with some random encoding vector that is passed through VQGAN to produce an image. Then CLIP is used to produce a loss function that shows how well the image corresponds to the text prompt. The goal then is to minimize this loss, using back propagation to adjust the input vector parameters.
A great library that implements VQGAN+CLIP is [Pixray](http://github.com/pixray/pixray)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment