Skip to content
Snippets Groups Projects
CNN_Architectures.md 4.67 KiB

Well-Known CNN Architectures

VGG-16

VGG-16 is a network that achieved 92.7% accuracy in ImageNet top-5 classification in 2014. It has the following layer structure:

ImageNet Layers

As you can see, VGG follows a traditional pyramid architecture, which is a sequence of convolution-pooling layers.

ImageNet Pyramid

Image from Researchgate

ResNet

ResNet is a family of models proposed by Microsoft Research in 2015. The main idea of ResNet is to use residual blocks:

Image from this paper

The reason for using identity pass-through is to have our layer predict the difference between the result of a previous layer and the output of the residual block - hence the name residual. Those blocks are much easier to train, and one can construct networks with several hundreds of those blocks (most common variants are ResNet-52, ResNet-101 and ResNet-152).

You can also think of this network as being able to adjust its complexity to the dataset. Initially, when you are starting to train the network, the weights values are small, and most of the signal goes through passthrough identity layers. As training progresses and weights become larger, the significance of network parameters grow, and the networks adjusts to accommodate required expressive power to correctly classify training images.

Google Inception

Google Inception architecture takes this idea one step further, and builds each network layer as a combination of several different paths:

Image from Researchgate

Here, we need to emphasize the role of 1x1 convolutions, because at first they do not make sense. Why would we need to run through the image with 1x1 filter? However, you need to remember that convolution filters also work with several depth channels (originally - RGB colors, in subsequent layers - channels for different filters), and 1x1 convolution is used to mix those input channels together using different trainable weights. It can be also viewed as downsampling (pooling) over channel dimension.

Here is a good blog post on the subject, and the original paper.

MobileNet