VGG16
Introducation
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual computer vision competition. Each year, teams compete on two tasks. The first is to detect objects within an image coming from 200 classes, which is called object localization. The second is to classify images, each labeled with one of 1000 categories, which is called image classification. VGG 16 was proposed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group Lab of Oxford University in 2014 in the paper “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION”. This model won 1st and 2nd place in the above categories in the 2014 ILSVRC challenge.
The Architeture
The input to cov1 layer is of fixed size 224 x 224 RGB image. The image is passed through a stack of convolutional (conv.) layers, where the filters were used with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.
cov1 层的输入是固定大小的 224 x 224 RGB 图像。图像通过一堆卷积 (conv.) 层,其中使用的滤波器具有非常小的感受野:3×3(这是捕获左/右、上/下、中心概念的最小尺寸) )。在其中一种配置中,它还利用 1×1 卷积滤波器,这可以看作是输入通道的线性变换(其次是非线性)。卷积步幅固定为1像素;转换的空间填充。层输入使得空间分辨率在卷积后得以保留,即对于 3×3 卷积,填充为 1 像素。层。空间池化由五个最大池化层执行,这些层遵循一些卷积。层(并非所有的转换层后面都有最大池化)。最大池化在 2×2 像素窗口上执行,
Three Fully-Connected (FC) layers follow a stack of convolutional layers (which has a different depth in different architectures): the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
三个全连接 (FC) 层跟随一堆卷积层(在不同的架构中具有不同的深度):前两个层各有 4096 个通道,第三个执行 1000 路 ILSVRC 分类,因此包含 1000 个通道(每个通道一个)班级)。最后一层是 soft-max 层。所有网络中全连接层的配置都是相同的。
All hidden layers are equipped with the rectification (ReLU) non-linearity. It is also noted that none of the networks (except for one) contain Local Response Normalisation (LRN), such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.
所有隐藏层都配备了非线性校正(ReLU)。还值得注意的是,没有一个网络(除了一个)包含局部响应归一化(LRN),这种归一化不会提高 ILSVRC 数据集上的性能,而是会导致内存消耗和计算时间增加。
ConvNet Configuration
Configuration: The table below listed different VGG architectures. We can see that there are 2 versions of VGG-16 (C and D). There is not much difference between them except for one that except for some convolution layers, (3, 3) filter size convolution is used instead of (1, 1). These two contain 134 million and 138 million parameters respectively.
Vgg16 使用D配置
1 | # 先看看pytorch内置的VGG16 |
1 | VGG( |
1 | import torchsummary |
代码部分
1 | import torch |
特征图大小的计算:
MaxPool最大池化,AVGPool平均池化
MaxPool–>取窗口内的最大值作为输出
AVGPool–>取窗口内的平均值作为输出
卷积神经网络的数据维度
数据的维度【B,H,W,C】
B: batch 多少个图像
H: height 高
W: weight 宽
C: channel 通道数