Yangyehan&UndGround.

VGG 卷积神经网络笔记

Word count: 2.6kReading time: 12 min
2023/07/26

VGG16

Introducation

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual computer vision competition. Each year, teams compete on two tasks. The first is to detect objects within an image coming from 200 classes, which is called object localization. The second is to classify images, each labeled with one of 1000 categories, which is called image classification. VGG 16 was proposed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group Lab of Oxford University in 2014 in the paper “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION”. This model won 1st and 2nd place in the above categories in the 2014 ILSVRC challenge.

The Architeture

image-20230731120351482

The input to cov1 layer is of fixed size 224 x 224 RGB image. The image is passed through a stack of convolutional (conv.) layers, where the filters were used with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.

cov1 层的输入是固定大小的 224 x 224 RGB 图像。图像通过一堆卷积 (conv.) 层,其中使用的滤波器具有非常小的感受野:3×3(这是捕获左/右、上/下、中心概念的最小尺寸) )。在其中一种配置中,它还利用 1×1 卷积滤波器,这可以看作是输入通道的线性变换(其次是非线性)。卷积步幅固定为1像素;转换的空间填充。层输入使得空间分辨率在卷积后得以保留,即对于 3×3 卷积,填充为 1 像素。层。空间池化由五个最大池化层执行,这些层遵循一些卷积。层(并非所有的转换层后面都有最大池化)。最大池化在 2×2 像素窗口上执行,

Three Fully-Connected (FC) layers follow a stack of convolutional layers (which has a different depth in different architectures): the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.

三个全连接 (FC) 层跟随一堆卷积层(在不同的架构中具有不同的深度):前两个层各有 4096 个通道,第三个执行 1000 路 ILSVRC 分类,因此包含 1000 个通道(每个通道一个)班级)。最后一层是 soft-max 层。所有网络中全连接层的配置都是相同的。

All hidden layers are equipped with the rectification (ReLU) non-linearity. It is also noted that none of the networks (except for one) contain Local Response Normalisation (LRN), such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.

所有隐藏层都配备了非线性校正(ReLU)。还值得注意的是,没有一个网络(除了一个)包含局部响应归一化(LRN),这种归一化不会提高 ILSVRC 数据集上的性能,而是会导致内存消耗和计算时间增加。

ConvNet Configuration

Configuration: The table below listed different VGG architectures. We can see that there are 2 versions of VGG-16 (C and D). There is not much difference between them except for one that except for some convolution layers, (3, 3) filter size convolution is used instead of (1, 1). These two contain 134 million and 138 million parameters respectively.

image-20230731121028553

Vgg16 使用D配置

image-20230731120430851

1
2
3
4
5
6
7
# 先看看pytorch内置的VGG16
import torch
import torch.nn as nn
import torchvision

model = torchvision.models.vgg16()
print(model)

image-20230728183925956

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
VGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace=True)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace=True)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace=True)
(16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): ReLU(inplace=True)
(19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace=True)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace=True)
(23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(25): ReLU(inplace=True)
(26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(27): ReLU(inplace=True)
(28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(29): ReLU(inplace=True)
(30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
(classifier): Sequential(
(0): Linear(in_features=25088, out_features=4096, bias=True)
(1): ReLU(inplace=True)
(2): Dropout(p=0.5, inplace=False)
(3): Linear(in_features=4096, out_features=4096, bias=True)
(4): ReLU(inplace=True)
(5): Dropout(p=0.5, inplace=False)
(6): Linear(in_features=4096, out_features=1000, bias=True)
)
)
## Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
创建了一个包含3个输入通道、64个输出通道的二维卷积层,使用3x3的卷积核进行滤波操作,步长为1,边界填充1个像素。该层将对输入的特征图进行卷积操作,输出64个通道的特征图。

## MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
kernel_size=2: 池化核大小,表示池化操作使用的窗口大小。这里设置为2,表示使用2x2的池化窗口。

stride=2: 步幅,表示池化窗口在特征图上滑动的步幅。这里设置为2,表示池化窗口每次水平和垂直方向上滑动2个像素。

padding=0: 零填充,表示在特征图的边缘填充0的层数。这里设置为0,表示不在输入特征图的边缘填充。

dilation=1: 膨胀率,表示在特征图上应用池化操作时的膨胀率。这里设置为1,表示不使用膨胀。

ceil_mode=False: 边界模式,决定当特征图的尺寸除以池化核大小后不整除时,是否向上取整。这里设置为False,表示不向上取整,而是向下取整。

综上所述,MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)创建了一个最大池化层,使用2x2的池化窗口,步幅为2,不进行边界填充和膨胀,且不向上取整。池化操作会将输入的特征图划分成不重叠的2x2区域,每个区域内取最大值作为池化操作的结果。这样,输出特征图的大小将缩小为输入特征图的一半。最大池化是一种常用的操作,用于减小特征图的尺寸,并保留最显著的特征。

## (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
nn.AdaptiveAvgPool2d(output_size=(7, 7)) 是 PyTorch 中用于创建自适应平均池化层的函数调用。让我们解释这个函数的参数:

output_size=(7, 7): 输出大小,表示池化操作后输出的特征图的尺寸。这里设置为 (7, 7),表示希望输出特征图的高度和宽度都为 7。
自适应平均池化是一种池化操作,它可以在输入特征图的尺寸不同的情况下,将特征图缩放到指定的输出尺寸。这种池化操作非常有用,因为它允许在不同大小的输入上进行池化,而不需要手动调整池化核大小或步幅。

具体而言,nn.AdaptiveAvgPool2d 可以将任意大小的输入特征图池化成指定大小的输出特征图。在这里,输出尺寸被设定为 (7, 7),所以无论输入特征图的大小是多少,池化操作后的输出特征图都将具有 7x7 的尺寸。这在深度学习中经常用于将不同大小的图像或特征图调整为固定大小的输入,以便于在全连接层等模块中进行处理。
1
2
3
4
5
6
7
import torchsummary
# 参数量:1.38亿
torchsummary.summary(model, input_size= (3,224,224), batch_size =2, device='cpu')

"""
是一个用于显示PyTorch模型概要信息的Python库。它提供了一种简单、方便的方式来查看和验证模型的结构,包括每一层的输出尺寸和参数数量。这对于调试和优化深度学习模型非常有帮助。
"""

image-20230728175818510

代码部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import torch 
import torch.nn as nn
import torchvision
import torchsummary

def showmodel():
model = torchvision.models.vgg16()
print(model)
# 参数量:1.38亿
torchsummary.summary(model, input_size= (3,224,224), batch_size =2, device='cpu')


class Vgg16(nn.Module):
# 每个卷积核的大小都是3X3,后面的步长1,Padding:1
# 每次卷积后面都用了ReLU
def __init__(self, in_channel=3,out_channel=1000, num_hidden=25088):
supper(Vgg16, self).__init__()
self.features = nn.Sequential(
self.features = nn.Sequential(
# block 1
nn.Conv2d(in_channel,64, 3, 1 ,1),
nn.ReLU(inplace = True),
nn.Conv2d(64,64, 3, 1 ,1),
nn.ReLU(inplace = True),
# 表示使用2X2的池化窗口,步长为2
nn.MaxPool2d(2,2),

# block 2
nn.Conv2d(64,128, 3, 1 ,1),
nn.ReLU(inplace = True),
nn.Conv2d(128,128, 3, 1 ,1),
nn.ReLU(inplace = True),

nn.MaxPool2d(2,2),

# block 3
nn.Conv2d(128,256, 3, 1 ,1),
nn.ReLU(inplace = True),
nn.Conv2d(256,256, 3, 1 ,1),
nn.ReLU(inplace = True),
nn.Conv2d(256,256, 3, 1, 1),
nn.ReLU(inplace = True),

nn.MaxPool2d(2,2),


# block 4
nn.Conv2d(256,512, 3, 1 ,1),
nn.ReLU(inplace = True),
nn.Conv2d(512,512, 3, 1 ,1),
nn.ReLU(inplace = True),
nn.Conv2d(512,512, 3, 1, 1),
nn.ReLU(inplace = True),

nn.MaxPool2d(2,2),

# block 5
nn.Conv2d(256,512, 3, 1 ,1),
nn.ReLU(inplace = True),
nn.Conv2d(512,512, 3, 1 ,1),
nn.ReLU(inplace = True),
nn.Conv2d(512,512, 3, 1, 1),
nn.ReLU(inplace = True),

nn.MaxPool2d(2,2),
)
)

# 创建自适应平均池化层
# output_size=(7,7)表示池化后输出的特征图的尺寸是7X7
self.avgpool = nn.AdaptiveAvgPool2d(output_size=(7,7))

# 创建分类器模型 包含多个全连接层
# Linear就是一个输入大小为num_hidden,输出大小为4096的全连接层
# ReLU 它会对上一层的输出进行 ReLU(Rectified Linear Unit)激活函数处理。ReLU 激活函数对所有负数置零,并保留所有正数。
# Dropout 是丢弃层 随机丢弃一部分神经元,以减少过拟合的可能性。这里没有指定丢弃率,所以默认使用一个较小的丢弃率。
self.classifier = nn.Sequential(
nn.Linear(num_hidden,4096),
nn.ReLU(),
nn.Dropout(),

nn.Linear(4096,4096),
nn.ReLU(),
nn.Dropout(),

nn.Linear(4096,out_channel)
)

def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = torch.classifier(x,1) # 拉平, 二维变一维
x = self.classifier(x)
return x


vgg = Vgg16(3, 1000 , 25088)
showmodel(vgg)

特征图大小的计算:

image-20230814113602849

MaxPool最大池化,AVGPool平均池化

MaxPool–>取窗口内的最大值作为输出

AVGPool–>取窗口内的平均值作为输出

卷积神经网络的数据维度

数据的维度【B,H,W,C】

B: batch 多少个图像

H: height 高

W: weight 宽

C: channel 通道数

image-20230815160110735

CATALOG
  1. 1. VGG16
  2. 2. Introducation
  3. 3. The Architeture
    1. 3.0.0.1. The input to cov1 layer is of fixed size 224 x 224 RGB image. The image is passed through a stack of convolutional (conv.) layers, where the filters were used with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.
    2. 3.0.0.2. Three Fully-Connected (FC) layers follow a stack of convolutional layers (which has a different depth in different architectures): the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
    3. 3.0.0.3. All hidden layers are equipped with the rectification (ReLU) non-linearity. It is also noted that none of the networks (except for one) contain Local Response Normalisation (LRN), such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.
  • 4. ConvNet Configuration
  • 5. Vgg16 使用D配置
  • 代码部分
    1. 1. 特征图大小的计算:
    2. 2. MaxPool最大池化,AVGPool平均池化
    3. 3. 卷积神经网络的数据维度