yolov3_PyTorch

YOLO3代码讲解

reference

1.yolov3.cfg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

在YOLO3的配置文件中,有5种类型的层:

convolutional,

shortcut

1
2
3
[shortcut]
from=-3
activation=linear

就是skip connection层,就行resnet用的,from -3的意思是该层输出是由前一层和前三层的输出叠加产生的。

upsample

1
2
[upsample]
stride=2

双线性插值,因子为2

Route

1
2
3
4
5
[route]
layers = -4

[route]
layers = -1, 61

如果属性layer有一个值,表示这层输出的是前第4层的特征图。

如果layer有两个值,表示这层输出的是这两个层通道拼接的特征图。

Yolo

1
2
3
4
5
6
7
8
9
[yolo]
mask = 0,1,2
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=80
num=9
jitter=.3
ignore_thresh = .5
truth_thresh = 1
random=1

YOLO层代表了之前描述的Detection层,anchor表示9种anchor,但是只有被mask索引的anchor才会被用到。比如这里是0,1,2,表示第1,2,3个anchor会被用到。在detection层每个cell预测3个box,在3个尺度上预测,最后有9个anchor。

net

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[net]
# Testing
batch=1
subdivisions=1
# Training
# batch=64
# subdivisions=16
width= 320
height = 320
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

下配置文件最开始有net层,描述了输入网络的参数信息。在前传时并没有用到。

2.create module

1.convolutional layers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
if (x["type"] == "convolutional"):
#Get the info about the layer
activation = x["activation"]
try:
batch_normalize = int(x["batch_normalize"])
bias = False
except:
batch_normalize = 0
bias = True

filters= int(x["filters"])
padding = int(x["pad"])
kernel_size = int(x["size"])
stride = int(x["stride"])

if padding:
pad = (kernel_size - 1) // 2
else:
pad = 0

#Add the convolutional layer
conv = nn.Conv2d(prev_filters, filters, kernel_size, stride, pad, bias = bias)
module.add_module("conv_{0}".format(index), conv)

#Add the Batch Norm Layer
if batch_normalize:
bn = nn.BatchNorm2d(filters)
module.add_module("batch_norm_{0}".format(index), bn)

#Check the activation.
#It is either Linear or a Leaky ReLU for YOLO
if activation == "leaky":
activn = nn.LeakyReLU(0.1, inplace = True)
module.add_module("leaky_{0}".format(index), activn)

2.upsample

1
2
3
4
5
6
#If it's an upsampling layer
#We use Bilinear2dUpsampling
elif (x["type"] == "upsample"):
stride = int(x["stride"])
upsample = nn.Upsample(scale_factor = 2, mode = "bilinear")
module.add_module("upsample_{}".format(index), upsample)

3.yolo

1
2
3
4
5
6
7
8
9
10
11
12
#Yolo is the detection layer
elif x["type"] == "yolo":
mask = x["mask"].split(",")
mask = [int(x) for x in mask]

anchors = x["anchors"].split(",")
anchors = [int(a) for a in anchors]
anchors = [(anchors[i], anchors[i+1]) for i in range(0, len(anchors),2)]
anchors = [anchors[i] for i in mask]

detection = DetectionLayer(anchors)
module.add_module("Detection_{}".format(index), detection)

yolo就是检测层,分别在82,94,106层上

3.输出处理

中心坐标,objectness score经过sigmod函数变到0-1之间:

1
2
3
4
5

#Sigmoid the centre_X, centre_Y. and object confidencce
prediction[:,:,0] = torch.sigmoid(prediction[:,:,0])#tx,ty,tw,th,p0,p1,.....,pc
prediction[:,:,1] = torch.sigmoid(prediction[:,:,1])
prediction[:,:,4] = torch.sigmoid(prediction[:,:,4])

给中心坐标加上网格偏移:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#Add the center offsets
grid = np.arange(grid_size)
a,b = np.meshgrid(grid, grid)

x_offset = torch.FloatTensor(a).view(-1,1)
y_offset = torch.FloatTensor(b).view(-1,1)

if CUDA:
x_offset = x_offset.cuda()
y_offset = y_offset.cuda()

x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1,num_anchors).view(-1,2).unsqueeze(0)

prediction[:,:,:2] += x_y_offset

class score进行sigmod:

1
prediction[:,:,5: 5 + num_classes] = torch.sigmoid((prediction[:,:, 5 : 5 + num_classes]))

把检测特征图变成输入图片大小,

1
prediction[:,:,:4] *= stride

最后把不同尺度的检测特征图拼接到一个大的tensor上,

1
2
3
4
5
6
7
8
9
#Transform 
x = x.data
x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
if not write: #if no collector has been intialised.
detections = x
write = 1

else:
detections = torch.cat((detections, x), 1)

最后输出的prediction维度为

$$1,10647,85$$

1-d:一个batch内的图片索引

2-d: $13\times13\times3+26\times26\times3+52\times52\times3=10647$

3-d: x,y,w,h,objrctness_score,class_score

4.做非极大值抑制。

1.先做object confidence的阈值抑制:

预测的tensor中包含$B\times10647$个hound ing box. objectness score低于阈值0.5的,把他的每个属性,即代表这个bounding box的整行都置为0.

1
2
3
4
5
conf_mask = (prediction[:,:,4] > onfidence).float().unsqueeze(2)
#prediction[:,:,4] > onfidence:维度是【1,10647】
#unsequeeze后变成【1,10647,1】

prediction = prediction*conf_mask

2.把坐标(x,y,w,h)转换成(x1,y1,x2,y2)形式

3.objectness_score和class_score处理

因为batch里面每张图片的真实检测个数可能不一样,所以每次只能执行一次非极大值抑制。也就是说,不能向量化这个非极大值抑制的操作

1
2
3
4
5
6
7
8
batch_size = prediction.size(0)

write = False

for ind in range(batch_size):
image_pred = prediction[ind] #image Tensor,10647boxes,维度是:10647*85
#confidence threshholding
#NMS

write标志表示我们还没初始化输出,output.

先关心calss score最大的,也就是那【5:5+class_numbers】中最大的。从每行中把那80个类别分数抹掉,加上分数最高的类别索引,和该类别的分数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
  max_conf, max_conf_score = torch.max(image_pred[:,5:5+ num_classes], 1)#torch.max()返回两个值,一个是最大值,另一个是索引值。函数第二个参数为1表示按行取最大值并返回对应的索引值。
#两个变成了10647维的向量
#max_conf:最大的分数
#max_conf_score:对应的类别索引

max_conf = max_conf.float().unsqueeze(1)
#squeeze函数,参数为0,1分别表示第一维度和第二维度。sequeeze(0)表示如果第一维度为1,而去掉,否则不变。比如b.size()=[1,3]则可以去掉1变成[3]。
#unsequeeze同理。作用相反。unsequeeze(1)表示如果第二维度为0,则添加变成1.比如max_conf.size()=[10647]->[10647,1]

max_conf_score = max_conf_score.float().unsqueeze(1)

seq = (image_pred[:,:5], max_conf, max_conf_score)
#seq:10647*[5,1,1]
#seq:0-4:x,y,w,h,objectness score,

image_pred = torch.cat(seq, 1)
#torch.cat(,1)表示在第一维度上拼接,
#seq:10647*[5,1,1]->image_pred:10647*7

之前已经把objetness_score<0.5行都置为0了,现在去掉他们,也就是去掉这些bounding box.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
    non_zero_ind =  (torch.nonzero(image_pred[:,4]))
#【10647,1】->【15,1】只有15个box剩下来了
try:
image_pred_ = image_pred[non_zero_ind.squeeze(),:].view(-1,7)
#【10647,7】
#image_pred_就是这15个box的类别【17,17,……,17】都是狗

except:
continue
#Get the various classes detected in the image
img_classes = unique(image_pred_[:,-1])
#去重,得到检测出的类别

#For PyTorch 0.4 compatibility
#Since the above code with not raise exception for no detection
#as scalars are supported in PyTorch 0.4
if image_pred_.shape[0] == 0:
continue

4.然后执行非极大值抑制

首先提取出检测到的类别的detection。

1
2
3
4
5
6
7
for cls in img_classes:
#get the detections with one particular class
cls_mask = image_pred_*(image_pred_[:,-1] == cls).float().unsqueeze(1)
class_mask_ind = torch.nonzero(cls_mask[:,-2]).squeeze()


image_pred_class = image_pred_[class_mask_ind].view(-1,7)

根据objectness_score从高到低排序:

1
2
3
conf_sort_index = torch.sort(image_pred_class[:,4], descending = True )[1]
image_pred_class = image_pred_class[conf_sort_index]
idx = image_pred_class.size(0)

计算IOU的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def bbox_iou(box1, box2):
#Get the coordinates of bounding boxes
b1_x1, b1_y1, b1_x2, b1_y2 = box1[:,0], box1[:,1], box1[:,2], box1[:,3]
b2_x1, b2_y1, b2_x2, b2_y2 = box2[:,0], box2[:,1], box2[:,2], box2[:,3]

#get the corrdinates of the intersection rectangle
inter_rect_x1 = torch.max(b1_x1, b2_x1)
inter_rect_y1 = torch.max(b1_y1, b2_y1)
inter_rect_x2 = torch.min(b1_x2, b2_x2)
inter_rect_y2 = torch.min(b1_y2, b2_y2)

#Intersection area
if torch.cuda.is_available():
inter_area = torch.max(inter_rect_x2 - inter_rect_x1 + 1,torch.zeros(inter_rect_x2.shape).cuda())*torch.max(inter_rect_y2 - inter_rect_y1 + 1, torch.zeros(inter_rect_x2.shape).cuda())
else:
inter_area = torch.max(inter_rect_x2 - inter_rect_x1 + 1,torch.zeros(inter_rect_x2.shape))*torch.max(inter_rect_y2 - inter_rect_y1 + 1, torch.zeros(inter_rect_x2.shape))

#Union Area
b1_area = (b1_x2 - b1_x1 + 1)*(b1_y2 - b1_y1 + 1)
b2_area = (b2_x2 - b2_x1 + 1)*(b2_y2 - b2_y1 + 1)

iou = inter_area / (b1_area + b2_area - inter_area)

return iou

非极大值抑制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
if nms:
#For each detection
for i in range(idx):
#Get the IOUs of all boxes that come after the one we are looking at
#in the loop
try:
#计算第i个box的iou值
ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i+1:])
#第i个box和其他的box求ious

except ValueError:
break

except IndexError:
break

#Zero out all the detections that have IoU > treshhold
iou_mask = (ious < nms_conf).float().unsqueeze(1)
#nms_conf=0.4,
#举例:第0个detection是objectness_score最大的,计算剩下的14个和他的IOU,如果iou<0.4,则保留,否则去掉。剩下第二次循环时,就剩下比如11个box了。

image_pred_class[i+1:] *= iou_mask

#Remove the non-zero entries
non_zero_ind = torch.nonzero(image_pred_class[:,4]).squeeze()
image_pred_class = image_pred_class[non_zero_ind].view(-1,7)

总结

  1. 根据objectness_score的阈值筛一波,把不符合要求的整行全部置零,剩下15个box

  2. 把剩下的box的class_score处理,取最大值,得到每个box的类别

  3. 对每一个类别进行非极大值抑制

    a.将该类的box按照objectness_score从高到低排列,

    b.计算第i个和第i+1到最后一个box的iou,将>阈值的box筛掉,

    c.再迭代。