• toc {:toc}

Overview of R-CNN structure

image

image


Fast RCNN

Fast RCNN ์€ RCNN ๊ณผ SPPnet ์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ๊ณ ์•ˆ๋๋‹ค. ๊ฐ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ  ๋ฐœ์ƒํ•  ๋ฌธ์ œ๋ฅผ ์ฒดํฌํ•œ ํ›„ Fast RCNN ์— ๋Œ€ํ•ด ์ดํ•ดํ•ด๋ณด์ž.

RCNN

image

image

Structure of RCNN

  1. ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ Selective Search ํ†ตํ•ด RoIs ์ƒ์„ฑ
  2. RoI(Region of Interest) ๊ฐ๊ฐ์— ๋Œ€ํ•ด Crop, Resize(Wrapping)
  3. CNN ํ†ต๊ณผํ•˜์—ฌ ํŠน์ง•๋งต ์ƒ์„ฑ
  4. ์ƒ์„ฑํ•œ ํŠน์ง•๋งต์„ cache ๋กœ ์ €์žฅํ•œ๋‹ค.
  5. Classifiers(SVM) ์„ ํ†ตํ•ด ๋ถ„๋ฅ˜ ์ž‘์—…์„ ํ•œ๋‹ค.
  6. Bounding-box Regressor(Bbox reg) ๋ฅผ ํ†ตํ•ด์„œ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ํ•œ๋‹ค.

Problems of RCNN

  1. CNN ์ž…๋ ฅ์„ ์œ„ํ•œ Warpping ๊ณผ์ •์—์„œ crop, resize ๋กœ ์ธํ•ด ์ •๋ณด ์†์‹ค์ด ๋ฐœ์ƒํ•œ๋‹ค.

  2. Training is a multi-stage pipeline RCNN ์˜ pipeline ์€ a) Selective Search ๋ฅผ ํ†ตํ•ด์„œ RoI ๋ฅผ ์ถ”์ถœ, b) CNN ์„ ์ด์šฉํ•ด ํŠน์ง• ์ถ”์ถœ, c) ๊ฐ ์ž‘์—…์— ๋”ฐ๋ผ ๋ถ„๋ฅ˜, ๊ฐ์ฒด ํƒ์ง€ ๊ณผ์ • ๋“ฑ ์—ฌ๋Ÿฌ ๊ณผ์ •์ด ๋”ฐ๋กœ ๋–จ์–ด์ ธ์„œ ์ง„ํ–‰๋œ๋‹ค. ์ด๋Ÿฐ multi-stage pipeline ์€ End-to-End ํ•™์Šต์„ ํ•  ์ˆ˜ ์—†์–ด ํ•œ ์ชฝ์˜ ํ›ˆ๋ จ์ด ๋˜๋”๋ผ๋„ ๋‹ค๋ฅธ stage ๊ฐ€ ํ•™์Šต๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ๊ฐ–๋Š”๋‹ค.

  3. Training is expensive in space and time. ๊ฐ๊ฐ์˜ RoI ๋ฅผ CNN ์— ๋„ฃ์–ด ๋‚˜์˜จ ํŠน์ง•๋งต์„ ์ „๋ถ€ ๋ชจ์•„ SVM, Bbox reg ์— ์‚ฌ์šฉํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ง•๋งต์„ ์ €์žฅํ•ด๋‘ฌ์•ผ ํ•œ๋‹ค. ์ด๋Š” ๋งŽ์€ ์ €์žฅ ๊ณต๊ฐ„์„ ํ•„์š”๋กœ ํ•œ๋‹ค.

  4. Object detection is slow. ๊ฐ ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€๋งˆ๋‹ค RoI ๋ฅผ ๋ฝ‘๊ณ , RoI ๋งˆ๋‹ค CNN ์— ํ†ต๊ณผ์‹œ์ผœ ๋ฐœ์ƒํ•˜๋Š” ๋ณต์žก๋„๋กœ ์ธํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋งŽ์•„์ง€๋ฉด์„œ ๊ณต๊ฐ„์ ์ธ ๋น„์šฉ, ์‹œ๊ฐ„์ ์ธ ๋น„์šฉ์ด ์ƒ์Šนํ•˜๊ณ  ์†๋„๊ฐ€ ๋А๋ฆฌ๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ๊ฐ–๋Š”๋‹ค.


SPPnet

image

Strucutre of SPPnet

  1. ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ CNN ์— ๋„ฃ๋Š”๋‹ค.
  2. CNN ์œผ๋กœ๋ถ€ํ„ฐ ์ถ”์ถœ๋œ ํŠน์ง•๋งต์— Selective Search ์™€ ๊ฐ™์€ Proposal Method ๋ฅผ ์‚ฌ์šฉํ•ด RoI ๋ฅผ ์„ ์ •ํ•œ๋‹ค.
  3. ๊ฐ RoI ์— ๋Œ€ํ•ด Spatial Pyramid Pooling layer ๋ฅผ ์ ์šฉํ•œ๋‹ค.
  4. Fully-connected layers ๋ฅผ ํ†ต๊ณผํ•˜๊ณ  ๋‚œ ๊ฒฐ๊ณผ๋ฅผ cache ๋กœ ์ €์žฅํ•œ๋‹ค.(SVMs ์™€ Bbox reg ๊ฐ€ ์—ฐ์‚ฐ์„ ๊ณต์œ ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ)
  5. SVM ์„ ํ†ตํ•ด ๋ถ„๋ฅ˜ ์ž‘์—…์„ ํ•œ๋‹ค.
  6. Bbox reg ์™€ Non Maximum Suppression ์„ ํ†ตํ•ด์„œ bounding box ๋ฅผ ๊ตฌํ•œ๋‹ค.

Improvements of SPPnet over RCNN

  • Spatial Pyramid Pooling(SPP) ๋ฐฉ์‹์„ ๊ณ ์•ˆํ•œ SPPnet ์€ RCNN ์˜ ๋‹ค์Œ์˜ ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ–ˆ๋‹ค.

image

  1. ๊ฐ๊ฐ์˜ RoI ์— CNN ์„ ์ ์šฉํ•˜์—ฌ ๋งŽ์€ ๋น„์šฉ์ด ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ
  • RCNN โ†’ Selective Search ๋กœ ์ถ”์ถœํ•œ RoI ๊ฐ๊ฐ์— CNN ์„ ์ ์šฉํ•œ๋‹ค. โ†’ ๋งค์šฐ ํฐ ๋ณต์žก๋„
  • SPPnet โ†’ ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ CNN ์— ๋„ฃ์–ด ์ถ”์ถœํ•œ ํŠน์ง•๋งต์— Selective Search ์ ์šฉํ•œ๋‹ค. โ†’ ๋ณต์žก๋„ ๊ฐ์†Œ
  1. Warping(Crop, Resize) ์œผ๋กœ ์ธํ•ด ์ •๋ณด์˜ ์†์‹ค์ด ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ
  • RCNN โ†’ Warping ์„ ์‚ฌ์šฉํ•ด FC layer ์— ์ž…๋ ฅ๋  ๊ณ ์ • ๊ธธ์ด ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ ๋‹ค. โ†’ ์ •๋ณด์˜ ์†์‹ค, ์ •ํ™•๋„์˜ ์†์‹ค ๋ฐœ์ƒ
  • SPPnet โ†’ Spatial Pyramid Pooling(SPP) ์„ ์‚ฌ์šฉํ•˜์—ฌ warping ์—†์ด ๊ณ ์ • ๊ธธ์ด์˜ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ ๋‹ค. โ†’ ์ •๋ณด ์†์‹ค ์—†์ด, ๋‹ค์–‘ํ•œ ์œ„์น˜ ์ •๋ณด ๋‹ด์•„ ์‚ฌ์šฉํ•œ๋‹ค.

Problems of SPPnet

  • RCNN ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ a) CNN ์ด์šฉํ•ด ํŠน์ง• ์ถ”์ถœ, b) Selective Search ๋กœ RoI ์ถ”์ถœ, c) ๊ฐ ์ž‘์—…์— ๋”ฐ๋ผ ๋ถ„๋ฅ˜, ๊ฐ์ฒด ํƒ์ง€ ๊ณผ์ •๊ณผ ๊ฐ™์ด ์—ฌ๋Ÿฌ pipeline ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
  • SPPnet ๋˜ํ•œ ๊ฐ RoI ์— ๋Œ€ํ•ด SPP ์™€ FC layer ๋ฅผ ๊ฑฐ์นœ ์ดํ›„์˜ ํŠน์ง•์„ ์ €์žฅํ•ด์•ผ ํ•œ๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ๊ฐ–๋Š”๋‹ค.

Overall Structure of Fast RCNN

  • Fast RCNN ์€ RCNN ์˜ ๋ฌธ์ œ์™€ SPPnet ์˜ ๋ฌธ์ œ์ ์„ ๋™์‹œ์— ํ•ด๊ฒฐํ•˜๊ณ ์ž ๊ณ ์•ˆ๋๋‹ค.

image

image

  1. ์ „์ฒด ์ž…๋ ฅ ์ด๋ฏธ์ง€์— Selective Search ๋ฅผ ํ†ตํ•ด RoI(์ด๋ฏธ์ง€) ๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
  2. ์ „์ฒด ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ CNN ์— ์ž…๋ ฅํ•ด ํŠน์ง•๋งต์„ ์ถ”์ถœํ•œ๋‹ค.
  3. CNN ์„ ํ†ต๊ณผํ•œ ํŠน์ง•๋งต์€ ํฌ๊ธฐ๊ฐ€ ์ค„์–ด๋“  ํ˜•ํƒœ (e.g. 14x14) ์ด๊ธฐ ๋•Œ๋ฌธ์— RoI ๋ฅผ ์ค„์–ด๋“  ํฌ๊ธฐ์˜ ํŠน์ง•๋งต์˜ ์œ„์น˜์— ๋งž๊ฒŒ projection ํ•œ๋‹ค.
  4. ์ดํ›„ projection ํ•œ RoI(ํŠน์ง•๋งต) ์— RoI Pooling ์„ ์ ์šฉํ•˜์—ฌ FC layer ์˜ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ๊ธฐ ์œ„ํ•œ ๊ณ ์ • ๊ธธ์ด ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•ด FC layers ๋ฅผ ํ†ต๊ณผํ•œ๋‹ค.
  5. SVM ๋Œ€์‹  ๋ถ„๋ฅ˜ ์ž‘์—…์— softmax ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์–ด๋–ค ๋ฌผ์ฒด์ธ์ง€ ๋ถ„๋ฅ˜ํ•œ๋‹ค.
  6. Bbox reg ๋ฅผ ํ†ตํ•ด์„œ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์กฐ์ •ํ•œ๋‹ค.
  • Fast RCNN ์€ CNN ์œผ๋กœ VGGNet ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.
  • FC layers ์˜ ๋งˆ์ง€๋ง‰ FC layer ํ•˜๋‚˜๋ฅผ FC layer 2 ๊ฐœ๋กœ ๋ถ„ํ• ํ•˜์—ฌ์—ฌ ์ฒซ ๋ฒˆ์งธ FC layer ๋Š” Softmax ๋ฅผ ํ†ตํ•œ ๋ถ„๋ฅ˜, ๋‘ ๋ฒˆ์งธ FC layer ๋Š” Bbox reg ๋ฅผ ํ†ตํ•œ ๊ฐ์ฒด ํƒ์ง€๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค.
  • ์œ„ ๊ทธ๋ฆผ์—์„œ ์—ฐ๊ฒฐ๋œ ๊ฒƒ๊ณผ ๊ฐ™์ด Fast RCNN ์€ softmax, bbox regressor ์— ๊ฐ™์€ feature vector ๊ฐ€ ์ „๋‹ฌ๋˜๊ณ , ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๊ฐ€ ํ•˜๋‚˜์˜ stage ๋กœ ์ด์–ด์ ธ ์žˆ๋‹ค.

Main Ideas of Fast RCNN

Fast RCNN ์ด ์ œ์•ˆํ•œ ์•„์ด๋””์–ด์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

RoI Pooling

RoI Pooling ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ถ€๋ถ„์€ CNN ์„ ํ†ต๊ณผํ•œ ํŠน์ง•๋งต์— RoI ์— ๋Œ€ํ•œ projection ์ด ์ ์šฉ๋œ ์ƒํƒœ์ด๋‹ค. ์ฆ‰, ์ž…๋ ฅ๋˜๋Š” ์ด๋ฏธ์ง€ (e.g. 224x224) ํฌ๊ธฐ์˜ RoI ์˜ ์œ„์น˜๋ฅผ ํŠน์ง•๋งต (e.g. 14x14) ์— ์ผ์น˜ํ•˜๋„๋ก ์—ฐ๊ฒฐ์‹œ์ผœ์ฃผ๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค. ์ดํ›„ ํŠน์ง•๋งต์˜ RoI ์— RoI Pooling ์„ ์ ์šฉํ•œ๋‹ค.

image

image

  • ์œ„ ๊ทธ๋ฆผ์€ ์ด๋ฏธ์ง€๊ฐ€ VGG ๋ฅผ ํ†ต๊ณผํ•ด RoI Pooling ๊นŒ์ง€ ์ ์šฉํ•œ ๊ณผ์ •์„ ์˜ˆ๋ฅผ ๋“ค์–ด ๋ณด์—ฌ์ค€๋‹ค.
  1. ์ด๋ฏธ์ง€๊ฐ€ VGG ๋ฅผ ํ†ต๊ณผํ•ด 8x8 ์˜ ํŠน์ง•๋งต์„ ์‚ฐ์ถœํ•œ๋‹ค.
  2. ์›๋ณธ ์ด๋ฏธ์ง€์— Selective Search ๋ฅผ ํ†ตํ•ด ์ถ”์ถœํ•œ RoI ๋ฅผ projection ํ•˜์—ฌ ํŠน์ง•๋งต์— 5x7 ์˜ ํฌ๊ธฐ๋กœ ์œ„์น˜๋ฅผ ์„ค์ •ํ•œ๋‹ค.
  3. RoI Pooling ์˜ ์ถœ๋ ฅ ํฌ๊ธฐ์— ๋งž๋„๋ก Grid ๋กœ spliting ํ•œ๋‹ค. ์œ„ ๊ทธ๋ฆผ์˜ ๊ฒฝ์šฐ 2x2 RoI Pooling ์ธ ์ƒํƒœ์ด๊ณ , h ์™€ w ๊ฐ€ ๋ชจ๋‘ ํ™€์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์— ๋น„๋Œ€์นญ์ ์œผ๋กœ ๋‚˜๋ˆ„์–ด์ง„ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค.
  4. ํ•ด๋‹น Grid ์•ˆ์—์„œ ์ตœ๋Œ“๊ฐ’์„ ์ถ”์ถœํ•˜๋Š” MaxPooling ์„ ํ†ตํ•ด์„œ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ์ถœ๋ ฅ์„ ๋งŒ๋“ ๋‹ค.
  • RoI Pooling ์€ SPP(Spatial Pyramid Pooling) ์˜ ํŠน๋ณ„ํ•œ ๊ฒฝ์šฐ์ด๋‹ค.
  • Grid ๋Š” Projection_height / Pooling_height, Projection_width / Pooling_width ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •ํ•ด์„œ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

Multi-task loss

  • : K+1 ๊ฐœ์˜ Class Score (K ๊ฐœ์˜ ๊ฐ์ฒด + ๋ฐฐ๊ฒฝ), Softmax ์— ์˜ํ•ด ๊ณ„์‚ฐ๋œ๋‹ค.
  • : ์ •๋‹ต ๊ฐ์ฒด ํด๋ž˜์Šค (0 or 1)
  • : ์˜ˆ์ธกํ•œ K ๊ฐ์ฒด ํด๋ž˜์Šค์— ๋Œ€ํ•œ bounding box ์ขŒํ‘œ ์กฐ์ •๊ฐ’
  • : K ๊ฐ์ฒด ํด๋ž˜์Šค์— ๋Œ€ํ•œ bounding box ์ขŒํ‘œ๊ฐ’
  • : ๋ถ„๋ฅ˜, bbox ์†์‹ค ํ•จ์ˆ˜ ์‚ฌ์ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ ˆํ•˜๋Š” hyperparameter
  • : ์ธ ๊ฒฝ์šฐ u = 1 ์•„๋‹ˆ๋ฉด u = 0
  • : ๋ถ„๋ฅ˜ ์ ์ˆ˜์— ๋Œ€ํ•œ log loss
  • : Bounding box ์ขŒํ‘œ์— ๋Œ€ํ•œ Smooth L1 loss
  • smooth L1 ์„ ์‚ฌ์šฉํ•œ ์ด์œ ๋Š” L1 loss ๋Š” ์ด์ƒ์น˜์— ๋œ ๋ฏผ๊ฐํ•˜๊ณ , L2 loss ๋Š” ๋„ˆ๋ฌด ๋ฏผ๊ฐํ•ด ํ•™์Šต๋ฅ  ์กฐ์ •์ด ํž˜๋“ค๊ธฐ ๋•Œ๋ฌธ์— L1 loss ์˜ ๊ฐ•๊ฑดํ•จ์„ ์™„ํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ ์šฉํ–ˆ๋‹ค.

Mini-batch sampling

  • Fine-tuning ์‹œ ๊ฐ SGD ๋ฐฐ์น˜๋Š” N=2 ์ด๋ฏธ์ง€์™€ R=128 ๋กœ ์ด๋ฏธ์ง€ ๋‹น R/N ๊ฐœ์˜ RoI ๋ฅผ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค.
  • RoI ์™€ GT ์˜ IoU ๊ฐ€ 0.5 ์ด์ƒ์ธ ๊ฒฝ์šฐ ์ค‘ 25% ์˜ RoI ๋ฅผ positive, foreground object class ๋กœ ์„ค์ •ํ•˜๊ณ , ์ธ ๊ฒฝ์šฐ ์ค‘ 75% ๋ฅผ negative, background object class ๋กœ ์„ค์ •ํ•œ๋‹ค.
  • RCNN, SPPnet ์—์„œ๋Š” ์ƒ˜ํ”Œ๋ง์„ ์–ด๋–ป๊ฒŒ ํ–ˆ๋Š”์ง€๋Š” ์ž˜ ๋ชจ๋ฅด๊ฒ ๋‹ค.

Backpropagation through RoI Pooling Layers

[image]

  • : RoI Pooling Layer ์— ์ž…๋ ฅ๋˜๋Š” i ๋ฒˆ์งธ activation input
  • : r ๋ฒˆ์งธ RoI ์— ๋Œ€ํ•œ j ๋ฒˆ์งธ ์ถœ๋ ฅ

(Backpropagation, Truncated SVD โ†’ ์ถ”ํ›„ ์ •๋ฆฌ ์˜ˆ์ •)

Contributions of Fast RCNN

  1. RCNN, SPPnet ๋ณด๋‹ค ๋” ๋†’์€ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๊ฐ–๋Š”๋‹ค.
  2. Multi-task loss ๋ฅผ ์‚ฌ์šฉํ•ด ํ›ˆ๋ จ์„ single stage, End-to-End ๋กœ ์ง„ํ–‰ํ•œ๋‹ค.
  3. ํ›ˆ๋ จ์„ ํ†ตํ•œ backpropagation ์ด ๋ชจ๋“  ๋„คํŠธ์›Œํฌ ์ธต์— ๋„๋‹ฌํ•˜์—ฌ ๋ชจ๋“  ์ธต์ด ํ›ˆ๋ จ๋œ๋‹ค.
  4. ํŠน์ง•์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ์ €์žฅ๊ณต๊ฐ„์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค.

Result

[image]

  • Fast RCNN ์˜ ์„ฑ๋Šฅ์ด ์ด์ „ ๋‹ค๋ฅธ ๋ฐฉ์‹๋“ค๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ฐธ๊ณ ๋ฌธํ—Œ