Overview

๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ

image

  1. Heuristic proposals, anchors, window centers ๋ฅผ ์„ค์ •ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž„์˜์˜ bbox ๋ฅผ ์„ค์ •ํ•œ๋‹ค.
  2. Near-duplicate prediction, bbox ๋“ค์ด ์„œ๋กœ ๋งŽ์ด ๊ฒน์ณ ์ค‘๋ณต๋˜๋Š” ์˜ˆ์ธก๋“ค์ด ๋งŽ์ด ์ƒ์„ฑ๋˜๊ณ , ์ด๋ฅผ NMS ๋ฅผ ํ†ตํ•ด ํ•˜๋‚˜์˜ bbox ๋กœ ์ทจํ•ฉํ•œ๋‹ค.
  3. bbox ๋ฅผ ํ†ตํ•ด ํƒ์ง€๋œ ๊ฐ์ฒด๋ฅผ ๋ถ„๋ฅ˜ํ•œ๋‹ค.
  4. ์ฆ‰, ๋‹ค์Œ์˜ ๋ฐฉ์‹์€ ๊ฐ์ฒด๋ฅผ ์ง์ ‘ ์ฐพ๋Š” ๋ฐฉ์‹์ด ์•„๋‹Œ, ๋งŽ์€ bbox ๋ฅผ ๋งŒ๋“ค์–ด ํƒ์ง€ํ•˜๋Š” ๊ฐ„์ ‘์ ์ธ ๋ฐฉ์‹์ด๋‹ค.
  • ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋งŽ์€ hand-crafted ์š”์†Œ๋“ค์„ ์‚ฌ์šฉํ•ด ๊ฐ์ฒด ํƒ์ง€๋ฅผ ํ–ˆ๊ณ , ์™„์ „ํ•œ end-to-end ๋ชจ๋ธ์ด ์•„๋‹ˆ์—ˆ๋‹ค.
  • ๋‹ค์Œ์˜ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์„ DETR ์€ CNN ๊ณผ Transformer ๋ฅผ ํ†ตํ•ด ์™„์ „ํ•œ end-to-end ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ƒˆ๋‹ค.

DETR

image

image

  • DETR ์€ CNN ๊ณผ Transformer ๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€๊ธฐ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋˜ hand-crafted ๊ตฌ์„ฑ๋“ค์„ ์—†์• ๊ณ  1 ๋Œ€ 1 ๋Œ€์‘์„ ํ†ตํ•ด ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์ด๋ฃฌ๋‹ค.
  • ๋‹ค์Œ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ๊ณผ ๊ฒฝ์Ÿํ• ๋งŒํ•œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๊ฐ€์ง„๋‹ค.

Problem of DETR and Transformer

  • ํ•˜์ง€๋งŒ, DETR ์€ ๋‹ค์Œ์˜ ๋ฌธ์ œ์ ์„ ๊ฐ–๋Š”๋‹ค.

1. DETR ์€ ๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋ณด๋‹ค ์ˆ˜๋ ดํ•˜๊ธฐ ์œ„ํ•ด ๋” ๋งŽ์€ ํ›ˆ๋ จ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Š” Transformer ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š”๋ฐ, Transformer ๊ฐ€ ๋” ๊ธด ํ›ˆ๋ จ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•˜๋Š” ์ด์œ ๋Š” 2 ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

  • ๋‹ค์Œ์€ Transformer ์—์„œ ์‚ฌ์šฉ๋˜๋Š” multi-head attention ์˜ ์‹์ด๋‹ค.
  • : Query element
  • : Key element
  • : Input feature map
  • , : Learnable weight
  • : Attention weight
  • : Number of query element
  • : Number of key element
  • : Learnable weight for query
  • : Learnable weight for key
  • DETR ์€ Transformer, self-attention ์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ์„ ํ•œ๋‹ค. ์ฟผ๋ฆฌ, ํ‚ค์˜ ๊ฐ’์„ ์ฃผ๋กœ ํ”ฝ์…€๋กœ ์„ค์ •๋œ๋‹ค. ์ ์ ˆํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ดˆ๊ธฐํ™”๋ฅผ ํ•œ ๊ฒฝ์šฐ ์ฟผ๋ฆฌ, ํ‚ค๋ฅผ ๋งŒ๋“œ๋Š” , ๋Š” ํ‰๊ท ์ด 0, ๋ถ„์‚ฐ์ด 1 ์ธ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ , ํ”ฝ์…€์˜ ์ˆ˜๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก attention weight ๊ฐ€ 1/ ์— ๊ฐ€๊นŒ์›Œ์ง€๊ฒŒ ๋œ๋‹ค. ์ฆ‰, backpropagation ์„ ํ†ตํ•ด gradient ๋ฅผ ๊ตฌํ•˜๋”๋ผ๋„ ๋งค์šฐ ์ž‘์€ ๊ฐ’์ด ๋‚˜์™€ ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜ ์กฐ์ •์ด ๋А๋ฆฌ๊ฒŒ ๋œ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ๋Š” ๋” ๋งŽ์€ ํ›ˆ๋ จ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•œ๋‹ค.
  • Multi-head attention ์˜ ๋ฉ”๋ชจ๋ฆฌ, ๊ณ„์‚ฐ ๋ณต์žก๋„๊ฐ€ ์ฟผ๋ฆฌ์™€ ํ‚ค์— ๋”ฐ๋ผ ๋งค์šฐ ๋†’์•„์ง„๋‹ค. ์ด๋ฏธ์ง€ ๋ถ„์•ผ์˜ ๊ฒฝ์šฐ ์ฟผ๋ฆฌ, ํ‚ค๊ฐ€ ๊ฐ ํ”ฝ์…€์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ฟผ๋ฆฌ ํ‚ค ๊ฐ’์ด ๋†’์€ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ณ  ์ด๋กœ ์ธํ•ด ๋ณต์žก๋„๊ฐ€ ๋งค์šฐ ๋†’์•„ ์ˆ˜๋ ด์ด ๋А๋ ค์ง€๊ฒŒ ๋œ๋‹ค.

2. DETR ์€ ์ž‘์€ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” ๊ฒƒ์— ์ƒ๋Œ€์ ์œผ๋กœ ๋” ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๋‹ค๋ฅธ ๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€๊ธฐ์˜ ๊ฒฝ์šฐ multi-scale ์˜ ํŠน์ง•๋งต์„ ์‚ฌ์šฉํ•˜๋Š” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์„ ํ†ตํ•ด์„œ ์—ฌ๋Ÿฌ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•œ๋‹ค. DETR ๋„ ์ž‘์€ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋ ค๋ฉด CNN ์„ ํ†ตํ•ด์„œ ์‚ฐ์ถœ๋˜๋Š” ํŠน์ง•๋งต์˜ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๋Š” ๋ฐฉ์‹์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, DETR ์€ ์—ฐ์‚ฐ๋Ÿ‰์ด quadratic ํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ํŠน์ง•๋งต์„ ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•˜์—ฌ ์ž‘์€ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” ๊ฒƒ์— ์ƒ๋Œ€์ ์œผ๋กœ ๋” ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

Deformable DETR ์˜ ๊ฒฝ์šฐ ์œ„์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆ๋˜์—ˆ๋‹ค.

Background

Deformable Convolution

  • Deformable DETR ์€ DETR ์˜ ์—ฐ์‚ฐ๋Ÿ‰ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด deformable convolution ์˜ ์•„์ด๋””์–ด๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

image

image

  • Deformable convolution ์€ ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.
  • ์ผ๋ฐ˜์ ์ธ convolution ์˜ ๊ฒฝ์šฐ ๊ณ ์ •๋œ ํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•ด ํ•˜๋‚˜์˜ ํŠน์ง•๋งต์„ ์‚ฐ์ถœํ•œ๋‹ค.
  • Deformable convolution ์€ feature ๋ฅผ ํŠน์ • layer ์— ํƒœ์›Œ sampling point ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ํ•ด๋‹น point ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ convolution ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
  • ์ด๋ฅผ ํ†ตํ•ด ํŠน์ • ์œ„์น˜์˜ ๊ฐ์ฒด์— ๋งž์ถฐ, sampling ์ด ์ด๋ฃจ์–ด์ง€๊ณ  ๋ณด๋‹ค ์œ ์—ฐํ•˜๊ฒŒ ํŠน์ง•์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.

Deformable DETR

Architecture

image

  • DETR ์˜ ๊ตฌ์กฐ๋Š” ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.
  • Multi-scale ์˜ ํŠน์ง•๋งต์„ Deformable Attention ์„ ํ†ตํ•ด์„œ reference point ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ‚ค๋ฅผ sampling ํ•˜์—ฌ ๋„˜๊ฒจ์ค€๋‹ค. ์ž์„ธํ•œ ๊ฒƒ์€ ์•„๋ž˜๋ฅผ ํ†ตํ•ด ํ™•์ธํ•ด๋ณด์ž.

Constructing Multi-Scale Feature Maps

image

  • ์ž…๋ ฅ ํŠน์ง•๋งต์— ๋Œ€ํ•œ multi-scale ํŠน์ง•๋งต์€ ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด CNN ์˜ ๊ฐ ์ธต์—์„œ 1x1 Conv ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ฝ‘์•„๋‚ด๊ณ  ๋งจ ๋งˆ์ง€๋ง‰๋งŒ 3x3 Conv ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”์ถœํ•œ๋‹ค.

Deformable Attention Module

image

  • Transformer ๋Š” ์ด๋ฏธ์ง€ ํŠน์ง•๋งต์˜ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๊ณต๊ฐ„์„ ์‚ดํŽด๋ณด๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ์‚ฐ๋Ÿ‰์ด ํฐ ๋ฌธ์ œ๋ฅผ ๊ฐ–๋Š”๋‹ค.
  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Deformable Attention Module ์„ ์ œ์‹œํ–ˆ๋‹ค.
  • Deformable Attention Module ์€ ๊ธฐ์ค€์  ์ฃผ๋ณ€ ์ž‘์€ key sampling point ๋“ค์—๋งŒ ์ง‘์ค‘ํ•จ์œผ๋กœ์จ ํŠน์ • ์œ„์น˜๋งŒ ํ™•์ธํ•˜๋Š” ๋ฐฉ์‹์„ ํ†ตํ•ด์„œ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ธ๋‹ค.
  • : Query element
  • : 2d reference point
  • : Input feature map
  • , : Learnable weight
  • : Attention weight (Normalized by )
  • : Sampling offset
  • ์ˆ˜์‹ ๋ถ„์„ ๊ทธ๋ฆผ ๊ทธ๋ ค ๋„ฃ์„ ์˜ˆ์ •

Multi-scale Deformable Attention Module

image

  • : Query element
  • : Normalized 2d reference point
  • : Input feature map
  • , : Learnable weight
  • : Attention weight (Normalized by )
  • : Sampling offset
  • : Re-scale the normalized coordinated to the input feature map of the l-th levels
  • Multi-scale Deformable Attention Module ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ Deformable Attention Module ์—์„œ level ์ด ์ƒ๊ฒผ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ์ฆ‰, ๊ฐ ์ธต์— ๋Œ€ํ•ด Deformable Attention ์„ ์ ์šฉํ•œ๋‹ค.
  • MSDeformable Attention ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ ๊ฐ ํŠน์ง•๋งต์˜ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์œ„์น˜์ ์ธ ์ •๋ณด๋ฅผ ๋งž์ถฐ์ค˜์•ผ ํ•œ๋‹ค. ๋•Œ๋ฌธ์— reference point ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ์ „๋‹ฌํ•œ๋‹ค.
  • ์˜ ์˜ ๊ฒฝ์šฐ ์ •๊ทœํ™”๋˜์–ด ์ „๋‹ฌ๋œ reference point ๋ฅผ ์—ฐ์‚ฐํ•  scale level ์— ๋งž๋„๋ก ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

Details

image

image

  • ์ธ์ฝ”๋”์˜ ๊ฒฝ์šฐ ์ฟผ๋ฆฌ๋Š” multi-scale ํŠน์ง•๋งต์˜ ํ”ฝ์…€์ด๋‹ค. ํ•˜์ง€๋งŒ ๋””์ฝ”๋”์˜ ๊ฒฝ์šฐ object query ๊ฐ€ ์ฟผ๋ฆฌ๊ฐ€ ๋œ๋‹ค.

Result

image

image

  • Deformable DETR ์„ ์ ์šฉํ–ˆ์„ ๋•Œ DETR ์— ๋น„ํ•ด ๋™์ผ ์„ฑ๋Šฅ์—์„œ ํ›จ์”ฌ ๋น ๋ฅธ ์†๋„๋กœ ํ•™์Šต๋œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋˜ํ•œ, ์ž‘์€ ๋ฌผ์ฒด ํƒ์ง€๊ฐ€ ์•ฝํ•œ DETR ์— ๋น„ํ•ด Deformable DETR ์€ Faster R-CNN ๊ณผ์˜ ๋น„๊ต์—์„œ๋„ ๊ฒฝ์Ÿ๋ ฅ์„ ์–ป์„ ๋งŒํผ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค.

image

  • Single-scale ์ž…๋ ฅ ๋Œ€์‹  multi-scale ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๊ธฐ๋ณธ์ ์ธ ํƒ์ง€ ์„ฑ๋Šฅ๊ณผ ์ž‘์€ ๋ฌผ์ฒด์˜ ํƒ์ง€ ์„ฑ๋Šฅ์˜ ํ–ฅ์ƒ์„ ์ด๋ค„๋ƒˆ๋‹ค.
  • MS deformable attention ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ํ–ฅ์ƒ์„ ์ด๋ค„๋ƒˆ๋‹ค.
  • FPN ์„ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ FPN ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ์—์„œ ์™„์ „ํžˆ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.
  • ์ €์ž๋“ค์€ Cross-Attention ์„ ํ†ตํ•ด์„œ multi-scale ๊ฐ„์˜ ํŠน์ง• ๊ตํ™˜์ด ์ด๋ฃจ์–ด์กŒ๊ธฐ ๋•Œ๋ฌธ์— FPN ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋Œ์ง€ ๋ชปํ•  ๊ฒƒ์ด๋ผ ์„ค๋ช…ํ•œ๋‹ค.

Conclusion

Deformable DETR ์€ end-to-end ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋กœ, DETR ์˜ ๋А๋ฆฐ ์ˆ˜๋ ด, ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€์— ์•ฝํ•˜๋‹ค๋Š” ๋‹จ์ ์„ multi-scale deformable attention module ์„ ํ†ตํ•ด์„œ ๋ณด์™„ํ–ˆ๋‹ค๋Š” ์ ์—์„œ ์˜์˜๋ฅผ ๊ฐ–๋Š”๋‹ค.

์ฐธ๊ณ ๋ฌธํ—Œ