Post

๐Ÿ“Understanding YOLO - YOLO ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ“Understanding YOLO - YOLO ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿง  Understanding YOLO in One Page

๐Ÿ” Detecting objects lightning-fast with a single unified model!

Just like classic literature is essential beyond trending new books,
today weโ€™re diving into the classic paper of object detection: YOLO!

manhwa

Paper: You Only Look Once: Unified, Real-Time Object Detection
Conference: CVPR 2016 (Joseph Redmon & Facebook AI Research)
๐Ÿ”— Presentation Slides


๐Ÿ’ก YOLO Highlights

  1. Itโ€™s fast.
    • Runs at 45 FPS! Even Faster R-CNN runs at just 7 FPS.
  2. It sees the whole image.
    • Unlike sliding window methods, YOLO looks at the entire image at once, reducing background false positives.
  3. Works across domains.
    • Can be applied to artworks, drawings, cartoons, etc.

๐Ÿง  Background of YOLOโ€™s Emergence

While humans can understand a scene at a glance, previous models could not.
Traditional methods looped over many bounding boxes and performed complex post-processing!

  • Complex pipelines: Region proposals โ†’ classification โ†’ post-processing, making optimization difficult.
  • Slow inference: Each component runs independently, especially R-CNN-based models being unsuitable for real-time use.
  • Inefficient sliding windows: Classifier runs over many positions and scales โ†’ computationally expensive.
  • Non end-to-end training: Separate training for different parts of the pipeline.

๐Ÿ” Traditional Methods: DPM and R-CNN

DPM

  • DPM (Deformable Parts Models): Breaks an object into a root filter and several part filters.
  • Key components:
    • Root filter: captures overall object shape.
    • Part filters: detect key parts (e.g., eyes, nose for face).
    • Deformation model: handles flexible placement of parts.
  • Used HoG features and sliding window detection.
  • Strengths: robust to occlusion and deformation.
  • Weaknesses: computationally heavy and hard to use in real time.

RCNN

  • R-CNN (2014): Early deep learning-based object detector.
  • Steps:
    1. Generate ~2,000 region proposals via Selective Search.
    2. Extract features with a CNN per region.
    3. Classify using SVM.
    4. Refine with bounding box regression.
  • Pros: Higher accuracy than traditional methods.
  • Cons: Very slow, complex multi-stage training, not end-to-end.

๐Ÿ“˜ Dataset: PASCAL VOC Detection

  • Background: Developed for the PASCAL Visual Object Classes Challenge (2005โ€“2012).
  • Objective: Benchmark for object detection with 20 labeled classes like person, dog, car, etc.
  • Each image may contain multiple objects.
  • Each object has bounding box (x, y, w, h) + label.
  • Evaluation metric: mAP (mean Average Precision)
    • Averaged over classes.
    • Based on IOU โ‰ฅ 0.5 for a correct match.

๐Ÿ”— PASCAL VOC Website


๐Ÿ–‡๏ธ YOLO Model Architecture

archi

  • Inspired by GoogLeNet for image classification.
  • 24 convolutional layers + 2 fully connected layers.
  • Input: 448ร—448 image โ†’ Output: 7ร—7ร—30 tensor.

๐Ÿ“Œ YOLOv1 Layer Summary

BlockLayerFilters ร— Size / StrideOutput Size (448ร—448 input)
InputImage-448ร—448ร—3
Conv 1Conv + LeakyReLU64 ร— 7ร—7 / 2224ร—224ร—64
MaxPool 1MaxPooling2ร—2 / 2112ร—112ร—64
Conv 2Conv + LeakyReLU192 ร— 3ร—3 / 1112ร—112ร—192
MaxPool 2MaxPooling2ร—2 / 256ร—56ร—192
Conv 3โ€“4Conv + LeakyReLU128ร—1ร—1, 256ร—3ร—356ร—56ร—256
Conv 5โ€“6Conv + LeakyReLU256ร—1ร—1, 512ร—3ร—356ร—56ร—512
MaxPool 3MaxPooling2ร—2 / 228ร—28ร—512
Conv 7โ€“12Repeated Conv Blocks (4ร—)256ร—1ร—1, 512ร—3ร—328ร—28ร—512
Conv 13โ€“14Conv512ร—1ร—1, 1024ร—3ร—328ร—28ร—1024
MaxPool 4MaxPooling2ร—2 / 214ร—14ร—1024
Conv 15โ€“20Repeated Conv Blocks (2ร—)512ร—1ร—1, 1024ร—3ร—314ร—14ร—1024
Conv 21โ€“22Conv1024ร—3ร—3, 1024ร—3ร—37ร—7ร—1024
FC 1Fully Connected40961ร—1ร—4096
FC 2Fully Connected (Detection Output)7ร—7ร—30 (S=7, B=2, C=20)7ร—7ร—30

๐Ÿ”„ YOLO Training

๐ŸŽฏ Loss Function (Sum-Squared Error)

1
2
3
4
5
L = ฮป_coord โˆ‘(obj) [(x - xฬ‚)^2 + (y - ลท)^2] 
  + ฮป_coord โˆ‘(obj) [(โˆšw - โˆšลต)^2 + (โˆšh - โˆšฤฅ)^2]
  + โˆ‘(obj) (C - ฤˆ)^2
  + ฮป_noobj โˆ‘(noobj) (C - ฤˆ)^2
  + โˆ‘(obj) โˆ‘_class (p(c) - pฬ‚(c))^2

๐ŸŽฏ Loss Function and Training Parameters

  • Uses sum-squared error by default โ€” simple to implement.
  • However:
    • It treats classification and localization errors equally, which may not be ideal.
    • Most grid cells donโ€™t contain any object, so confidence scores are pushed toward zero, causing unstable gradients.
  • Solution: Adjust the loss weights
    • ฮป_coord = 5: Increase weight on bounding box coordinate loss.
    • ฮป_noobj = 0.5: Decrease weight on confidence loss for background cells.
    • Also, to reduce sensitivity to large boxes, it predicts square root of width and height (sqrt(w), sqrt(h)).

๐Ÿ‹๏ธ Training Configuration

  • Epochs: 135
  • Dataset: VOC 2007 + 2012 train/val sets
  • Batch Size: 64
  • Momentum: 0.9
  • Weight Decay: 0.0005
  • Learning Rate Schedule:
    • Start with 1e-3, gradually increase to 1e-2
    • Hold at 1e-2 for 75 epochs โ†’ 1e-3 for 30 epochs โ†’ 1e-4 for 30 epochs
  • Dropout (rate = 0.5) after the first FC layer to prevent overfitting

๐Ÿงฉ YOLO Model Evaluation Results

  1. Speed
    • Significantly faster than previous detectors!
    ModelmAP (%)FPSInference Time (per image)
    DPM v533.70.0714 s/img
    R-CNN66.00.0520 s/img
    Fast R-CNN70.00.52 s/img
    Faster R-CNN73.27140 ms/img
    YOLO63.44522 ms/img
  2. Global Context Reduces Background Errors
    • Because YOLO sees the entire image, it makes fewer background mistakes!
    • So, combining YOLO with Fast R-CNN boosts performance significantly.
    • Since YOLO is so fast, it adds minimal overhead.

    yolo_rcnn

  3. Cross-domain Applicability
    • Can be used for artworks, illustrations, cartoons, and more!

    domain


๐Ÿง  Final Thoughts

From YOLOv1 to v2, v3, v4โ€ฆ and all the way to YOLO-World,
this model has become a classic in object detection, cited over 60,000 times!

๐Ÿ“ What I learned while revisiting this work:

  • Clearly defining a open problem โ€” like the speed bottleneck in detection โ€”
    is often the first step toward innovation.
  • Once a clear limitation is set, new ideas like YOLO can emerge naturally to solve it.

โ— Without clearly identifying open problems,
we may end up overanalyzing existing methods and merely making incremental tweaks,
rather than aiming for truly transformative improvements.


๐Ÿง  (ํ•œ๊ตญ์–ด) YOLO ์•Œ์•„๋ณด๊ธฐ!

๐Ÿ” ๋‹จ์ผ ๋ชจ๋ธ๋กœ ์—„์ฒญ ๋น ๋ฅด๊ฒŒ ๊ฐ์ฑ„๋ฅผ ๊ฐ์ง€ํ•ด๋ฒ„๋ฆฌ๊ธฐ!!!

๋…์„œ๋ฅผ ํ• ๋•Œ, ๋”ฐ๋ˆ๋”ฐ๊ทผํ•œ ์‹ ๊ฐ„๋„ ์ข‹์ง€๋งŒ, ์˜ค๋ž˜๋„๋ก ๊ธฐ์–ต๋˜๋Š” ๊ณ ์ „๋„ ํ•„์ˆ˜์ง€์š”!?
์˜ค๋Š˜์€ Object Detection์˜ ๋Œ€๋ช…์‚ฌ๊ฐ€ ๋œ YOLO, ๊ทธ ์ฒซ ๋…ผ๋ฌธ์— ๋Œ€ํ•˜์—ฌ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

manhwa

๋…ผ๋ฌธ: You Only Look Once: Unified, Real-Time Object Detection
๋ฐœํ‘œ: CVPR 2016 (Joseph Redmon & Facebook AI Research)
PPT ์Šฌ๋ผ์ด๋“œ


๐Ÿ’ก YOLO ์˜ ํŠน์ง• ์š”์•ฝ!!

  1. ๋น ๋ฅด๋‹ค.
    • ์ดˆ๋‹น 45 ํ”„๋ ˆ์ž„!! - ๊ธฐ์กด ๊ฐ€์žฅ ๋น ๋ฅด๋˜ Faster R-CNN๋„ 7FPS ์—ฟ๋Š”๋ฐ!
  2. ์ด๋ฏธ์ง€๋ฅผ ์ „์ฒด์ ์œผ๋กœ ๋ณธ๋‹ค
    • ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ๋ฐฉ์‹์œผ๋กœ ๊ฐ๊ฐ์„ ๋ณด๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ์ „์ฑ„๋กœ ๋ณด๊ธฐ์—, ๋ฐฐ๊ฒฝ์˜คํƒ์ด ์ค„์–ด๋“ ๋‹ค!
  3. ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ์— ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค. (๋งŒํ™”, ๊ทธ๋ฆผ ๋“ฑ)

๐Ÿง  YOLO ๋“ฑ์žฅ์˜ ๋ฐฐ๊ฒฝ

์‚ฌ๋žŒ์€ ํ•œ๋ˆˆ์— ๋ชจ๋“ ๊ฒƒ์„ ๋น ๋ฅด๊ฒŒ ํŒŒ์•…ํ•˜์ง€๋งŒ, ๋‹น์‹œ ๋ชจ๋ธ๋“ค์€ ๊ทธ๋ ‡์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค!
์ด๋ฏธ์ง€์—์„œ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๋ฐ•์Šค๋ฅผ ๋ฃจํ”„ ๋Œ๋ ค์„œ ๊ฐ์ฑ„๋ฅผ ์ฐพ๊ณ 
์ฐพ์•„์ง„ ๊ฐ์ฒด์—์„œ ์ค‘๋ณต์„ ์ œ๊ฑฐํ•ด์•ผํ•˜๋Š” ๋“ฑ ๋ณต์žกํ•ด!

  • ๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ: region proposal โ†’ ๋ถ„๋ฅ˜ โ†’ ํ›„์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋กœ ๋‚˜๋‰˜์–ด ์žˆ์–ด ํ†ตํ•ฉ ์ตœ์ ํ™”๊ฐ€ ์–ด๋ ค์›€.
  • ๋А๋ฆฐ ์†๋„: ๊ฐ ๋‹จ๊ณ„๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์‹คํ–‰๋˜๋ฉฐ, ํŠนํžˆ R-CNN ๊ณ„์—ด์€ ์‹ค์‹œ๊ฐ„ ์ ์šฉ์— ๋ถ€์ ํ•ฉ.
  • ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ๋ฐฉ์‹์˜ ๋น„ํšจ์œจ์„ฑ: ๋งŽ์€ ์œ„์น˜์™€ ํฌ๊ธฐ์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‹คํ–‰ โ†’ ๊ณ„์‚ฐ๋Ÿ‰ ๊ณผ๋‹ค.
  • ๊ตฌ์„ฑ์š”์†Œ ๋ถ„๋ฆฌ ํ›ˆ๋ จ ํ•„์š”: ์ „์ฒด ์‹œ์Šคํ…œ์„ end-to-end๋กœ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์›€.

๐Ÿ” ๊ธฐ์กด์ด ์—ฐ๊ตฌ ์กฐ๊ธˆ ๋” ์•Œ์•„๋ณด๊ธฐ : DPM๊ณผ R-CNN

DPM

  • DPM(Deformable Parts Models)์€ ๋ฌผ์ฒด๋ฅผ ๊ธฐ๋ณธ ํ˜•ํƒœ(๋ฃจํŠธ)์™€ ๊ทธ์— ๋”ธ๋ฆฐ ๋ณ€ํ˜• ๊ฐ€๋Šฅํ•œ ๋ถ€ํ’ˆ๋“ค(parts)๋กœ ๋‚˜๋ˆ„์–ด ์ธ์‹ํ•˜๋Š” ๊ฐ์ฒด ํƒ์ง€ ๊ธฐ๋ฒ•
  • 3๊ฐœ์˜ ๊ตฌ์„ฑ์š”์†Œ
    • Root filter: ๋ฌผ์ฒด์˜ ์ „์ฒด ํ˜•ํƒœ๋ฅผ ํฌ์ฐฉ
    • Part filters: ๋ฌผ์ฒด์˜ ์ค‘์š”ํ•œ ๊ตฌ์„ฑ ์š”์†Œ(์˜ˆ: ์–ผ๊ตด์˜ ๋ˆˆ, ์ฝ”, ์ž… ๋“ฑ)๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ํƒ์ง€
    • Deformation model: ๋ถ€ํ’ˆ์˜ ์œ„์น˜ ์ด๋™์— ๋”ฐ๋ฅธ penalty๋ฅผ ํ•™์Šต
  • ํŠน์ง•
    • ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ(sliding window) ๊ธฐ๋ฐ˜์˜ ํƒ์ง€ ๋ฐฉ์‹ ์‚ฌ์šฉ
    • HoG(Histogram of Oriented Gradients) ํŠน์ง• ์‚ฌ์šฉ
    • ๊ฐ ์œˆ๋„์šฐ์— ๋Œ€ํ•ด ๋ฃจํŠธ์™€ ํŒŒํŠธ ์ ์ˆ˜๋ฅผ ํ•ฉ์‚ฐํ•˜์—ฌ ์ตœ์ข… ์ ์ˆ˜ ๊ณ„์‚ฐ
  • ์žฅ์ 
    • ๋ถ€ํ’ˆ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ง์œผ๋กœ ๋ถ€๋ถ„ ๊ฐ€๋ฆผ(pose variation)์— ๊ฐ•์ธํ•จ
    • ํ•œ๋•Œ PASCAL VOC ๋“ฑ์—์„œ SOTA๋ฅผ ๊ธฐ๋กํ–ˆ๋˜ ๊ฐ•๋ ฅํ•œ ์ „ํ†ต์  ๊ธฐ๋ฒ•.
  • ๋‹จ์ 
    • ๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ: ํŠน์ง• ์ถ”์ถœ โ†’ ๋ถ„๋ฅ˜๊ธฐ โ†’ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์กฐ์ • ๋“ฑ ๊ฐ ๋‹จ๊ณ„๊ฐ€ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ์Œ.
    • ์—ฐ์‚ฐ ๋น„์šฉ์ด ํฌ๊ณ , ์‹ค์‹œ๊ฐ„ ํƒ์ง€์—๋Š” ๋ถ€์ ํ•ฉ
    • ๋ฌธ๋งฅ ์ •๋ณด ํ™œ์šฉ์— ์–ด๋ ค์›€

RCNN

  • R-CNN (Regions with CNN Features)์€ 2014๋…„ ๋ฐœํ‘œ๋œ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ๊ฐ์ฒด ์ธ์‹์—์„œ ํฐ ์ง„์ „์„ ์ด๋ฃฌ ์ดˆ๊ธฐ ๋ชจ๋ธ
  • ํ•ต์‹ฌ ์•„์ด๋””์–ด
    • ๊ฐ์ฒด๊ฐ€ ์žˆ์„ ๋ฒ•ํ•œ ํ›„๋ณด ์˜์—ญ(Region Proposals) ์„ ๋จผ์ € ์ƒ์„ฑ
    • ๊ฐ ์˜์—ญ์„ CNN์„ ํ†ตํ•ด ๊ฐœ๋ณ„์ ์œผ๋กœ ํŠน์ง• ์ถ”์ถœ
    • SVM ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ†ตํ•ด ๊ฐ์ฒด ์กด์žฌ ์—ฌ๋ถ€์™€ ํด๋ž˜์Šค ํŒ๋ณ„
    • Bounding Box Regression์„ ํ†ตํ•ด ์œ„์น˜ ์ •์ œ
  • ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ
    1. Selective Search: ์•ฝ 2,000๊ฐœ์˜ region proposal ์ƒ์„ฑ
    2. CNN: ๊ฐ region์„ ๊ณ ์ • ํฌ๊ธฐ๋กœ ๋ฆฌ์‚ฌ์ด์ง• ํ›„, ํŠน์ง• ๋ฒกํ„ฐ ์ถ”์ถœ
    3. SVM: ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•ด ๊ฐ์ฒด ๋ถ„๋ฅ˜
    4. Bounding Box Regression: ์˜ˆ์ธก ๋ฐ•์Šค๋ฅผ ๋ณด์ •ํ•˜์—ฌ ์ •ํ™•๋„ ํ–ฅ์ƒ
  • ์žฅ์ 
    • ๊ธฐ์กด ๋ฐฉ์‹๋ณด๋‹ค ์ •ํ™•๋„ ํ–ฅ์ƒ
    • ๋”ฅ๋Ÿฌ๋‹์„ ๊ฐ์ฒด ํƒ์ง€์— ๋ณธ๊ฒฉ์ ์œผ๋กœ ์ ์šฉํ•œ ์ฒซ ์‚ฌ๋ก€ ์ค‘ ํ•˜๋‚˜
  • ๋‹จ์ 
    • ๋งค์šฐ ๋А๋ฆผ (์ˆ˜ ์ดˆ ~ ์ˆ˜์‹ญ ์ดˆ/์ด๋ฏธ์ง€)
    • ํ›ˆ๋ จ ๋‹จ๊ณ„ ๋ณต์žก (CNN, SVM, BBox Regression ๋”ฐ๋กœ ํ•™์Šต)
    • End-to-End ํ•™์Šต ๋ถˆ๊ฐ€

๐Ÿ“˜ ํ•™์Šต์— ์‚ฌ์šฉ๋˜๋Š” Dataset ๋ฏธ๋ฆฌ ํŒŒ์•…ํ•˜๊ธฐ! : PASCAL VOC detection dataset

  • ๋ฐฐ๊ฒฝ: PASCAL Visual Object Classes Challenge(2005~2012) ์„ ์œ„ํ•ด ๊ณต๊ฐœ๋จ
  • ๋ชฉ์ : ๊ฐ์ฒด ์ธ์‹(Object Detection) ๋ฐ ๊ด€๋ จ ์‹œ๊ฐ ๊ณผ์ œ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€์šฉ ๋ฒค์น˜๋งˆํฌ ์ œ๊ณต
  • ์ด 20๊ฐœ ํด๋ž˜์Šค๋กœ ๊ตฌ์„ฑ๋จ : person, car, bus, dog, bottle, chair ๋“ฑ
    • ์ด๋ฏธ์ง€๋‹น ๋‹ค์ค‘ ๊ฐ์ฒด ํฌํ•จ ๊ฐ€๋Šฅ
    • ๊ฐ ๊ฐ์ฒด๋Š” bounding box (x, y, w, h) ์™€ class label ์ด ํ•จ๊ป˜ ์ฃผ์–ด์ง
    • ์ด๋ฏธ์ง€ โ†’ ๊ฐ์ฒด ์œ„์น˜ + ๋ ˆ์ด๋ธ”์„ ์ธ์‹ํ•˜๋Š” Object Detection ๊ณผ์ œ์— ์ ํ•ฉ
  • ํ‰๊ฐ€ ์ง€ํ‘œ : mAP (mean Average Precision)
    • ํด๋ž˜์Šค๋ณ„ AP๋ฅผ ํ‰๊ท ํ•œ ๊ฐ’
    • IOU threshold ๊ธฐ์ค€์œผ๋กœ ์ •ํ™•๋„๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •
    • ๋ณดํ†ต IOU โ‰ฅ 0.5 ๊ธฐ์ค€ ์‚ฌ์šฉ (VOC ์Šคํƒ€์ผ)
  • VOC ๋ฐ์ดํ„ฐ ๊ณต์‹ ํ™ˆํŽ˜์ด์ง€

๐Ÿ–‡๏ธ YOLO ๋ชจ๋ธ ๊ตฌ์กฐ

archi

  • ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์šฉ GoogLeNet ๋ชจ๋ธ์˜ ์•„์ด๋””์–ด๋ฅผ ์ฐฉ์•ˆ,
  • 24๊ฐœ์˜ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต(Conv)๊ณผ 2๊ฐœ์˜ ์™„์ „ ์—ฐ๊ฒฐ ๊ณ„์ธต(FC)์œผ๋กœ ๊ตฌ์„ฑ

  • 448 X 448 ํ•ด์ƒ๋„์˜ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ํ†ต๊ณผํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค!!

๐Ÿ“Œ YOLOv1 ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ!

๊ตฌ๋ถ„์ธต ๊ตฌ์„ฑ (Layer)ํ•„ํ„ฐ ๊ฐœ์ˆ˜ ร— ํฌ๊ธฐ / ์ŠคํŠธ๋ผ์ด๋“œ*์ถœ๋ ฅ ํฌ๊ธฐ (์ž…๋ ฅ: 448ร—448ร—3 ๊ธฐ์ค€)
์ž…๋ ฅ์ด๋ฏธ์ง€ ์ž…๋ ฅ-448ร—448ร—3
Conv 1Convolution + LeakyReLU64 ร— 7ร—7 / 2 **224ร—224ร—64
MaxPool 1MaxPooling2ร—2 / 2112ร—112ร—64
Conv 2Convolution + LeakyReLU192 ร— 3ร—3 / 1112ร—112ร—192
MaxPool 2MaxPooling2ร—2 / 256ร—56ร—192
Conv 3โ€“4Conv โ†’ Conv + LeakyReLU128 ร— 1ร—1, 256 ร— 3ร—356ร—56ร—256
Conv 5โ€“6Conv โ†’ Conv + LeakyReLU256 ร— 1ร—1, 512 ร— 3ร—356ร—56ร—512
MaxPool 3MaxPooling2ร—2 / 228ร—28ร—512
Conv 7โ€“12๋ฐ˜๋ณต Conv blocks (4ํšŒ ๋ฐ˜๋ณต)256ร—1ร—1, 512ร—3ร—328ร—28ร—512
Conv 13โ€“14Conv โ†’ Conv512 ร— 1ร—1, 1024 ร— 3ร—328ร—28ร—1024
MaxPool 4MaxPooling2ร—2 / 214ร—14ร—1024
Conv 15โ€“20๋ฐ˜๋ณต Conv blocks (2ํšŒ ๋ฐ˜๋ณต)512ร—1ร—1, 1024ร—3ร—314ร—14ร—1024
Conv 21โ€“22Conv โ†’ Conv1024 ร— 3ร—3, 1024 ร— 3ร—37ร—7ร—1024
FC 1Fully Connected40961ร—1ร—4096
FC 2Fully Connected (Detection Output)7ร—7ร—30 (S=7, B=2, C=20)***7ร—7ร—30

* ์ŠคํŠธ๋ผ์ด๋“œ? : ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์—์„œ **ํ•„ํ„ฐ(์ปค๋„)**๊ฐ€ ์ž…๋ ฅ ์œ„๋ฅผ ์–ผ๋งˆ๋งŒํผ ์ด๋™ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์œผ๋กœ ์ปค์งˆ์ˆ˜๋ก ์—ฐ์‚ฐ์ด ์ค„์–ด๋“ฌ ** ์—ฌ๊ธฐ์„œ์˜ 64๋Š” 448 X 448 ์ด๋ฏธ์ง€๋ฅผ 7๊ฐœ์˜ Cell๋กœ ๋‚˜๋ˆ„๋ฉด, ๊ฐ Cell์€ 64X64๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๋ž‘์€ ๊ด€๊ณ„๊ฐ€ ์—†์ด, CNN ์•„ํ‚คํ…์ฒ˜์˜ ๊ด€๋ก€์™€ ์‹คํ—˜์  ์„ ํƒ์— ๊ธฐ๋ฐ˜ํ•ด ์‚ฌ์šฉ๋œ ์ˆ˜์น˜์ž…๋‹ˆ๋‹ค.
`*** 7 X 7 , ์ฆ‰ 49๊ฐœ Cell์— ๋Œ€ํ•˜์—ฌ 30๊ฐœ์˜ prediction์ด ์žˆ๋Š”๋ฐ์š”! ์ด๋Š” Cell ๋ณ„ 2๊ฐœ์˜ Bbox ์ •๋ณด (x,y,w,h,p) 10๊ฐœ + 20๊ฐœ Class ์— ๋Œ€ํ•œ ๊ฐ€๋Šฅ์„ฑ ์œผ๋กœ 30์ด๋ฉ๋‹ˆ๋‹ค!


๐Ÿ”„ YOLO ๋ชจ๋ธ ํ•™์Šต!

  1. ๐ŸŽฏ ์†์‹ค ํ•จ์ˆ˜ ์„ค๊ณ„ (Loss Function)
1
2
3
4
5
L = ฮป_coord โˆ‘(obj) [(x - xฬ‚)^2 + (y - ลท)^2] 
    + ฮป_coord โˆ‘(obj) [(โˆšw - โˆšลต)^2 + (โˆšh - โˆšฤฅ)^2]
    + โˆ‘(obj) (C - ฤˆ)^2
    + ฮป_noobj โˆ‘(noobj) (C - ฤˆ)^2
    + โˆ‘(obj) โˆ‘_class (p(c) - pฬ‚(c))^2
  • ๊ธฐ๋ณธ์ ์œผ๋กœ sum-squared error ์‚ฌ์šฉ: ๊ตฌํ˜„์ด ๊ฐ„๋‹จ
  • ํ•˜์ง€๋งŒ
    • classification๊ณผ localization error๋ฅผ ๋™์ผํ•˜๊ฒŒ ๋‹ค๋ฃจ๊ธฐ์— ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ ,
    • ๋Œ€๋ถ€๋ถ„์˜ grid cell์€ ๊ฐ์ฒด๊ฐ€ ์—†๊ธฐ์— confidence ๊ฐ’์ด 0์œผ๋กœ ๋ชฐ๋ ค ํ•™์Šต ๋ถˆ์•ˆ์ •ํ•ด์ง!!
  • ์ด์— ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ ˆํ•จ
    • ฮป_coord = 5: ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ขŒํ‘œ ์˜ค์ฐจ๋Š” ๋” ํฌ๊ฒŒ
    • ฮป_noobj = 0.5: ๊ฐ์ฒด ์—†๋Š” ์…€์˜ confidence ์˜ค์ฐจ๋Š” ์ž‘๊ฒŒ
    • ๋˜ํ•œ, ํฐ ๋ฐ•์Šค๋Š” ์ž‘์€ ์˜ค์ฐจ์— ๊ด€๋Œ€ํ•˜๋ฏ€๋กœ โ†’ ๋„ˆ๋น„์™€ ๋†’์ด์˜ ์ œ๊ณฑ๊ทผ์„ ์˜ˆ์ธกํ•จ (sqrt(w), sqrt(h))
  1. ๐Ÿ‹๏ธ ํ•™์Šต์—์„œ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ
    • Epochs: ์ด 135 ์—ํญ
    • ๋ฐ์ดํ„ฐ: VOC2007 + VOC2012 ํ•™์Šต/๊ฒ€์ฆ์…‹
    • Batch Size: 64
    • Momentum: 0.9
    • Weight Decay: 0.0005
    • Learning rate : ์ดˆ๊ธฐ epoch: 1e-3 โ†’ 1e-2 ์ ์ง„์  ์ฆ๊ฐ€ / ์ดํ›„: 1e-2 (75epoch) โ†’ 1e-3 (30epoch) โ†’ 1e-4 (30epoch)
    • dropout(rate=0.5) ๋ฅผ ํ†ตํ•ด์„œ overfitting์„ ๋ฐฉ์ง€!

๐Ÿงฉ YOLO ๋ชจ๋ธ์˜ ํ‰๊ฐ€๊ฒฐ๊ณผ!

  1. ๋น ๋ฅด๋‹ค.
    • ๊ธฐ์กด ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์„๋•Œ ์ฒ˜๋ฆฌ์‹œ๊ฐ„์ด ์ •๋ง์ •๋ง ๋นจ๋ž์Šต๋‹ˆ๋‹ค!!
๋ชจ๋ธmAP (%)FPS์ฒ˜๋ฆฌ ์‹œ๊ฐ„ (์ดˆ/์ด๋ฏธ์ง€)
DPM v533.70.0714 s/img
R-CNN66.00.0520 s/img
Fast R-CNN70.00.52 s/img
Faster R-CNN73.27140 ms/img
YOLO63.44522 ms/img
  1. ์ด๋ฏธ์ง€๋ฅผ ์ „์ฒด์ ์œผ๋กœ ๋ณด๊ธฐ์— ๋ฐฐ๊ฒฝ์˜คํƒ์ด ์ ๊ณ !
    • ๊ทธ๋ž˜์„œ!! Fact-RCNN๊ณผ ๊ฒฐํ•ฉํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์—„์ฒญ ์ข‹๋‹ค!!
    • ๊ฑฐ๊ธฐ์— YOLO ๋Š” ์—„์ฒญ ๋น ๋ฅด๋‹ˆ RCNN์— ์ถ”๊ฐ€ํ•ด๋„ ์‹œ๊ฐ„๋„ ๋ณ„๋กœ ์•ˆ๋“ค์–ด!

yolo_rcnn

  1. ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ์— ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค. (๋งŒํ™”, ๊ทธ๋ฆผ ๋“ฑ)
    domain

๐Ÿง  ๋งˆ๋ฌด๋ฆฌ ์ƒ๊ฐ

Yolo ๋ฐœํ‘œ ์ดํ›„ 2,3,4, โ€ฆ 11 ์— ์ด์–ด yolo world๊นŒ์ง€!!
Object Detection ์˜ ๊ณ ์ „์œผ๋กœ ๋‚จ์„,
6๋งŒ๋ฒˆ ๋„˜๊ฒŒ ์ธ์šฉ๋œ YOLO!!
์ด๋ ‡๊ฒŒ ์ •๋ฆฌํ•ด๋ณด๋ฉด์„œ ๋А๋‚€์ ์€ ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์˜ ๋ฌธ์ œ(์†๋„)๋ฅผ ๋ช…ํ™•ํžˆ ์ •์˜ํ•˜๋Š”๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๋Š”๊ฒƒ์ด์—ˆ์Šต๋‹ˆ๋‹ค!
๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์˜ ๋ช…ํ™•ํ•œ ๋ฌธ์ œ๊ฐ€ ์ •์˜ ๋˜๊ณ ๋‚˜์„œ์•ผ!
๊ทธ ๋ฌธ์ œ๋ฅผ ํ’€๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•์„ ์ƒ๊ฐํ• ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!

์ด๋ ‡๊ฒŒ ๋ช…ํ™•ํ•œ ๋ฌธ์ œ ์ •์˜๊ฐ€ ๋˜์ง€ ์•Š๋Š”๋‹ค๋ฉด,
์—ญ์„ค์ ์œผ๋กœ ๊ธฐ์กด ๋…ผ๋ฌธ๋งŒ์„ ๊นŠ๊ฒŒ ๋ถ„์„ํ•˜๋ฉด์„œ(์˜ค๋ฒ„ํ”ผํŒ…๊ณผ ๊ฐ™์ด) ํ˜์‹ ์ ์ธ ๊ฐœ์„ ๋ฒ• ๋ณด๋‹ค๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ์˜ ์„ธ๋ถ€์ ์ธ ์ˆ˜์ •๋งŒ ์ด๋ฃจ์–ด์งˆ ๋ฆฌ์Šคํฌ๊ฐ€ ์žˆ์„๊ฒƒ ๊ฐ™๋„ค์š”!


This post is licensed under CC BY 4.0 by the author.