📝Understanding YOLO - YOLO 알아보기?!!

Posted Jun 14, 2025

By DrFirst

17 min read

🧠 Understanding YOLO in One Page

🔍 Detecting objects lightning-fast with a single unified model!

Just like classic literature is essential beyond trending new books,
today we’re diving into the classic paper of object detection: YOLO!

Paper: You Only Look Once: Unified, Real-Time Object Detection
Conference: CVPR 2016 (Joseph Redmon & Facebook AI Research)
🔗 Presentation Slides

💡 YOLO Highlights

It’s fast.
- Runs at 45 FPS! Even Faster R-CNN runs at just 7 FPS.
It sees the whole image.
- Unlike sliding window methods, YOLO looks at the entire image at once, reducing background false positives.
Works across domains.
- Can be applied to artworks, drawings, cartoons, etc.

🧠 Background of YOLO’s Emergence

While humans can understand a scene at a glance, previous models could not.
Traditional methods looped over many bounding boxes and performed complex post-processing!

Complex pipelines: Region proposals → classification → post-processing, making optimization difficult.
Slow inference: Each component runs independently, especially R-CNN-based models being unsuitable for real-time use.
Inefficient sliding windows: Classifier runs over many positions and scales → computationally expensive.
Non end-to-end training: Separate training for different parts of the pipeline.

🔍 Traditional Methods: DPM and R-CNN

DPM (Deformable Parts Models): Breaks an object into a root filter and several part filters.
Key components:
- Root filter: captures overall object shape.
- Part filters: detect key parts (e.g., eyes, nose for face).
- Deformation model: handles flexible placement of parts.
Used HoG features and sliding window detection.
Strengths: robust to occlusion and deformation.
Weaknesses: computationally heavy and hard to use in real time.

R-CNN (2014): Early deep learning-based object detector.
Steps:
1. Generate ~2,000 region proposals via Selective Search.
2. Extract features with a CNN per region.
3. Classify using SVM.
4. Refine with bounding box regression.
Pros: Higher accuracy than traditional methods.
Cons: Very slow, complex multi-stage training, not end-to-end.

📘 Dataset: PASCAL VOC Detection

Background: Developed for the PASCAL Visual Object Classes Challenge (2005–2012).
Objective: Benchmark for object detection with 20 labeled classes like person, dog, car, etc.
Each image may contain multiple objects.
Each object has bounding box (x, y, w, h) + label.
Evaluation metric: mAP (mean Average Precision)
- Averaged over classes.
- Based on IOU ≥ 0.5 for a correct match.

🔗 PASCAL VOC Website

🖇️ YOLO Model Architecture

Inspired by GoogLeNet for image classification.
24 convolutional layers + 2 fully connected layers.
Input: 448×448 image → Output: 7×7×30 tensor.

📌 YOLOv1 Layer Summary

Block	Layer	Filters × Size / Stride	Output Size (448×448 input)
Input	Image	-	448×448×3
Conv 1	Conv + LeakyReLU	64 × 7×7 / 2	224×224×64
MaxPool 1	MaxPooling	2×2 / 2	112×112×64
Conv 2	Conv + LeakyReLU	192 × 3×3 / 1	112×112×192
MaxPool 2	MaxPooling	2×2 / 2	56×56×192
Conv 3–4	Conv + LeakyReLU	128×1×1, 256×3×3	56×56×256
Conv 5–6	Conv + LeakyReLU	256×1×1, 512×3×3	56×56×512
MaxPool 3	MaxPooling	2×2 / 2	28×28×512
Conv 7–12	Repeated Conv Blocks (4×)	256×1×1, 512×3×3	28×28×512
Conv 13–14	Conv	512×1×1, 1024×3×3	28×28×1024
MaxPool 4	MaxPooling	2×2 / 2	14×14×1024
Conv 15–20	Repeated Conv Blocks (2×)	512×1×1, 1024×3×3	14×14×1024
Conv 21–22	Conv	1024×3×3, 1024×3×3	7×7×1024
FC 1	Fully Connected	4096	1×1×4096
FC 2	Fully Connected (Detection Output)	7×7×30 (S=7, B=2, C=20)	7×7×30

🔄 YOLO Training

🎯 Loss Function (Sum-Squared Error)

L = λ_coord ∑(obj) [(x - x̂)^2 + (y - ŷ)^2] 
  + λ_coord ∑(obj) [(√w - √ŵ)^2 + (√h - √ĥ)^2]
  + ∑(obj) (C - Ĉ)^2
  + λ_noobj ∑(noobj) (C - Ĉ)^2
  + ∑(obj) ∑_class (p(c) - p̂(c))^2

🎯 Loss Function and Training Parameters

Uses sum-squared error by default — simple to implement.
However:
- It treats classification and localization errors equally, which may not be ideal.
- Most grid cells don’t contain any object, so confidence scores are pushed toward zero, causing unstable gradients.
Solution: Adjust the loss weights
- λ_coord = 5: Increase weight on bounding box coordinate loss.
- λ_noobj = 0.5: Decrease weight on confidence loss for background cells.
- Also, to reduce sensitivity to large boxes, it predicts square root of width and height (sqrt(w), sqrt(h)).

🏋️ Training Configuration

Epochs: 135
Dataset: VOC 2007 + 2012 train/val sets
Batch Size: 64
Momentum: 0.9
Weight Decay: 0.0005
Learning Rate Schedule:
- Start with 1e-3, gradually increase to 1e-2
- Hold at 1e-2 for 75 epochs → 1e-3 for 30 epochs → 1e-4 for 30 epochs
Dropout (rate = 0.5) after the first FC layer to prevent overfitting

🧩 YOLO Model Evaluation Results

Speed
- Significantly faster than previous detectors!
Model mAP (%) FPS Inference Time (per image)
DPM v5 33.7 0.07 14 s/img
R-CNN 66.0 0.05 20 s/img
Fast R-CNN 70.0 0.5 2 s/img
Faster R-CNN 73.2 7 140 ms/img
YOLO 63.4 45 22 ms/img
Global Context Reduces Background Errors
- Because YOLO sees the entire image, it makes fewer background mistakes!
- So, combining YOLO with Fast R-CNN boosts performance significantly.
- Since YOLO is so fast, it adds minimal overhead.
Cross-domain Applicability
- Can be used for artworks, illustrations, cartoons, and more!

🧠 Final Thoughts

From YOLOv1 to v2, v3, v4… and all the way to YOLO-World,
this model has become a classic in object detection, cited over 60,000 times!

📝 What I learned while revisiting this work:

Clearly defining a open problem — like the speed bottleneck in detection —
is often the first step toward innovation.
Once a clear limitation is set, new ideas like YOLO can emerge naturally to solve it.

❗ Without clearly identifying open problems,
we may end up overanalyzing existing methods and merely making incremental tweaks,
rather than aiming for truly transformative improvements.

🧠 (한국어) YOLO 알아보기!

🔍 단일 모델로 엄청 빠르게 객채를 감지해버리기!!!

독서를 할때, 따끈따근한 신간도 좋지만, 오래도록 기억되는 고전도 필수지요!?
오늘은 Object Detection의 대명사가 된 YOLO, 그 첫 논문에 대하여 알아보겠습니다!

논문: You Only Look Once: Unified, Real-Time Object Detection
발표: CVPR 2016 (Joseph Redmon & Facebook AI Research)
PPT 슬라이드

💡 YOLO 의 특징 요약!!

빠르다.
- 초당 45 프레임!! - 기존 가장 빠르던 Faster R-CNN도 7FPS 엿는데!
이미지를 전체적으로 본다
- 슬라이딩 윈도우 방식으로 각각을 보는게 아니라 전채로 보기에, 배경오탐이 줄어든다!
여러 도메인에 적용 가능하다. (만화, 그림 등)

🧠 YOLO 등장의 배경

사람은 한눈에 모든것을 빠르게 파악하지만, 당시 모델들은 그렇지 못했습니다!
이미지에서 다양한 크기의 박스를 루프 돌려서 객채를 찾고
찾아진 객체에서 중복을 제거해야하는 등 복잡해!

복잡한 파이프라인: region proposal → 분류 → 후처리 단계로 나뉘어 있어 통합 최적화가 어려움.
느린 속도: 각 단계가 독립적으로 실행되며, 특히 R-CNN 계열은 실시간 적용에 부적합.
슬라이딩 윈도우 방식의 비효율성: 많은 위치와 크기에서 반복적으로 분류기를 실행 → 계산량 과다.
구성요소 분리 훈련 필요: 전체 시스템을 end-to-end로 학습하기 어려움.

🔍 기존이 연구 조금 더 알아보기 : DPM과 R-CNN

DPM(Deformable Parts Models)은 물체를 기본 형태(루트)와 그에 딸린 변형 가능한 부품들(parts)로 나누어 인식하는 객체 탐지 기법
3개의 구성요소
- Root filter: 물체의 전체 형태를 포착
- Part filters: 물체의 중요한 구성 요소(예: 얼굴의 눈, 코, 입 등)를 개별적으로 탐지
- Deformation model: 부품의 위치 이동에 따른 penalty를 학습
특징
- 슬라이딩 윈도우(sliding window) 기반의 탐지 방식 사용
- HoG(Histogram of Oriented Gradients) 특징 사용
- 각 윈도우에 대해 루트와 파트 점수를 합산하여 최종 점수 계산
장점
- 부품 기반 모델링으로 부분 가림(pose variation)에 강인함
- 한때 PASCAL VOC 등에서 SOTA를 기록했던 강력한 전통적 기법.
단점
- 복잡한 파이프라인: 특징 추출 → 분류기 → 바운딩 박스 조정 등 각 단계가 분리되어 있음.
- 연산 비용이 크고, 실시간 탐지에는 부적합
- 문맥 정보 활용에 어려움

R-CNN (Regions with CNN Features)은 2014년 발표된, 딥러닝 기반의 객체 인식에서 큰 진전을 이룬 초기 모델
핵심 아이디어
- 객체가 있을 법한 후보 영역(Region Proposals) 을 먼저 생성
- 각 영역을 CNN을 통해 개별적으로 특징 추출
- SVM 분류기를 통해 객체 존재 여부와 클래스 판별
- Bounding Box Regression을 통해 위치 정제
주요 구성 요소
1. Selective Search: 약 2,000개의 region proposal 생성
2. CNN: 각 region을 고정 크기로 리사이징 후, 특징 벡터 추출
3. SVM: 특징 벡터를 이용해 객체 분류
4. Bounding Box Regression: 예측 박스를 보정하여 정확도 향상
장점
- 기존 방식보다 정확도 향상
- 딥러닝을 객체 탐지에 본격적으로 적용한 첫 사례 중 하나
단점
- 매우 느림 (수 초 ~ 수십 초/이미지)
- 훈련 단계 복잡 (CNN, SVM, BBox Regression 따로 학습)
- End-to-End 학습 불가

📘 학습에 사용되는 Dataset 미리 파악하기! : PASCAL VOC detection dataset

배경: PASCAL Visual Object Classes Challenge(2005~2012) 을 위해 공개됨
목적: 객체 인식(Object Detection) 및 관련 시각 과제의 성능 평가용 벤치마크 제공
총 20개 클래스로 구성됨 : person, car, bus, dog, bottle, chair 등
- 이미지당 다중 객체 포함 가능
- 각 객체는 bounding box (x, y, w, h) 와 class label 이 함께 주어짐
- 이미지 → 객체 위치 + 레이블을 인식하는 Object Detection 과제에 적합
평가 지표 : mAP (mean Average Precision)
- 클래스별 AP를 평균한 값
- IOU threshold 기준으로 정확도를 정량적으로 측정
- 보통 IOU ≥ 0.5 기준 사용 (VOC 스타일)
VOC 데이터 공식 홈페이지

🖇️ YOLO 모델 구조

이미지 분류용 GoogLeNet 모델의 아이디어를 착안,
24개의 합성곱 계층(Conv)과 2개의 완전 연결 계층(FC)으로 구성
448 X 448 해상도의 이미지가 있다면 아래와 같이 통과하게 됩니다!!

📌 YOLOv1 네트워크 구조!

구분	층 구성 (Layer)	필터 개수 × 크기 / 스트라이드*	출력 크기 (입력: 448×448×3 기준)
입력	이미지 입력	-	448×448×3
Conv 1	Convolution + LeakyReLU	64 × 7×7 / 2 **	224×224×64
MaxPool 1	MaxPooling	2×2 / 2	112×112×64
Conv 2	Convolution + LeakyReLU	192 × 3×3 / 1	112×112×192
MaxPool 2	MaxPooling	2×2 / 2	56×56×192
Conv 3–4	Conv → Conv + LeakyReLU	128 × 1×1, 256 × 3×3	56×56×256
Conv 5–6	Conv → Conv + LeakyReLU	256 × 1×1, 512 × 3×3	56×56×512
MaxPool 3	MaxPooling	2×2 / 2	28×28×512
Conv 7–12	반복 Conv blocks (4회 반복)	256×1×1, 512×3×3	28×28×512
Conv 13–14	Conv → Conv	512 × 1×1, 1024 × 3×3	28×28×1024
MaxPool 4	MaxPooling	2×2 / 2	14×14×1024
Conv 15–20	반복 Conv blocks (2회 반복)	512×1×1, 1024×3×3	14×14×1024
Conv 21–22	Conv → Conv	1024 × 3×3, 1024 × 3×3	7×7×1024
FC 1	Fully Connected	4096	1×1×4096
FC 2	Fully Connected (Detection Output)	7×7×30 (S=7, B=2, C=20)***	7×7×30

* 스트라이드? : 합성곱 연산에서 **필터(커널)**가 입력 위를 얼마만큼 이동하는지를 나타내는 값으로 커질수록 연산이 줄어듬 ** 여기서의 64는 448 X 448 이미지를 7개의 Cell로 나누면, 각 Cell은 64X64가 되는 것이랑은 관계가 없이, CNN 아키텍처의 관례와 실험적 선택에 기반해 사용된 수치입니다.
`*** 7 X 7 , 즉 49개 Cell에 대하여 30개의 prediction이 있는데요! 이는 Cell 별 2개의 Bbox 정보 (x,y,w,h,p) 10개 + 20개 Class 에 대한 가능성 으로 30이됩니다!

🔄 YOLO 모델 학습!

🎯 손실 함수 설계 (Loss Function)

L = λ_coord ∑(obj) [(x - x̂)^2 + (y - ŷ)^2] 
    + λ_coord ∑(obj) [(√w - √ŵ)^2 + (√h - √ĥ)^2]
    + ∑(obj) (C - Ĉ)^2
    + λ_noobj ∑(noobj) (C - Ĉ)^2
    + ∑(obj) ∑_class (p(c) - p̂(c))^2

기본적으로 sum-squared error 사용: 구현이 간단
하지만
- classification과 localization error를 동일하게 다루기에 문제가 발생하고,
- 대부분의 grid cell은 객체가 없기에 confidence 값이 0으로 몰려 학습 불안정해짐!!
이에 아래와 같이 가중치를 조절함
- λ_coord = 5: 바운딩 박스 좌표 오차는 더 크게
- λ_noobj = 0.5: 객체 없는 셀의 confidence 오차는 작게
- 또한, 큰 박스는 작은 오차에 관대하므로 → 너비와 높이의 제곱근을 예측함 (sqrt(w), sqrt(h))

🏋️ 학습에서의 파라미터
- Epochs: 총 135 에폭
- 데이터: VOC2007 + VOC2012 학습/검증셋
- Batch Size: 64
- Momentum: 0.9
- Weight Decay: 0.0005
- Learning rate : 초기 epoch: 1e-3 → 1e-2 점진적 증가 / 이후: 1e-2 (75epoch) → 1e-3 (30epoch) → 1e-4 (30epoch)
- dropout(rate=0.5) 를 통해서 overfitting을 방지!

🧩 YOLO 모델의 평가결과!

빠르다.
- 기존 모델과 비교했을때 처리시간이 정말정말 빨랐습니다!!

모델	mAP (%)	FPS	처리 시간 (초/이미지)
DPM v5	33.7	0.07	14 s/img
R-CNN	66.0	0.05	20 s/img
Fast R-CNN	70.0	0.5	2 s/img
Faster R-CNN	73.2	7	140 ms/img
YOLO	63.4	45	22 ms/img

이미지를 전체적으로 보기에 배경오탐이 적고!
- 그래서!! Fact-RCNN과 결합하면 성능이 엄청 좋다!!
- 거기에 YOLO 는 엄청 빠르니 RCNN에 추가해도 시간도 별로 안들어!

여러 도메인에 적용 가능하다. (만화, 그림 등)

🧠 마무리 생각

Yolo 발표 이후 2,3,4, … 11 에 이어 yolo world까지!!
Object Detection 의 고전으로 남을,
6만번 넘게 인용된 YOLO!!
이렇게 정리해보면서 느낀점은 기존 연구들의 문제(속도)를 명확히 정의하는것이 중요하다는것이었습니다!
기존 연구들의 명확한 문제가 정의 되고나서야!
그 문제를 풀기 위한 새로운 접근법을 생각할수 있을것 같습니다!

이렇게 명확한 문제 정의가 되지 않는다면,
역설적으로 기존 논문만을 깊게 분석하면서(오버피팅과 같이) 혁신적인 개선법 보다는 기존 연구에서의 세부적인 수정만 이루어질 리스크가 있을것 같네요!

AI, Research

This post is licensed under CC BY 4.0 by the author.