Post

๐Ÿ“ Segment Anything, You are amazing! - ๋ˆ„๋ผ์˜ ๊ดด๋ฌผ, SAM์˜ ๋“ฑ์žฅ!! (ICCV, 2023)

๐Ÿ“ Segment Anything, You are amazing! - ๋ˆ„๋ผ์˜ ๊ดด๋ฌผ, SAM์˜ ๋“ฑ์žฅ!! (ICCV, 2023)

๐Ÿง  What is SAM?

Studying ใ€ŽSegment Anythingใ€ (ICCV, 2023)

SAM_paper

๐Ÿ“– Paper Title: Segment Anything
โœ๏ธ Authors: Meta AI Research (Kirillov, Alexey et al.)
๐ŸŒŸ One-line Summary: A general-purpose segmentation model that can segment anything, in any image, from any prompt!


๐Ÿ“š Key Idea

manwha

  • SAM stands for Segment Anything Model
  • Unlike traditional segmentation models,
  • SAM is a universal segmentation AI that can extract any object using a single pre-trained model
  • Without predefined classes, SAM can segment targets from user prompts
  • Itโ€™s often called the โ€œGPT for Segmentationโ€ due to its generalization ability

๐Ÿ” Background of the SAM Research

  • The era of Foundation Models:
    • Language models work well with large-scale data
    • In vision, CLIP, ALIGN, and image encoders have emerged
    • But vision segmentation lacks data diversity
  • Research Goal: Build a foundation model for image segmentation
    • Three key challenges: a. Task: What segmentation task to define? b. Model: What architecture to use? c. Data: How to collect it?

๐ŸŽฏ The Task Definition in SAM

  • Limitations of existing segmentation models:
    • Rely on predefined classes
    • Require labelled data
    • Need fine-tuning for new objects
  • Need for Prompt-based, Open-Vocabulary segmentation:
    • With multimodal models like CLIP, now we want:
    • Models that can segment user-defined targets using text, point, box prompts

๐Ÿ‘‰ So SAM was defined as a โ€œsegment-anythingโ€ universal segmentation system


โš™๏ธ SAM Model Architecture

architecture

ComponentDescription
Image EncoderEncodes entire image into a fixed embedding (done once)
Prompt EncoderEncodes prompts like points, boxes, masks
Mask DecoderCombines image & prompt embeddings to predict segmentation mask

Components in Detail

  1. Image Encoder (ViT-H, MAE pre-trained)
    • Uses ViT with Masked Autoencoder (MAE) training
    • Produces rich visual representation
    • Image embeddings are reused for multiple prompts
  2. Prompt Encoder
    • Handles two types of inputs:
    TypeExampleEncodingNotes
    SparsePoint, Box, TextPosition + learned embeddings / CLIP text encoderText uses CLIP text encoder
    DenseMaskConvolution + element-wise sum with image embeddingUsed for dense prompts like masks
  3. Mask Decoder

mask_encoder

  • Core logic that fuses prompt and image to output the final mask
StepDescription
1. InputImage Embedding + Prompt Embedding + Output Token
2. Decoder Blocks ร—2Transformer decoder variant + self & cross attention
3. UpsamplingUpsamples decoder output using image embedding
4. Dynamic PredictionMLP โ†’ Linear classifier to produce per-pixel FG probabilities
5. OutputGenerates 3 mask candidates with confidence scores (to resolve ambiguity)

๐Ÿ—๏ธ SA-1B Dataset and the Data Engine

datasets

  • SA-1B: The largest segmentation dataset ever, built by Meta for SAM
  • Contains 11M images and over 1.1B masks
  • 400ร— more masks than prior datasets
  • โœ… Fully automatic annotation, โœ… High diversity and quality

๐Ÿ› ๏ธ 3-Stage Data Engine

StageNameWhoMethodKey Features
1๏ธโƒฃAssisted-manualHuman + SAMHuman segments, SAM assistsInteractive tool, semantic-free
2๏ธโƒฃSemi-automaticSAM + HumanSAM segments, human fills the restEfficient + diverse
3๏ธโƒฃFully-automaticSAM onlyGrid prompts, full automation~100 masks/image, 99.1% of SA-1B

Assisted-manual Stage

  • Professional annotators use browser tool with SAM
  • Click foreground/background points
  • Refinement via brush & eraser
  • Focused on recognizable objects (but no label stored)
  • Moved to next image if took >30 seconds
MetricResult
Avg. annotation time34 โ†’ 14 sec (6.5ร— faster than COCO)
Masks/image20 โ†’ 44
Total120K images, 4.3M masks
Retraining6 times total

Semi-automatic Stage

MetricResult
Additional masks+5.9M (total 10.2M)
Images180K
Retraining5 more times
Time/image34 sec (excluding auto-masks)
Masks/image44 โ†’ 72

Fully Automatic Stage

  • Grid of 32ร—32 point prompts
  • Predicts multiple masks per point (sub-part, part, whole)
  • IoU prediction module filters reliable masks
  • Stability check with probability thresholding
  • Non-Max Suppression (NMS) removes duplicates
  • Cropped regions help improve small object coverage

๐Ÿ“ฆ Final SA-1B Dataset Summary

AspectDescription
Image count11M
ResolutionAvg. 3300ร—4950 px
LicensingLicensed from photographers
PrivacyFaces/plates blurred
Released imagesResized (short side 1500 px)
ComparisonHigher-res than COCO (480ร—640)
MasksDetails
Total1.1B masks
Auto-generated99.1%
Human-level quality94% of masks have IoU > 90% w/ expert
Fair & diverseBalanced across gender, regions

๐Ÿ”ฌ Zero-Shot Transfer Experiments

SAM proves itโ€™s not just a segmentation tool, but a universal model.
Evaluated on 5 tasks without fine-tuning:

TaskOutcome
1. Single-Point MaskOutperforms RITM (auto & human eval)
2. Edge DetectionStrong edges from prompts (even w/o training)
3. Object ProposalExcellent for mid/rare objects (beats ViTDet)
4. Instance SegmentationBetter visual quality than ViTDet, even if AP is lower
5. Text-to-MaskUses CLIP text embeddings for free-text segmentation

1๏ธโƒฃ Single-Point Valid Mask

  • Only one foreground point โ†’ segment object
  • Evaluation: mIoU + human rating (1โ€“10)
  • SAM beats RITM on 16/23 datasets (mIoU), and all datasets (oracle mode)
  • Human ratings: 7โ€“9 (higher than RITM)

2๏ธโƒฃ Edge Detection

  • Dataset: BSDS500
  • Prompted via 16ร—16 grid
  • Sobel edge detection applied to mask probabilities
  • Matches early DL models like HED
  • Recallโ†‘, Precisionโ†“ due to over-segmentation (expected)

3๏ธโƒฃ Object Proposal (LVIS)

  • Method: Mask output used as object proposals
  • Compared to ViTDet-H + Mask R-CNN
  • SAM outperforms in:
    • Medium/large objects
    • Common/rare categories
  • Falls behind on small/frequent ones

4๏ธโƒฃ Instance Segmentation

  • ViTDet boxes โ†’ fed as prompt to SAM
  • COCO/LVIS: SAM slightly behind in AP, but
  • Visual quality better (confirmed via human study)
  • Less biased by noisy ground truth (unlike ViTDet)

5๏ธโƒฃ Text-to-Mask

  • Uses CLIP text encoder as prompt
  • Training with CLIP image embedding โ†’ inference with text embedding
  • Example prompts: โ€œa wheelโ€, โ€œwipersโ€
  • Additional point improves ambiguous cases

โœจ Final Thoughts

Meta didnโ€™t just build a model โ€” they released the model + high-quality data with strong fairness,
making a true contribution to the open AI community.

Letโ€™s hope we can do the same in the future โ€” building & sharing great models and datasets!


(ํ•œ๊ตญ์–ด) ๐Ÿง  SAM์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€?

ใ€ŽSegment Anythingใ€(ICCV, 2023) ๊ณต๋ถ€

SAM_paper

๐Ÿ“– ๋…ผ๋ฌธ ์ œ๋ชฉ: Segment Anything
โœ๏ธ ์ €์ž: Meta AI Research (Kirillov, Mintun et al.)
๐ŸŒŸ ํ•œ์ค„ ์š”์•ฝ: ์–ด๋–ค ๊ฐ์ฒด๋“ , ์–ด๋–ค ์ด๋ฏธ์ง€๋“ , ์–ด๋–ค ์ž…๋ ฅ์ด๋“  โ€œ๋ฌด์—‡์ด๋“ โ€ ์ž˜๋ผ๋‚ด๋Š” ๋ฒ”์šฉ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ชจ๋ธ์˜ ๋“ฑ์žฅ!!


๐Ÿ“š ํ•ต์‹ฌ ์•„์ด๋””์–ด

manwha

  • SAM์€ Segment Anything Model์˜ ์•ฝ์ž๋กœ,
  • ๊ธฐ์กด์˜ ์˜์—ญ ๋ถ„ํ• (Segmentation) ๋ชจ๋ธ๋“ค๊ณผ๋Š” ๋‹ฌ๋ฆฌ,
  • ์–ด๋–ค ๊ฐ์ฒด๋“  ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ ํ•˜๋‚˜๋กœ ์ž˜๋ผ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ Segmentation ์ธ๊ณต์ง€๋Šฅ์ž…๋‹ˆ๋‹ค!
  • ์ฆ‰, ๋ฏธ๋ฆฌ ์ •์˜๋œ ํด๋ž˜์Šค๊ฐ€ ์—†์–ด๋„, โ€œ์‚ฌ์šฉ์ž ์ž…๋ ฅ(Prompt)โ€๋งŒ์œผ๋กœ ์›ํ•˜๋Š” ๋Œ€์ƒ์„ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด์š”.
  • SAM์€ โ€œSegmentation์„ ์œ„ํ•œ GPTโ€๋ผ๊ณ  ๋ถˆ๋ฆด ์ •๋„๋กœ ๋ฒ”์šฉ์„ฑ์ด ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” SAM ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ

  • ๋ฐ”์•ผํ๋กœ Foundation Model์˜ ์‹œ๋Œ€!!
    • ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋กœ Language Model๋“ค์€ ๋†€๋ž๊ฒŒ ์ž˜ ์ž‘๋™!!
    • Computer Vision ์—์„œ๋„ CLIP, ALIGN ๋“ฑ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋“ค์ด ๋“ฑ์žฅ, ์ด๋ฏธ์ง€ ์ƒ์„ฑ์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์นจ!!
    • ํ•˜์ง€๋งŒ, Vision ๋ฐ์ดํ„ฐ์˜ ๋ถ€์กฑ์œผ๋กœ ์ธํ•ด ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Œ!!
  • ๊ทธ๋ž˜์„œ!! ์ด๋ฒˆ ์—ฐ๊ตฌ์˜ ๋ชฉํ‘œ๋Š” โ€œbuild a foundation model for image segmentationโ€ ์œผ๋กœ ์ •์˜!!
    • ๊ทธ๋ฆฌ๊ณ  ์„ฑ๊ณต์  ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•ด์„œ ์•„๋ž˜ 3๊ฐ€์ง€ ์š”์†Œ๋ฅผ ๊ณ ๋ฏผํ•จ!!
      a. ๊ณผ์ œ : ์–ด๋–ค ๊ณผ์ œ๋ฅผ ์„ค์ •ํ• ๊ฒƒ์ธ๊ฐ€!
      b. ๋ชจ๋ธ : ์–ด๋–ค ๋ชจ๋ธ์„ ์“ธ๊ฒƒ์ธ๊ฐ€!
      c. ๋ฐ์ดํ„ฐ : ์–ด๋–ค ๋ฐ์ดํ„ฐ๋ฅผ ์“ธ๊ฒƒ์ธ๊ฐ€!

SAM ์—ฐ๊ตฌ์˜ ๊ณผ์ œ(Task)์˜ ์ •์˜

  • ๊ธฐ์กด Segmentation ๋ชจ๋ธ์˜ ํ•œ๊ณ„
    • ๋Œ€๋ถ€๋ถ„์˜ segmentation ๋ชจ๋ธ์€ ์‚ฌ์ „ ์ •์˜๋œ ํด๋ž˜์Šค(class)๊ฐ€ ์žˆ์–ด์•ผ ํ•™์Šต ๊ฐ€๋Šฅ
    • ํŠน์ • ๊ฐ์ฒด(ex: ๊ณ ์–‘์ด, ๊ฐœ, ์ž๋™์ฐจ)๋งŒ ๋ถ„ํ•  ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋ผ๋ฒจ๋ง ๋ฐ์ดํ„ฐ์— ๋งค์šฐ ์˜์กด
    • ์ƒˆ๋กœ์šด ํด๋ž˜์Šค์— ๋Œ€ํ•ด์„  ์žฌํ•™์Šต(fine-tuning)์ด ํ•„์š”
  • Open-Vocabulary, Prompt ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ํ•„์š”์„ฑ
    • ์ตœ๊ทผ CLIP ๋“ฑ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ ๋“ฑ์žฅ๊ณผ ํ•จ๊ป˜,
    • โ€œํ…์ŠคํŠธโ€๋‚˜ โ€œํฌ์ธํŠธโ€ ๋“ฑ์„ ํ†ตํ•ด ์‚ฌ์šฉ์ž ์ค‘์‹ฌ์œผ๋กœ ๊ฐ์ฒด๋ฅผ ์ง€์ •ํ•˜๊ณ  ๋ถ„ํ• ํ•˜๋Š” ๋ชจ๋ธ์ด ์š”๊ตฌ๋จ
  • ๊ทธ๋ž˜์„œ!! โ€œ๋ฌด์—‡์ด๋“  ์ž˜๋ผ๋‚ด๋Š” ๋ฒ”์šฉ ๋ถ„ํ• ๊ธฐโ€๋ฅผ ๊ณผ์ œ๋กœ ์ •์˜!

โš™๏ธ SAM์˜ ๋ชจ๋ธ ๊ตฌ์กฐ

architecture

๊ตฌ์„ฑ ์š”์†Œ์„ค๋ช…
Image Encoder์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ๊ณ ์ •๋œ image embedding ์ƒ์„ฑ (ํ•œ ๋ฒˆ๋งŒ ์ˆ˜ํ–‰)
Prompt Encoder์ , ๋ฐ•์Šค, ๋งˆ์Šคํฌ ๋“ฑ ๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ธ์ฝ”๋”ฉ
Mask Decoder์ด๋ฏธ์ง€์™€ ํ”„๋กฌํ”„ํŠธ ์ธ์ฝ”๋”ฉ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋งˆ์Šคํฌ ์˜ˆ์ธก ์ˆ˜ํ–‰

SAM์€ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ ๋ฐ ๊ทธ ๊ธฐ๋Šฅ!!

  1. Image Encoder (ViT-H ๊ธฐ๋ฐ˜)
    • MAE (Masked Autoencoders) ๋ฐฉ์‹์˜ ViT๋ฅผ ์‚ฌ์šฉ! - MAE๊ฐ€ ๋ญ”์ง€ ๊ณต๋ถ€ํ•ด๋ณด์ž!!
      | Masked Autoencoders Are Scalable Vision Learners (CVPR, 2022)
    • ์ด๋ฏธ์ง€๋ฅผ ๊ณ ํ•ด์ƒ๋„๋กœ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ํ’๋ถ€ํ•œ ์‹œ๊ฐ ํ‘œํ˜„ ์ƒ์„ฑ
    • ํ•œ๋ฒˆ ์ธ์ฝ”๋”ฉ๋œ ์ด๋ฏธ์ง€๋Š” ์—ฌ๋Ÿฌ ํ”„๋กฌํ”„ํŠธ์—๋„ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  2. Prompt Encoder
    • ์‚ฌ์šฉ์ž์˜ ์ž…๋ ฅ์„ ์ธ์ฝ”๋”ฉ
    • ์ž…๋ ฅ ์ข…๋ฅ˜ - ํฌ๊ฒŒ 2๊ฐ€์ง€!
    ์ข…๋ฅ˜์˜ˆ์‹œ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ค๋ช…
    ํฌ์†Œ (Sparse)Point, Box, Text์œ„์น˜ + ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ / ํ…์ŠคํŠธ ์ธ์ฝ”๋”(CLIP)- ์œ„์น˜ ์ •๋ณด์— Positional Encoding + ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ
    - ํ…์ŠคํŠธ๋Š” CLIP ํ…์ŠคํŠธ ์ธ์ฝ”๋” ์‚ฌ์šฉ
    ๋ฐ€์ง‘ (Dense)MaskConvolution + Element-wise Sum- ๋งˆ์Šคํฌ๋ฅผ Conv๋กœ ์ž„๋ฒ ๋”ฉ ํ›„
    ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ๊ณผ ์›์†Œ ๋‹จ์œ„ ํ•ฉ์‚ฐ
  3. Mask Decoder mask_encoder

    • ์ธ์ฝ”๋” ์ถœ๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ข… ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑ
    • ์ด๋ฏธ์ง€์™€ ํ”„๋กฌํ”„ํŠธ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•ต์‹ฌ ๊ตฌ์„ฑ
    • ๐Ÿ”ง ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ ๋ฐ ์ฒ˜๋ฆฌ ๊ณผ์ •
    ๋‹จ๊ณ„์„ค๋ช…
    1. ์ž…๋ ฅ- Image Embedding
    - Prompt Embedding (Point, Box, Text ๋“ฑ)
    - Output Token
    2. ๋””์ฝ”๋” ๋ธ”๋ก (ร—2)- Transformer Decoder ๋ณ€ํ˜• ๋ฒ„์ „ ์‚ฌ์šฉ
    - Prompt Self-Attention
    - Cross-Attention (Prompt โ†” Image ์ž„๋ฒ ๋”ฉ) ์–‘๋ฐฉํ–ฅ ์ˆ˜ํ–‰
    3. ์—…์ƒ˜ํ”Œ๋ง- ๋””์ฝ”๋” ์ถœ๋ ฅ์—์„œ Image Embedding์„ ์—…์ƒ˜ํ”Œ๋ง
    4. ๋™์  ๋งˆ์Šคํฌ ์˜ˆ์ธก- Output Token โ†’ MLP โ†’ ๋™์  Linear Classifier
    - ๊ฐ ํ”ฝ์…€ ์œ„์น˜๋งˆ๋‹ค Foreground ํ™•๋ฅ  ๊ณ„์‚ฐ
    5. ์ตœ์ข… ์ถœ๋ ฅ- ์ „๊ฒฝ ํ™•๋ฅ (foreground probability) ๋งต โ†’ Binary Mask ์ถœ๋ ฅ

    ๏ผ๋ชจํ˜ธ์„ฑ์˜ ํ•ด๊ฒฐ์„ ์œ„ํ•˜์—ฌ! : ์„ธ ๊ฐœ์˜ ํ›„๋ณด ๋งˆ์Šคํฌ๋ฅผ ์ถœ๋ ฅ, ๊ฐ ๋งˆ์Šคํฌ ๋ณ„ ํ™•์‹ ๋„(uncertainty score) ์ œ๊ณต


๐Ÿ—๏ธ SAM์˜ ๋ฐ์ดํ„ฐ (SA-1B) ๋ฐ ๋ฐ์ดํ„ฐ ์—”์ง„

datasets

  • SA-1B: SAM ํ•™์Šต์„ ์œ„ํ•ด Meta๊ฐ€ ๋งŒ๋“  ์ดˆ๋Œ€๊ทœ๋ชจ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ฐ์ดํ„ฐ์…‹
  • ์ด 11M๊ฐœ์˜ ์ด๋ฏธ์ง€์—์„œ ์ž๋™์œผ๋กœ ์ˆ˜์ง‘๋œ 1B+ ๋งˆ์Šคํฌ
  • ๊ธฐ์กด ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ฐ์ดํ„ฐ์…‹๋ณด๋‹ค 400๋ฐฐ ๋” ๋งŽ์€ ๋งˆ์Šคํฌ ๋ณด์œ 
  • โœ… ์™„์ „ ์ž๋™ ์ˆ˜์ง‘, โœ… ๊ณ ํ’ˆ์งˆ & ๋‹ค์–‘์„ฑ ๋ณด์žฅ
  • SAM์˜ ๋ฒ”์šฉ์„ฑ ๋ฐ ๊ฒฌ๊ณ ์„ฑ ํ™•๋ณด์— ํ•ต์‹ฌ ์—ญํ• 
  • ๐Ÿ“š ํ–ฅํ›„ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ๊ณต๊ณต ์ž์›์œผ๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅ
  • SA-1B ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ ์ ˆ์ฐจ ์š”์•ฝ ํ‘œ
๋‹จ๊ณ„๋ช…์นญ์ฃผ์ฒด์ฃผ์š” ์ž‘์—…ํŠน์ง•
1๏ธโƒฃ๋ณด์กฐ ์ˆ˜๋™ ์ฃผ์„ (Assisted-manual)์‚ฌ๋žŒ + SAM์‚ฌ๋žŒ์ด ๋งˆ์Šคํฌ๋ฅผ ์ง์ ‘ ๋งŒ๋“ค๊ณ , SAM์ด ๋ณด์กฐ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ฐฉ์‹, ์ดˆ๊ธฐ ํ’ˆ์งˆ ํ™•๋ณด
2๏ธโƒฃ๋ฐ˜์ž๋™ ์ฃผ์„ (Semi-automatic)SAM + ์‚ฌ๋žŒSAM์ด ์ผ๋ถ€ ๊ฐ์ฒด ๋งˆ์Šคํฌ ์ƒ์„ฑ, ์‚ฌ๋žŒ์€ ๋‚˜๋จธ์ง€๋ฅผ ์ฃผ์„๋‹ค์–‘์„ฑ ํ–ฅ์ƒ, ์‹œ๊ฐ„ ํšจ์œจ ์ฆ๊ฐ€
3๏ธโƒฃ์™„์ „ ์ž๋™ ์ฃผ์„ (Fully automatic)SAMSAM์ด ํฌ์ธํŠธ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ์ „์ฒด ๋งˆ์Šคํฌ ์ƒ์„ฑ์ด๋ฏธ์ง€๋‹น ํ‰๊ท  100๊ฐœ ๋งˆ์Šคํฌ, SA-1B ๋Œ€๋ถ€๋ถ„ ๊ตฌ์„ฑ

1๋‹จ๊ณ„: Assisted-Manual Stage

  • ๋ธŒ๋ผ์šฐ์ € ๊ธฐ๋ฐ˜ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ํˆด์—์„œ SAM์ด ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ฃผ์„ ์ง€์›
  • ์ „๋ฌธ ์ฃผ์„์ž๊ฐ€ ์ „๊ฒฝ/๋ฐฐ๊ฒฝ ํฌ์ธํŠธ ํด๋ฆญํ•˜์—ฌ ๋งˆ์Šคํฌ ์ƒ์„ฑ
  • ๋ธŒ๋Ÿฌ์‹œ & ์ง€์šฐ๊ฐœ๋กœ ์ •๋ฐ€ ์ˆ˜์ • ๊ฐ€๋Šฅ
  • โ€œ์„ค๋ช… ๊ฐ€๋Šฅํ•œโ€ ๊ฐ์ฒด ์ค‘์‹ฌ์œผ๋กœ ์ž์œ ๋กญ๊ฒŒ ๋ผ๋ฒจ๋ง (semantic ์ œํ•œ ์—†์Œ)
  • ๋งˆ์Šคํฌ์— ์ด๋ฆ„/์„ค๋ช…์€ ์ €์žฅํ•˜์ง€ ์•Š์Œ
  • 30์ดˆ ์ด์ƒ ๊ฑธ๋ฆฌ๋ฉด ๋‹ค์Œ ์ด๋ฏธ์ง€๋กœ ๋„˜์–ด๊ฐ
  • ์ˆ˜์ง‘๋œ ๋งˆ์Šคํฌ๋กœ 6ํšŒ ์žฌํ•™์Šต!!
๐Ÿ” ๋ชจ๋ธ ํ–ฅ์ƒ ๊ณผ์ •
ํ•ญ๋ชฉ๋‚ด์šฉ
์ดˆ๊ธฐ ๋ชจ๋ธ๊ณต๊ฐœ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ SAM
๋ฐ˜๋ณต ํ•™์Šต์ˆ˜์ง‘๋œ ๋งˆ์Šคํฌ๋งŒ์œผ๋กœ ์ด 6ํšŒ ์žฌํ•™์Šต
ViT ๋ฐฑ๋ณธViT-B โ†’ ViT-H๋กœ ์ ์ง„์  ํ™•์žฅ
๊ตฌ์กฐ ๊ฐœ์„ ๋‹ค์–‘ํ•œ ์„ธ๋ถ€ ๊ตฌ์กฐ ์ง„ํ™” ํฌํ•จ
๐Ÿ“ˆ ์„ฑ๋Šฅ ๊ฐœ์„  ์ง€ํ‘œ
์ง€ํ‘œ๋ณ€ํ™”
ํ‰๊ท  ์ฃผ์„ ์‹œ๊ฐ„34์ดˆ โ†’ 14์ดˆ (COCO๋ณด๋‹ค 6.5๋ฐฐ ๋น ๋ฆ„)
ํ‰๊ท  ๋งˆ์Šคํฌ ์ˆ˜์ด๋ฏธ์ง€๋‹น 20๊ฐœ โ†’ 44๊ฐœ
์ˆ˜์ง‘๋Ÿ‰12๋งŒ ์ด๋ฏธ์ง€, 430๋งŒ ๋งˆ์Šคํฌ ์ˆ˜์ง‘ ์™„๋ฃŒ

2๋‹จ๊ณ„: Semi-Automatic Stage

  • ์ด ๋‹จ๊ณ„๋Š” โ€œ์ž๋™ + ์ˆ˜๋™ ํ˜‘์—… ๊ตฌ์กฐโ€๋กœ, ๋” ์–ด๋ ค์šด ๊ฐ์ฒด, ๋” ๋‹ค์–‘ํ•œ ๊ฐ์ฒด๋ฅผ ์ปค๋ฒ„ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์ˆ˜ํ–‰
  • ๋งˆ์Šคํฌ ๋‹ค์–‘์„ฑ ํ–ฅ์ƒ์„ ํ†ตํ•ด SAM์˜ ๋ฒ”์šฉ ๋ถ„ํ•  ๋Šฅ๋ ฅ ๊ฐ•ํ™”
    1. 1๋‹จ๊ณ„ ๋งˆ์Šคํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ โ€œobjectโ€ ํด๋ž˜์Šค ํ•˜๋‚˜๋กœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ํƒ์ง€๊ธฐ ํ•™์Šต
    2. ์ž๋™ ํƒ์ง€๋œ ๋งˆ์Šคํฌ(confident masks)๋ฅผ ์ด๋ฏธ์ง€์— ๋ฏธ๋ฆฌ ์‚ฝ์ž…
    3. ์ฃผ์„์ž๋Š” ์ž๋™ ๋งˆ์Šคํฌ ์™ธ์˜ ๋ˆ„๋ฝ๋œ ๊ฐ์ฒด๋งŒ ์ˆ˜๋™์œผ๋กœ ์ถ”๊ฐ€ ์ฃผ์„
๐Ÿ“ˆ ์„ฑ๋Šฅ ๋ฐ ์ˆ˜์น˜
ํ•ญ๋ชฉ๋‚ด์šฉ
์ˆ˜์ง‘ ๋งˆ์Šคํฌ ์ˆ˜590๋งŒ ๊ฐœ ์ถ”๊ฐ€ ์ˆ˜์ง‘ (์ด 1,020๋งŒ ๊ฐœ ๋„๋‹ฌ)
์ด๋ฏธ์ง€ ์ˆ˜18๋งŒ ์žฅ
SAM ์žฌํ•™์Šต ํšŸ์ˆ˜5ํšŒ ๋ฐ˜๋ณต ํ•™์Šต
ํ‰๊ท  ์ฃผ์„ ์‹œ๊ฐ„34์ดˆ (์ž๋™ ๋งˆ์Šคํฌ ์ œ์™ธ)
์ด๋ฏธ์ง€๋‹น ํ‰๊ท  ๋งˆ์Šคํฌ ์ˆ˜44๊ฐœ โ†’ 72๊ฐœ (์ž๋™ + ์ˆ˜๋™ ํฌํ•จ)

3๋‹จ๊ณ„: Fully Automatic Stage

  • 2๋‹จ๊ณ„๊นŒ์ง€ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ, ์™„์ „ ์ž๋™์œผ๋กœ!!, ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ!!
  • ์ด๋กœ์จ Segment Anything์˜ SA-1B ๋ฐ์ดํ„ฐ์…‹์ด ์™„์„ฑ
  • SAM ๋ชจ๋ธ๋„ ์ค‘์š”ํ•˜์ง€๋งŒ, ์ด์ฒ˜๋Ÿผ ๋ฒ”์šฉ ๋ถ„ํ•  ๋ชจ๋ธ ํ•™์Šต์— ์žˆ์–ด ์ „๋ก€ ์—†๋Š” ๋ฆฌ์†Œ์Šค ์ œ๊ณตํ–ˆ๋‹ค๋Š”์ ๋„ ํฐ ์˜๋ฏธ!!
๐Ÿ”ง ์ž๋™ ์ƒ์„ฑ ์ ˆ์ฐจ
  1. 32ร—32 ํฌ์ธํŠธ ๊ทธ๋ฆฌ๋“œ๋กœ ์ด๋ฏธ์ง€ ํ”„๋กฌํ”„ํŠธ
  2. ๊ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•ด ๋‹ค์ค‘ ๋งˆ์Šคํฌ ์˜ˆ์ธก
    • ์˜ˆ: โ€œํŒ”โ€ ํฌ์ธํŠธ โ†’ ํŒ” / ํŒ”+๋ชธํ†ต / ์ „์ฒด ์‚ฌ๋žŒ ๋งˆ์Šคํฌ
  3. IoU ์˜ˆ์ธก ๋ชจ๋“ˆ๋กœ ์‹ ๋ขฐ๋„ ๋†’์€ ๋งˆ์Šคํฌ๋งŒ ์„ ํƒ
  4. ์•ˆ์ •์„ฑ ๊ฒ€์‚ฌ:
    • ํ™•๋ฅ ๋งต์„ 0.5, 0.55 ๋“ฑ์œผ๋กœ thresholdํ•ด๋„ ๋น„์Šทํ•˜๋ฉด โ€œ์•ˆ์ •๋œ ๋งˆ์Šคํฌโ€
  5. NMS (Non-Max Suppression)๋กœ ์ค‘๋ณต ์ œ๊ฑฐ
  6. ์ž‘์€ ๊ฐ์ฒด ๋ณด์™„์„ ์œ„ํ•ด ํ™•๋Œ€๋œ ์ด๋ฏธ์ง€ crop๋„ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ

์ตœ์ข… ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ(SA-1B)๋Š”!?

๐Ÿ–ผ๏ธ ์ด๋ฏธ์ง€ ๊ตฌ์„ฑ
ํ•ญ๋ชฉ๋‚ด์šฉ
์ด๋ฏธ์ง€ ์ˆ˜11,000,000์žฅ
ํ•ด์ƒ๋„ํ‰๊ท  3300 ร— 4950 ํ”ฝ์…€
์ถœ์ฒ˜์‚ฌ์ง„์ž‘๊ฐ€์™€ ์ง์ ‘ ํ˜‘์—…ํ•˜๋Š” ๊ณต๊ธ‰์—…์ฒด๋กœ๋ถ€ํ„ฐ ๋ผ์ด์„ ์Šค ํš๋“
๋ณดํ˜ธ ์กฐ์น˜์–ผ๊ตด ๋ฐ ์ฐจ๋Ÿ‰ ๋ฒˆํ˜ธํŒ ๋ธ”๋Ÿฌ ์ฒ˜๋ฆฌ ํฌํ•จ
๋ฐฐํฌ ํ˜•์‹์ตœ๋‹จ ๋ณ€ ๊ธฐ์ค€ 1500ํ”ฝ์…€ ๋‹ค์šด์ƒ˜ํ”Œ ๋ฒ„์ „ ์ œ๊ณต
๋น„๊ตCOCO: 480ร—640 โ†’ SA-1B๋Š” ํ›จ์”ฌ ๋” ๊ณ ํ•ด์ƒ๋„
๐Ÿงฉ ๋งˆ์Šคํฌ ๊ตฌ์„ฑ
ํ•ญ๋ชฉ๋‚ด์šฉ
์ด ๋งˆ์Šคํฌ ์ˆ˜1.1B (11์–ต ๊ฐœ)
์ƒ์„ฑ ๋ฐฉ์‹99.1% ์ž๋™ ์ƒ์„ฑ (Fully Automatic Stage)
ํฌํ•จ ๋งˆ์Šคํฌ์ตœ์ข…์ ์œผ๋กœ๋Š” ์ž๋™ ์ƒ์„ฑ๋œ ๋งˆ์Šคํฌ๋งŒ ํฌํ•จ๋จ
ํ’ˆ์งˆ ํ‰๊ฐ€์ „๋ฌธ๊ฐ€ ๋ณด์ • ๋Œ€๋น„ 94%๊ฐ€ IoU > 90% ์ˆ˜์ค€์˜ ์ผ์น˜์œจ
๐Ÿ” ํ’ˆ์งˆ ๊ฒ€์ฆ: ์ž๋™ vs ์ „๋ฌธ๊ฐ€์˜ ๊ฒ€์ฆ!!
  • ๋ฌด์ž‘์œ„ 500๊ฐœ ์ด๋ฏธ์ง€(์ด 5๋งŒ ๋งˆ์Šคํฌ)๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ
    ์ „๋ฌธ๊ฐ€๊ฐ€ ๋ธŒ๋Ÿฌ์‹œ & ์ง€์šฐ๊ฐœ๋กœ ๋งˆ์Šคํฌ๋ฅผ ์ •๊ตํ•˜๊ฒŒ ๋ณด์ •
  • ๊ทธ ๊ฒฐ๊ณผ:
    • 94%์˜ ๋งˆ์Šคํฌ ์Œ์ด IoU > 90%
    • 97%๋Š” IoU > 75%
  • ์ฐธ๊ณ : ๊ธฐ์กด ๋…ผ๋ฌธ ๊ธฐ์ค€ ์‚ฌ๋žŒ ๊ฐ„ IoU ์ผ์น˜์œจ์€ 85~91% ์ˆ˜์ค€
  • โ‡’ SAM์˜ ์ž๋™ ๋งˆ์Šคํฌ๋Š” ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ํ’ˆ์งˆ ํ™•๋ณด
PC์˜ ๋ฐ์ดํ„ฐ!!
  • ๋‚จ์„ฑ ์—ฌ์„ฑ, ์œ ๋Ÿฝ ์•„์‹œ์•„ ์•„ํ”„๋ผ์นด ๋“ฑ ์–ด๋–ค์ ์—์„œ๋„ ์น˜์šฐ์น˜์ง€ ์•Š์€ Fairness ๋ฐ์ดํ„ฐ์—์š”!!^^*

Zero-Shot Transfer Experiments (SAM์˜ ๋ฒ”์šฉ์„ฑ ์‹คํ—˜)

SAM(Segment Anything Model)์€ ๋‹จ์ˆœํžˆ ์ด๋ฏธ์ง€์— ๋งˆ์Šคํฌ๋ฅผ ๊ทธ๋ฆฌ๋Š” ๋„๊ตฌ๋ฅผ ๋„˜์–ด!!
์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ๋‹ค์–‘ํ•œ ๋น„์ „ ๊ณผ์ œ์— ์ง์ ‘ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฒ”์šฉ ๋ชจ๋ธ์ด๋ผ๋Š” ์ ์„ ์‹คํ—˜์„ ํ†ตํ•ด ์ž…์ฆ
์ด 5๊ฐ€์ง€ ์‹คํ—˜์„ ํ†ตํ•ด SAM์˜ Zero-Shot ์„ฑ๋Šฅ์„ ์ธก์ •


๐Ÿงญ Zero-shot ์‹คํ—˜ ๊ฐœ์š”

  • Zero-Shot Transfer: SAM์€ ํ•™์Šต์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ž‘์—…์— ๋Œ€ํ•ด ์ง์ ‘ ์ ์šฉ
  • ํ‰๊ฐ€ ๋Œ€์ƒ ๊ณผ์ œ 5์ข…:
    1. Single-Point Valid Mask (๋‹จ์ผ ํฌ์ธํŠธ ๊ฐ์ฒด ๋ถ„ํ• )
    2. Edge Detection (์—์ง€ ๊ฐ์ง€)
    3. Object Proposal Generation (๊ฐ์ฒด ์ œ์•ˆ)
    4. Instance Segmentation (์ธ์Šคํ„ด์Šค ๋ถ„ํ• )
    5. Text-to-Mask (ํ…์ŠคํŠธ โ†’ ๋งˆ์Šคํฌ)
  • ์‹คํ—˜ ์š”์•ฝ!
์‹คํ—˜๊ฒฐ๊ณผ ์š”์•ฝ
Single-Point MaskRITM ๋Œ€๋น„ ์ •์„ฑยท์ •๋Ÿ‰ ์„ฑ๋Šฅ ๋ชจ๋‘ ์šฐ์ˆ˜
Edge Detectionํ•™์Šต ์—†์ด๋„ ์˜๋ฏธ ์žˆ๋Š” ์—์ง€ ์ถ”์ถœ ๊ฐ€๋Šฅ
Object Proposal์ค‘๊ฐ„/ํฌ๊ท€ ๊ฐ์ฒด ์ œ์•ˆ์—์„œ ์ตœ๊ณ  ์ˆ˜์ค€ ์„ฑ๋Šฅ
Instance SegmentationAP๋Š” ๋‚ฎ์ง€๋งŒ ์‹œ๊ฐ์  ํ’ˆ์งˆ๊ณผ ์‚ฌ์šฉ์ž ํ‰๊ฐ€ ์šฐ์ˆ˜
Text-to-MaskCLIP ์ž„๋ฒ ๋”ฉ ํ™œ์šฉํ•ด ์ž์—ฐ์–ด ๋ถ„ํ• ๊นŒ์ง€ ํ™•์žฅ ์„ฑ๊ณต

1๏ธโƒฃ ๋‹จ์ผ ํฌ์ธํŠธ ๊ฐ์ฒด ๋ถ„ํ•  (Single-Point Valid Mask)
  • ์„ค์ •: ์ „๊ฒฝ ํฌ์ธํŠธ ํ•˜๋‚˜๋งŒ์œผ๋กœ ๊ฐ์ฒด ๋ถ„ํ• 
  • ํ‰๊ฐ€: mIoU + ์‚ฌ๋žŒ ์ฃผ์„์ž ํ‰๊ฐ€ (1~10์ )
  • ๊ฒฐ๊ณผ:
    • 23๊ฐœ ์ค‘ 16๊ฐœ ๋ฐ์ดํ„ฐ์…‹์—์„œ RITM ๋Œ€๋น„ mIoU ์šฐ์œ„
    • Oracle ์„ ํƒ ์‹œ ์ „ ๋ฐ์ดํ„ฐ์…‹์—์„œ RITM ๋Šฅ๊ฐ€
    • ์‚ฌ๋žŒ ํ‰๊ฐ€๋Š” 7~9์ ์œผ๋กœ RITM๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ๋†’์Œ
    • SAM์€ ๋ชจํ˜ธํ•œ ์ž…๋ ฅ์—์„œ๋„ ์œ ํšจํ•œ ๋งˆ์Šคํฌ ์ƒ์„ฑ ๋Šฅ๋ ฅ ์ž…์ฆ

2๏ธโƒฃ ์—ฃ์ง€ ๊ฐ์ง€ (Zero-Shot Edge Detection)
  • ์„ค์ •: BSDS500์—์„œ ์—์ง€ ๊ฐ์ง€ ์ˆ˜ํ–‰
    • 16ร—16 ํฌ์ธํŠธ๋กœ SAM์„ ํ”„๋กฌํ”„ํŠธ โ†’ Sobel ํ•„ํ„ฐ๋กœ ๊ฒฝ๊ณ„ ์ถ”์ถœ
  • ๊ฒฐ๊ณผ:
    • ์—์ง€ ๊ฐ์ง€์šฉ ํ•™์Šต ์—†์ด๋„ ์˜๋ฏธ ์žˆ๋Š” ์—์ง€ ๋งต ์ƒ์„ฑ
    • ์ตœ์‹  ๊ธฐ๋ฒ•๋ณด๋‹จ ์ •๋ฐ€๋„๋Š” ๋‚ฎ์ง€๋งŒ, HED ๋“ฑ ์ดˆ๊ธฐ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ์ˆ˜์ค€ ์ด์ƒ
    • Zero-shot ์น˜๊ณ  ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ

3๏ธโƒฃ ๊ฐ์ฒด ์ œ์•ˆ (Zero-Shot Object Proposal)
  • ์„ค์ •: LVIS์—์„œ ์ œ์•ˆ๋œ ๋งˆ์Šคํฌ๋“ค๋กœ ๊ฐ์ฒด ์ œ์•ˆ
  • ๋น„๊ต: ViTDet-H + Mask R-CNN (DMP ๋ฐฉ๋ฒ•)
  • ํ‰๊ฐ€ ์ง€ํ‘œ: Average Recall (AR@1000)
  • ๊ฒฐ๊ณผ:
    • ์ค‘๊ฐ„/ํฐ ๊ฐ์ฒด, ํฌ๊ท€/์ผ๋ฐ˜ ๊ฐ์ฒด์—์„œ ViTDet-H๋ณด๋‹ค ์šฐ์ˆ˜
    • ์ž‘์€ ๊ฐ์ฒด์—์„œ๋Š” ViTDet-H๊ฐ€ ์šฐ์„ธ (LVIS์— ํŠนํ™”๋œ ํ•™์Šต ๋•Œ๋ฌธ)
    • Ambiguity-aware ๋ฒ„์ „์ด ์••๋„์  ์„ฑ๋Šฅ ํ–ฅ์ƒ ์ œ๊ณต
4๏ธโƒฃ ์ธ์Šคํ„ด์Šค ์„ธ๋ถ„ํ™” (Zero-Shot Instance Segmentation)
  • ์„ค์ •: ๊ฐ์ง€๊ธฐ(ViTDet) ๋ฐ•์Šค๋ฅผ ํ”„๋กฌํ”„ํŠธ๋กœ SAM์— ๋งˆ์Šคํฌ ์ƒ์„ฑ
  • ๊ฒฐ๊ณผ:
    • COCO/LVIS์—์„œ AP๋Š” ViTDet๋ณด๋‹ค ๋‚ฎ์ง€๋งŒ
    • ๊ฒฝ๊ณ„ ํ’ˆ์งˆ์€ SAM์ด ๋” ์šฐ์ˆ˜
    • ์‚ฌ๋žŒ ํ‰๊ฐ€์—์„œ๋„ SAM ๋งˆ์Šคํฌ๊ฐ€ ๋” ๋†’๊ฒŒ ํ‰๊ฐ€๋จ
  • ๋ถ„์„:
    • COCO๋Š” ํ’ˆ์งˆ ๋‚ฎ์€ GT โ†’ ViTDet๋Š” ๋ฐ์ดํ„ฐ ํŽธํ–ฅ ํ•™์Šต
    • SAM์€ ๊ทธ๋Ÿฐ ํŽธํ–ฅ ์—†์ด ๋ณด๋‹ค ์ผ๋ฐ˜์ ์ธ ๋ถ„ํ•  ์ˆ˜ํ–‰
5๏ธโƒฃ ํ…์ŠคํŠธ โ†’ ๋งˆ์Šคํฌ (Zero-Shot Text-to-Mask)
  • ์„ค์ •: ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋งŒ์œผ๋กœ ๊ฐ์ฒด ๋ถ„ํ• 
    • CLIP์˜ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ โ†” ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์ •๋ ฌ์„ ์ด์šฉํ•ด ํ•™์Šต
  • ๊ฒฐ๊ณผ:
    • โ€œa wheelโ€, โ€œbeaver tooth grilleโ€ ๋“ฑ ์ž์—ฐ์–ด๋กœ ๊ฐ์ฒด ๋ถ„ํ•  ๊ฐ€๋Šฅ
    • ํ…์ŠคํŠธ๋งŒ์œผ๋กœ ์ž˜ ์•ˆ๋  ๊ฒฝ์šฐ, ํฌ์ธํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ๊ฐœ์„ ๋จ
  • ์‹œ์‚ฌ์ :
    • SAM์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ๋ฐœ์ „ ๊ฐ€๋Šฅ์„ฑ์ด ํผ

โœจ ๋งˆ๋ฌด๋ฆฌํ•˜๋ฉฐ

๋‹จ์ˆœํžˆ ์—ฐ๊ตฌ๋งŒ ํ•˜๊ธฐ๋„ ๋ฐ”์œ๋ฐ,, ๋น…ํ…Œํฌ ๊ธฐ์—…์—์„œ ์—ฐ๊ตฌ ๋ชจ๋ธ + ๋ฐ์ดํ„ฐ์…‹์„ ๊ณต๊ฐœํ•ด์ค€๋‹ค๋Š”๊ฒƒ์€!
๊ฒŒ๋‹ค๊ฐ€ Fairness๋ฅผ ๊ฐ–์ถ˜ ์ข‹์€๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•ด์ค€๋‹ค๋Š” ๊ฒƒ์€ ์ฐธ ๊ณ ๋งˆ์šด ์ผ์ธ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!
์–ธ์  ๊ฐ„! ์šฐ๋ฆฌ๋„ ๋†’์€ ํ’ˆ์งˆ์˜ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๊ณ ์„ฑ๋Šฅ์˜ ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•˜๋Š” ๋‚ ์ด ์˜ค๊ธฐ๋ฅผ!@!!


This post is licensed under CC BY 4.0 by the author.