🧠 Understanding SAM2 - SAM2 알아보기?!!

Posted May 16, 2025

By DrFirst

16 min read

🦖 (English) 🧠 Understanding SAM2

🔍 A next-generation segmentation model with unified image & video support, real-time speed, and high accuracy!

Paper: SAM 2: SEGMENT ANYTHING IN IMAGES AND VIDEOS
Conference: ICLR 2025 (by META Research)
Code: facebookresearch/SAM2
Comment: SAM2 follows SAM — now moving beyond static images to dynamic videos!

In a previous post, we explored SAM, released by Facebook Research.
Today, let’s dive into SAM2, which was released two years later by the same team!

❗ Limitations of the Original SAM

As the era of AR/VR and video content expands, SAM — which was designed for static images — has the following limitations:

Designed for static images only
→ SAM does not account for temporal dynamics across frames.
Cannot track spatio-temporal continuity
→ Cannot handle changes due to motion, deformation, occlusion, or lighting variations.
Vulnerable to low video quality
→ Performance drops with blur, noise, or low resolution common in videos.
Processes each frame independently
→ Lacks consistent tracking or segmentation continuity across frames.
Inefficient on long videos
→ Cannot scale well to thousands of frames due to memory and speed limitations.

➡️ Modern video applications need a model that can handle both spatial and temporal segmentation in a unified way.

✅ Key Features of SAM2

🔍 1. Promptable Visual Segmentation (PVS)

Have you ever tried background removal in PowerPoint?
You can click on areas to keep or remove — SAM2’s prompting works similarly, but for video frames!
You can provide point, box, or mask prompts at any video frame,
and SAM2 generates a spatio-temporal mask (masklet).
Additional prompts help refine the segmentation over time.

🧠 2. The SAM2 Model: Memory-based Streaming Architecture

SAM2 is a unified model for both images and videos,
extending the original SAM with a streaming architecture and memory module for video support.

🔧 Core Components

🔹 1. Image Encoder

Uses Hiera-based hierarchical encoder (MAE pre-trained)
Supports streaming frame-by-frame processing
Enables high-resolution segmentation via multiscale features
Outputs unconditioned embeddings

🔹 2. Memory Attention

Conditions the current frame’s features on memory from previous frames and prompts
Uses L transformer blocks
(self-attention → cross-attention → MLP)
Leverages modern attention kernel optimizations

🔹 3. Prompt Encoder & Mask Decoder

Same prompt encoder as SAM: supports clicks, boxes, masks
Uses two-way transformers to update prompt/frame embeddings
Can generate multiple masks for ambiguous prompts
Adds object presence head to detect frames where the object is absent
Adds high-resolution skip connections for improved decoding

🔹 4. Memory Encoder

Downsamples predicted masks
Combines them with unconditioned embeddings via element-wise summation
Fuses features via lightweight CNN layers

🔹 5. Memory Bank

Stores memory features for:
- N recent frames (auto-segmented)
- M prompted frames (user-guided)
Each memory is a spatial feature map
Stores object pointers as high-level semantic vectors
Adds temporal position embeddings to N frames
→ Helps track short-term motion

+ Summary of Memory Encoder & Bank (like a tracker!)

Segment current frame using prompts → mask decoder
Encode memory → summarized memory features
Store in memory bank → N auto-segmented + M prompted frames
Next frame input → unconditioned embedding
Compare via memory attention → cross-attend to past memories → localize object

🏋️‍♂️ Model Training

SAM2 is trained jointly on image and video data.
Training simulates interactive user prompting scenarios.

Each training sequence samples 8 frames, with up to 2 prompted frames.
Initial prompts are randomly selected from:

50%: full mask
25%: positive click
25%: bounding box

Additionally, corrective clicks are generated during training to refine predictions.
The model learns to sequentially and interactively predict spatio-temporal masklets based on user guidance.

🧰 3. Data Engine-based Training

SAM2 uses a Human + Model collaboration approach (data engine), organized in phases:

Phase	Description
Phase 1	Human annotates each frame using SAM1 → High-quality GT data → Trains SAM2
Phase 2	Early SAM2 + human refinements → SAM2 retrained, faster propagation
Phase 3	Full SAM2 with memory → humans refine via clicks only
+ QA	Separate validators check quality → unsatisfactory samples corrected or rejected
+ Auto	SAM2 auto-generates masklets → filtered and added if satisfactory

🚀 SAM2 Performance

As expected, SAM2 achieves strong performance across benchmarks —
No surprise it was accepted to ICLR 2025 😉 (details skipped here)

✅ SAM vs. SAM2 Comparison

Feature	SAM (2023)	SAM2 (2025)
Inference Speed	Slow (large ViT backbone)	✅ Up to 30× faster, real-time capable
Architecture	Heavy ViT-H model (~632M params)	✅ Lightweight design with sparse attention
Accuracy	Strong but struggles with small objects	✅ Improved mask precision, especially for small objects
Prompt Types	Point, Box, Mask	✅ Potential for text & multimodal prompts
Input Modalities	Static images only	✅ Supports video & multi-scale inputs
Deployment	Cloud & research-focused	✅ Runs on mobile & edge devices

🦖 (한국어) 🧠 SAM2 알아보기?!!

🔍 이미지와 비디오를 통합!, 실시간&높은 정확도를 구현한 차세대 세그멘테이션 모델!

논문: SAM 2: SEGMENT ANYTHING IN IMAGES AND VIDEOS
발표: ICLR 2025 (by META Research)
코드: facebookresearch/SAM2
코멘트 : SAM이후 등장한 SAM2. 이제 이미지를 넘어 영상으로!!

지난 포스팅에서는 facebook research에서 공개한 SAM 에 대하여 알아보았습니다.
오늘은 같은 Facebook Research에서 2년뒤에 공개한 SAM2 모델에 대하여 알아보아요!

❗ 기존 SAM 모델의 한계

AR/VR 등 영상의 시대가 오면서 이미지 segment 용 SAM모델은 아래와 같은 한계가 있었습니다.

정적 이미지 전용:
SAM은 단일 이미지에 대해 동작하도록 설계되어, 시간 축(temporal dimension)을 고려하지 않음.
시공간 추적 불가:
객체의 움직임, 변형, 가림(occlusion) 등 시간에 따른 변화를 처리하지 못함.
낮은 영상 품질에 취약:
영상은 종종 블러, 노이즈, 낮은 해상도를 가지며, SAM은 이러한 품질 저하에 강인하지 않음.
프레임 단위 독립 처리:
각 프레임을 별도로 처리하므로, 연속적인 추적이나 일관된 세분화(mask tracking)가 어려움.
대규모 영상 처리 비효율:
수천 개의 프레임을 처리해야 하는 영상에서는 처리 속도와 메모리 효율이 떨어짐.

➡️ 따라서, 비디오 중심의 현대적 응용에서는 보다 통합적이고 시공간 정보를 반영하는 모델이 필요했음.

✅ SAM2의 핵심 특징!

🔍 1. Promptable Visual Segmentation (PVS)

PPT에서 이미지 배경제거를 해보신적이 있나요?
PPT는 이미지에서 어떤부분을 추가할지, 어떤부분을 제거할지 쉽게 설정할수 있습니다.
SAM2의 프롬포트(PVS)는 이처럼 동영상의 어떤 프레임(이미지)에서도 추가, 제거영역을 설정할 수 있고
이에 따라 시공간적 마스크(masklet)를 생성합니다.
또한 추가 프레임에 프롬프트를 주면 세그먼트를 점진적으로 정교화할 수 있습니다.

🧠 2. SAM2 모델 : 메모리 기반 스트리밍 구조!

SAM2는 이미지와 비디오 모두에 적용 가능한 통합 세분화 모델로, 기존 SAM을 스트리밍 기반 아키텍처와 메모리 모듈로 확장하여 영상에서도 유연하게 동작하도록 설계되었습니다.

🔧 주요 구성 요소

🔹 1. 이미지 인코더 (Image Encoder)

Hiera 기반 계층적 인코더 사용 (MAE 사전학습 기반)
프레임을 순차적으로 처리하는 스트리밍 방식 지원
멀티스케일 특성(multiscale features)을 통해 고해상도 세분화 가능
출력은 언컨디셔닝된 토큰(embedding)

🔹 2. 메모리 어텐션 (Memory Attention)

현재 프레임의 피처를 이전 프레임들과 프롬프트 기반 메모리에 조건화
L개의 Transformer 블록 사용
- self-attention → cross-attention (memory + object pointers) → MLP
최신 attention kernel 최적화 적용 가능

🔹 3. 프롬프트 인코더 & 마스크 디코더

SAM과 동일한 구조의 Prompt Encoder
- click, box, mask를 positional encoding + learned embedding으로 처리
두 방향(two-way) 트랜스포머 블록으로 prompt/frame embedding 상호 업데이트
다중 마스크 예측 가능 (모호한 프롬프트 대응)
객체 존재 여부 판단 헤드 추가 : SAM과 다르게 프레임에 객체가 없을 수도 있어 판단이 필요!!
고해상도 skip connection 추가 (메모리 attention을 거치지 않고 디코더로 연결)

🔹 4. 메모리 인코더 (Memory Encoder)

현재 예측된 마스크를 다운샘플한 후,
이미지 인코더의 unconditioned embedding과 element-wise summation
이후 경량 CNN 레이어를 통해 정보 융합

🔹 5. 메모리 뱅크 (Memory Bank)

최근 N개의 프레임과 최대 M개의 프롬프트된 프레임에 대한 메모리 저장
각 메모리는 공간적 feature map 형태
객체 포인터(object pointers):
- 각 프레임의 마스크 디코더 출력 토큰 기반
- 객체의 고수준 의미 정보를 벡터로 저장
시간 정보 임베딩:
- 최근 프레임들(N개)에만 적용하여 단기 움직임 추적 가능
SAM2 메모리 인코더 & 메모리 뱅크 작동 요약 (Tracker 같구나!)

현재 프레임 세그먼트 : 프롬프트 기반, 마스크 디코더가 객체 세분화
메모리 인코딩 : 객체 특징 요약 생성 → 메모리 feature
메모리 뱅크 저장 : N개의 프레임(자동 세그먼트한거) memory를 FIFO 큐로 관리 + M프롬프트 받은 프레임은 따로 M개까지 저장
다음 프레임 입력 : 이미지 인코더가 새 프레임의 unconditioned embedding 생성
메모리 attention 비교 : 현재 프레임 embedding과 Memory Bank에 저장된 과거 memory feature들을 cross-attention으로 비교, 객체 위치 추정

🏋️‍♂️ 모델의 학습 (Training)

SAM2는 이미지와 비디오 데이터를 함께 사용하여 공동 학습을 진행!!
학습 과정은 사용자와 상호작용하는 시나리오를 시뮬레이션하는 방식으로 설계,
8개의 프레임으로 구성된 비디오 시퀀스를 샘플링한 후, 이 중 최대 2개 프레임에 프롬프트 제시.

초기 프롬프트는 확률적으로 다양하게 주어지는데,
50% 확률로 정답 마스크, 25% 확률로 마스크 내부의 클릭, 25% 확률로 바운딩 박스가 사용.
또한 모델의 예측 결과와 실제 마스크를 비교하여, 교정 클릭(corrective click)도 함께 생성되어 학습에 반영.

이러한 방식으로 SAM2는 프롬프트를 기반으로 정답 masklet(시공간 마스크)를
순차적이고 점진적으로 예측하는 능력을 학습하게 됩니다.

🧰 3. 데이터 엔진 기반 학습

SAM2는 사람 + 모델 협업(data engine) 방식으로 Phase를 나누어 학습 데이터를 생성!!

단계	주요 내용
Phase 1	사람이 SAM1을 사용해서 매 프레임 수동 마스크 → 정답 데이터 생성 (느림, 고정확도) → 이 데이터로 SAM2 초기 모델을 학습
Phase 2	SAM2(초기 버전) + 사람 보정 → SAM2 재학습, 빠른 마스크 전파 & 교정
Phase 3	SAM2(기억 포함 완전체) 중심으로 프롬프트 기반 세그먼트 → 사람은 클릭만으로 정교화
+ 검증	별도 검수자가 마스크 품질 평가 → 불량은 다시 보정, 기준 미달은 폐기
+ 자동생성	SAM2가 자동 생성한 마스크 중 ‘만족스러운 것’만 필터링 → 데이터로 추가

🚀 SAM2의 성능

언제나 그렇든 다양한 지표로 성능을 뽐내는데, 좋은 성능이니 ICLR에 선정되었겠지!?ㅎㅎ 요긴 스킵!!

✅ SAM vs SAM2 비교!!

항목	SAM (2023)	SAM2 (2025)
추론 속도	느림 (대형 ViT 백본 사용)	✅ 최대 30배 더 빠름, 실시간 처리 가능
모델 아키텍처	무거운 ViT-H 모델 (~632M 파라미터)	✅ 경량화 설계, sparse attention 적용
정확도	강력하지만 작은 객체에 약함	✅ 작은 객체에 대한 마스크 정확도 향상
프롬프트 타입	포인트, 박스, 마스크	✅ 텍스트 및 멀티모달 프롬프트 확장 가능성
입력 형식	정적 이미지 전용	✅ 비디오 및 멀티스케일 입력 지원
배포 환경	클라우드/연구용 중심	✅ 모바일 및 엣지 디바이스에서도 실행 가능

AI, Research

This post is licensed under CC BY 4.0 by the author.

🦖 (English) 🧠 Understanding SAM2

❗ Limitations of the Original SAM

✅ Key Features of SAM2

🔍 1. Promptable Visual Segmentation (PVS)

🧠 2. The SAM2 Model: Memory-based Streaming Architecture

🔧 Core Components

🔹 1. Image Encoder

🔹 2. Memory Attention

🔹 3. Prompt Encoder & Mask Decoder

🔹 4. Memory Encoder

🔹 5. Memory Bank

🏋️‍♂️ Model Training

🧰 3. Data Engine-based Training

🚀 SAM2 Performance

✅ SAM vs. SAM2 Comparison

🦖 (한국어) 🧠 SAM2 알아보기?!!

❗ 기존 SAM 모델의 한계

✅ SAM2의 핵심 특징!

🔍 1. Promptable Visual Segmentation (PVS)

🧠 2. SAM2 모델 : 메모리 기반 스트리밍 구조!

🔧 주요 구성 요소

🔹 1. 이미지 인코더 (Image Encoder)

🔹 2. 메모리 어텐션 (Memory Attention)

🔹 3. 프롬프트 인코더 & 마스크 디코더

🔹 4. 메모리 인코더 (Memory Encoder)

🔹 5. 메모리 뱅크 (Memory Bank)

🏋️‍♂️ 모델의 학습 (Training)

🧰 3. 데이터 엔진 기반 학습

🚀 SAM2의 성능

✅ SAM vs SAM2 비교!!

Trending Tags