Post

๐Ÿง  SAM2 Hands-On Practice!! : SAM2 ์‹ค์Šต!! with Python

๐Ÿง  SAM2 Hands-On Practice!! : SAM2 ์‹ค์Šต!! with Python

๐Ÿงฌ (English) EfficientSAM Practice!!

Today, letโ€™s revisit the practice of the new SAM model SAM2, which we studied theoretically in a previous post
and also experimented with before!

This time, instead of using ultralytics, weโ€™ll fetch the model directly from HuggingFace!!

Howeverโ€ฆ compared to EfficientSAM, the results of this SAM2 arenโ€™t really satisfying to meโ€ฆ


๐Ÿ”ง 1. Installation & Setup

Clone directly from GitHub and install.
I created a virtual environment called sam2 beforehand!

1
2
3
4
conda create --name sam2 python=3.12
git clone https://github.com/facebookresearch/sam2.git && cd sam2

pip install -e .

Additionally, although the SAM2 GitHub readme suggests downloading the model via ./download_ckpts.sh,
the actual code has a hf_hub_download function to fetch weights โ€” so we can skip that!!


๐Ÿ–ผ๏ธ 2. Image Segmentation

Just like in the EfficientSAM practice,
Iโ€™ll use a dog photo with only two prompt points!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import numpy as np
from PIL import Image, ImageDraw
import os
from sam2.sam2_image_predictor import SAM2ImagePredictor

# 1. Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# 2. Load SAM2 model
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")

# 3. Load image
image_path = "./EfficientSAM_gdino/figs/examples/dogs.jpg"
output_image_path = "output_masked_image.png"

image_pil = Image.open(image_path).convert("RGB")
image_np = np.array(image_pil)

# -----------------------------------------------------
# 4. Prepare prompt 
# -----------------------------------------------------
input_points = torch.tensor([[[580, 350], [650, 350]]], device=device)
input_labels = torch.tensor([[1, 1]], device=device)

# 5. Run prediction
with torch.inference_mode(), torch.autocast(device_type=device.split(":")[0], dtype=torch.bfloat16):
    predictor.set_image(image_np)
    masks, scores, _ = predictor.predict(
        point_coords=input_points,
        point_labels=input_labels,
    )
mask = masks[0]
print(f"mask shape :{mask.shape}")

# Create black canvas
segmented_image_np = np.zeros_like(image_np)

# Convert mask
binary_mask = (mask > 0.5)

# Apply mask
segmented_image_np[binary_mask] = image_np[binary_mask]

# Convert back to image
result_image = Image.fromarray(segmented_image_np)

# Draw prompt points
draw_result = ImageDraw.Draw(result_image)
points_np = input_points[0].cpu().numpy()
labels_np = input_labels[0].cpu().numpy()

for i, (x, y) in enumerate(points_np):
    label = labels_np[i]
    fill_color = "green" if label == 1 else "red"
    outline_color = "white"
    radius = 5

    if label == 1:
        draw_result.ellipse((x - radius, y - radius, x + radius, y + radius),
                            fill=fill_color, outline=outline_color, width=1)
    else:
        draw_result.line((x - radius, y - radius, x + radius, y + radius),
                         fill=fill_color, width=2)
        draw_result.line((x + radius, y - radius, x - radius, y + radius),
                         fill=fill_color, width=2)

result_image.save(output_image_path)
print(f"Result saved to '{output_image_path}'")

Result image is!!!??

Image

Totally disappointingโ€ฆ
So, I tried with 4 prompt points and switched the model to sam2.1_hiera_large.pt!!

Image

Still disappointingโ€ฆ
GPT explained itโ€™s due to prompt interpretation differences.
But honestly, I still prefer EfficientSAM!!

1
2
3
4
EfficientSAM didnโ€™t actually โ€œperform better,โ€ but rather interpreted your imperfect prompts more โ€œforgivingly.โ€

SAM2, on the other hand, is much more powerful and precise, so to fully leverage it, you must provide more accurate prompts.  
As suggested before, if you place several points on the dogsโ€™ body and head, youโ€™ll see that SAM2 produces masks that are far more refined and higher quality than EfficientSAM.

๐Ÿงช 3. Video Segmentation

Image

Starting with the result!!! This one looks good~~

Code is below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import torch
import numpy as np
import cv2
import os
from sam2.sam2_video_predictor import SAM2VideoPredictor
from moviepy.editor import VideoFileClip

# 1. Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Paths
output_video_path = "output_segmented_video_50_55s.mp4"
clipped_video_path = "temp_clip.mp4"

# 3. Load video
cap = cv2.VideoCapture(clipped_video_path)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

video_frames = []
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    video_frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
cap.release()
print(f"Loaded clip: {width}x{height}, {total_frames} frames, {fps:.2f} FPS")

# 4. Load SAM2 model
print("Loading SAM2 video predictor...")
predictor = SAM2VideoPredictor.from_pretrained("facebook/sam2-hiera-large")

# -----------------------------------------------------
# 5. Prompt setup
# -----------------------------------------------------
prompt_frame_idx = 0  # frame index to add prompt
prompt_obj_id = 1     # unique object ID

# Coordinates and labels
points = np.array([[width // 2, height // 2]], dtype=np.float32)
labels = np.array([1], dtype=np.int32)

# -----------------------------------------------------
# 6. Initialize and predict
# -----------------------------------------------------
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    print("Initializing predictor state...")
    state = predictor.init_state(clipped_video_path) 
    
    print("Adding prompt on first frame...")
    _, _, masks = predictor.add_new_points_or_box(
        inference_state=state,
        frame_idx=prompt_frame_idx,
        obj_id=prompt_obj_id,
        points=points,
        labels=labels,
    )

    # 7. Propagate in video
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out_writer = cv2.VideoWriter(output_video_path, fourcc, fps, (width, height))
    
    print("Propagating masks across video...")
    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
        original_frame = video_frames[frame_idx]
        segmented_image_np = np.full_like(original_frame, 255)

        if prompt_obj_id in object_ids:
            mask_logits = masks[0][0].cpu().numpy()
            binary_mask_before_resize = (mask_logits > 0.0).astype(np.uint8)
            resized_mask = cv2.resize(binary_mask_before_resize, (width, height), interpolation=cv2.INTER_NEAREST)
            boolean_mask = (resized_mask == 1)
            segmented_image_np[boolean_mask] = original_frame[boolean_mask]

        # Draw red point
        for x, y in points:
            cv2.circle(segmented_image_np, (int(x), int(y)), radius=5, color=(255, 0, 0), thickness=-1)

        output_frame = cv2.cvtColor(segmented_image_np, cv2.COLOR_RGB2BGR)
        out_writer.write(output_frame)
        
        print(f"\r- Processing: frame {frame_idx + 1}/{total_frames}", end="")

    out_writer.release()
    print(f"\nVideo segmentation complete! Saved to '{output_video_path}'")

SAM2โ€™s video tracking is really impressive!!!


๐Ÿงฌ (ํ•œ๊ตญ์–ด) EfficientSAM ์‹ค์Šต!!

์˜ค๋Š˜์€ ์˜ˆ์ „ํฌ์ŠคํŒ…์—์„œ ์ด๋ก ์— ๋Œ€ํ•˜์—ฌ ๊ณต๋ถ€ํ•ด๋ณด์•˜๋˜!
๊ทธ๋ฆฌ๊ณ  ์‹ค์Šต๋„ ํ•ด๋ณด์•˜๋˜ ์ƒˆ๋กœ์šด SAM ๋ชจ๋ธ! SAM2 ์˜ ์‹ค์Šต์„ ๋‹ค์‹œ ํ•œ ๋ฒˆ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!!
์ด๋ฒˆ์—” ์ง€๋‚œ๋ฒˆ๊ณผ ๋‹ฌ๋ฆฌ ultralytics ๊ฐ€ ์•„๋‹ˆ๋ผ huggingface์—์„œ ๋ชจ๋ธ์„ ๋ฐ›์•„์„œ๊ณ ๊ณ !!

๊ทธ๋Ÿฐ๋ฐ,, ์ด SAM2, EfficientSAM๋ณด๋‹ค๋Š” ๊ฒฐ๊ณผ๋ฌผ์ด ๋ง˜์—๋“ค์ง€ ์•Š๊ตฌ๋งŒ์œ ,,


๐Ÿ”ง 1. ์„ค์น˜ ๋ฐ ์…‹์—…

GitHub์—์„œ ์ง์ ‘ ํด๋ก ํ•˜์—ฌ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.
์ €๋Š” ๊ทธ์ „์— sam2๋ผ๋Š” ๊ฐ€์ƒํ™˜๊ฒฝ์„ ๋งŒ๋“ค๊ณ  ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค!!

1
2
3
4
conda create --name sam2 python=3.12
git clone https://github.com/facebookresearch/sam2.git && cd sam2

pip install -e .

์ถ”๊ฐ€๋กœ SAM2 git์˜ readme์—์„œ๋Š” ./download_ckpts.sh๋กœ ๋ชจ๋ธ์„ ๋ฐ›์œผ๋ผ๊ณ ํ–ˆ์ง€๋งŒ,
์‹ค์ œ ์ฝ”๋“œ๋Š” hf_hub_download๋ฅผ ํ†ตํ•ด weight๋ฅผ ๋ฐ›์•„์˜ค๋Š” ๊ธฐ๋Šฅ์ด ์žˆ๊ธฐ์— ์Šคํ‚ต!!


๐Ÿ–ผ๏ธ 2. ์ด๋ฏธ์ง€ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜

์ง€๋‚œ EfficientSAM ์‹ค์Šต๊ณผ ๋™์ผํ•˜๊ฒŒ!! ๋ฉ๋ฉ์ด ์‚ฌ์ง„์œผ๋กœ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!!
ํ”„๋กฌํฌํŠธ๋„~ ์ ์„ 2๊ฐœ๋งŒ!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import numpy as np
from PIL import Image, ImageDraw
import os
from sam2.sam2_image_predictor import SAM2ImagePredictor

# 1. ๊ธฐ๋ณธ ์„ค์ •
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# 2. SAM2 ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")

# 3. ์ด๋ฏธ์ง€ ์ค€๋น„
image_path = "./EfficientSAM_gdino/figs/examples/dogs.jpg"
output_image_path = "output_masked_image.png"


image_pil = Image.open(image_path).convert("RGB")
image_np = np.array(image_pil)


# -----------------------------------------------------
# 4. ํ”„๋กฌํ”„ํŠธ ์ค€๋น„ 
# -----------------------------------------------------
# input_points๋ฅผ (batch, num_points, 2) ํ˜•ํƒœ์˜ 3D ํ…์„œ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
input_points = torch.tensor([[[580, 350], [650, 350]]], device=device)
input_labels = torch.tensor([[1, 1]], device=device)

# input_labels = torch.tensor([[1, 1, 1, 1]], device=device) # ๋ชจ๋“  ์ ์ด ์ „๊ฒฝ
# 5. ์˜ˆ์ธก ์‹คํ–‰
with torch.inference_mode(), torch.autocast(device_type=device.split(":")[0], dtype=torch.bfloat16):
    predictor.set_image(image_np)
    masks, scores, _ = predictor.predict(
        point_coords=input_points,
        point_labels=input_labels,
    )
mask = masks[0]
print(f"mask shape :{mask.shape}")

# ์›๋ณธ ์ด๋ฏธ์ง€์™€ ๊ฐ™์€ ํฌ๊ธฐ์˜ ์™„์ „ํžˆ ๊ฒ€์ •์ƒ‰ NumPy ๋ฐฐ์—ด ์ƒ์„ฑ
segmented_image_np = np.zeros_like(image_np)

# ๋งˆ์Šคํฌ๊ฐ€ True์ธ ์˜์—ญ์—๋งŒ ์›๋ณธ ์ด๋ฏธ์ง€ ํ”ฝ์…€์„ ๋ณต์‚ฌ
# mask๋Š” boolean ๋ฐฐ์—ด (True/False)์ด๊ฑฐ๋‚˜ 0~1 ์‚ฌ์ด์˜ float ๋ฐฐ์—ด์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# float ๋ฐฐ์—ด์ด๋ผ๋ฉด ์ž„๊ณ„๊ฐ’์„ ์ ์šฉํ•˜์—ฌ boolean ๋งˆ์Šคํฌ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
binary_mask = (mask > 0.5) # 0.5๋ฅผ ๊ธฐ์ค€์œผ๋กœ True/False ๋งˆ์Šคํฌ ์ƒ์„ฑ

# ๋งˆ์Šคํฌ๊ฐ€ True์ธ ์œ„์น˜์— ์›๋ณธ ์ด๋ฏธ์ง€ ํ”ฝ์…€์„ ํ• ๋‹น
segmented_image_np[binary_mask] = image_np[binary_mask]

# NumPy ๋ฐฐ์—ด์„ PIL Image๋กœ ๋ณ€ํ™˜
result_image = Image.fromarray(segmented_image_np)

# ํ”„๋กฌํ”„ํŠธ ์ ๋„ ์ด๋ฏธ์ง€ ์œ„์— ๊ทธ๋ฆฌ๊ธฐ (์„ ํƒ ์‚ฌํ•ญ)
draw_result = ImageDraw.Draw(result_image)
points_np = input_points[0].cpu().numpy()
labels_np = input_labels[0].cpu().numpy()

for i, (x, y) in enumerate(points_np):
    label = labels_np[i]

    fill_color = "green" if label == 1 else "red"
    outline_color = "white"
    radius = 5

    if label == 1:
        draw_result.ellipse((x - radius, y - radius, x + radius, y + radius), fill=fill_color, outline=outline_color, width=1)
    else:
        draw_result.line((x - radius, y - radius, x + radius, y + radius), fill=fill_color, width=2)
        draw_result.line((x + radius, y - radius, x - radius, y + radius), fill=fill_color, width=2)

result_image.save(output_image_path)
print(f"๊ฒฐ๊ณผ ์ด๋ฏธ์ง€๊ฐ€ '{output_image_path}'์— ์ €์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.")

๊ฒฐ๊ณผ ์ด๋ฏธ์ง€๋Š”!!!??

Image

์™„์ „ ์‹ค๋ง์Šค๋Ÿฌ์šด๋ฐ,,??
๊ทธ๋ž˜์„œ, ํ”„๋กฌํฌํŠธ ์ ์„ 4๊ฐœ๋กœ, ๋ชจ๋ธ๋„ sam2.1_hiera_large.pt๋กœ ๋ฐ”๊ฟ”์„œ ํ•ด๋ณด์•˜๋Š”๋ฐ!!

Image

๊ทธ๋ž˜๋„ ๊ฒฐ๊ณผ๊ฐ€ ์‹ค๋ง์Šค๋Ÿฝ๋„ค์š”,, GPT์— ๋ฌผ์–ด๋ณด๋‹ˆ ํ”„๋กฌํฌํŠธ์˜ ํ•ด์„์ฐจ์ด๋ผ๊ณ ํ•˜๋Š”๋ฐ,,
์ €๋Š” ๊ทธ๋ž˜๋„ EfficientSAM์ด ์ข‹๋„ค์š”!!

1
2
3
EfficientSAM์ด ๋” "์ž˜" ํ–ˆ๋‹ค๊ธฐ๋ณด๋‹ค๋Š”, ์‚ฌ์šฉ์ž์˜ ๋ถ€์ •ํ™•ํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋” "๊ด€๋Œ€ํ•˜๊ฒŒ" ํ•ด์„ํ•ด ์ค€ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ฐ˜๋ฉด์— SAM2๋Š” ํ›จ์”ฌ ๊ฐ•๋ ฅํ•˜๊ณ  ์ •๋ฐ€ํ•œ ๋„๊ตฌ์ด๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ ์„ฑ๋Šฅ์„ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜๋ ค๋ฉด ์‚ฌ์šฉ์ž๋„ ๋” ์ •ํ™•ํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ œ๊ณตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ๋‹ต๋ณ€์—์„œ ์ œ์•ˆํ•ด ๋“œ๋ฆฐ ๊ฒƒ์ฒ˜๋Ÿผ ๊ฐ•์•„์ง€๋“ค์˜ ๋ชธํ†ต๊ณผ ๋จธ๋ฆฌ์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ ์„ ์ฐ์–ด์ฃผ์‹œ๋ฉด, SAM2๊ฐ€ EfficientSAM๋ณด๋‹ค ํ›จ์”ฌ ๋” ๊ณ ํ’ˆ์งˆ์˜ ์ •๊ตํ•œ ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์„ ๊ฒ๋‹ˆ๋‹ค.

๐Ÿงช 3. ์˜์ƒ Segmentation

Image

๊ฒฐ๊ณผ๋ถ€ํ„ฐ!!! ๋ง˜์—๋“œ๋Š”๋ฐ์š”~~

์ฝ”๋“œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import torch
import numpy as np
import cv2
import os
from sam2.sam2_video_predictor import SAM2VideoPredictor
from moviepy.editor import VideoFileClip

# 1. ๊ธฐ๋ณธ ์„ค์ •
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# ์›๋ณธ ๋ฐ ์ถœ๋ ฅ ๊ฒฝ๋กœ ์„ค์ •
output_video_path = "output_segmented_video_50_55s.mp4"
clipped_video_path = "temp_clip.mp4"


# 3. ๋น„๋””์˜ค ๋กœ๋”ฉ
cap = cv2.VideoCapture(clipped_video_path)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

video_frames = []
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    video_frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
cap.release()
print(f"ํด๋ฆฝ ๋กœ๋“œ ์™„๋ฃŒ: {width}x{height}, {total_frames} ํ”„๋ ˆ์ž„, {fps:.2f} FPS")

# 4. SAM2 ๋ชจ๋ธ ๋กœ๋“œ
print("SAM2 ๋น„๋””์˜ค ์˜ˆ์ธก๊ธฐ ๋ชจ๋ธ์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค...")
predictor = SAM2VideoPredictor.from_pretrained("facebook/sam2-hiera-large")

# -----------------------------------------------------
# 5. ํ”„๋กฌํ”„ํŠธ ์ค€๋น„ (โ˜…โ˜…โ˜…โ˜…โ˜… ์ด ๋ถ€๋ถ„์ด ์ˆ˜์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค โ˜…โ˜…โ˜…โ˜…โ˜…)
# -----------------------------------------------------
prompt_frame_idx = 0  # ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ ์šฉํ•  ํ”„๋ ˆ์ž„ ์ธ๋ฑ์Šค (0์€ ์ฒซ ๋ฒˆ์งธ ํ”„๋ ˆ์ž„)
prompt_obj_id = 1     # ์ถ”์ ํ•  ๊ฐ์ฒด์˜ ๊ณ ์œ  ID (์ฒซ ๋ฒˆ์งธ ๊ฐ์ฒด์ด๋ฏ€๋กœ 1)

# ์  ์ขŒํ‘œ: (NumPoints, 2) ํ˜•ํƒœ์˜ NumPy ๋ฐฐ์—ด
points = np.array([[width // 2, height // 2]], dtype=np.float32)
# ๋ ˆ์ด๋ธ”: (NumPoints,) ํ˜•ํƒœ์˜ NumPy ๋ฐฐ์—ด
labels = np.array([1], dtype=np.int32)

# -----------------------------------------------------
# 6. ๋ชจ๋ธ ์ดˆ๊ธฐํ™” ๋ฐ ์ฒซ ํ”„๋ ˆ์ž„ ์˜ˆ์ธก
# -----------------------------------------------------
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    print("์˜ˆ์ธก๊ธฐ ์ƒํƒœ๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค...")
    state = predictor.init_state(clipped_video_path) 
    
    print("์ฒซ ๋ฒˆ์งธ ํ”„๋ ˆ์ž„์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค...")
    # ์˜ˆ์ œ ์ฝ”๋“œ์— ๋งž์ถฐ ํ•จ์ˆ˜ ์ด๋ฆ„๊ณผ ์ธ์ž๋ฅผ ๋ชจ๋‘ ๋ณ€๊ฒฝ
    _, _, masks = predictor.add_new_points_or_box(
        inference_state=state,
        frame_idx=prompt_frame_idx,
        obj_id=prompt_obj_id,
        points=points,
        labels=labels,
    )
    # 7. ๋น„๋””์˜ค ์ „ํŒŒ ๋ฐ ๊ฒฐ๊ณผ ์ €์žฅ
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out_writer = cv2.VideoWriter(output_video_path, fourcc, fps, (width, height))
    
    print("ํด๋ฆฝ ์ „์ฒด์— ๋งˆ์Šคํฌ๋ฅผ ์ „ํŒŒํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค...")
    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
        original_frame = video_frames[frame_idx]
        # segmented_image_np = np.zeros_like(original_frame)
        # segmented_image_np = np.zeros_like(original_frame)
        segmented_image_np = np.full_like(original_frame, 255)

        # prompt_obj_id (1)๊ฐ€ ์ถ”์ ๋˜๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธ
        if prompt_obj_id in object_ids:
            mask_logits = masks[0][0].cpu().numpy()
            binary_mask_before_resize = (mask_logits > 0.0).astype(np.uint8)
            resized_mask = cv2.resize(binary_mask_before_resize, (width, height), interpolation=cv2.INTER_NEAREST)

            # 4. ๋ฆฌ์‚ฌ์ด์ฆˆ๋œ ๋งˆ์Šคํฌ๋ฅผ boolean ํƒ€์ž…์œผ๋กœ ์ตœ์ข… ๋ณ€ํ™˜
            boolean_mask = (resized_mask == 1)

            # 5. ์˜ฌ๋ฐ”๋ฅธ ๋งˆ์Šคํฌ๋กœ ์ธ๋ฑ์‹ฑ
            segmented_image_np[boolean_mask] = original_frame[boolean_mask]
            # # ์ด์ œ shape์ด ์ผ์น˜ํ•˜๋ฏ€๋กœ ์ •์ƒ์ ์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
            # segmented_image_np[mask] = original_frame[mask]   
        # --- ์ถ”๊ฐ€๋œ ๋ถ€๋ถ„: ๋ชจ๋“  ํ”„๋ ˆ์ž„์— ํ”„๋กฌํ”„ํŠธ ์  ๊ทธ๋ฆฌ๊ธฐ ---
        for x, y in points:
            # cv2.circle์„ ์‚ฌ์šฉํ•˜์—ฌ ๋นจ๊ฐ„์ƒ‰ ์ ์„ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.
            # segmented_image_np๋Š” RGB ์ƒํƒœ์ด๋ฏ€๋กœ, ๋นจ๊ฐ„์ƒ‰์€ (255, 0, 0)์ž…๋‹ˆ๋‹ค.
            # thickness=-1์€ ์ฑ„์›Œ์ง„ ์›์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
            cv2.circle(segmented_image_np, (int(x), int(y)), radius=5, color=(255, 0, 0), thickness=-1)

        output_frame = cv2.cvtColor(segmented_image_np, cv2.COLOR_RGB2BGR)
        out_writer.write(output_frame)
        
        print(f"\r- ์ฒ˜๋ฆฌ ์ค‘: ํ”„๋ ˆ์ž„ {frame_idx + 1}/{total_frames}", end="")

    out_writer.release()
    print(f"\n๋น„๋””์˜ค ๋ถ„ํ•  ์™„๋ฃŒ! ๊ฒฐ๊ณผ๊ฐ€ '{output_video_path}'์— ์ €์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.")

SAM2, ์˜์ƒ์—์„œ ํŠธ๋ž˜์ด์‹ฑ์€ ์ •๋ง ๋ง˜์—๋“œ๋Š”๊ตฐ์š”!^^

This post is licensed under CC BY 4.0 by the author.