🔎 VL-SAM Hands-on: VL-SAM 을 실습해보자!

Posted Sep 7, 2025

By DrFirst

9 min read

개요

이 포스트에서는 VL-SAM의 핵심인 “객체 이름 → 위치 힌트(Attention Map) 생성 → SAM 포인트 프롬프트” 중, Attention Map 생성(VLM 측) 실습을 다룹니다.
(반복 iteration 없이, 단일 패스 성격의 데모)

1) [Object Recognition] – Attention Map Generation (VLM)

핵심 아이디어: SAM에 넣을 Object Prompt를 VLM의 어텐션 흐름으로부터 만들자!

a. 이미지 기반으로 VLM에게 “이미지 속 모든 객체를 나열해”라고 요청 → 응답에서 Tag2Text 유사 방식으로 객체 리스트를 확보
b. 토큰 생성 과정에서 모든 레이어/헤드의 Q, K를 저장
c. 추출한 객체 토큰에 대해 Q × Kᵀ → causal mask 적용 → Softmax 표준화 → similarity matrix S
d. 레이어/헤드 별 가중치 W 산출
- 예시: W = Mean(Max(S, dim=1), dim=0)
e. 가중치 W로 보정된 S′ 계산
f. 레이어별 S′를 종합하여 attention flow 산출
g. Auto-Regressive VLM의 특성(좌상단으로 collapse 경향)을 줄이기 위해 Regularized attention flow column 사용
Finally, 최종 Attention Map 완성!

2) 실습 코드 (Attention Map 생성)

환경 가정
Qwen/Qwen2.5-VL-3B-Instruct
4-bit 양자화(bitsandbytes) 사용 가능 시 메모리 절약
아래 코드는 어텐션 맵 생성까지 (SAM 연동은 별도 단계에서 진행 가능)

주의
블로그에 맞춰 코드 펜스는 $$$ 로 감쌌습니다. 복붙 후 바로 실행하세요.
경로(IMAGE_PATH)는 로컬 환경에 맞게 수정하세요.

$$$ import os import re import cv2 import torch import numpy as np import matplotlib.pyplot as plt from PIL import Image

from transformers import AutoProcessor, AutoModelForVision2Seq from transformers import BitsAndBytesConfig

—————————

Config

—————————

QWEN_MODEL = “Qwen/Qwen2.5-VL-3B-Instruct” # VRAM에 맞춰 조정 가능 (3B 권장) IMAGE_PATH = “/home/bongo/porter_notebook/research/dog.jpg” # 실습용 이미지(강아지 1마리) TEXT_QUERY = “dog” DEVICE = “cuda” if torch.cuda.is_available() else “cpu”

4bit 양자화(가능 시 메모리 절약) — bnb 미설치면 이 줄 삭제

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

print(f”[Info] Device: {DEVICE}”) image_pil = Image.open(IMAGE_PATH).convert(“RGB”)

print(f”[Info] Loading {QWEN_MODEL}…”) processor = AutoProcessor.from_pretrained(QWEN_MODEL, trust_remote_code=True) model = AutoModelForVision2Seq.from_pretrained( QWEN_MODEL, device_map=”auto”, torch_dtype=”auto”, quantization_config=bnb, # bitsandbytes 미설치 시 제거 trust_remote_code=True, ) print(“[Info] Model loaded.”)

———————————————

Step (a): 이미지에서 객체 리스트 생성 (Tag2Text 유사)

———————————————

print(“\n[Step a] Generating object list from the image…”)

messages_for_obj_list = [ { “role”: “user”, “content”: [ {“type”: “image”}, {“type”: “text”, “text”: “Please analyze the image and list all the objects present.”} ] } ]

text_template = processor.apply_chat_template(messages_for_obj_list, tokenize=False, add_generation_prompt=True) inputs_for_obj_list = processor(text=[text_template], images=[image_pil], return_tensors=”pt”).to(DEVICE)

with torch.no_grad(): output = model.generate( **inputs_for_obj_list, max_new_tokens=256, # 과도한 길이는 메모리↑ do_sample=False, temperature=0.0 )

generated_text = processor.batch_decode(output, skip_special_tokens=True)[0]

간단 파서: “assistant\n” 이후를 긁어 쉼표/개행으로 split

obj_list_raw = generated_text.split(“assistant\n”)[-1].strip() object_list = [obj.strip().lower() for obj in re.split(r’[,\n]’, obj_list_raw) if obj.strip()]

print(f” - Detected objects (raw): {object_list}”)

데모용으로 대상 객체 1개만 사용

if len(object_list) == 0: object_list = [TEXT_QUERY] target_object = object_list[0] print(f” - Target object: {target_object}”)

————————————————————

Final: 수동 생성 루프를 통해 디코더 어텐션을 수집 → 맵 근사

(주의) 실제로는 Vision 패치 어텐션이 더 공간 의미가 명확하지만,

빌드/메모리 제약 시 디코더 어텐션 근사를 사용.

————————————————————

print(f”\n[Final Step] Generating attention map for ‘{target_object}’ via a MANUAL generation loop…”)

prompt_for_attention = “Please analyze the image and list all the objects present.” messages_for_attention = [ {“role”: “user”, “content”: [{“type”: “image”}, {“type”: “text”, “text”: prompt_for_attention}]} ] text_template = processor.apply_chat_template(messages_for_attention, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text_template], images=[image_pil], return_tensors=”pt”).to(DEVICE) input_ids = inputs.input_ids

대상 토큰 ID (간단히 공백+토큰 인코딩)

token_ids = processor.tokenizer.encode(f” {target_object}”, add_special_tokens=False) target_token_id = token_ids[0] if len(token_ids) > 0 else None

found_target_attention = None generated_token_idx = -1 past_key_values = None

메모리 안전을 위해 토큰 길이 제한 (예: 최대 30)

for step in range(30): with torch.no_grad(): outputs = model( input_ids=input_ids, past_key_values=past_key_values, use_cache=True, output_attentions=True, # 디코더 어텐션 요청 return_dict=True )

next_token_logits = outputs.logits[:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)

# 대상 토큰이 생성되면 어텐션 저장
if target_token_id is not None and next_token.item() == target_token_id and found_target_attention is None:
    print(f"  - Found '{target_object}' token at generation step {step}.")
    # outputs.attentions: Tuple[ num_layers x (B, num_heads, tgt_len, src_len) ]
    found_target_attention = outputs.attentions
    generated_token_idx = input_ids.shape[1]  # 현재 마지막 위치

# 다음 단계 준비
input_ids = torch.cat([input_ids, next_token], dim=-1)
past_key_values = outputs.past_key_values

# EOS면 중단
if next_token.item() == processor.tokenizer.eos_token_id:
    print("  - Reached EOS.")
    break

print(“ - Manual generation complete.”)

———————————————

Attention Map 집계 (근사) 및 시각화

———————————————

if found_target_attention: print(f” - Aggregating attention from {len(found_target_attention)} layers…”) # 비전 패치 그리드 크기 추정값 (간단한 데모용) image_size = 448 patch_size = getattr(model.config.vision_config, “patch_size”, 14) grid_size = max(1, image_size // patch_size) num_patches = grid_size * grid_size

all_layer_attentions = []
for layer_attention in found_target_attention:
    # layer_attention: (B, num_heads, tgt_len, src_len)
    # 생성된 직전 토큰의 query가 바라본 src 분포(row) 취득
    token_row = layer_attention[0, :, generated_token_idx - 1, :]     # (num_heads, src_len)
    # 앞쪽 패치 토큰에 해당한다고 가정하고 num_patches만 취득 (근사)
    image_patch_attention = token_row[:, :num_patches]                # (num_heads, num_patches)
    layer_avg = image_patch_attention.mean(dim=0)                     # (num_patches,)
    all_layer_attentions.append(layer_avg)

final_avg = torch.stack(all_layer_attentions, dim=0).mean(dim=0)      # (num_patches,)
attention_map = final_avg.reshape(grid_size, grid_size).cpu().numpy()

# 원본 해상도로 업샘플
resized_map = cv2.resize(attention_map, (image_pil.width, image_pil.height), interpolation=cv2.INTER_CUBIC)

# 시각화
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1); plt.title("Original Image"); plt.axis('off'); plt.imshow(image_pil)
plt.subplot(1, 2, 2); plt.title(f"Aggregated Attention for '{target_object}'"); plt.axis('off')
plt.imshow(image_pil); plt.imshow(resized_map, cmap='jet', alpha=0.5)
plt.tight_layout()
plt.savefig("attention_map_result.png", dpi=200)
print("[OK] Saved attention_map_result.png") else:
print(f"  - Could not find or generate the token for '{target_object}' within the generation limit.") $$$

3) 실행 팁 & 트러블슈팅

메모리(OOM) 회피
- 가능하면 3B 체크포인트 사용
- 입력 이미지를 먼저 긴 변 384~448로 축소 후 투입
- max_new_tokens를 48~128 범위로 제한
bitsandbytes(4bit) 미설치/호환 문제
- quantization_config=bnb 줄을 삭제하고 실행 (대신 VRAM 여유 필요)
- 또는 pip install -U bitsandbytes (NVIDIA CUDA 환경에서만)
어텐션이 None으로 오는 경우
- 빌드에 따라 generate()에서는 어텐션을 반환하지 않음
- 위 코드는 수동 생성 루프 + output_attentions=True로 디코더 어텐션을 수집 (공간적 근사)
- 더 정확한 공간 히트맵이 필요하면 비전 타워 어텐션 후킹 또는 히든 유사도(cosine) 방식으로 대체 가능

4) 다음 단계 (SAM 연동)

위에서 얻은 Attention Map에서 Positive/Negative 포인트를 샘플링
segment-anything의 SamPredictor.predict(point_coords, point_labels)에 전달
결과 segmentation mask 시각화(overlay)

SAM 연동 예시는 별도 포스트/섹션에서 다룹니다. (sam_vit_b.pth 등 체크포인트 필요)

마치며

이 글에서는 VLM 기반 Attention Map을 만들어 VL-SAM의 전반적인 흐름을 검증했습니다.
환경/빌드 차이로 어텐션 제공 방식이 다를 수 있으니, 위 수동 루프/근사 전략을 기반으로 시작해 보세요.
필요하시면 비전 어텐션 후크 버전 또는 히든 유사도 기반 경량 버전도 첨부해 드릴게요.

AI, Research

This post is licensed under CC BY 4.0 by the author.

개요

1) [Object Recognition] – Attention Map Generation (VLM)

2) 실습 코드 (Attention Map 생성)

—————————

Config

—————————

4bit 양자화(가능 시 메모리 절약) — bnb 미설치면 이 줄 삭제

———————————————

Step (a): 이미지에서 객체 리스트 생성 (Tag2Text 유사)

———————————————

간단 파서: “assistant\n” 이후를 긁어 쉼표/개행으로 split

데모용으로 대상 객체 1개만 사용

————————————————————

Final: 수동 생성 루프를 통해 디코더 어텐션을 수집 → 맵 근사

(주의) 실제로는 Vision 패치 어텐션이 더 공간 의미가 명확하지만,

빌드/메모리 제약 시 디코더 어텐션 근사를 사용.

————————————————————

대상 토큰 ID (간단히 공백+토큰 인코딩)

메모리 안전을 위해 토큰 길이 제한 (예: 최대 30)

———————————————

Attention Map 집계 (근사) 및 시각화

———————————————

3) 실행 팁 & 트러블슈팅

4) 다음 단계 (SAM 연동)

마치며

Trending Tags