Post

๐Ÿ“ Understanding LISA - LISA ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ“ Understanding LISA - LISA ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿง  (English) LISA: A New Frontier in Reasoning-Based Segmentation

๐Ÿ” An innovative model that understands complex linguistic instructions and segments the corresponding regions in an image!

Image

Paper: LISA: Reasoning Segmentation via Large Language Model
Conference: CVPR 2024 (by CUHK, MSRA, SmartMore)
Code: dvlab-research/LISA
Comment: A groundbreaking approach combining the language understanding ability of LLMs with visual segmentation!


โ— Limitations of Existing Visual Recognition Systems

Despite many high-performance segmentation models, they lack the ability to understand implicit user intent and perform reasoning!

  • Explicit instructions required: Users must directly specify the target object.
  • Dependent on predefined categories: Difficult to handle new objects or scenarios flexibly.
  • Lacks complex reasoning: Cannot understand or process instructions like โ€œfoods rich in Vitamin C.โ€

โžก๏ธ To overcome these limitations,
a new task called โ€œreasoning segmentationโ€ was introduced, based on complex and implicit language instructions!

Example from the paper: when someone says โ€œChange the TV channel,โ€ a robot doesnโ€™t understand.
Instead, it needs a command like โ€œgo to the table, find the remote, press the channel button.โ€ LISA introduces reasoning to solve such issues.


โœ… Key Features of LISA!

๐Ÿ” 1. Reasoning Segmentation

  • Understands complex language instructions:
    Able to process commands like โ€œSegment the US president in this image and explain why.โ€
  • Utilizes world knowledge:
    For example, โ€œfoods rich in Vitamin C.โ€
  • Provides explanation:
    Can generate explanations for the segmentation output.

๐Ÿง  2. Unified Processing! LISA Model Architecture

  • SEG Token Introduction:
    Introduces a new token SEG and uses the embedding-as-mask paradigm.
  • Multimodal LLM Integration:
    Combines LLMโ€™s language understanding with visual information.
  • End-to-End Training:
    Directly maps language instruction + image to segmentation mask.

๐Ÿ“Š 3. Creation of the ReasonSeg Benchmark!

To evaluate LISAโ€™s performance, a new benchmark called ReasonSeg was created!

  • ๐Ÿ“ฆ Total samples: 1218
  • ๐Ÿงช Data split:
    • Train: 239
    • Validation: 200
    • Test: 779
  • ๐Ÿ–ผ Image sources: OpenImages, ScanNetv2
  • ๐Ÿ“ Instruction types: short phrases + complex sentences

ReasonSeg is designed to evaluate the modelโ€™s reasoning-based segmentation capabilities.


๐Ÿ‹๏ธโ€โ™‚๏ธ Training Methodology

LISA is trained in an end-to-end manner using the following three main data sources:

1. Semantic Segmentation Datasets

Datasets: ADE20K, COCO-Stuff, LVIS-PACO Learn โ€œwhat it isโ€ (e.g., chair)

  • Input: image + class name
  • Output: binary mask ๏ธŽโ†’ learns pixel-level semantic understanding
  • QA Format Example:
1
2
USER: <IMAGE> Can you segment the chair in this image?\\
ASSISTANT: It is <SEG>.

2. Referring Segmentation Datasets

Datasets: refCOCO, refCOCO+, refCOCOg, refCLEF
These ref* datasets are known to facilitate reasoning understanding!
Explicit referring expressions converted into QA format: โ€œthe red chair on the rightโ€ โ†’ โ€œCan you segment the red chair on the right in this image?โ€
Learns not only โ€œwhatโ€ but also โ€œwhich one specificallyโ€ (e.g., wooden chair)

  • Input: image + explicit object description
  • Output: binary mask for the target object ๏ธŽโ†’ learns to localize and segment based on natural language

3. Visual Question Answering (VQA)

๐Ÿ”Ž Important: Even though reasoning segmentation examples were not included,
LISA performed impressively on ReasonSeg in zero-shot setting!

  • Input: image + natural language question
  • Output: natural language answer ๏ธŽโ†’ learns to integrate visual and language understanding
  • Models used:

    • LLaVA-Instruct-150k (v1)
    • LLaVA-v1.5-mix665k (v1.5)

๐Ÿ LISA Architecture: Embedding-as-Mask Paradigm

Image

Prior polygon-sequence methods are expensive and less generalizable.
LISA introduces a new structure called Embedding-as-Mask.

๐Ÿ“ Key Components

  1. Add <SEG> token to specify segmentation request
  2. Extract <SEG> embedding from the last LLM layer
  3. Pass through MLP to generate mask embedding
  4. Combine with vision encoder features and pass to decoder
  5. Output final binary mask

To better understand the mask output flow from SEG, we follow the pseudocode below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Image and text input\\
x_img = load_image_tensor(...)             # [3, H, W]\\
x_txt = "Can you segment the red chair in this image? It is <SEG>."

# 1. Tokenize text and find <SEG> token index\\
input_ids = tokenizer(x_txt, return_tensors='pt')\\
seg_token_index = input_ids.input_ids[0].tolist().index(tokenizer.convert_tokens_to_ids("<SEG>"))

# 2. Vision Encoder extracts image features\\
f_img = vision_encoder(x_img)             # [B, C, H', W']

# 3. Multimodal LLM encoding\\
output_hidden_states = multimodal_llm(input_ids, image_features=f_img, output_hidden_states=True)

# 4. Extract embedding for <SEG> from final hidden state\\
h_tilde_seg = output_hidden_states.last_hidden_state[0, seg_token_index]  # [hidden_dim]

# 5. Project with MLP\\
h_seg = mlp_projection(h_tilde_seg)       # [proj_dim]

# 6. Decode to segmentation mask\\
pred_mask = mask_decoder(h_seg, f_img)    # [1, H, W]

# 7. Loss function\\
loss = bce_loss(pred_mask, gt_mask) + dice_loss(pred_mask, gt_mask)

๐ŸŽฏ Training Objective Function

๐“› = ฮป_txt * ๐“›_txt + ฮป_mask * ๐“›_mask
๐“›_txt: Text generation loss  (Auto-regressive CE)
๐“›_mask:  Mask loss = BCE + DICE
ฮป_txt, ฮป_mask	: Hyperparameter  

๐Ÿ“‰ 1. Text Generation Loss ๐“›_txt

Evaluates accuracy of the natural language portion before <SEG>

  • Uses autoregressive cross-entropy loss, same as typical language modeling

๐Ÿ“‰ 2. Mask Loss ๐“›_mask

Evaluates segmentation mask accuracy generated from <SEG> token embedding

  • Combines two losses:

    • BCE (pixel-wise accuracy)
    • DICE (overall shape similarity)

๐Ÿš€ Efficiency and Performance

ModelGPU ResourcesTraining Time
VisionLLM4 ร— 8 ร— A100 80GB50 Epochs (unrealistic)
LISA-7B8 ร— RTX 3090 24GB< 3 days

LISA is a practical segmentation model that excels in both efficiency and performance.


โœจ Conclusion

LISA empowers multimodal LLMs with reasoning-based image segmentation,
evolving them into models capable of understanding and executing complex natural language instructions.

๐Ÿ”ฎ Initially, it seemed like multimodal models could do everything alone.
But going forward, we expect new models of varying styles and perhaps a unified solution to integrate them all!


๐Ÿง  (ํ•œ๊ตญ์–ด) LISA: ์ถ”๋ก  ๊ธฐ๋ฐ˜ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜์˜ ์ƒˆ๋กœ์šด ์ง€ํ‰

๐Ÿ” ๋ณต์žกํ•œ ์–ธ์–ด ์ง€์‹œ๋ฅผ ์ดํ•ดํ•˜๊ณ , ์ด๋ฏธ์ง€์—์„œ ํ•ด๋‹น ์˜์—ญ์„ ๋ถ„ํ• ํ•˜๋Š” ํ˜์‹ ์ ์ธ ๋ชจ๋ธ!

Image

๋…ผ๋ฌธ: LISA: Reasoning Segmentation via Large Language Model
๋ฐœํ‘œ: CVPR 2024 (by CUHK, MSRA, SmartMore)
์ฝ”๋“œ: dvlab-research/LISA
์ฝ”๋ฉ˜ํŠธ: LLM์˜ ์–ธ์–ด ์ดํ•ด ๋Šฅ๋ ฅ์„ ์‹œ๊ฐ ๋ถ„ํ• ์— ์ ‘๋ชฉํ•œ ํš๊ธฐ์ ์ธ ์ ‘๊ทผ!


โ— ๊ธฐ์กด ์‹œ๊ฐ ์ธ์‹ ์‹œ์Šคํ…œ์˜ ํ•œ๊ณ„

์—ฌ๋Ÿฌ, ์„ฑ๋Šฅ ์ข‹์€ Segmentation ๋ชจ๋ธ๋“ค์ด ๋‚˜์™”์ง€๋งŒ!!
์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์€ ์•”์‹œ์ ์ธ ์‚ฌ์šฉ์ž ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ถ”๋ก ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•˜๋‹ค!

  • ๋ช…์‹œ์  ์ง€์‹œ ํ•„์š”: ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘์ ์œผ๋กœ ๋Œ€์ƒ ๊ฐ์ฒด๋ฅผ ์ง€์ •ํ•ด์•ผ ํ•จ.
  • ์‚ฌ์ „ ์ •์˜๋œ ๋ฒ”์ฃผ ์˜์กด: ์ƒˆ๋กœ์šด ๊ฐ์ฒด๋‚˜ ์ƒํ™ฉ์— ๋Œ€ํ•œ ์œ ์—ฐํ•œ ๋Œ€์‘์ด ์–ด๋ ค์›€.
  • ๋ณต์žกํ•œ ์ถ”๋ก  ๋ถ€์กฑ: โ€œ๋น„ํƒ€๋ฏผ C๊ฐ€ ๋งŽ์€ ์Œ์‹โ€๊ณผ ๊ฐ™์€ ๋ณต์žกํ•œ ์ง€์‹œ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ.

โžก๏ธ ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด,
๋ณต์žกํ•˜๊ณ  ์•”์‹œ์ ์ธ ์–ธ์–ด ์ง€์‹œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฏธ์ง€์—์„œ ํŠน์ • ์˜์—ญ์„ ๋ถ„ํ• ํ•˜๋Š”
โ€œ์ถ”๋ก  ๋ถ„ํ• (reasoning segmentation)โ€์ด๋ผ๋Š” ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰!!

๋…ผ๋ฌธ์—์„œ ๋‚˜์˜จ ์˜ˆ๋กœ๋Š” TV ์ฑ„๋„์„ ๋ฐ”๊ฟ”! ํ–ˆ์„๋•Œ ์‚ฌ๋žŒ์€ ์ดํ•ดํ•˜์ง€๋งŒ ๋กœ๋ด‡์€ ์ดํ•ด๋ฅผ ๋ชป ํ•˜๊ธฐ์—,
ํ…Œ์ด๋ธ”๋กœ ๊ฐ€์„œ, ๋ฆฌ๋ชจ์ปจ์„ ์ฐพ๊ณ , ์ฑ„๋„๋ณ€๊ฒฝ ๋ฒ„ํŠผ์„ ๋ˆŒ๋Ÿฌ! ๋ผ๊ณ  ๋ช…๋ นํ•ด์•ผํ•˜๋Š”๋ฐ,
์ด๋Ÿฐ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ์ถ”๋ก ๊ธฐ๋Šฅ์„ ๋„ฃ์€๊ฒƒ์ž„!


โœ… LISA์˜ ํ•ต์‹ฌ ํŠน์ง•!

๐Ÿ” 1. ์ถ”๋ก  ๋ถ„ํ• (Reasoning Segmentation)

  • ๋ณต์žกํ•œ ์–ธ์–ด ์ง€์‹œ ์ดํ•ด:
    โ€œ์ด ์ด๋ฏธ์ง€์—์„œ ๋ฏธ๊ตญ ๋Œ€ํ†ต๋ น์ด ๋ˆ„๊ตฌ์ธ์ง€ ๋ถ„ํ•  ๋งˆ์Šคํฌ๋ฅผ ์ถœ๋ ฅํ•˜๊ณ  ์ด์œ ๋ฅผ ์„ค๋ช…ํ•˜์„ธ์š”.โ€ ์™€ ๊ฐ™์€ ์ง€์‹œ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  • ์„ธ๊ณ„ ์ง€์‹ ํ™œ์šฉ:
    โ€œ๋น„ํƒ€๋ฏผ C๊ฐ€ ๋งŽ์€ ์Œ์‹โ€ ๋“ฑ ์‹ค์ œ ์ง€์‹์„ ํ™œ์šฉํ•ด ์ ์ ˆํ•œ ์˜์—ญ ๋ถ„ํ• 
  • ์„ค๋ช… ์ œ๊ณต:
    ๋ถ„ํ•  ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์ด์œ ์™€ ์„ค๋ช… ์ƒ์„ฑ ๊ฐ€๋Šฅ

๐Ÿง  2. ํ•œ๋ฐฉ์œผ๋กœ ์ฒ˜๋ฆฌ! LISA ๋ชจ๋ธ ๊ตฌ์กฐ

  • SEG ํ† ํฐ ๋„์ž…:
    ์ƒˆ๋กœ์šด ํ† ํฐ SEG๋ฅผ ํ™œ์šฉํ•ด, ์ž„๋ฒ ๋”ฉ ์ž์ฒด๋ฅผ ๋งˆ์Šคํฌ๋กœ ํ•ด์„ํ•˜๋Š” embedding-as-mask ํŒจ๋Ÿฌ๋‹ค์ž„ ์‚ฌ์šฉ
  • ๋‹ค์ค‘ ๋ชจ๋‹ฌ LLM ํ™œ์šฉ:
    ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์˜ ์–ธ์–ด ์ดํ•ด ๋Šฅ๋ ฅ์„ ์‹œ๊ฐ ์ •๋ณด์™€ ๊ฒฐํ•ฉ
  • End-to-End ํ•™์Šต:
    ์–ธ์–ด ์ง€์‹œ + ์ด๋ฏธ์ง€ โ†’ ์ง์ ‘ ๋งˆ์Šคํฌ ์ƒ์„ฑ๊นŒ์ง€ ์ด์–ด์ง€๋Š” ๊ตฌ์กฐ

๐Ÿ“Š 3. ReasonSeg ๋ฒค์น˜๋งˆํฌ๋ผ๋Š” ๊ฒƒ์„ ๋งŒ๋“ฆ!!

LISA์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ReasonSeg๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ๊ฐ€ ๊ตฌ์ถ•ํ•จ!!

  • ๐Ÿ“ฆ ์ด ์ƒ˜ํ”Œ ์ˆ˜: 1218
  • ๐Ÿงช ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ:
    • Train: 239๊ฐœ
    • Validation: 200๊ฐœ
    • Test: 779๊ฐœ
  • ๐Ÿ–ผ ์ด๋ฏธ์ง€ ์ถœ์ฒ˜: OpenImages, ScanNetv2
  • ๐Ÿ“ ์ง€์‹œ๋ฌธ ๊ตฌ์„ฑ: ์งง์€ ๊ตฌ + ๋ณต์žกํ•œ ๋ฌธ์žฅ

ReasonSeg๋Š” ๋ชจ๋ธ์ด ์‹ค์ œ ์ถ”๋ก  ๊ธฐ๋ฐ˜ ๋ถ„ํ•  ๊ณผ์ œ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„


๐Ÿ‹๏ธโ€โ™‚๏ธ ๋ชจ๋ธ ํ•™์Šต ๋ฐฉ๋ฒ•!

LISA๋Š” end-to-end ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋˜๋ฉฐ, ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๋ฐ์ดํ„ฐ ์†Œ์Šค๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

1. Semantic Segmentation Datasets

์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹: ADE20K, COCO-Stuff, LVIS-PACO
๋ฌด์—‡์ด๋ƒ (ex. ์˜์ž) ์— ๋Œ€ํ•œ ํ•™์Šต!

  • ์ž…๋ ฅ: ์ด๋ฏธ์ง€ + ํด๋ž˜์Šค ์ด๋ฆ„
  • ์ถœ๋ ฅ: ์ด์ง„ ๋งˆ์Šคํฌ
    โ†’ ํ”ฝ์…€ ์ˆ˜์ค€ ์‹œ๋งจํ‹ฑ ์ดํ•ด ํ•™์Šต
  • QA ํฌ๋งท ์˜ˆ์‹œ:
1
2
USER: <IMAGE> Can you segment the chair in this image?
ASSISTANT: It is <SEG>.

2. Referring Segmentation Datasets

์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹: refCOCO, refCOCO+, refCOCOg, refCLEF
์œ„ ๋ฐ์ดํ„ฐ ์…‹์€ ๋ชจ๋‘ ref* ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ์ถ”๋ก  ์ดํ•ด๋ฅผ ์œ„ํ•œ ๋Œ€ํ‘œ์ ์ธ ๋ฐ์ดํ„ฐ์…‹!
์ด ๋•๋ถ„์— ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•ด์ง€๋Š”๊ฒƒ์ด์ง€์œ ~~
๋ช…์‹œ์  ๊ฐ์ฒด ์ง€์‹œ๋ฌธ์„ QA ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ : โ€œthe red chair on the rightโ€ โ†’ โ€œCan you segment the red chair on the right in this image?โ€
๋ฌด์—‡์„ ๋„˜์–ด ์–ด๋–ค ๊ฒƒ์ด๋ƒ๋ฅผ (ex. ๋‚˜๋ฌด ์˜์ž) ํ•™์Šต

  • ์ž…๋ ฅ: ์ด๋ฏธ์ง€ + ๋ช…์‹œ์  ๊ฐ์ฒด ์„ค๋ช…
  • ์ถœ๋ ฅ: ๋Œ€์ƒ ๊ฐ์ฒด์— ๋Œ€ํ•œ ์ด์ง„ ๋งˆ์Šคํฌ
    โ†’ ์ž์—ฐ์–ด ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ์ง€์ • + ๋ถ„ํ•  ๋Šฅ๋ ฅ ํ•™์Šต

3. Visual Question Answering (VQA)

๐Ÿ”Ž ์ค‘์š”ํ•œ ์ : ํ•™์Šต ๋ฐ์ดํ„ฐ์—๋Š” reasoning segmentation์šฉ ์˜ˆ์ œ๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์•˜์Œ์—๋„,
LISA๋Š” ์ œ๋กœ์ƒท(zero-shot)์œผ๋กœ ReasonSeg์—์„œ ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค๋Š” ๊ฒƒ!

  • ์ž…๋ ฅ: ์ด๋ฏธ์ง€ + ์ž์—ฐ์–ด ์งˆ๋ฌธ
  • ์ถœ๋ ฅ: ์ž์—ฐ์–ด ๋‹ต๋ณ€
    โ†’ ํ…์ŠคํŠธ ์ดํ•ด + ์‹œ๊ฐ ์ •๋ณด ํ†ตํ•ฉ ๋Šฅ๋ ฅ ํ•™์Šต
  • ์‚ฌ์šฉ ๋ชจ๋ธ
    • LLaVA-Instruct-150k (v1)
    • LLaVA-v1.5-mix665k (v1.5)

๐Ÿ— LISA์˜ ๊ตฌ์กฐ : Embedding-as-Mask Paradigm

Image

๊ธฐ์กด์˜ polygon ์‹œํ€€์Šค ๊ธฐ๋ฐ˜ ๋ถ„ํ•  ๋ฐฉ์‹์€ ์—ฐ์‚ฐ ๋น„์šฉ์ด ํฌ๊ณ  ์ผ๋ฐ˜ํ™”์— ์–ด๋ ค์›€์ด ์žˆ์—ˆ๊ธฐ์—,
LISA๋Š” Embedding-as-Mask๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ตฌ์กฐ๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ ํ•ต์‹ฌ ๊ตฌ์„ฑ์š”์†Œ

  1. <SEG> ํ† ํฐ ์ถ”๊ฐ€ โ†’ ๋ถ„ํ•  ์š”์ฒญ์„ ๋ช…์‹œ
  2. LLM์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์—์„œ <SEG> ์ž„๋ฒ ๋”ฉ ์ถ”์ถœ
  3. MLP๋ฅผ ํ†ตํ•ด ๋งˆ์Šคํฌ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
  4. Vision Encoder์—์„œ ์ถ”์ถœํ•œ ์‹œ๊ฐ ํŠน์ง•๊ณผ ํ•จ๊ป˜ ๋””์ฝ”๋”์— ์ž…๋ ฅ
  5. ์ตœ์ข… ์ด์ง„ ๋งˆ์Šคํฌ ์ถœ๋ ฅ

์œ„์˜ SEG๋กœ ๋ถ€ํ„ฐ ๋งˆ์Šคํฌ ์ถœ๋ ฅ ๋ถ€๋ถ„์„ ์กฐ๊ธˆ ๋” ์‰ฝ๊ฒŒ ์ดํ•ดํ•˜๊ณ ์ž
์•„๋ž˜์™€ ๊ฐ™์ด psuedo code๋กœ ํ๋ฆ„์„ ํŒŒ์•…ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ž…๋ ฅ
x_img = load_image_tensor(...)             # [3, H, W]
x_txt = "Can you segment the red chair in this image? It is <SEG>."

# 1. ํ…์ŠคํŠธ ํ† ํฌ๋‚˜์ด์ฆˆ & <SEG> ์œ„์น˜ ํ™•์ธ
input_ids = tokenizer(x_txt, return_tensors='pt')
seg_token_index = input_ids.input_ids[0].tolist().index(tokenizer.convert_tokens_to_ids("<SEG>"))

# 2. Vision Encoder: ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ
f_img = vision_encoder(x_img)             # shape: [B, C, H', W']

# 3. Multimodal LLM ์ธ์ฝ”๋”ฉ
# (์ด๋ฏธ์ง€ ํ† ํฐ + ํ…์ŠคํŠธ ํ† ํฐ โ†’ LLM์œผ๋กœ ์ธ์ฝ”๋”ฉ)
output_hidden_states = multimodal_llm(input_ids, image_features=f_img, output_hidden_states=True)

# 4. ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์—์„œ <SEG> ํ† ํฐ์˜ ์ž„๋ฒ ๋”ฉ ์ถ”์ถœ
h_tilde_seg = output_hidden_states.last_hidden_state[0, seg_token_index]  # shape: [hidden_dim]

# 5. MLP๋ฅผ ํ†ตํ•ด h_seg ์ƒ์„ฑ
h_seg = mlp_projection(h_tilde_seg)       # shape: [proj_dim]

# 6. ๋งˆ์Šคํฌ ๋””์ฝ”๋”: h_seg + ์ด๋ฏธ์ง€ ํ”ผ์ฒ˜ โ†’ ๋ถ„ํ•  ๋งˆ์Šคํฌ ์ƒ์„ฑ
pred_mask = mask_decoder(h_seg, f_img)    # shape: [1, H, W], binary segmentation

# 7. Loss ๊ณ„์‚ฐ (ํ•™์Šต ์ค‘์ผ ๊ฒฝ์šฐ)
loss = bce_loss(pred_mask, gt_mask) + dice_loss(pred_mask, gt_mask)

๐ŸŽฏ ํ•™์Šต ๋ชฉํ‘œ ํ•จ์ˆ˜

๐“› = ฮป_txt * ๐“›_txt + ฮป_mask * ๐“›_mask
๐“›_txt: ํ…์ŠคํŠธ ์ƒ์„ฑ ์†์‹ค (Auto-regressive CE)
๐“›_mask: ๋งˆ์Šคํฌ ์†์‹ค = BCE + DICE
ฮป_txt, ฮป_mask	: ๋‘ ์†์‹ค ํ•ญ๋ชฉ์˜ ๊ฐ€์ค‘์น˜ (ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)
๐Ÿ“Œ 1. ํ…์ŠคํŠธ ์ƒ์„ฑ ์†์‹ค ๐“›_txt

LLM์ด ์ƒ์„ฑํ•œ ์‘๋‹ต ๋ฌธ์žฅ์—์„œ <SEG> ์ด์ „์˜ ์ž์—ฐ์–ด ํ…์ŠคํŠธ ๋ถ€๋ถ„์˜ ์ •ํ™•๋„๋ฅผ ํ‰๊ฐ€!!

  • ์ผ๋ฐ˜์ ์ธ ์–ธ์–ด ๋ชจ๋ธ ํ•™์Šต ๋ฐฉ์‹๊ณผ ๋™์ผํ•œ
    Autoregressive Cross-Entropy Loss ์‚ฌ์šฉ
๐Ÿ“Œ 2. ๋งˆ์Šคํฌ ์†์‹ค ๐“›_mask

<SEG> ํ† ํฐ์—์„œ ์ถ”์ถœํ•œ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด ์ƒ์„ฑ๋œ ๋ถ„ํ•  ๋งˆ์Šคํฌ์˜ ์ •ํ™•๋„๋ฅผ ํ‰๊ฐ€!

๋‘ ๊ฐ€์ง€ ์†์‹ค์„ ์กฐํ•ฉํ•˜์—ฌ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • BCE (Binary Cross-Entropy): ํ”ฝ์…€ ๋‹จ์œ„ ์ •ํ™•๋„
  • DICE Loss: ์ „์ฒด ๋งˆ์Šคํฌ์˜ ํ˜•ํƒœ ์œ ์‚ฌ๋„ ๋ฐ˜์˜

๐Ÿš€ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ

๋ชจ๋ธGPU ์ž์›ํ•™์Šต ์‹œ๊ฐ„
VisionLLM4 ร— 8 ร— A100 80GB50 Epochs (๋น„ํ˜„์‹ค์ )
LISA-7B8 ร— RTX 3090 24GB3์ผ ๋ฏธ๋งŒ

LISA๋Š” ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋ชจ๋‘ ๋งŒ์กฑํ•˜๋Š” ์‹ค์šฉ์  ๋ถ„ํ•  ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.


โœจ ๊ฒฐ๋ก 

LISA๋Š” ๊ธฐ์กด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์— ์ถ”๋ก  ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ๋ถ„ํ•  ๋Šฅ๋ ฅ์„ ๋ถ€์—ฌํ•จ์œผ๋กœ์จ,
๋‹จ์ˆœํ•œ ๋ช…๋ น ์ดํ–‰์„ ๋„˜์–ด์„œ ๋ณต์žกํ•œ ์–ธ์–ด์  ์š”์ฒญ์„ ์ดํ•ดํ•˜๊ณ  ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ๋กœ ์ง„ํ™”!!

๐Ÿ”ฎ LLM ๋งŒ์œผ๋กœ๋„ ๊ทธ๋ฆฌ๊ณ  ์ตœ๊ทผ๋‚˜์˜จ Multi Modal Model๋กœ ๋ชจ๋“ ๊ฒƒ์„ ๋‹คํ•  ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™์•˜๋Š”๋ฐ.
ํ–ฅํ›„์—๋Š” ๋” ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์˜ Model ๋“ค์ด ๋‚˜์˜ฌ๊ฒƒ ๊ฐ™๊ณ  ์ด๋Ÿฐ๊ฒƒ์„ ํ•˜๋‚˜๋กœ ํ†ตํ•ฉํ•˜๋Š” ๋ฌด์–ธ๊ฐ€๋„ ๋‚˜์˜ค๊ฒ ๊ตฐ์š”!!


This post is licensed under CC BY 4.0 by the author.