📝 Understanding LISA - LISA 알아보기?!!

Posted Jun 5, 2025

By DrFirst

16 min read

🧠 (English) LISA: A New Frontier in Reasoning-Based Segmentation

🔍 An innovative model that understands complex linguistic instructions and segments the corresponding regions in an image!

Paper: LISA: Reasoning Segmentation via Large Language Model
Conference: CVPR 2024 (by CUHK, MSRA, SmartMore)
Code: dvlab-research/LISA
Comment: A groundbreaking approach combining the language understanding ability of LLMs with visual segmentation!

❗ Limitations of Existing Visual Recognition Systems

Despite many high-performance segmentation models, they lack the ability to understand implicit user intent and perform reasoning!

Explicit instructions required: Users must directly specify the target object.
Dependent on predefined categories: Difficult to handle new objects or scenarios flexibly.
Lacks complex reasoning: Cannot understand or process instructions like “foods rich in Vitamin C.”

➡️ To overcome these limitations,
a new task called “reasoning segmentation” was introduced, based on complex and implicit language instructions!

Example from the paper: when someone says “Change the TV channel,” a robot doesn’t understand.
Instead, it needs a command like “go to the table, find the remote, press the channel button.” LISA introduces reasoning to solve such issues.

✅ Key Features of LISA!

🔍 1. Reasoning Segmentation

Understands complex language instructions:
Able to process commands like “Segment the US president in this image and explain why.”
Utilizes world knowledge:
For example, “foods rich in Vitamin C.”
Provides explanation:
Can generate explanations for the segmentation output.

🧠 2. Unified Processing! LISA Model Architecture

SEG Token Introduction:
Introduces a new token SEG and uses the embedding-as-mask paradigm.
Multimodal LLM Integration:
Combines LLM’s language understanding with visual information.
End-to-End Training:
Directly maps language instruction + image to segmentation mask.

📊 3. Creation of the ReasonSeg Benchmark!

To evaluate LISA’s performance, a new benchmark called ReasonSeg was created!

📦 Total samples: 1218
🧪 Data split:
- Train: 239
- Validation: 200
- Test: 779
🖼 Image sources: OpenImages, ScanNetv2
📝 Instruction types: short phrases + complex sentences

ReasonSeg is designed to evaluate the model’s reasoning-based segmentation capabilities.

🏋️‍♂️ Training Methodology

LISA is trained in an end-to-end manner using the following three main data sources:

1. Semantic Segmentation Datasets

Datasets: ADE20K, COCO-Stuff, LVIS-PACO Learn “what it is” (e.g., chair)

Input: image + class name
Output: binary mask ︎→ learns pixel-level semantic understanding
QA Format Example:

USER: <IMAGE> Can you segment the chair in this image?\\
ASSISTANT: It is <SEG>.

2. Referring Segmentation Datasets

Datasets: refCOCO, refCOCO+, refCOCOg, refCLEF
These ref* datasets are known to facilitate reasoning understanding!
Explicit referring expressions converted into QA format: “the red chair on the right” → “Can you segment the red chair on the right in this image?”
Learns not only “what” but also “which one specifically” (e.g., wooden chair)

Input: image + explicit object description
Output: binary mask for the target object ︎→ learns to localize and segment based on natural language

3. Visual Question Answering (VQA)

🔎 Important: Even though reasoning segmentation examples were not included,
LISA performed impressively on ReasonSeg in zero-shot setting!

Input: image + natural language question
Output: natural language answer ︎→ learns to integrate visual and language understanding
Models used:
- LLaVA-Instruct-150k (v1)
- LLaVA-v1.5-mix665k (v1.5)

🏐 LISA Architecture: Embedding-as-Mask Paradigm

Prior polygon-sequence methods are expensive and less generalizable.
LISA introduces a new structure called Embedding-as-Mask.

📁 Key Components

Add <SEG> token to specify segmentation request
Extract <SEG> embedding from the last LLM layer
Pass through MLP to generate mask embedding
Combine with vision encoder features and pass to decoder
Output final binary mask

To better understand the mask output flow from SEG, we follow the pseudocode below:

# Image and text input\\
x_img = load_image_tensor(...)             # [3, H, W]\\
x_txt = "Can you segment the red chair in this image? It is <SEG>."

# 1. Tokenize text and find <SEG> token index\\
input_ids = tokenizer(x_txt, return_tensors='pt')\\
seg_token_index = input_ids.input_ids[0].tolist().index(tokenizer.convert_tokens_to_ids("<SEG>"))

# 2. Vision Encoder extracts image features\\
f_img = vision_encoder(x_img)             # [B, C, H', W']

# 3. Multimodal LLM encoding\\
output_hidden_states = multimodal_llm(input_ids, image_features=f_img, output_hidden_states=True)

# 4. Extract embedding for <SEG> from final hidden state\\
h_tilde_seg = output_hidden_states.last_hidden_state[0, seg_token_index]  # [hidden_dim]

# 5. Project with MLP\\
h_seg = mlp_projection(h_tilde_seg)       # [proj_dim]

# 6. Decode to segmentation mask\\
pred_mask = mask_decoder(h_seg, f_img)    # [1, H, W]

# 7. Loss function\\
loss = bce_loss(pred_mask, gt_mask) + dice_loss(pred_mask, gt_mask)

🎯 Training Objective Function

𝓛 = λ_txt * 𝓛_txt + λ_mask * 𝓛_mask
𝓛_txt: Text generation loss  (Auto-regressive CE)
𝓛_mask:  Mask loss = BCE + DICE
λ_txt, λ_mask	: Hyperparameter

📉 1. Text Generation Loss `𝓛_txt`

Evaluates accuracy of the natural language portion before <SEG>

Uses autoregressive cross-entropy loss, same as typical language modeling

📉 2. Mask Loss `𝓛_mask`

Evaluates segmentation mask accuracy generated from <SEG> token embedding

Combines two losses:
- BCE (pixel-wise accuracy)
- DICE (overall shape similarity)

🚀 Efficiency and Performance

Model	GPU Resources	Training Time
VisionLLM	4 × 8 × A100 80GB	50 Epochs (unrealistic)
LISA-7B	8 × RTX 3090 24GB	< 3 days

LISA is a practical segmentation model that excels in both efficiency and performance.

✨ Conclusion

LISA empowers multimodal LLMs with reasoning-based image segmentation,
evolving them into models capable of understanding and executing complex natural language instructions.

🔮 Initially, it seemed like multimodal models could do everything alone.
But going forward, we expect new models of varying styles and perhaps a unified solution to integrate them all!

🧠 (한국어) LISA: 추론 기반 세그멘테이션의 새로운 지평

🔍 복잡한 언어 지시를 이해하고, 이미지에서 해당 영역을 분할하는 혁신적인 모델!

논문: LISA: Reasoning Segmentation via Large Language Model
발표: CVPR 2024 (by CUHK, MSRA, SmartMore)
코드: dvlab-research/LISA
코멘트: LLM의 언어 이해 능력을 시각 분할에 접목한 획기적인 접근!

❗ 기존 시각 인식 시스템의 한계

여러, 성능 좋은 Segmentation 모델들이 나왔지만!!
이러한 시스템은 암시적인 사용자 의도를 이해하고 추론하는 능력이 부족하다!

명시적 지시 필요: 사용자가 직접적으로 대상 객체를 지정해야 함.
사전 정의된 범주 의존: 새로운 객체나 상황에 대한 유연한 대응이 어려움.
복잡한 추론 부족: “비타민 C가 많은 음식”과 같은 복잡한 지시를 이해하고 처리하는 데 한계가 있음.

➡️ 이러한 한계를 극복하기 위해,
복잡하고 암시적인 언어 지시를 기반으로 이미지에서 특정 영역을 분할하는
“추론 분할(reasoning segmentation)”이라는 연구를 진행!!

논문에서 나온 예로는 TV 채널을 바꿔! 했을때 사람은 이해하지만 로봇은 이해를 못 하기에,
테이블로 가서, 리모컨을 찾고, 채널변경 버튼을 눌러! 라고 명령해야하는데,
이런 단점을 해결하고자 추론기능을 넣은것임!

✅ LISA의 핵심 특징!

🔍 1. 추론 분할(Reasoning Segmentation)

복잡한 언어 지시 이해:
“이 이미지에서 미국 대통령이 누구인지 분할 마스크를 출력하고 이유를 설명하세요.” 와 같은 지시 처리 가능
세계 지식 활용:
“비타민 C가 많은 음식” 등 실제 지식을 활용해 적절한 영역 분할
설명 제공:
분할 결과에 대한 이유와 설명 생성 가능

🧠 2. 한방으로 처리! LISA 모델 구조

SEG 토큰 도입:
새로운 토큰 SEG를 활용해, 임베딩 자체를 마스크로 해석하는 embedding-as-mask 패러다임 사용
다중 모달 LLM 활용:
대형 언어 모델의 언어 이해 능력을 시각 정보와 결합
End-to-End 학습:
언어 지시 + 이미지 → 직접 마스크 생성까지 이어지는 구조

📊 3. ReasonSeg 벤치마크라는 것을 만듦!!

LISA의 성능을 평가하기 위해 ReasonSeg라는 새로운 벤치마크가 구축함!!

📦 총 샘플 수: 1218
🧪 데이터 구성:
- Train: 239개
- Validation: 200개
- Test: 779개
🖼 이미지 출처: OpenImages, ScanNetv2
📝 지시문 구성: 짧은 구 + 복잡한 문장

ReasonSeg는 모델이 실제 추론 기반 분할 과제를 얼마나 잘 수행할 수 있는지를 평가하기 위해 설계

🏋️‍♂️ 모델 학습 방법!

LISA는 end-to-end 방식으로 학습되며, 다음 세 가지 주요 데이터 소스로 구성됩니다:

1. Semantic Segmentation Datasets

사용 데이터셋: ADE20K, COCO-Stuff, LVIS-PACO
무엇이냐 (ex. 의자) 에 대한 학습!

입력: 이미지 + 클래스 이름
출력: 이진 마스크
→ 픽셀 수준 시맨틱 이해 학습
QA 포맷 예시:

USER: <IMAGE> Can you segment the chair in this image?
ASSISTANT: It is <SEG>.

2. Referring Segmentation Datasets

사용 데이터셋: refCOCO, refCOCO+, refCOCOg, refCLEF
위 데이터 셋은 모두 ref* 데이터 셋으로 추론 이해를 위한 대표적인 데이터셋!
이 덕분에 추론이 가능해지는것이지유~~
명시적 객체 지시문을 QA 형식으로 변환 : “the red chair on the right” → “Can you segment the red chair on the right in this image?”
무엇을 넘어 어떤 것이냐를 (ex. 나무 의자) 학습

입력: 이미지 + 명시적 객체 설명
출력: 대상 객체에 대한 이진 마스크
→ 자연어 기반 객체 지정 + 분할 능력 학습

3. Visual Question Answering (VQA)

🔎 중요한 점: 학습 데이터에는 reasoning segmentation용 예제가 포함되지 않았음에도,
LISA는 제로샷(zero-shot)으로 ReasonSeg에서 매우 우수한 성능을 보였다는 것!

입력: 이미지 + 자연어 질문
출력: 자연어 답변
→ 텍스트 이해 + 시각 정보 통합 능력 학습
사용 모델
- LLaVA-Instruct-150k (v1)
- LLaVA-v1.5-mix665k (v1.5)

🏗 LISA의 구조 : Embedding-as-Mask Paradigm

기존의 polygon 시퀀스 기반 분할 방식은 연산 비용이 크고 일반화에 어려움이 있었기에,
LISA는 Embedding-as-Mask라는 새로운 구조를 도입합니다.

📌 핵심 구성요소

<SEG> 토큰 추가 → 분할 요청을 명시
LLM의 마지막 레이어에서 <SEG> 임베딩 추출
MLP를 통해 마스크 임베딩 생성
Vision Encoder에서 추출한 시각 특징과 함께 디코더에 입력
최종 이진 마스크 출력

위의 SEG로 부터 마스크 출력 부분을 조금 더 쉽게 이해하고자
아래와 같이 psuedo code로 흐름을 파악해보았습니다!!

  
# 이미지와 텍스트 입력
x_img = load_image_tensor(...)             # [3, H, W]
x_txt = "Can you segment the red chair in this image? It is <SEG>."

# 1. 텍스트 토크나이즈 & <SEG> 위치 확인
input_ids = tokenizer(x_txt, return_tensors='pt')
seg_token_index = input_ids.input_ids[0].tolist().index(tokenizer.convert_tokens_to_ids("<SEG>"))

# 2. Vision Encoder: 이미지 특징 추출
f_img = vision_encoder(x_img)             # shape: [B, C, H', W']

# 3. Multimodal LLM 인코딩
# (이미지 토큰 + 텍스트 토큰 → LLM으로 인코딩)
output_hidden_states = multimodal_llm(input_ids, image_features=f_img, output_hidden_states=True)

# 4. 마지막 레이어에서 <SEG> 토큰의 임베딩 추출
h_tilde_seg = output_hidden_states.last_hidden_state[0, seg_token_index]  # shape: [hidden_dim]

# 5. MLP를 통해 h_seg 생성
h_seg = mlp_projection(h_tilde_seg)       # shape: [proj_dim]

# 6. 마스크 디코더: h_seg + 이미지 피처 → 분할 마스크 생성
pred_mask = mask_decoder(h_seg, f_img)    # shape: [1, H, W], binary segmentation

# 7. Loss 계산 (학습 중일 경우)
loss = bce_loss(pred_mask, gt_mask) + dice_loss(pred_mask, gt_mask)

🎯 학습 목표 함수

𝓛 = λ_txt * 𝓛_txt + λ_mask * 𝓛_mask
𝓛_txt: 텍스트 생성 손실 (Auto-regressive CE)
𝓛_mask: 마스크 손실 = BCE + DICE
λ_txt, λ_mask	: 두 손실 항목의 가중치 (하이퍼파라미터)

📌 1. 텍스트 생성 손실 `𝓛_txt`

LLM이 생성한 응답 문장에서 <SEG> 이전의 자연어 텍스트 부분의 정확도를 평가!!

일반적인 언어 모델 학습 방식과 동일한
Autoregressive Cross-Entropy Loss 사용

📌 2. 마스크 손실 `𝓛_mask`

<SEG> 토큰에서 추출한 임베딩을 통해 생성된 분할 마스크의 정확도를 평가!

두 가지 손실을 조합하여 사용합니다:

BCE (Binary Cross-Entropy): 픽셀 단위 정확도
DICE Loss: 전체 마스크의 형태 유사도 반영

🚀 성능과 효율성

모델	GPU 자원	학습 시간
VisionLLM	4 × 8 × A100 80GB	50 Epochs (비현실적)
LISA-7B	8 × RTX 3090 24GB	3일 미만

LISA는 효율성과 성능을 모두 만족하는 실용적 분할 모델입니다.

✨ 결론

LISA는 기존 멀티모달 LLM에 추론 기반 이미지 분할 능력을 부여함으로써,
단순한 명령 이행을 넘어서 복잡한 언어적 요청을 이해하고 실행할 수 있는 모델로 진화!!

🔮 LLM 만으로도 그리고 최근나온 Multi Modal Model로 모든것을 다할 수 있을것 같았는데.
향후에는 더 다양한 스타일의 Model 들이 나올것 같고 이런것을 하나로 통합하는 무언가도 나오겠군요!!

AI, Research

This post is licensed under CC BY 4.0 by the author.

🧠 (English) LISA: A New Frontier in Reasoning-Based Segmentation

❗ Limitations of Existing Visual Recognition Systems

✅ Key Features of LISA!

🔍 1. Reasoning Segmentation

🧠 2. Unified Processing! LISA Model Architecture

📊 3. Creation of the ReasonSeg Benchmark!

🏋️‍♂️ Training Methodology

1. Semantic Segmentation Datasets

2. Referring Segmentation Datasets

3. Visual Question Answering (VQA)

🏐 LISA Architecture: Embedding-as-Mask Paradigm

📁 Key Components

🎯 Training Objective Function

📉 1. Text Generation Loss 𝓛_txt

📉 2. Mask Loss 𝓛_mask

🚀 Efficiency and Performance

✨ Conclusion

🧠 (한국어) LISA: 추론 기반 세그멘테이션의 새로운 지평

❗ 기존 시각 인식 시스템의 한계

✅ LISA의 핵심 특징!

🔍 1. 추론 분할(Reasoning Segmentation)

🧠 2. 한방으로 처리! LISA 모델 구조

📊 3. ReasonSeg 벤치마크라는 것을 만듦!!

🏋️‍♂️ 모델 학습 방법!

1. Semantic Segmentation Datasets

2. Referring Segmentation Datasets

3. Visual Question Answering (VQA)

🏗 LISA의 구조 : Embedding-as-Mask Paradigm

📌 핵심 구성요소

🎯 학습 목표 함수

📌 1. 텍스트 생성 손실 𝓛_txt

📌 2. 마스크 손실 𝓛_mask

🚀 성능과 효율성

✨ 결론

Trending Tags

📉 1. Text Generation Loss `𝓛_txt`

📉 2. Mask Loss `𝓛_mask`

📌 1. 텍스트 생성 손실 `𝓛_txt`

📌 2. 마스크 손실 `𝓛_mask`