📝Understanding EZ-HOI - EZ-HOI 알아보기!!

Posted Jun 18, 2025

By DrFirst

27 min read

🧠 (English) Understanding EZ-HOI?!!

🔍 Creating Perfect Prompts for Zero-shot and Unseen Cases!!

Paper: EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
Conference: NeurIPS 2024 (Lei, Wang, et al.)
Code: ChelsieLei/EZ-HOI

📌 Background: Limitations of HOI and VLM Integration Research!?

Human-Object Interaction (HOI) refers to the task of finding pairs of humans and objects in images or videos and classifying the interactions between them.

❓ Problem a: HOI Research with VLM Integration!

Models are too large and have difficulty capturing fine-grained details!!

Recent HOI research has extensively utilized Vision-Language Models (VLMs), with a representative approach being the alignment of feature vectors between HOI detectors and VLMs so that both models can similarly understand concepts like actions.
Through this alignment, the features could understand previously unseen interactions even in zero-shot situations, but there were the following drawbacks:

💸 High-cost alignment learning process: VLM alignment is typically based on transformer structures, causing significant computational cost and training time issues!
🔒 Difficulty in zero-shot generalization: VLM alignment is optimized only for trained classes (Seen classes), resulting in poor prediction performance for unseen classes!
🧠 Limitations in knowledge transfer: While VLMs understand broad concepts well, they have weaknesses in tasks like HOI that require distinguishing subtle differences in human actions!

❗ Problem b: Lightweight learning by tuning only prompts!!

However, prompt tuning is mainly focused on Seen classes, resulting in poor performance on Unseen classes!

Recently, prompt tuning based approaches that skip the alignment process and directly utilize VLM’s representational power have gained attention as alternatives, but they still haven’t shown sufficient results in zero-shot problems!!

Note: What is the prompt tuning based approach that directly utilizes VLM’s representational power!?
It changes “A photo of a cat” to “[P1] [P2] [P3] cat” and trains P1 P2 P3!
The MaPLe prompt tuning mentioned in the paper tunes both image and text together!!

Consequently, while the combination of HOI and VLM is promising, there were limitations in achieving lightweight models & generalization capabilities!

💡 EZ-HOI Emerges!!!

🧩 Inference

Pre-fine-tuned learnable prompts are combined with existing foundation models!!
So the existing foundation models remain untrained, achieving zero-shot through prompt tuning!!

[Input] Single image
    ↓
Stage 1: Human-Object Detection
    - Extract bounding boxes for humans and all objects
    - Generate all possible (human, object) pairs

Stage 2: HOI Recognition
    - Each human-object pair → CLIP's visual encoder + vision learnable prompt → image embedding (f_vis)
    - All HOI classes (object-action pairs) → CLIP's text encoder + text learnable prompt → text embedding (f_txt)
    - Select the most similar HOI class based on cosine similarity(f_vis, f_txt)
    → Final HOI prediction

🛠️ Training

.1. LLM-based HOI Class Description Generation

Generate rich sentences using LLM for all object-interaction (HOI class) pairs
"Swinging a baseball bat describes a person..."

.2. VLM-based Image Prompts (VLM Guidance)

→ Cross-Attention (Image MHCA, Multi-Head Cross Attention, initialized and then trained)  
  - Q: vision learnable prompt (initialized and then trained)  
  - K/V: Vectors encoded by CLIP(VLM) (For unseen cases, descriptions generated by LLM are encoded)
→ Train MHCA and learnable prompt so that attention results become similar to CLIP(VLM) encoding results

Vision MHCA ensures that the results of vision prompt + Vision MHCA become similar to unseen description embeddings created by LLM!!

.3. Seen Class Training

At this point, learnable prompts and MHCA weights for Seen Classes are determined!!

→ Cross-Attention (Text MHCA, Multi-Head Cross Attention, initialized and then trained)  
  - Q: text learnable prompt (initialized and then trained)  
  - K/V: Token embeddings of LLM descriptions  
→ Train to make attention results similar to image embeddings (based on cosine similarity)  

Text MHCA makes the results of text prompt + MHCA similar to image embeddings (mainly for SEEN)!!

.4. Unseen Class Training: 3 stages!! (UPTL: Unseen Text Prompt Learning)

At this point, learnable prompts for Unseen Classes are determined based on the learnable prompts and MHCA weights of Seen Classes!!

Stage 1: Cross-Attention (MHCA) - MHCA weights determined from Seen classes
- Q: learnable prompt (starts with the final learnable prompt of the most similar Seen Class)
- K/V: Token embeddings of Unseen class LLM descriptions
→ Train to make attention results similar to similar seen class prompt results (based on cosine similarity)

Stage 2: Class-relation learning - Train learnable prompts to be similar according to the similarity between Seen and Unseen LLM description embeddings!

Stage 3: Negative learning - Train so that Seen class image encodings and Unseen class learnable prompts become distant

Note! Learnable prompts are not inserted just once at the beginning, but are divided and inserted by layer!!

Deep Visual-Text Prompt Learning

While previous approaches simply tuned input prompts (adding fixed tokens at the front of encoder input),
This research inserts individual learnable prompts into each Transformer layer of text and vision encoders!

✅ Basic Prompt Tuning vs. Deep Visual-Text Prompt Learning

Item	Basic Prompt Tuning	Deep Visual-Text Prompt Learning
Application Location	Add fixed tokens at the front of encoder input	Insert learnable prompts into all Transformer layers of the encoder
Learning Target	Usually tune only a few learnable prompt vectors	Learn entire sequences of text/visual prompts layer by layer
Expressiveness	Limited (shallow), only controls upstream information	Can control representations at deep positions (deep)
Flexibility	Fast tuning with simple structure	Can reflect context/relationship/complex information (e.g. HOI)
Example: CLIP	Token insertion only in text → control single sentence meaning	Adjust both text & visual, redesigning vision-language alignment itself

🎯 Why is Deep Visual-Text Prompt Learning Better??

1. Considering Layer-wise Semantic/Functional Differentiation

Each layer of Transformer handles different levels of meaning:

Early layers: Low-level (local) features
Middle layers: Relational (contextual) information
Final layers: Conceptual abstraction (high-level semantics)

➡️ Simply attaching prompts only at the input makes it difficult to convey or manipulate information to all these layers.

🔹 In contrast, Deep Prompt Learning can finely control hierarchical semantic flow by inserting appropriate prompts at each layer.

Existing Prompt Tuning mainly inserts prompts only on the text side. However:

HOI (Human-Object Interaction)
VQA (Visual Question Answering)

In tasks where the combination of text and visual is key,
visual representations must also be simultaneously aligned/controlled to improve performance.

🔹 Deep Visual-Text Prompt inserts prompts in parallel to both text and image encoders,
improving alignment quality between the two modalities.

3. Fine-grained Control & Context Adaptation

Since prompts exist independently at each layer, the following becomes possible:

Detailed adjustments for specific tasks / classes / contexts
Learning prompts differently for each HOI class to achieve fine-grained expression control
Advantageous for complex relational expressions like “a person holding a cat” rather than simply “this is a cat”

🔬 EZ-HOI Performance Experiments!!

1. 📘 Definition of Zero-Shot HOI Setting

Similar to existing zero-shot HOI methods, utilize names of unseen HOI classes during training
Previous studies:
- VCL, FCL, ATL: Compose new samples by combining unseen HOI class names
- EoID: Distill CLIP with predefined HOI prompts (seen + unseen classes)
- HOICLIP: Introduce verb class representation (including seen/unseen)

2. ⚙️ Implementation Details

Basic Structure:
- DETR + ResNet-50 backbone
- CLIP-based dual encoder structure (prompt insertion in both text/visual)
Hyperparameters:
- Batch size: 16
- Learning rate: 1e-3
- Optimizer: AdamW
- GPU: 4 × Nvidia A5000
Backbone:
- Visual encoder: DETR (ResNet-50)
- Text encoder: Description-based prompt generation with LLaVA-v1.5-7b
Prompt Design:
- Number of layers: N = 9, Prompt length: p = 2
- Insert learnable text & visual prompts into each Transformer layer
Additional Techniques:
- Intra-HOI fusion: Feature fusion of human-object pairs
- Inter-HOI fusion: Context injection between multiple HOI pairs within an image
- LLM-based fine-grained prompts (including text descriptions)
- Visual Adapter (ref: [27])
- UTPL module (Unseen Text Prompt Learning)

3. Experimental Results Analysis and Ablation Study

Unseen-Verb Setting
- Up to 87.9% reduction in trainable parameters compared to existing methods
- Slightly lower performance than CLIP4HOI, but maximizes efficiency
- 2.77 mAP improvement over UniHOI, with parameter count at 26.9% level
Unseen-Composition Setting (RF-UC / NF-UC)
- Superior performance in all settings compared to CLIP4HOI
- +5.56 mAP in RF-UC and +7.88 mAP in NF-UC compared to UniHOI
Unseen-Object Setting
- +1.49 mAP over CLIP4HOI, with parameter count at 12.08%
- +13.36 mAP in unseen classes compared to UniHOI
🔬 Ablation Study Interpretation

Component	Function Description	Performance Change	Interpretation
Intra-HOI Fusion	Information combination within a single human-object (H-O) pair in an image	seen `+7.41 mAP`	Significantly improves recognition precision for learned classes (seen) by more accurately capturing human/object relationships within a pair
Visual Adapter	Module that inserts external information (e.g., position, class) into each layer of the visual encoder	seen ↑ / unseen ↓	This information helps with seen classes, but may cause overfitting for unseen classes → hindrance to generalization
LLM Guidance	Uses sophisticated text descriptions generated by LLaVA-based language model in prompts	unseen `+1.52 mAP`	Increases understanding of previously unseen classes by utilizing semantic-based descriptions rather than simple class names
UTPL (Unseen Text Prompt Learning)	Structure that separately trains prompts dedicated to unseen classes	unseen `+2.42 mAP`	Prevents prompts from being biased toward seen classes and directly learns expressiveness for unseen classes, enhancing performance
Inter-HOI Fusion	Information enhancement by sharing context between multiple human-object pairs	Both seen/unseen improved	Various relationships within an image provide contextual help to each other, increasing overall recognition and classification accuracy
VLM Guidance	Strategy to induce (align) characteristics of pre-trained vision-language models like CLIP	unseen `+1.33 mAP`	Enables semantic inference for previously unseen classes by reflecting VLM’s generalization properties in prompts

🧠 Final Thoughts

Research to prepare for unseen cases!
That is, research to insert pre-trained prompts so that they can adapt well to previously unseen situations in zero-shot scenarios!!

1) For seen cases, describe them with LLM and train text prompts to match image embeddings,
2) For unseen cases, start with prompts from similar seen cases and perform additional training with LLM’s unseen descriptions and similar seen images!
3) Describe unseen cases with LLM and train seen case-based prompts to match unseen VLM embeddings! 4) Through this! Create perfect prompts for zero-shot unseen cases!

🧠 (한국어) EZ-HOI 알아보기?!!

🔍 Zero shot, Unseen을 위한 완벽한 프롬포트 만들기!!

논문: EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
발표: NeurIPS 2024 (Lei, Wang, et al.)
코드: ChelsieLei/EZ-HOI

📌 배경: HOI와 VLM 결합 연구의 한계!?

Human-Object Interaction (HOI)란!!
이미지 또는 비디오에서 사람(Human)과 객체(Object)의 쌍을 찾아고, 이들 사이의 상호작용(Interaction)을 분류하는 작업입니다.

❓ 문제 a: VLM과 연계하는 HOI 연구!

너무 모델이 크고 세세한 부분까지 파악은 어렵다는 단점!!

최근의 HOI 연구들은 Vision-Language Models (VLMs)을 많이 활용했는데,
대표적인 것이 HOI 검출기와 VLM의 특징 벡터를 정렬(alignment)시켜 행동(action)과 같은 개념을 양쪽 모델이 유사하게 이해할 수 있도록 만드는 방법이었음!!
이를 통해 정렬된 특징은 제로샷(zero-shot) 상황에서도 모델이 본 적 없는 상호작용도 이해할수 있었지만!!
아래와 같은 단점들이 있었음

💸 고비용의 정렬 학습 과정: VLM과의 정렬은 대개 트랜스포머 구조 기반으로, 연산 비용/학습 시간 등이 큰 문제!
🔒 제로샷 일반화의 어려움: VLM 정렬은 학습된 클래스(Seen classes)에만 최적화되어, 보지 못한 클래스(Unseen classes)에 대한 예측 성능이 낮음!
🧠 지식 전이의 한계: VLM은 넓은 개념은 잘 이해하지만, HOI처럼 사람의 미세한 행동 차이를 구분해야 하는 과제에는 약점이 있음!

❗ 문제 b: 프롬포트만을 튜닝해서 가벼운 학습!!

다만, 프롬포트 튜닝은 Seen위주로만 진행되어 Unseen에서는 성능이 좋지않음!

이에, 최근에는 정렬 과정을 생략하고, VLM의 표현력을 그대로 활용하는 프롬프트 튜닝(prompt tuning) 기반 접근 방식이 대안으로 주목받고 있지만,
이 또한 제로샷 문제에서는 아직 충분한 성과를 보여주지 못했음!!

참고 : VLM의 표현력을 그대로 활용하는 프롬프트 튜닝(prompt tuning) 기반 접근 방식 이란!?
“A photo of a cat” 를 “[P1] [P2] [P3] cat” 와 같이 넣고 P1 P2 P3을 학습시킴!
논문에서 예를든 MaPLe의 프롬포트 튜닝은 이미지와 텍스트를 함꼐 튜닝함!!

결과적으로, HOI와 VLM의 결합은 유망하지만, 가벼운 모델&일반화 능력 확보라는 한계가 있었습니다!

💡 EZ-HOI 등장!!!

🧩 Inference (추론)

사전 Fine Tuning 된 learnable prompt가 기존 foundation 모델과 결합되어 쓰임!!
그래서 기존 foundation 모델은 학습된것 없는, 프롬포트 튜닝 기반의 Zero shot!!

[Input] 단일 이미지
    ↓
Stage 1: Human-Object Detection
    - 사람과 모든 객체 bbox 추출
    - 가능한 모든 (human, object) pair 생성

Stage 2: HOI 인식
    - 각 human-object pair → CLIP의 visual encoder + vision learnable prompt  → 이미지 임베딩 (f_vis)
    - 모든 HOI 클래스의 (object-action pair) → CLIP의 text encoder + test learnable prompt → 텍스트 임베딩 (f_txt)
    - cosine similarity(f_vis, f_txt) 기반으로 가장 유사한 HOI class 선택
    → 최종 HOI 예측

🛠️ Training (학습)

.1. LLM 기반 HOI 클래스 설명 생성

모든 object-interaction (HOI class) 쌍에 대해 LLM으로 풍부한 문장 생성
"Swinging a baseball bat describes a person..."

.2. VLM 기반의 이미지 프롬포트 (VLM Guidance)

→ Cross-Attention (이미지 MHCA, Multi-Head Cross Attention, 초기회되어 시작 후 학슴됨)  
  - Q: vision learnable prompt (초기화되어 시작 후 학습됨)  
  - K/V: CLIP(VLM)으로 인코딩된 벡터 (Unseen의 경우 llm으로 생성된 설명을 인코딩)
→ Attention 결과물이 CLIP(VLM)으로 인코딩결과과 유사해지도록 MHCA 및 Learnable prompt 학습

vision MHCA는 vision프롬포트+Vision MHCA의 결과가 LLM이 만든 unseen의 설명 임베딩과 유사하도록!!

.3. Seen 클래스 학습

이때 Seen Class의 learnable prompt와 MHCA weight가 정해짐!!

→ Cross-Attention (텍스트 MHCA, Multi-Head Cross Attention, 초기회되어 시작 후 학슴됨)  
  - Q: text learnable prompt (초기화되어 시작 후 학습됨)  
  - K/V: LLM 설명의 토큰 임베딩  
→ Attention 결과물을 이미지 임베딩과 유사하도록 학습 (cosine similarity 기반)  

text MHCA는 텍스트프롬포트+MHCA의 결과가 이미지 임베딩과 유사하도록 되는것 (SEEN)위주!!

.4. Unseen Class 학습 : 3단게!! (UPTL: Unseen Text Prompt Learning)

이때 Seen Class의 learnable prompt와 MHCA weight를 바탕으로 Unseen Class의 learnable prompt가 정해짐!!

1단계: Cross-Attention (MHCA) - Seen 에서 정해진 MHCA weight
- Q: learnable prompt (가장 유사한 Seen Class의 최종 learnable prompt로 시작)
- K/V: Unseen class의 LLM 설명의 토큰 임베딩
→ Attention 결과물을 유사 seen class의 prompt 결과와 유사하도록 학습 (cosine similarity 기반)

2단계: Class-relation 학습 - Seen과 Unseen의 LLM 설명 임베딩끼리의 유사도 만큼 learnable prompt가 유사하게 되도록 학습!

3단계: Negative 학습 - Seen class의 이미지 인코딩과 Unclass의 learnable prompt가 멀어지도록 학습

이떄! learnable prompt가 처음에 한번 들어가는게 아니라 layer별로 나눠서 들어간다!!

Deep Visual-Text Prompt Learning

기존에는 단순히 입력되는 프롬포트를 튜닝했다면(인코더 입력 앞단에 고정된 토큰 추가),
이번 연구는 text 및 Vision 인코더의 Transformer 레이어에 각각의 learnable 프롬포트를 삽입!

✅ 기본 Prompt Tuning vs. Deep Visual-Text Prompt Learning

항목	기본 Prompt Tuning	Deep Visual-Text Prompt Learning
적용 위치	인코더 입력 앞단에 고정된 토큰 추가	인코더의 모든 Transformer 레이어에 학습 가능한 프롬프트 삽입
학습 대상	보통 몇 개의 학습 가능한 프롬프트 벡터만 튜닝	레이어별로 텍스트/비주얼 프롬프트 전체 시퀀스 학습
표현력	제한적 (shallow), 상류 정보만 조절	깊은 위치의 표현까지 조절 가능 (deep)
유연성	단순 구조로 빠르게 튜닝	문맥/관계/복합 정보 반영 가능 (e.g. HOI)
예: CLIP	텍스트에만 토큰 삽입 → 단일 문장 의미 조절	텍스트 & 비주얼 둘 다 조정하며 시각-언어 정렬 자체를 재설계

🎯 왜 Deep Visual-Text Prompt Learning 방식이 더 좋을까??

1. 레이어별 의미/기능 분화 고려

Transformer의 각 레이어는 서로 다른 수준의 의미를 담당합니다:

초기 레이어: 저수준 (local) 특징
중간 레이어: 관계 (contextual) 정보
마지막 레이어: 개념적 추상 (high-level semantics)

➡️ 단순히 입력 앞에만 프롬프트를 붙이면, 이 모든 레이어에 정보를 전달하거나 조작하기 어렵습니다.

🔹 반면 Deep Prompt Learning은 각 레이어마다 적절한 프롬프트를 삽입함으로써, 계층별 의미 흐름을 미세하게 조절할 수 있습니다.

기존 Prompt Tuning은 주로 텍스트 쪽에만 프롬프트를 삽입합니다. 하지만:

HOI (Human-Object Interaction)
VQA (Visual Question Answering)

처럼 텍스트와 비주얼의 조합이 핵심인 작업에서는,
시각 표현도 동시에 정렬/조절해야 성능이 향상됩니다.

🔹 Deep Visual-Text Prompt는 텍스트와 이미지 인코더 모두에 병렬적으로 프롬프트를 삽입하여,
두 모달리티 간의 정렬 품질 (alignment)을 높입니다.

3. Fine-grained Control & Context Adaptation

프롬프트가 각 레이어에 독립적으로 존재하므로 다음이 가능합니다:

특정 작업 / 클래스 / 문맥에 맞는 세부적 조정
프롬프트를 HOI 클래스별로 다르게 학습시켜 세밀한 표현 제어
단순히 “이건 고양이야”가 아니라 → “사람이 고양이를 안고 있다” 같은 복합 관계 표현에 유리

🔬 EZ-HOI의 성능 실험!!

1. 📘 Zero-Shot HOI 설정의 정의

기존 zero-shot HOI 방식과 동일하게, unseen HOI 클래스의 이름을 훈련 중에 활용
기존 연구들:
- VCL, FCL, ATL: unseen HOI 클래스 이름을 조합하여 새로운 샘플 구성
- EoID: CLIP을 사전 정의된 HOI 프롬프트로 디스틸 (seen + unseen 클래스)
- HOICLIP: verb class representation 도입 (seen/unseen 포함)

2. ⚙️ 구현 세부 설정

기본 구조:
- DETR + ResNet-50 백본
- CLIP 기반 dual encoder 구조 (텍스트/비주얼 모두 프롬프트 삽입)
하이퍼파라미터:
- 배치 사이즈: 16
- 학습률: 1e-3
- 옵티마이저: AdamW
- GPU: 4 × Nvidia A5000
백본(backbone):
- Visual encoder: DETR (ResNet-50)
- Text encoder: LLaVA-v1.5-7b로 설명 기반 프롬프트 생성
프롬프트 설계:
- 레이어 수: N = 9, 프롬프트 길이: p = 2
- 학습 가능한 text & visual prompts를 Transformer 레이어마다 삽입
추가 기법:
- Intra-HOI fusion: 사람-객체 쌍의 feature 융합
- Inter-HOI fusion: 이미지 내 여러 HOI 쌍 간 문맥 주입
- LLM 기반 세밀 프롬프트 (텍스트 설명 포함)
- Visual Adapter (참고: [27])
- UTPL 모듈 (Unseen Text Prompt Learning)

3. 실험결과 분석 및 Ablation Study

Unseen-Verb Setting
- 기존 대비 학습 가능한 파라미터 수 최대 87.9% 감소
- CLIP4HOI보다 성능은 약간 낮지만, 효율성 극대화
- UniHOI 대비 2.77 mAP 향상, 파라미터 수는 26.9% 수준
Unseen-Composition Setting (RF-UC / NF-UC)
- CLIP4HOI 대비 모든 설정에서 성능 우수
- UniHOI 대비 RF-UC에서 +5.56 mAP, NF-UC에서 +7.88 mAP
Unseen-Object Setting
- CLIP4HOI보다 +1.49 mAP, 파라미터 수는 12.08%
- UniHOI 대비 unseen 클래스에서 +13.36 mAP
🔬 Ablation Study 해석

항목	기능 설명	성능 변화	해석
Intra-HOI Fusion	한 이미지 내에서 단일 사람-객체(H-O) 쌍 내부의 정보 결합	seen `+7.41 mAP`	한 쌍 내 사람/객체 관계를 더 정확히 포착함으로써, 학습된 클래스(seen)에 대한 인식 정밀도 크게 향상
Visual Adapter	시각 인코더 각 레이어에 외부 정보(예: 위치, 클래스)를 삽입하는 모듈	seen ↑ / unseen ↓	seen 클래스에서는 이 정보가 도움이 되지만, unseen 클래스에는 과적합 가능성 → 일반화에 방해 요인
LLM Guidance	LLaVA 기반 언어모델이 생성한 정교한 텍스트 설명을 프롬프트에 사용	unseen `+1.52 mAP`	단순 클래스 이름보다 의미 기반의 묘사를 활용함으로써, 처음 보는 클래스에 대한 이해력 증가
UTPL (Unseen Text Prompt Learning)	unseen 전용 학습 프롬프트를 따로 훈련하는 구조	unseen `+2.42 mAP`	프롬프트가 seen에 편중되지 않게 하고, unseen을 위한 표현력을 직접 학습하게 하여 성능 강화
Inter-HOI Fusion	여러 사람-객체 쌍 간 문맥을 공유하여 정보 보강	seen/unseen 모두 향상	한 이미지 내의 다양한 관계들이 서로 간에 문맥적 도움을 주어, 전반적인 인식력과 분류 정확도 상승
VLM Guidance	CLIP 등 사전학습된 시각언어모델의 특성을 유도(align)하는 전략	unseen `+1.33 mAP`	VLM의 일반화 성질을 프롬프트에 반영함으로써, 처음 보는 클래스에도 의미 유추 가능

🧠 마무리 생각

Unseen 케이스에 대한 대비를 위한 연구!
즉 Zero shot으로 처음 보는 상황에도 잘 적응 할수 있도록,
미리 학습된 프롬포트를 넣는 연구!!

1) Seen 케이스에 대하여 LLM으로 설명하고 이를 이미지 임베딩과 매칭될 수 있도록 텍스트 프롬포트를 학습시킨 다음,
2) Unseen 케이스의 프롬포트는 유사한 Seen 프롬포트에서 시작해서, LLM의 Unseen 설명 및 유사한 Seen 이미지로 추가학습!
3) LLM으로 Unseen 케이스에 대하여 설명하고, Seen case기반의 프럼포트가 Unseen의 VLM 임베딩과 매칭되도록 학습! 4) 이를 통해서! Unseen 의 Zero shot을 위한 완벽한 프롬포트를 만든다!

AI, Research

This post is licensed under CC BY 4.0 by the author.

🧠 (English) Understanding EZ-HOI?!!

📌 Background: Limitations of HOI and VLM Integration Research!?

❓ Problem a: HOI Research with VLM Integration!

❗ Problem b: Lightweight learning by tuning only prompts!!

💡 EZ-HOI Emerges!!!

🧩 Inference

🛠️ Training

Deep Visual-Text Prompt Learning

1. Considering Layer-wise Semantic/Functional Differentiation

2. Application to Both Visual/Text → Improved Modal Alignment

3. Fine-grained Control & Context Adaptation

🔬 EZ-HOI Performance Experiments!!

1. 📘 Definition of Zero-Shot HOI Setting

2. ⚙️ Implementation Details

3. Experimental Results Analysis and Ablation Study

🧠 Final Thoughts

🧠 (한국어) EZ-HOI 알아보기?!!

📌 배경: HOI와 VLM 결합 연구의 한계!?

❓ 문제 a: VLM과 연계하는 HOI 연구!

❗ 문제 b: 프롬포트만을 튜닝해서 가벼운 학습!!

💡 EZ-HOI 등장!!!

🧩 Inference (추론)

🛠️ Training (학습)

Deep Visual-Text Prompt Learning

1. 레이어별 의미/기능 분화 고려

2. 시각/텍스트 모두에 적용 → Modal Alignment 개선

3. Fine-grained Control & Context Adaptation

🔬 EZ-HOI의 성능 실험!!

1. 📘 Zero-Shot HOI 설정의 정의

2. ⚙️ 구현 세부 설정

3. 실험결과 분석 및 Ablation Study

🧠 마무리 생각

Trending Tags