๐Understanding EZ-HOI - EZ-HOI ์์๋ณด๊ธฐ!!
๐ง (English) Understanding EZ-HOI?!!
๐ Creating Perfect Prompts for Zero-shot and Unseen Cases!!
Paper: EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
Conference: NeurIPS 2024 (Lei, Wang, et al.)
Code: ChelsieLei/EZ-HOI
๐ Background: Limitations of HOI and VLM Integration Research!?
Human-Object Interaction (HOI) refers to the task of finding pairs of humans and objects in images or videos and classifying the interactions between them.
โ Problem a: HOI Research with VLM Integration!
Models are too large and have difficulty capturing fine-grained details!!
Recent HOI research has extensively utilized Vision-Language Models (VLMs), with a representative approach being the alignment of feature vectors between HOI detectors and VLMs so that both models can similarly understand concepts like actions.
Through this alignment, the features could understand previously unseen interactions even in zero-shot situations, but there were the following drawbacks:
- ๐ธ High-cost alignment learning process: VLM alignment is typically based on transformer structures, causing significant computational cost and training time issues!
- ๐ Difficulty in zero-shot generalization: VLM alignment is optimized only for trained classes (Seen classes), resulting in poor prediction performance for unseen classes!
- ๐ง Limitations in knowledge transfer: While VLMs understand broad concepts well, they have weaknesses in tasks like HOI that require distinguishing subtle differences in human actions!
โ Problem b: Lightweight learning by tuning only prompts!!
However, prompt tuning is mainly focused on Seen classes, resulting in poor performance on Unseen classes!
Recently, prompt tuning based approaches that skip the alignment process and directly utilize VLMโs representational power have gained attention as alternatives, but they still havenโt shown sufficient results in zero-shot problems!!
Note: What is the prompt tuning based approach that directly utilizes VLMโs representational power!?
It changes โA photo of a catโ to โ[P1] [P2] [P3] catโ and trains P1 P2 P3!
The MaPLe prompt tuning mentioned in the paper tunes both image and text together!!
- Consequently, while the combination of HOI and VLM is promising, there were limitations in achieving lightweight models & generalization capabilities!
๐ก EZ-HOI Emerges!!!
๐งฉ Inference
Pre-fine-tuned learnable prompts are combined with existing foundation models!!
So the existing foundation models remain untrained, achieving zero-shot through prompt tuning!!
1
2
3
4
5
6
7
8
9
10
11
[Input] Single image
โ
Stage 1: Human-Object Detection
- Extract bounding boxes for humans and all objects
- Generate all possible (human, object) pairs
Stage 2: HOI Recognition
- Each human-object pair โ CLIP's visual encoder + vision learnable prompt โ image embedding (f_vis)
- All HOI classes (object-action pairs) โ CLIP's text encoder + text learnable prompt โ text embedding (f_txt)
- Select the most similar HOI class based on cosine similarity(f_vis, f_txt)
โ Final HOI prediction
๐ ๏ธ Training
- LLM-based HOI Class Description Generation
- Generate rich sentences using LLM for all object-interaction (HOI class) pairs
"Swinging a baseball bat describes a person..."
- Generate rich sentences using LLM for all object-interaction (HOI class) pairs
- VLM-based Image Prompts (VLM Guidance)
1
2
3
4
โ Cross-Attention (Image MHCA, Multi-Head Cross Attention, initialized and then trained)
- Q: vision learnable prompt (initialized and then trained)
- K/V: Vectors encoded by CLIP(VLM) (For unseen cases, descriptions generated by LLM are encoded)
โ Train MHCA and learnable prompt so that attention results become similar to CLIP(VLM) encoding results
- Vision MHCA ensures that the results of vision prompt + Vision MHCA become similar to unseen description embeddings created by LLM!!
- Seen Class Training
At this point, learnable prompts and MHCA weights for Seen Classes are determined!!
1
2
3
4
โ Cross-Attention (Text MHCA, Multi-Head Cross Attention, initialized and then trained)
- Q: text learnable prompt (initialized and then trained)
- K/V: Token embeddings of LLM descriptions
โ Train to make attention results similar to image embeddings (based on cosine similarity)
- Text MHCA makes the results of text prompt + MHCA similar to image embeddings (mainly for SEEN)!!
- Unseen Class Training: 3 stages!! (UPTL: Unseen Text Prompt Learning)
At this point, learnable prompts for Unseen Classes are determined based on the learnable prompts and MHCA weights of Seen Classes!!
Stage 1: Cross-Attention (MHCA) - MHCA weights determined from Seen classes
- Q: learnable prompt (starts with the final learnable prompt of the most similar Seen Class)
- K/V: Token embeddings of Unseen class LLM descriptions
โ Train to make attention results similar to similar seen class prompt results (based on cosine similarity)
Stage 2: Class-relation learning - Train learnable prompts to be similar according to the similarity between Seen and Unseen LLM description embeddings!
Stage 3: Negative learning - Train so that Seen class image encodings and Unseen class learnable prompts become distant
Note! Learnable prompts are not inserted just once at the beginning, but are divided and inserted by layer!!
Deep Visual-Text Prompt Learning
- While previous approaches simply tuned input prompts (adding fixed tokens at the front of encoder input),
- This research inserts individual learnable prompts into each Transformer layer of text and vision encoders!
โ Basic Prompt Tuning vs. Deep Visual-Text Prompt Learning
Item | Basic Prompt Tuning | Deep Visual-Text Prompt Learning |
---|---|---|
Application Location | Add fixed tokens at the front of encoder input | Insert learnable prompts into all Transformer layers of the encoder |
Learning Target | Usually tune only a few learnable prompt vectors | Learn entire sequences of text/visual prompts layer by layer |
Expressiveness | Limited (shallow), only controls upstream information | Can control representations at deep positions (deep) |
Flexibility | Fast tuning with simple structure | Can reflect context/relationship/complex information (e.g. HOI) |
Example: CLIP | Token insertion only in text โ control single sentence meaning | Adjust both text & visual, redesigning vision-language alignment itself |
๐ฏ Why is Deep Visual-Text Prompt Learning Better??
1. Considering Layer-wise Semantic/Functional Differentiation
Each layer of Transformer handles different levels of meaning:
- Early layers: Low-level (local) features
- Middle layers: Relational (contextual) information
- Final layers: Conceptual abstraction (high-level semantics)
โก๏ธ Simply attaching prompts only at the input makes it difficult to convey or manipulate information to all these layers.
๐น In contrast, Deep Prompt Learning can finely control hierarchical semantic flow by inserting appropriate prompts at each layer.
2. Application to Both Visual/Text โ Improved Modal Alignment
Existing Prompt Tuning mainly inserts prompts only on the text side. However:
- HOI (Human-Object Interaction)
- VQA (Visual Question Answering)
In tasks where the combination of text and visual is key,
visual representations must also be simultaneously aligned/controlled to improve performance.
๐น Deep Visual-Text Prompt inserts prompts in parallel to both text and image encoders,
improving alignment quality between the two modalities.
3. Fine-grained Control & Context Adaptation
Since prompts exist independently at each layer, the following becomes possible:
- Detailed adjustments for specific tasks / classes / contexts
- Learning prompts differently for each HOI class to achieve fine-grained expression control
- Advantageous for complex relational expressions like โa person holding a catโ rather than simply โthis is a catโ
๐ฌ EZ-HOI Performance Experiments!!
1. ๐ Definition of Zero-Shot HOI Setting
- Similar to existing zero-shot HOI methods, utilize names of unseen HOI classes during training
- Previous studies:
- VCL, FCL, ATL: Compose new samples by combining unseen HOI class names
- EoID: Distill CLIP with predefined HOI prompts (seen + unseen classes)
- HOICLIP: Introduce verb class representation (including seen/unseen)
2. โ๏ธ Implementation Details
- Basic Structure:
- DETR + ResNet-50 backbone
- CLIP-based dual encoder structure (prompt insertion in both text/visual)
- Hyperparameters:
- Batch size:
16
- Learning rate:
1e-3
- Optimizer:
AdamW
- GPU:
4 ร Nvidia A5000
- Batch size:
- Backbone:
- Visual encoder:
DETR (ResNet-50)
- Text encoder: Description-based prompt generation with
LLaVA-v1.5-7b
- Visual encoder:
- Prompt Design:
- Number of layers:
N = 9
, Prompt length:p = 2
- Insert learnable text & visual prompts into each Transformer layer
- Number of layers:
- Additional Techniques:
- Intra-HOI fusion: Feature fusion of human-object pairs
- Inter-HOI fusion: Context injection between multiple HOI pairs within an image
- LLM-based fine-grained prompts (including text descriptions)
- Visual Adapter (ref: [27])
- UTPL module (Unseen Text Prompt Learning)
3. Experimental Results Analysis and Ablation Study
- Unseen-Verb Setting
Up to 87.9% reduction
in trainable parameters compared to existing methods- Slightly lower performance than CLIP4HOI, but maximizes efficiency
2.77 mAP
improvement over UniHOI, with parameter count at26.9%
level
- Unseen-Composition Setting (RF-UC / NF-UC)
- Superior performance in all settings compared to CLIP4HOI
+5.56 mAP
in RF-UC and+7.88 mAP
in NF-UC compared to UniHOI
- Unseen-Object Setting
+1.49 mAP
over CLIP4HOI, with parameter count at12.08%
+13.36 mAP
in unseen classes compared to UniHOI
- ๐ฌ Ablation Study Interpretation
Component | Function Description | Performance Change | Interpretation |
---|---|---|---|
Intra-HOI Fusion | Information combination within a single human-object (H-O) pair in an image | seen +7.41 mAP | Significantly improves recognition precision for learned classes (seen) by more accurately capturing human/object relationships within a pair |
Visual Adapter | Module that inserts external information (e.g., position, class) into each layer of the visual encoder | seen โ / unseen โ | This information helps with seen classes, but may cause overfitting for unseen classes โ hindrance to generalization |
LLM Guidance | Uses sophisticated text descriptions generated by LLaVA-based language model in prompts | unseen +1.52 mAP | Increases understanding of previously unseen classes by utilizing semantic-based descriptions rather than simple class names |
UTPL (Unseen Text Prompt Learning) | Structure that separately trains prompts dedicated to unseen classes | unseen +2.42 mAP | Prevents prompts from being biased toward seen classes and directly learns expressiveness for unseen classes, enhancing performance |
Inter-HOI Fusion | Information enhancement by sharing context between multiple human-object pairs | Both seen/unseen improved | Various relationships within an image provide contextual help to each other, increasing overall recognition and classification accuracy |
VLM Guidance | Strategy to induce (align) characteristics of pre-trained vision-language models like CLIP | unseen +1.33 mAP | Enables semantic inference for previously unseen classes by reflecting VLMโs generalization properties in prompts |
๐ง Final Thoughts
Research to prepare for unseen cases!
That is, research to insert pre-trained prompts so that they can adapt well to previously unseen situations in zero-shot scenarios!!
1) For seen cases, describe them with LLM and train text prompts to match image embeddings,
2) For unseen cases, start with prompts from similar seen cases and perform additional training with LLMโs unseen descriptions and similar seen images!
3) Describe unseen cases with LLM and train seen case-based prompts to match unseen VLM embeddings! 4) Through this! Create perfect prompts for zero-shot unseen cases!
๐ง (ํ๊ตญ์ด) EZ-HOI ์์๋ณด๊ธฐ?!!
๐ Zero shot, Unseen์ ์ํ ์๋ฒฝํ ํ๋กฌํฌํธ ๋ง๋ค๊ธฐ!!
๋ ผ๋ฌธ: EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
๋ฐํ: NeurIPS 2024 (Lei, Wang, et al.)
์ฝ๋: ChelsieLei/EZ-HOI
๐ ๋ฐฐ๊ฒฝ: HOI์ VLM ๊ฒฐํฉ ์ฐ๊ตฌ์ ํ๊ณ!?
Human-Object Interaction (HOI)๋!!
์ด๋ฏธ์ง ๋๋ ๋น๋์ค์์ ์ฌ๋(Human)๊ณผ ๊ฐ์ฒด(Object)์ ์์ ์ฐพ์๊ณ , ์ด๋ค ์ฌ์ด์ ์ํธ์์ฉ(Interaction)์ ๋ถ๋ฅํ๋ ์์
์
๋๋ค.
โ ๋ฌธ์ a: VLM๊ณผ ์ฐ๊ณํ๋ HOI ์ฐ๊ตฌ!
๋๋ฌด ๋ชจ๋ธ์ด ํฌ๊ณ ์ธ์ธํ ๋ถ๋ถ๊น์ง ํ์ ์ ์ด๋ ต๋ค๋ ๋จ์ !!
์ต๊ทผ์ HOI ์ฐ๊ตฌ๋ค์ Vision-Language Models (VLMs)์ ๋ง์ด ํ์ฉํ๋๋ฐ,
๋ํ์ ์ธ ๊ฒ์ด HOI ๊ฒ์ถ๊ธฐ์ VLM์ ํน์ง ๋ฒกํฐ๋ฅผ ์ ๋ ฌ(alignment)์์ผ ํ๋(action)๊ณผ ๊ฐ์ ๊ฐ๋
์ ์์ชฝ ๋ชจ๋ธ์ด ์ ์ฌํ๊ฒ ์ดํดํ ์ ์๋๋ก ๋ง๋๋ ๋ฐฉ๋ฒ์ด์์!!
์ด๋ฅผ ํตํด ์ ๋ ฌ๋ ํน์ง์ ์ ๋ก์ท(zero-shot) ์ํฉ์์๋ ๋ชจ๋ธ์ด ๋ณธ ์ ์๋ ์ํธ์์ฉ๋ ์ดํดํ ์ ์์์ง๋ง!!
์๋์ ๊ฐ์ ๋จ์ ๋ค์ด ์์์
- ๐ธ ๊ณ ๋น์ฉ์ ์ ๋ ฌ ํ์ต ๊ณผ์ : VLM๊ณผ์ ์ ๋ ฌ์ ๋๊ฐ ํธ๋์คํฌ๋จธ ๊ตฌ์กฐ ๊ธฐ๋ฐ์ผ๋ก, ์ฐ์ฐ ๋น์ฉ/ํ์ต ์๊ฐ ๋ฑ์ด ํฐ ๋ฌธ์ !
- ๐ ์ ๋ก์ท ์ผ๋ฐํ์ ์ด๋ ค์: VLM ์ ๋ ฌ์ ํ์ต๋ ํด๋์ค(Seen classes)์๋ง ์ต์ ํ๋์ด, ๋ณด์ง ๋ชปํ ํด๋์ค(Unseen classes)์ ๋ํ ์์ธก ์ฑ๋ฅ์ด ๋ฎ์!
- ๐ง ์ง์ ์ ์ด์ ํ๊ณ: VLM์ ๋์ ๊ฐ๋ ์ ์ ์ดํดํ์ง๋ง, HOI์ฒ๋ผ ์ฌ๋์ ๋ฏธ์ธํ ํ๋ ์ฐจ์ด๋ฅผ ๊ตฌ๋ถํด์ผ ํ๋ ๊ณผ์ ์๋ ์ฝ์ ์ด ์์!
โ ๋ฌธ์ b: ํ๋กฌํฌํธ๋ง์ ํ๋ํด์ ๊ฐ๋ฒผ์ด ํ์ต!!
๋ค๋ง, ํ๋กฌํฌํธ ํ๋์ Seen์์ฃผ๋ก๋ง ์งํ๋์ด Unseen์์๋ ์ฑ๋ฅ์ด ์ข์ง์์!
์ด์, ์ต๊ทผ์๋ ์ ๋ ฌ ๊ณผ์ ์ ์๋ตํ๊ณ , VLM์ ํํ๋ ฅ์ ๊ทธ๋๋ก ํ์ฉํ๋ ํ๋กฌํํธ ํ๋(prompt tuning) ๊ธฐ๋ฐ ์ ๊ทผ ๋ฐฉ์์ด ๋์์ผ๋ก ์ฃผ๋ชฉ๋ฐ๊ณ ์์ง๋ง,
์ด ๋ํ ์ ๋ก์ท ๋ฌธ์ ์์๋ ์์ง ์ถฉ๋ถํ ์ฑ๊ณผ๋ฅผ ๋ณด์ฌ์ฃผ์ง ๋ชปํ์!!
์ฐธ๊ณ : VLM์ ํํ๋ ฅ์ ๊ทธ๋๋ก ํ์ฉํ๋ ํ๋กฌํํธ ํ๋(prompt tuning) ๊ธฐ๋ฐ ์ ๊ทผ ๋ฐฉ์ ์ด๋!?
โA photo of a catโ ๋ฅผ โ[P1] [P2] [P3] catโ ์ ๊ฐ์ด ๋ฃ๊ณ P1 P2 P3์ ํ์ต์ํด!
๋ ผ๋ฌธ์์ ์๋ฅผ๋ MaPLe์ ํ๋กฌํฌํธ ํ๋์ ์ด๋ฏธ์ง์ ํ ์คํธ๋ฅผ ํจ๊ผ ํ๋ํจ!!
- ๊ฒฐ๊ณผ์ ์ผ๋ก, HOI์ VLM์ ๊ฒฐํฉ์ ์ ๋งํ์ง๋ง, ๊ฐ๋ฒผ์ด ๋ชจ๋ธ&์ผ๋ฐํ ๋ฅ๋ ฅ ํ๋ณด๋ผ๋ ํ๊ณ๊ฐ ์์์ต๋๋ค!
๐ก EZ-HOI ๋ฑ์ฅ!!!
๐งฉ Inference (์ถ๋ก )
์ฌ์ Fine Tuning ๋ learnable prompt๊ฐ ๊ธฐ์กด foundation ๋ชจ๋ธ๊ณผ ๊ฒฐํฉ๋์ด ์ฐ์!!
๊ทธ๋์ ๊ธฐ์กด foundation ๋ชจ๋ธ์ ํ์ต๋๊ฒ ์๋, ํ๋กฌํฌํธ ํ๋ ๊ธฐ๋ฐ์ Zero shot!!
1
2
3
4
5
6
7
8
9
10
11
[Input] ๋จ์ผ ์ด๋ฏธ์ง
โ
Stage 1: Human-Object Detection
- ์ฌ๋๊ณผ ๋ชจ๋ ๊ฐ์ฒด bbox ์ถ์ถ
- ๊ฐ๋ฅํ ๋ชจ๋ (human, object) pair ์์ฑ
Stage 2: HOI ์ธ์
- ๊ฐ human-object pair โ CLIP์ visual encoder + vision learnable prompt โ ์ด๋ฏธ์ง ์๋ฒ ๋ฉ (f_vis)
- ๋ชจ๋ HOI ํด๋์ค์ (object-action pair) โ CLIP์ text encoder + test learnable prompt โ ํ
์คํธ ์๋ฒ ๋ฉ (f_txt)
- cosine similarity(f_vis, f_txt) ๊ธฐ๋ฐ์ผ๋ก ๊ฐ์ฅ ์ ์ฌํ HOI class ์ ํ
โ ์ต์ข
HOI ์์ธก
๐ ๏ธ Training (ํ์ต)
- LLM ๊ธฐ๋ฐ HOI ํด๋์ค ์ค๋ช
์์ฑ
- ๋ชจ๋ object-interaction (HOI class) ์์ ๋ํด LLM์ผ๋ก ํ๋ถํ ๋ฌธ์ฅ ์์ฑ
"Swinging a baseball bat describes a person..."
- ๋ชจ๋ object-interaction (HOI class) ์์ ๋ํด LLM์ผ๋ก ํ๋ถํ ๋ฌธ์ฅ ์์ฑ
- VLM ๊ธฐ๋ฐ์ ์ด๋ฏธ์ง ํ๋กฌํฌํธ (VLM Guidance)
1
2
3
4
โ Cross-Attention (์ด๋ฏธ์ง MHCA, Multi-Head Cross Attention, ์ด๊ธฐํ๋์ด ์์ ํ ํ์ด๋จ)
- Q: vision learnable prompt (์ด๊ธฐํ๋์ด ์์ ํ ํ์ต๋จ)
- K/V: CLIP(VLM)์ผ๋ก ์ธ์ฝ๋ฉ๋ ๋ฒกํฐ (Unseen์ ๊ฒฝ์ฐ llm์ผ๋ก ์์ฑ๋ ์ค๋ช
์ ์ธ์ฝ๋ฉ)
โ Attention ๊ฒฐ๊ณผ๋ฌผ์ด CLIP(VLM)์ผ๋ก ์ธ์ฝ๋ฉ๊ฒฐ๊ณผ๊ณผ ์ ์ฌํด์ง๋๋ก MHCA ๋ฐ Learnable prompt ํ์ต
- vision MHCA๋ visionํ๋กฌํฌํธ+Vision MHCA์ ๊ฒฐ๊ณผ๊ฐ LLM์ด ๋ง๋ unseen์ ์ค๋ช ์๋ฒ ๋ฉ๊ณผ ์ ์ฌํ๋๋ก!!
- Seen ํด๋์ค ํ์ต
์ด๋ Seen Class์ learnable prompt์ MHCA weight๊ฐ ์ ํด์ง!!
1
2
3
4
โ Cross-Attention (ํ
์คํธ MHCA, Multi-Head Cross Attention, ์ด๊ธฐํ๋์ด ์์ ํ ํ์ด๋จ)
- Q: text learnable prompt (์ด๊ธฐํ๋์ด ์์ ํ ํ์ต๋จ)
- K/V: LLM ์ค๋ช
์ ํ ํฐ ์๋ฒ ๋ฉ
โ Attention ๊ฒฐ๊ณผ๋ฌผ์ ์ด๋ฏธ์ง ์๋ฒ ๋ฉ๊ณผ ์ ์ฌํ๋๋ก ํ์ต (cosine similarity ๊ธฐ๋ฐ)
- text MHCA๋ ํ ์คํธํ๋กฌํฌํธ+MHCA์ ๊ฒฐ๊ณผ๊ฐ ์ด๋ฏธ์ง ์๋ฒ ๋ฉ๊ณผ ์ ์ฌํ๋๋ก ๋๋๊ฒ (SEEN)์์ฃผ!!
- Unseen Class ํ์ต : 3๋จ๊ฒ!! (UPTL: Unseen Text Prompt Learning)
์ด๋ Seen Class์ learnable prompt์ MHCA weight๋ฅผ ๋ฐํ์ผ๋ก Unseen Class์ learnable prompt๊ฐ ์ ํด์ง!!
1๋จ๊ณ: Cross-Attention (MHCA) - Seen ์์ ์ ํด์ง MHCA weight
- Q: learnable prompt (๊ฐ์ฅ ์ ์ฌํ Seen Class์ ์ต์ข
learnable prompt๋ก ์์)
- K/V: Unseen class์ LLM ์ค๋ช
์ ํ ํฐ ์๋ฒ ๋ฉ
โ Attention ๊ฒฐ๊ณผ๋ฌผ์ ์ ์ฌ seen class์ prompt ๊ฒฐ๊ณผ์ ์ ์ฌํ๋๋ก ํ์ต (cosine similarity ๊ธฐ๋ฐ)
2๋จ๊ณ: Class-relation ํ์ต - Seen๊ณผ Unseen์ LLM ์ค๋ช ์๋ฒ ๋ฉ๋ผ๋ฆฌ์ ์ ์ฌ๋ ๋งํผ learnable prompt๊ฐ ์ ์ฌํ๊ฒ ๋๋๋ก ํ์ต!
3๋จ๊ณ: Negative ํ์ต - Seen class์ ์ด๋ฏธ์ง ์ธ์ฝ๋ฉ๊ณผ Unclass์ learnable prompt๊ฐ ๋ฉ์ด์ง๋๋ก ํ์ต
์ด๋! learnable prompt๊ฐ ์ฒ์์ ํ๋ฒ ๋ค์ด๊ฐ๋๊ฒ ์๋๋ผ layer๋ณ๋ก ๋๋ ์ ๋ค์ด๊ฐ๋ค!!
Deep Visual-Text Prompt Learning
- ๊ธฐ์กด์๋ ๋จ์ํ ์ ๋ ฅ๋๋ ํ๋กฌํฌํธ๋ฅผ ํ๋ํ๋ค๋ฉด(์ธ์ฝ๋ ์ ๋ ฅ ์๋จ์ ๊ณ ์ ๋ ํ ํฐ ์ถ๊ฐ),
- ์ด๋ฒ ์ฐ๊ตฌ๋ text ๋ฐ Vision ์ธ์ฝ๋์ Transformer ๋ ์ด์ด์ ๊ฐ๊ฐ์ learnable ํ๋กฌํฌํธ๋ฅผ ์ฝ์ !
โ ๊ธฐ๋ณธ Prompt Tuning vs. Deep Visual-Text Prompt Learning
ํญ๋ชฉ | ๊ธฐ๋ณธ Prompt Tuning | Deep Visual-Text Prompt Learning |
---|---|---|
์ ์ฉ ์์น | ์ธ์ฝ๋ ์ ๋ ฅ ์๋จ์ ๊ณ ์ ๋ ํ ํฐ ์ถ๊ฐ | ์ธ์ฝ๋์ ๋ชจ๋ Transformer ๋ ์ด์ด์ ํ์ต ๊ฐ๋ฅํ ํ๋กฌํํธ ์ฝ์ |
ํ์ต ๋์ | ๋ณดํต ๋ช ๊ฐ์ ํ์ต ๊ฐ๋ฅํ ํ๋กฌํํธ ๋ฒกํฐ๋ง ํ๋ | ๋ ์ด์ด๋ณ๋ก ํ ์คํธ/๋น์ฃผ์ผ ํ๋กฌํํธ ์ ์ฒด ์ํ์ค ํ์ต |
ํํ๋ ฅ | ์ ํ์ (shallow), ์๋ฅ ์ ๋ณด๋ง ์กฐ์ | ๊น์ ์์น์ ํํ๊น์ง ์กฐ์ ๊ฐ๋ฅ (deep) |
์ ์ฐ์ฑ | ๋จ์ ๊ตฌ์กฐ๋ก ๋น ๋ฅด๊ฒ ํ๋ | ๋ฌธ๋งฅ/๊ด๊ณ/๋ณตํฉ ์ ๋ณด ๋ฐ์ ๊ฐ๋ฅ (e.g. HOI) |
์: CLIP | ํ ์คํธ์๋ง ํ ํฐ ์ฝ์ โ ๋จ์ผ ๋ฌธ์ฅ ์๋ฏธ ์กฐ์ | ํ ์คํธ & ๋น์ฃผ์ผ ๋ ๋ค ์กฐ์ ํ๋ฉฐ ์๊ฐ-์ธ์ด ์ ๋ ฌ ์์ฒด๋ฅผ ์ฌ์ค๊ณ |
๐ฏ ์ Deep Visual-Text Prompt Learning ๋ฐฉ์์ด ๋ ์ข์๊น??
1. ๋ ์ด์ด๋ณ ์๋ฏธ/๊ธฐ๋ฅ ๋ถํ ๊ณ ๋ ค
Transformer์ ๊ฐ ๋ ์ด์ด๋ ์๋ก ๋ค๋ฅธ ์์ค์ ์๋ฏธ๋ฅผ ๋ด๋นํฉ๋๋ค:
- ์ด๊ธฐ ๋ ์ด์ด: ์ ์์ค (local) ํน์ง
- ์ค๊ฐ ๋ ์ด์ด: ๊ด๊ณ (contextual) ์ ๋ณด
- ๋ง์ง๋ง ๋ ์ด์ด: ๊ฐ๋ ์ ์ถ์ (high-level semantics)
โก๏ธ ๋จ์ํ ์ ๋ ฅ ์์๋ง ํ๋กฌํํธ๋ฅผ ๋ถ์ด๋ฉด, ์ด ๋ชจ๋ ๋ ์ด์ด์ ์ ๋ณด๋ฅผ ์ ๋ฌํ๊ฑฐ๋ ์กฐ์ํ๊ธฐ ์ด๋ ต์ต๋๋ค.
๐น ๋ฐ๋ฉด Deep Prompt Learning์ ๊ฐ ๋ ์ด์ด๋ง๋ค ์ ์ ํ ํ๋กฌํํธ๋ฅผ ์ฝ์ ํจ์ผ๋ก์จ, ๊ณ์ธต๋ณ ์๋ฏธ ํ๋ฆ์ ๋ฏธ์ธํ๊ฒ ์กฐ์ ํ ์ ์์ต๋๋ค.
2. ์๊ฐ/ํ ์คํธ ๋ชจ๋์ ์ ์ฉ โ Modal Alignment ๊ฐ์
๊ธฐ์กด Prompt Tuning์ ์ฃผ๋ก ํ ์คํธ ์ชฝ์๋ง ํ๋กฌํํธ๋ฅผ ์ฝ์ ํฉ๋๋ค. ํ์ง๋ง:
- HOI (Human-Object Interaction)
- VQA (Visual Question Answering)
์ฒ๋ผ ํ
์คํธ์ ๋น์ฃผ์ผ์ ์กฐํฉ์ด ํต์ฌ์ธ ์์
์์๋,
์๊ฐ ํํ๋ ๋์์ ์ ๋ ฌ/์กฐ์ ํด์ผ ์ฑ๋ฅ์ด ํฅ์๋ฉ๋๋ค.
๐น Deep Visual-Text Prompt๋ ํ
์คํธ์ ์ด๋ฏธ์ง ์ธ์ฝ๋ ๋ชจ๋์ ๋ณ๋ ฌ์ ์ผ๋ก ํ๋กฌํํธ๋ฅผ ์ฝ์
ํ์ฌ,
๋ ๋ชจ๋ฌ๋ฆฌํฐ ๊ฐ์ ์ ๋ ฌ ํ์ง (alignment)์ ๋์
๋๋ค.
3. Fine-grained Control & Context Adaptation
ํ๋กฌํํธ๊ฐ ๊ฐ ๋ ์ด์ด์ ๋ ๋ฆฝ์ ์ผ๋ก ์กด์ฌํ๋ฏ๋ก ๋ค์์ด ๊ฐ๋ฅํฉ๋๋ค:
- ํน์ ์์ / ํด๋์ค / ๋ฌธ๋งฅ์ ๋ง๋ ์ธ๋ถ์ ์กฐ์
- ํ๋กฌํํธ๋ฅผ HOI ํด๋์ค๋ณ๋ก ๋ค๋ฅด๊ฒ ํ์ต์์ผ ์ธ๋ฐํ ํํ ์ ์ด
- ๋จ์ํ โ์ด๊ฑด ๊ณ ์์ด์ผโ๊ฐ ์๋๋ผ โ โ์ฌ๋์ด ๊ณ ์์ด๋ฅผ ์๊ณ ์๋คโ ๊ฐ์ ๋ณตํฉ ๊ด๊ณ ํํ์ ์ ๋ฆฌ
๐ฌ EZ-HOI์ ์ฑ๋ฅ ์คํ!!
1. ๐ Zero-Shot HOI ์ค์ ์ ์ ์
- ๊ธฐ์กด zero-shot HOI ๋ฐฉ์๊ณผ ๋์ผํ๊ฒ, unseen HOI ํด๋์ค์ ์ด๋ฆ์ ํ๋ จ ์ค์ ํ์ฉ
- ๊ธฐ์กด ์ฐ๊ตฌ๋ค:
- VCL, FCL, ATL: unseen HOI ํด๋์ค ์ด๋ฆ์ ์กฐํฉํ์ฌ ์๋ก์ด ์ํ ๊ตฌ์ฑ
- EoID: CLIP์ ์ฌ์ ์ ์๋ HOI ํ๋กฌํํธ๋ก ๋์คํธ (seen + unseen ํด๋์ค)
- HOICLIP: verb class representation ๋์ (seen/unseen ํฌํจ)
2. โ๏ธ ๊ตฌํ ์ธ๋ถ ์ค์
- ๊ธฐ๋ณธ ๊ตฌ์กฐ:
- DETR + ResNet-50 ๋ฐฑ๋ณธ
- CLIP ๊ธฐ๋ฐ dual encoder ๊ตฌ์กฐ (ํ ์คํธ/๋น์ฃผ์ผ ๋ชจ๋ ํ๋กฌํํธ ์ฝ์ )
- ํ์ดํผํ๋ผ๋ฏธํฐ:
- ๋ฐฐ์น ์ฌ์ด์ฆ:
16
- ํ์ต๋ฅ :
1e-3
- ์ตํฐ๋ง์ด์ :
AdamW
- GPU:
4 ร Nvidia A5000
- ๋ฐฐ์น ์ฌ์ด์ฆ:
- ๋ฐฑ๋ณธ(backbone):
- Visual encoder:
DETR (ResNet-50)
- Text encoder:
LLaVA-v1.5-7b
๋ก ์ค๋ช ๊ธฐ๋ฐ ํ๋กฌํํธ ์์ฑ
- Visual encoder:
- ํ๋กฌํํธ ์ค๊ณ:
- ๋ ์ด์ด ์:
N = 9
, ํ๋กฌํํธ ๊ธธ์ด:p = 2
- ํ์ต ๊ฐ๋ฅํ text & visual prompts๋ฅผ Transformer ๋ ์ด์ด๋ง๋ค ์ฝ์
- ๋ ์ด์ด ์:
- ์ถ๊ฐ ๊ธฐ๋ฒ:
- Intra-HOI fusion: ์ฌ๋-๊ฐ์ฒด ์์ feature ์ตํฉ
- Inter-HOI fusion: ์ด๋ฏธ์ง ๋ด ์ฌ๋ฌ HOI ์ ๊ฐ ๋ฌธ๋งฅ ์ฃผ์
- LLM ๊ธฐ๋ฐ ์ธ๋ฐ ํ๋กฌํํธ (ํ ์คํธ ์ค๋ช ํฌํจ)
- Visual Adapter (์ฐธ๊ณ : [27])
- UTPL ๋ชจ๋ (Unseen Text Prompt Learning)
3. ์คํ๊ฒฐ๊ณผ ๋ถ์ ๋ฐ Ablation Study
- Unseen-Verb Setting
- ๊ธฐ์กด ๋๋น ํ์ต ๊ฐ๋ฅํ ํ๋ผ๋ฏธํฐ ์
์ต๋ 87.9% ๊ฐ์
- CLIP4HOI๋ณด๋ค ์ฑ๋ฅ์ ์ฝ๊ฐ ๋ฎ์ง๋ง, ํจ์จ์ฑ ๊ทน๋ํ
- UniHOI ๋๋น
2.77 mAP
ํฅ์, ํ๋ผ๋ฏธํฐ ์๋26.9%
์์ค
- ๊ธฐ์กด ๋๋น ํ์ต ๊ฐ๋ฅํ ํ๋ผ๋ฏธํฐ ์
- Unseen-Composition Setting (RF-UC / NF-UC)
- CLIP4HOI ๋๋น ๋ชจ๋ ์ค์ ์์ ์ฑ๋ฅ ์ฐ์
- UniHOI ๋๋น RF-UC์์
+5.56 mAP
, NF-UC์์+7.88 mAP
- Unseen-Object Setting
- CLIP4HOI๋ณด๋ค
+1.49 mAP
, ํ๋ผ๋ฏธํฐ ์๋12.08%
- UniHOI ๋๋น unseen ํด๋์ค์์
+13.36 mAP
- CLIP4HOI๋ณด๋ค
- ๐ฌ Ablation Study ํด์
ํญ๋ชฉ | ๊ธฐ๋ฅ ์ค๋ช | ์ฑ๋ฅ ๋ณํ | ํด์ |
---|---|---|---|
Intra-HOI Fusion | ํ ์ด๋ฏธ์ง ๋ด์์ ๋จ์ผ ์ฌ๋-๊ฐ์ฒด(H-O) ์ ๋ด๋ถ์ ์ ๋ณด ๊ฒฐํฉ | seen +7.41 mAP | ํ ์ ๋ด ์ฌ๋/๊ฐ์ฒด ๊ด๊ณ๋ฅผ ๋ ์ ํํ ํฌ์ฐฉํจ์ผ๋ก์จ, ํ์ต๋ ํด๋์ค(seen)์ ๋ํ ์ธ์ ์ ๋ฐ๋ ํฌ๊ฒ ํฅ์ |
Visual Adapter | ์๊ฐ ์ธ์ฝ๋ ๊ฐ ๋ ์ด์ด์ ์ธ๋ถ ์ ๋ณด(์: ์์น, ํด๋์ค)๋ฅผ ์ฝ์ ํ๋ ๋ชจ๋ | seen โ / unseen โ | seen ํด๋์ค์์๋ ์ด ์ ๋ณด๊ฐ ๋์์ด ๋์ง๋ง, unseen ํด๋์ค์๋ ๊ณผ์ ํฉ ๊ฐ๋ฅ์ฑ โ ์ผ๋ฐํ์ ๋ฐฉํด ์์ธ |
LLM Guidance | LLaVA ๊ธฐ๋ฐ ์ธ์ด๋ชจ๋ธ์ด ์์ฑํ ์ ๊ตํ ํ ์คํธ ์ค๋ช ์ ํ๋กฌํํธ์ ์ฌ์ฉ | unseen +1.52 mAP | ๋จ์ ํด๋์ค ์ด๋ฆ๋ณด๋ค ์๋ฏธ ๊ธฐ๋ฐ์ ๋ฌ์ฌ๋ฅผ ํ์ฉํจ์ผ๋ก์จ, ์ฒ์ ๋ณด๋ ํด๋์ค์ ๋ํ ์ดํด๋ ฅ ์ฆ๊ฐ |
UTPL (Unseen Text Prompt Learning) | unseen ์ ์ฉ ํ์ต ํ๋กฌํํธ๋ฅผ ๋ฐ๋ก ํ๋ จํ๋ ๊ตฌ์กฐ | unseen +2.42 mAP | ํ๋กฌํํธ๊ฐ seen์ ํธ์ค๋์ง ์๊ฒ ํ๊ณ , unseen์ ์ํ ํํ๋ ฅ์ ์ง์ ํ์ตํ๊ฒ ํ์ฌ ์ฑ๋ฅ ๊ฐํ |
Inter-HOI Fusion | ์ฌ๋ฌ ์ฌ๋-๊ฐ์ฒด ์ ๊ฐ ๋ฌธ๋งฅ์ ๊ณต์ ํ์ฌ ์ ๋ณด ๋ณด๊ฐ | seen/unseen ๋ชจ๋ ํฅ์ | ํ ์ด๋ฏธ์ง ๋ด์ ๋ค์ํ ๊ด๊ณ๋ค์ด ์๋ก ๊ฐ์ ๋ฌธ๋งฅ์ ๋์์ ์ฃผ์ด, ์ ๋ฐ์ ์ธ ์ธ์๋ ฅ๊ณผ ๋ถ๋ฅ ์ ํ๋ ์์น |
VLM Guidance | CLIP ๋ฑ ์ฌ์ ํ์ต๋ ์๊ฐ์ธ์ด๋ชจ๋ธ์ ํน์ฑ์ ์ ๋(align)ํ๋ ์ ๋ต | unseen +1.33 mAP | VLM์ ์ผ๋ฐํ ์ฑ์ง์ ํ๋กฌํํธ์ ๋ฐ์ํจ์ผ๋ก์จ, ์ฒ์ ๋ณด๋ ํด๋์ค์๋ ์๋ฏธ ์ ์ถ ๊ฐ๋ฅ |
๐ง ๋ง๋ฌด๋ฆฌ ์๊ฐ
Unseen ์ผ์ด์ค์ ๋ํ ๋๋น๋ฅผ ์ํ ์ฐ๊ตฌ!
์ฆ Zero shot์ผ๋ก ์ฒ์ ๋ณด๋ ์ํฉ์๋ ์ ์ ์ ํ ์ ์๋๋ก,
๋ฏธ๋ฆฌ ํ์ต๋ ํ๋กฌํฌํธ๋ฅผ ๋ฃ๋ ์ฐ๊ตฌ!!
1) Seen ์ผ์ด์ค์ ๋ํ์ฌ LLM์ผ๋ก ์ค๋ช
ํ๊ณ ์ด๋ฅผ ์ด๋ฏธ์ง ์๋ฒ ๋ฉ๊ณผ ๋งค์นญ๋ ์ ์๋๋ก ํ
์คํธ ํ๋กฌํฌํธ๋ฅผ ํ์ต์ํจ ๋ค์,
2) Unseen ์ผ์ด์ค์ ํ๋กฌํฌํธ๋ ์ ์ฌํ Seen ํ๋กฌํฌํธ์์ ์์ํด์, LLM์ Unseen ์ค๋ช
๋ฐ ์ ์ฌํ Seen ์ด๋ฏธ์ง๋ก ์ถ๊ฐํ์ต!
3) LLM์ผ๋ก Unseen ์ผ์ด์ค์ ๋ํ์ฌ ์ค๋ช
ํ๊ณ , Seen case๊ธฐ๋ฐ์ ํ๋ผํฌํธ๊ฐ Unseen์ VLM ์๋ฒ ๋ฉ๊ณผ ๋งค์นญ๋๋๋ก ํ์ต! 4) ์ด๋ฅผ ํตํด์! Unseen ์ Zero shot์ ์ํ ์๋ฒฝํ ํ๋กฌํฌํธ๋ฅผ ๋ง๋ ๋ค!