Post

๐Ÿ“Understanding EZ-HOI - EZ-HOI ์•Œ์•„๋ณด๊ธฐ!!

๐Ÿ“Understanding EZ-HOI - EZ-HOI ์•Œ์•„๋ณด๊ธฐ!!

๐Ÿง  (English) Understanding EZ-HOI?!!

๐Ÿ” Creating Perfect Prompts for Zero-shot and Unseen Cases!!

manhwa

Paper: EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
Conference: NeurIPS 2024 (Lei, Wang, et al.)
Code: ChelsieLei/EZ-HOI


๐Ÿ“Œ Background: Limitations of HOI and VLM Integration Research!?

Human-Object Interaction (HOI) refers to the task of finding pairs of humans and objects in images or videos and classifying the interactions between them.

existings

โ“ Problem a: HOI Research with VLM Integration!

Models are too large and have difficulty capturing fine-grained details!!

Recent HOI research has extensively utilized Vision-Language Models (VLMs), with a representative approach being the alignment of feature vectors between HOI detectors and VLMs so that both models can similarly understand concepts like actions.
Through this alignment, the features could understand previously unseen interactions even in zero-shot situations, but there were the following drawbacks:

  • ๐Ÿ’ธ High-cost alignment learning process: VLM alignment is typically based on transformer structures, causing significant computational cost and training time issues!
  • ๐Ÿ”’ Difficulty in zero-shot generalization: VLM alignment is optimized only for trained classes (Seen classes), resulting in poor prediction performance for unseen classes!
  • ๐Ÿง  Limitations in knowledge transfer: While VLMs understand broad concepts well, they have weaknesses in tasks like HOI that require distinguishing subtle differences in human actions!
โ— Problem b: Lightweight learning by tuning only prompts!!

However, prompt tuning is mainly focused on Seen classes, resulting in poor performance on Unseen classes!

Recently, prompt tuning based approaches that skip the alignment process and directly utilize VLMโ€™s representational power have gained attention as alternatives, but they still havenโ€™t shown sufficient results in zero-shot problems!!

Note: What is the prompt tuning based approach that directly utilizes VLMโ€™s representational power!?
It changes โ€œA photo of a catโ€ to โ€œ[P1] [P2] [P3] catโ€ and trains P1 P2 P3!
The MaPLe prompt tuning mentioned in the paper tunes both image and text together!!

  • Consequently, while the combination of HOI and VLM is promising, there were limitations in achieving lightweight models & generalization capabilities!

๐Ÿ’ก EZ-HOI Emerges!!!

๐Ÿงฉ Inference

Pre-fine-tuned learnable prompts are combined with existing foundation models!!
So the existing foundation models remain untrained, achieving zero-shot through prompt tuning!!

ezHOI_structure

1
2
3
4
5
6
7
8
9
10
11
[Input] Single image
    โ†“
Stage 1: Human-Object Detection
    - Extract bounding boxes for humans and all objects
    - Generate all possible (human, object) pairs

Stage 2: HOI Recognition
    - Each human-object pair โ†’ CLIP's visual encoder + vision learnable prompt โ†’ image embedding (f_vis)
    - All HOI classes (object-action pairs) โ†’ CLIP's text encoder + text learnable prompt โ†’ text embedding (f_txt)
    - Select the most similar HOI class based on cosine similarity(f_vis, f_txt)
    โ†’ Final HOI prediction

๐Ÿ› ๏ธ Training
  1. LLM-based HOI Class Description Generation
    • Generate rich sentences using LLM for all object-interaction (HOI class) pairs
      "Swinging a baseball bat describes a person..."

  1. VLM-based Image Prompts (VLM Guidance)
1
2
3
4
โ†’ Cross-Attention (Image MHCA, Multi-Head Cross Attention, initialized and then trained)  
  - Q: vision learnable prompt (initialized and then trained)  
  - K/V: Vectors encoded by CLIP(VLM) (For unseen cases, descriptions generated by LLM are encoded)
โ†’ Train MHCA and learnable prompt so that attention results become similar to CLIP(VLM) encoding results
  • Vision MHCA ensures that the results of vision prompt + Vision MHCA become similar to unseen description embeddings created by LLM!!

  1. Seen Class Training

    At this point, learnable prompts and MHCA weights for Seen Classes are determined!!

1
2
3
4
โ†’ Cross-Attention (Text MHCA, Multi-Head Cross Attention, initialized and then trained)  
  - Q: text learnable prompt (initialized and then trained)  
  - K/V: Token embeddings of LLM descriptions  
โ†’ Train to make attention results similar to image embeddings (based on cosine similarity)  
  • Text MHCA makes the results of text prompt + MHCA similar to image embeddings (mainly for SEEN)!!

  1. Unseen Class Training: 3 stages!! (UPTL: Unseen Text Prompt Learning)

    At this point, learnable prompts for Unseen Classes are determined based on the learnable prompts and MHCA weights of Seen Classes!!

Stage 1: Cross-Attention (MHCA) - MHCA weights determined from Seen classes
- Q: learnable prompt (starts with the final learnable prompt of the most similar Seen Class)
- K/V: Token embeddings of Unseen class LLM descriptions
โ†’ Train to make attention results similar to similar seen class prompt results (based on cosine similarity)

Stage 2: Class-relation learning - Train learnable prompts to be similar according to the similarity between Seen and Unseen LLM description embeddings!

Stage 3: Negative learning - Train so that Seen class image encodings and Unseen class learnable prompts become distant

Note! Learnable prompts are not inserted just once at the beginning, but are divided and inserted by layer!!


Deep Visual-Text Prompt Learning
  • While previous approaches simply tuned input prompts (adding fixed tokens at the front of encoder input),
  • This research inserts individual learnable prompts into each Transformer layer of text and vision encoders!

โœ… Basic Prompt Tuning vs. Deep Visual-Text Prompt Learning

ItemBasic Prompt TuningDeep Visual-Text Prompt Learning
Application LocationAdd fixed tokens at the front of encoder inputInsert learnable prompts into all Transformer layers of the encoder
Learning TargetUsually tune only a few learnable prompt vectorsLearn entire sequences of text/visual prompts layer by layer
ExpressivenessLimited (shallow), only controls upstream informationCan control representations at deep positions (deep)
FlexibilityFast tuning with simple structureCan reflect context/relationship/complex information (e.g. HOI)
Example: CLIPToken insertion only in text โ†’ control single sentence meaningAdjust both text & visual, redesigning vision-language alignment itself

๐ŸŽฏ Why is Deep Visual-Text Prompt Learning Better??


1. Considering Layer-wise Semantic/Functional Differentiation

Each layer of Transformer handles different levels of meaning:

  • Early layers: Low-level (local) features
  • Middle layers: Relational (contextual) information
  • Final layers: Conceptual abstraction (high-level semantics)

โžก๏ธ Simply attaching prompts only at the input makes it difficult to convey or manipulate information to all these layers.

๐Ÿ”น In contrast, Deep Prompt Learning can finely control hierarchical semantic flow by inserting appropriate prompts at each layer.


2. Application to Both Visual/Text โ†’ Improved Modal Alignment

Existing Prompt Tuning mainly inserts prompts only on the text side. However:

  • HOI (Human-Object Interaction)
  • VQA (Visual Question Answering)

In tasks where the combination of text and visual is key,
visual representations must also be simultaneously aligned/controlled to improve performance.

๐Ÿ”น Deep Visual-Text Prompt inserts prompts in parallel to both text and image encoders,
improving alignment quality between the two modalities.


3. Fine-grained Control & Context Adaptation

Since prompts exist independently at each layer, the following becomes possible:

  • Detailed adjustments for specific tasks / classes / contexts
  • Learning prompts differently for each HOI class to achieve fine-grained expression control
  • Advantageous for complex relational expressions like โ€œa person holding a catโ€ rather than simply โ€œthis is a catโ€

๐Ÿ”ฌ EZ-HOI Performance Experiments!!


1. ๐Ÿ“˜ Definition of Zero-Shot HOI Setting
  • Similar to existing zero-shot HOI methods, utilize names of unseen HOI classes during training
  • Previous studies:
    • VCL, FCL, ATL: Compose new samples by combining unseen HOI class names
    • EoID: Distill CLIP with predefined HOI prompts (seen + unseen classes)
    • HOICLIP: Introduce verb class representation (including seen/unseen)

2. โš™๏ธ Implementation Details
  • Basic Structure:
    • DETR + ResNet-50 backbone
    • CLIP-based dual encoder structure (prompt insertion in both text/visual)
  • Hyperparameters:
    • Batch size: 16
    • Learning rate: 1e-3
    • Optimizer: AdamW
    • GPU: 4 ร— Nvidia A5000
  • Backbone:
    • Visual encoder: DETR (ResNet-50)
    • Text encoder: Description-based prompt generation with LLaVA-v1.5-7b
  • Prompt Design:
    • Number of layers: N = 9, Prompt length: p = 2
    • Insert learnable text & visual prompts into each Transformer layer
  • Additional Techniques:
    • Intra-HOI fusion: Feature fusion of human-object pairs
    • Inter-HOI fusion: Context injection between multiple HOI pairs within an image
    • LLM-based fine-grained prompts (including text descriptions)
    • Visual Adapter (ref: [27])
    • UTPL module (Unseen Text Prompt Learning)

3. Experimental Results Analysis and Ablation Study

results

  • Unseen-Verb Setting
    • Up to 87.9% reduction in trainable parameters compared to existing methods
    • Slightly lower performance than CLIP4HOI, but maximizes efficiency
    • 2.77 mAP improvement over UniHOI, with parameter count at 26.9% level
  • Unseen-Composition Setting (RF-UC / NF-UC)
    • Superior performance in all settings compared to CLIP4HOI
    • +5.56 mAP in RF-UC and +7.88 mAP in NF-UC compared to UniHOI
  • Unseen-Object Setting
    • +1.49 mAP over CLIP4HOI, with parameter count at 12.08%
    • +13.36 mAP in unseen classes compared to UniHOI
  • ๐Ÿ”ฌ Ablation Study Interpretation

ablation

ComponentFunction DescriptionPerformance ChangeInterpretation
Intra-HOI FusionInformation combination within a single human-object (H-O) pair in an imageseen +7.41 mAPSignificantly improves recognition precision for learned classes (seen) by more accurately capturing human/object relationships within a pair
Visual AdapterModule that inserts external information (e.g., position, class) into each layer of the visual encoderseen โ†‘ / unseen โ†“This information helps with seen classes, but may cause overfitting for unseen classes โ†’ hindrance to generalization
LLM GuidanceUses sophisticated text descriptions generated by LLaVA-based language model in promptsunseen +1.52 mAPIncreases understanding of previously unseen classes by utilizing semantic-based descriptions rather than simple class names
UTPL (Unseen Text Prompt Learning)Structure that separately trains prompts dedicated to unseen classesunseen +2.42 mAPPrevents prompts from being biased toward seen classes and directly learns expressiveness for unseen classes, enhancing performance
Inter-HOI FusionInformation enhancement by sharing context between multiple human-object pairsBoth seen/unseen improvedVarious relationships within an image provide contextual help to each other, increasing overall recognition and classification accuracy
VLM GuidanceStrategy to induce (align) characteristics of pre-trained vision-language models like CLIPunseen +1.33 mAPEnables semantic inference for previously unseen classes by reflecting VLMโ€™s generalization properties in prompts

๐Ÿง  Final Thoughts

Research to prepare for unseen cases!
That is, research to insert pre-trained prompts so that they can adapt well to previously unseen situations in zero-shot scenarios!!

1) For seen cases, describe them with LLM and train text prompts to match image embeddings,
2) For unseen cases, start with prompts from similar seen cases and perform additional training with LLMโ€™s unseen descriptions and similar seen images!
3) Describe unseen cases with LLM and train seen case-based prompts to match unseen VLM embeddings! 4) Through this! Create perfect prompts for zero-shot unseen cases!



๐Ÿง  (ํ•œ๊ตญ์–ด) EZ-HOI ์•Œ์•„๋ณด๊ธฐ?!!

๐Ÿ” Zero shot, Unseen์„ ์œ„ํ•œ ์™„๋ฒฝํ•œ ํ”„๋กฌํฌํŠธ ๋งŒ๋“ค๊ธฐ!!

manhwa

๋…ผ๋ฌธ: EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
๋ฐœํ‘œ: NeurIPS 2024 (Lei, Wang, et al.)
์ฝ”๋“œ: ChelsieLei/EZ-HOI


๐Ÿ“Œ ๋ฐฐ๊ฒฝ: HOI์™€ VLM ๊ฒฐํ•ฉ ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„!?

Human-Object Interaction (HOI)๋ž€!!
์ด๋ฏธ์ง€ ๋˜๋Š” ๋น„๋””์˜ค์—์„œ ์‚ฌ๋žŒ(Human)๊ณผ ๊ฐ์ฒด(Object)์˜ ์Œ์„ ์ฐพ์•„๊ณ , ์ด๋“ค ์‚ฌ์ด์˜ ์ƒํ˜ธ์ž‘์šฉ(Interaction)์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค.

existings

โ“ ๋ฌธ์ œ a: VLM๊ณผ ์—ฐ๊ณ„ํ•˜๋Š” HOI ์—ฐ๊ตฌ!

๋„ˆ๋ฌด ๋ชจ๋ธ์ด ํฌ๊ณ  ์„ธ์„ธํ•œ ๋ถ€๋ถ„๊นŒ์ง€ ํŒŒ์•…์€ ์–ด๋ ต๋‹ค๋Š” ๋‹จ์ !!

์ตœ๊ทผ์˜ HOI ์—ฐ๊ตฌ๋“ค์€ Vision-Language Models (VLMs)์„ ๋งŽ์ด ํ™œ์šฉํ–ˆ๋Š”๋ฐ,
๋Œ€ํ‘œ์ ์ธ ๊ฒƒ์ด HOI ๊ฒ€์ถœ๊ธฐ์™€ VLM์˜ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ์ •๋ ฌ(alignment)์‹œ์ผœ ํ–‰๋™(action)๊ณผ ๊ฐ™์€ ๊ฐœ๋…์„ ์–‘์ชฝ ๋ชจ๋ธ์ด ์œ ์‚ฌํ•˜๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด์—ˆ์Œ!!
์ด๋ฅผ ํ†ตํ•ด ์ •๋ ฌ๋œ ํŠน์ง•์€ ์ œ๋กœ์ƒท(zero-shot) ์ƒํ™ฉ์—์„œ๋„ ๋ชจ๋ธ์ด ๋ณธ ์  ์—†๋Š” ์ƒํ˜ธ์ž‘์šฉ๋„ ์ดํ•ดํ• ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ!!
์•„๋ž˜์™€ ๊ฐ™์€ ๋‹จ์ ๋“ค์ด ์žˆ์—ˆ์Œ

  • ๐Ÿ’ธ ๊ณ ๋น„์šฉ์˜ ์ •๋ ฌ ํ•™์Šต ๊ณผ์ •: VLM๊ณผ์˜ ์ •๋ ฌ์€ ๋Œ€๊ฐœ ํŠธ๋žœ์Šคํฌ๋จธ ๊ตฌ์กฐ ๊ธฐ๋ฐ˜์œผ๋กœ, ์—ฐ์‚ฐ ๋น„์šฉ/ํ•™์Šต ์‹œ๊ฐ„ ๋“ฑ์ด ํฐ ๋ฌธ์ œ!
  • ๐Ÿ”’ ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™”์˜ ์–ด๋ ค์›€: VLM ์ •๋ ฌ์€ ํ•™์Šต๋œ ํด๋ž˜์Šค(Seen classes)์—๋งŒ ์ตœ์ ํ™”๋˜์–ด, ๋ณด์ง€ ๋ชปํ•œ ํด๋ž˜์Šค(Unseen classes)์— ๋Œ€ํ•œ ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ๋‚ฎ์Œ!
  • ๐Ÿง  ์ง€์‹ ์ „์ด์˜ ํ•œ๊ณ„: VLM์€ ๋„“์€ ๊ฐœ๋…์€ ์ž˜ ์ดํ•ดํ•˜์ง€๋งŒ, HOI์ฒ˜๋Ÿผ ์‚ฌ๋žŒ์˜ ๋ฏธ์„ธํ•œ ํ–‰๋™ ์ฐจ์ด๋ฅผ ๊ตฌ๋ถ„ํ•ด์•ผ ํ•˜๋Š” ๊ณผ์ œ์—๋Š” ์•ฝ์ ์ด ์žˆ์Œ!
โ— ๋ฌธ์ œ b: ํ”„๋กฌํฌํŠธ๋งŒ์„ ํŠœ๋‹ํ•ด์„œ ๊ฐ€๋ฒผ์šด ํ•™์Šต!!

๋‹ค๋งŒ, ํ”„๋กฌํฌํŠธ ํŠœ๋‹์€ Seen์œ„์ฃผ๋กœ๋งŒ ์ง„ํ–‰๋˜์–ด Unseen์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ข‹์ง€์•Š์Œ!

์ด์—, ์ตœ๊ทผ์—๋Š” ์ •๋ ฌ ๊ณผ์ •์„ ์ƒ๋žตํ•˜๊ณ , VLM์˜ ํ‘œํ˜„๋ ฅ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹(prompt tuning) ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๋Œ€์•ˆ์œผ๋กœ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ์ง€๋งŒ,
์ด ๋˜ํ•œ ์ œ๋กœ์ƒท ๋ฌธ์ œ์—์„œ๋Š” ์•„์ง ์ถฉ๋ถ„ํ•œ ์„ฑ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€ ๋ชปํ–ˆ์Œ!!

์ฐธ๊ณ  : VLM์˜ ํ‘œํ˜„๋ ฅ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹(prompt tuning) ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹ ์ด๋ž€!?
โ€œA photo of a catโ€ ๋ฅผ โ€œ[P1] [P2] [P3] catโ€ ์™€ ๊ฐ™์ด ๋„ฃ๊ณ  P1 P2 P3์„ ํ•™์Šต์‹œํ‚ด!
๋…ผ๋ฌธ์—์„œ ์˜ˆ๋ฅผ๋“  MaPLe์˜ ํ”„๋กฌํฌํŠธ ํŠœ๋‹์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ํ•จ๊ผ ํŠœ๋‹ํ•จ!!

  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, HOI์™€ VLM์˜ ๊ฒฐํ•ฉ์€ ์œ ๋งํ•˜์ง€๋งŒ, ๊ฐ€๋ฒผ์šด ๋ชจ๋ธ&์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ํ™•๋ณด๋ผ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค!

๐Ÿ’ก EZ-HOI ๋“ฑ์žฅ!!!

๐Ÿงฉ Inference (์ถ”๋ก )

์‚ฌ์ „ Fine Tuning ๋œ learnable prompt๊ฐ€ ๊ธฐ์กด foundation ๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉ๋˜์–ด ์“ฐ์ž„!!
๊ทธ๋ž˜์„œ ๊ธฐ์กด foundation ๋ชจ๋ธ์€ ํ•™์Šต๋œ๊ฒƒ ์—†๋Š”, ํ”„๋กฌํฌํŠธ ํŠœ๋‹ ๊ธฐ๋ฐ˜์˜ Zero shot!!

ezHOI_structure

1
2
3
4
5
6
7
8
9
10
11
[Input] ๋‹จ์ผ ์ด๋ฏธ์ง€
    โ†“
Stage 1: Human-Object Detection
    - ์‚ฌ๋žŒ๊ณผ ๋ชจ๋“  ๊ฐ์ฒด bbox ์ถ”์ถœ
    - ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  (human, object) pair ์ƒ์„ฑ

Stage 2: HOI ์ธ์‹
    - ๊ฐ human-object pair โ†’ CLIP์˜ visual encoder + vision learnable prompt  โ†’ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ (f_vis)
    - ๋ชจ๋“  HOI ํด๋ž˜์Šค์˜ (object-action pair) โ†’ CLIP์˜ text encoder + test learnable prompt โ†’ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ (f_txt)
    - cosine similarity(f_vis, f_txt) ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ HOI class ์„ ํƒ
    โ†’ ์ตœ์ข… HOI ์˜ˆ์ธก

๐Ÿ› ๏ธ Training (ํ•™์Šต)
  1. LLM ๊ธฐ๋ฐ˜ HOI ํด๋ž˜์Šค ์„ค๋ช… ์ƒ์„ฑ
    • ๋ชจ๋“  object-interaction (HOI class) ์Œ์— ๋Œ€ํ•ด LLM์œผ๋กœ ํ’๋ถ€ํ•œ ๋ฌธ์žฅ ์ƒ์„ฑ
      "Swinging a baseball bat describes a person..."

  1. VLM ๊ธฐ๋ฐ˜์˜ ์ด๋ฏธ์ง€ ํ”„๋กฌํฌํŠธ (VLM Guidance)
1
2
3
4
โ†’ Cross-Attention (์ด๋ฏธ์ง€ MHCA, Multi-Head Cross Attention, ์ดˆ๊ธฐํšŒ๋˜์–ด ์‹œ์ž‘ ํ›„ ํ•™์Šด๋จ)  
  - Q: vision learnable prompt (์ดˆ๊ธฐํ™”๋˜์–ด ์‹œ์ž‘ ํ›„ ํ•™์Šต๋จ)  
  - K/V: CLIP(VLM)์œผ๋กœ ์ธ์ฝ”๋”ฉ๋œ ๋ฒกํ„ฐ (Unseen์˜ ๊ฒฝ์šฐ llm์œผ๋กœ ์ƒ์„ฑ๋œ ์„ค๋ช…์„ ์ธ์ฝ”๋”ฉ)
โ†’ Attention ๊ฒฐ๊ณผ๋ฌผ์ด CLIP(VLM)์œผ๋กœ ์ธ์ฝ”๋”ฉ๊ฒฐ๊ณผ๊ณผ ์œ ์‚ฌํ•ด์ง€๋„๋ก MHCA ๋ฐ Learnable prompt ํ•™์Šต
  • vision MHCA๋Š” visionํ”„๋กฌํฌํŠธ+Vision MHCA์˜ ๊ฒฐ๊ณผ๊ฐ€ LLM์ด ๋งŒ๋“  unseen์˜ ์„ค๋ช… ์ž„๋ฒ ๋”ฉ๊ณผ ์œ ์‚ฌํ•˜๋„๋ก!!

  1. Seen ํด๋ž˜์Šค ํ•™์Šต

    ์ด๋•Œ Seen Class์˜ learnable prompt์™€ MHCA weight๊ฐ€ ์ •ํ•ด์ง!!

1
2
3
4
โ†’ Cross-Attention (ํ…์ŠคํŠธ MHCA, Multi-Head Cross Attention, ์ดˆ๊ธฐํšŒ๋˜์–ด ์‹œ์ž‘ ํ›„ ํ•™์Šด๋จ)  
  - Q: text learnable prompt (์ดˆ๊ธฐํ™”๋˜์–ด ์‹œ์ž‘ ํ›„ ํ•™์Šต๋จ)  
  - K/V: LLM ์„ค๋ช…์˜ ํ† ํฐ ์ž„๋ฒ ๋”ฉ  
โ†’ Attention ๊ฒฐ๊ณผ๋ฌผ์„ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ๊ณผ ์œ ์‚ฌํ•˜๋„๋ก ํ•™์Šต (cosine similarity ๊ธฐ๋ฐ˜)  
  • text MHCA๋Š” ํ…์ŠคํŠธํ”„๋กฌํฌํŠธ+MHCA์˜ ๊ฒฐ๊ณผ๊ฐ€ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ๊ณผ ์œ ์‚ฌํ•˜๋„๋ก ๋˜๋Š”๊ฒƒ (SEEN)์œ„์ฃผ!!

  1. Unseen Class ํ•™์Šต : 3๋‹จ๊ฒŒ!! (UPTL: Unseen Text Prompt Learning)

    ์ด๋•Œ Seen Class์˜ learnable prompt์™€ MHCA weight๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Unseen Class์˜ learnable prompt๊ฐ€ ์ •ํ•ด์ง!!

1๋‹จ๊ณ„: Cross-Attention (MHCA) - Seen ์—์„œ ์ •ํ•ด์ง„ MHCA weight
- Q: learnable prompt (๊ฐ€์žฅ ์œ ์‚ฌํ•œ Seen Class์˜ ์ตœ์ข… learnable prompt๋กœ ์‹œ์ž‘)
- K/V: Unseen class์˜ LLM ์„ค๋ช…์˜ ํ† ํฐ ์ž„๋ฒ ๋”ฉ
โ†’ Attention ๊ฒฐ๊ณผ๋ฌผ์„ ์œ ์‚ฌ seen class์˜ prompt ๊ฒฐ๊ณผ์™€ ์œ ์‚ฌํ•˜๋„๋ก ํ•™์Šต (cosine similarity ๊ธฐ๋ฐ˜)

2๋‹จ๊ณ„: Class-relation ํ•™์Šต - Seen๊ณผ Unseen์˜ LLM ์„ค๋ช… ์ž„๋ฒ ๋”ฉ๋ผ๋ฆฌ์˜ ์œ ์‚ฌ๋„ ๋งŒํผ learnable prompt๊ฐ€ ์œ ์‚ฌํ•˜๊ฒŒ ๋˜๋„๋ก ํ•™์Šต!

3๋‹จ๊ณ„: Negative ํ•™์Šต - Seen class์˜ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ฉ๊ณผ Unclass์˜ learnable prompt๊ฐ€ ๋ฉ€์–ด์ง€๋„๋ก ํ•™์Šต

์ด๋–„! learnable prompt๊ฐ€ ์ฒ˜์Œ์— ํ•œ๋ฒˆ ๋“ค์–ด๊ฐ€๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ layer๋ณ„๋กœ ๋‚˜๋ˆ ์„œ ๋“ค์–ด๊ฐ„๋‹ค!!


Deep Visual-Text Prompt Learning
  • ๊ธฐ์กด์—๋Š” ๋‹จ์ˆœํžˆ ์ž…๋ ฅ๋˜๋Š” ํ”„๋กฌํฌํŠธ๋ฅผ ํŠœ๋‹ํ–ˆ๋‹ค๋ฉด(์ธ์ฝ”๋” ์ž…๋ ฅ ์•ž๋‹จ์— ๊ณ ์ •๋œ ํ† ํฐ ์ถ”๊ฐ€),
  • ์ด๋ฒˆ ์—ฐ๊ตฌ๋Š” text ๋ฐ Vision ์ธ์ฝ”๋”์˜ Transformer ๋ ˆ์ด์–ด์— ๊ฐ๊ฐ์˜ learnable ํ”„๋กฌํฌํŠธ๋ฅผ ์‚ฝ์ž…!

โœ… ๊ธฐ๋ณธ Prompt Tuning vs. Deep Visual-Text Prompt Learning

ํ•ญ๋ชฉ๊ธฐ๋ณธ Prompt TuningDeep Visual-Text Prompt Learning
์ ์šฉ ์œ„์น˜์ธ์ฝ”๋” ์ž…๋ ฅ ์•ž๋‹จ์— ๊ณ ์ •๋œ ํ† ํฐ ์ถ”๊ฐ€์ธ์ฝ”๋”์˜ ๋ชจ๋“  Transformer ๋ ˆ์ด์–ด์— ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ”„๋กฌํ”„ํŠธ ์‚ฝ์ž…
ํ•™์Šต ๋Œ€์ƒ๋ณดํ†ต ๋ช‡ ๊ฐœ์˜ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ”„๋กฌํ”„ํŠธ ๋ฒกํ„ฐ๋งŒ ํŠœ๋‹๋ ˆ์ด์–ด๋ณ„๋กœ ํ…์ŠคํŠธ/๋น„์ฃผ์–ผ ํ”„๋กฌํ”„ํŠธ ์ „์ฒด ์‹œํ€€์Šค ํ•™์Šต
ํ‘œํ˜„๋ ฅ์ œํ•œ์  (shallow), ์ƒ๋ฅ˜ ์ •๋ณด๋งŒ ์กฐ์ ˆ๊นŠ์€ ์œ„์น˜์˜ ํ‘œํ˜„๊นŒ์ง€ ์กฐ์ ˆ ๊ฐ€๋Šฅ (deep)
์œ ์—ฐ์„ฑ๋‹จ์ˆœ ๊ตฌ์กฐ๋กœ ๋น ๋ฅด๊ฒŒ ํŠœ๋‹๋ฌธ๋งฅ/๊ด€๊ณ„/๋ณตํ•ฉ ์ •๋ณด ๋ฐ˜์˜ ๊ฐ€๋Šฅ (e.g. HOI)
์˜ˆ: CLIPํ…์ŠคํŠธ์—๋งŒ ํ† ํฐ ์‚ฝ์ž… โ†’ ๋‹จ์ผ ๋ฌธ์žฅ ์˜๋ฏธ ์กฐ์ ˆํ…์ŠคํŠธ & ๋น„์ฃผ์–ผ ๋‘˜ ๋‹ค ์กฐ์ •ํ•˜๋ฉฐ ์‹œ๊ฐ-์–ธ์–ด ์ •๋ ฌ ์ž์ฒด๋ฅผ ์žฌ์„ค๊ณ„

๐ŸŽฏ ์™œ Deep Visual-Text Prompt Learning ๋ฐฉ์‹์ด ๋” ์ข‹์„๊นŒ??


1. ๋ ˆ์ด์–ด๋ณ„ ์˜๋ฏธ/๊ธฐ๋Šฅ ๋ถ„ํ™” ๊ณ ๋ ค

Transformer์˜ ๊ฐ ๋ ˆ์ด์–ด๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ˆ˜์ค€์˜ ์˜๋ฏธ๋ฅผ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค:

  • ์ดˆ๊ธฐ ๋ ˆ์ด์–ด: ์ €์ˆ˜์ค€ (local) ํŠน์ง•
  • ์ค‘๊ฐ„ ๋ ˆ์ด์–ด: ๊ด€๊ณ„ (contextual) ์ •๋ณด
  • ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด: ๊ฐœ๋…์  ์ถ”์ƒ (high-level semantics)

โžก๏ธ ๋‹จ์ˆœํžˆ ์ž…๋ ฅ ์•ž์—๋งŒ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ถ™์ด๋ฉด, ์ด ๋ชจ๋“  ๋ ˆ์ด์–ด์— ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๊ฑฐ๋‚˜ ์กฐ์ž‘ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๐Ÿ”น ๋ฐ˜๋ฉด Deep Prompt Learning์€ ๊ฐ ๋ ˆ์ด์–ด๋งˆ๋‹ค ์ ์ ˆํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฝ์ž…ํ•จ์œผ๋กœ์จ, ๊ณ„์ธต๋ณ„ ์˜๋ฏธ ํ๋ฆ„์„ ๋ฏธ์„ธํ•˜๊ฒŒ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


2. ์‹œ๊ฐ/ํ…์ŠคํŠธ ๋ชจ๋‘์— ์ ์šฉ โ†’ Modal Alignment ๊ฐœ์„ 

๊ธฐ์กด Prompt Tuning์€ ์ฃผ๋กœ ํ…์ŠคํŠธ ์ชฝ์—๋งŒ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ:

  • HOI (Human-Object Interaction)
  • VQA (Visual Question Answering)

์ฒ˜๋Ÿผ ํ…์ŠคํŠธ์™€ ๋น„์ฃผ์–ผ์˜ ์กฐํ•ฉ์ด ํ•ต์‹ฌ์ธ ์ž‘์—…์—์„œ๋Š”,
์‹œ๊ฐ ํ‘œํ˜„๋„ ๋™์‹œ์— ์ •๋ ฌ/์กฐ์ ˆํ•ด์•ผ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ”น Deep Visual-Text Prompt๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋” ๋ชจ๋‘์— ๋ณ‘๋ ฌ์ ์œผ๋กœ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฝ์ž…ํ•˜์—ฌ,
๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„์˜ ์ •๋ ฌ ํ’ˆ์งˆ (alignment)์„ ๋†’์ž…๋‹ˆ๋‹ค.


3. Fine-grained Control & Context Adaptation

ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๊ฐ ๋ ˆ์ด์–ด์— ๋…๋ฆฝ์ ์œผ๋กœ ์กด์žฌํ•˜๋ฏ€๋กœ ๋‹ค์Œ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค:

  • ํŠน์ • ์ž‘์—… / ํด๋ž˜์Šค / ๋ฌธ๋งฅ์— ๋งž๋Š” ์„ธ๋ถ€์  ์กฐ์ •
  • ํ”„๋กฌํ”„ํŠธ๋ฅผ HOI ํด๋ž˜์Šค๋ณ„๋กœ ๋‹ค๋ฅด๊ฒŒ ํ•™์Šต์‹œ์ผœ ์„ธ๋ฐ€ํ•œ ํ‘œํ˜„ ์ œ์–ด
  • ๋‹จ์ˆœํžˆ โ€œ์ด๊ฑด ๊ณ ์–‘์ด์•ผโ€๊ฐ€ ์•„๋‹ˆ๋ผ โ†’ โ€œ์‚ฌ๋žŒ์ด ๊ณ ์–‘์ด๋ฅผ ์•ˆ๊ณ  ์žˆ๋‹คโ€ ๊ฐ™์€ ๋ณตํ•ฉ ๊ด€๊ณ„ ํ‘œํ˜„์— ์œ ๋ฆฌ

๐Ÿ”ฌ EZ-HOI์˜ ์„ฑ๋Šฅ ์‹คํ—˜!!


1. ๐Ÿ“˜ Zero-Shot HOI ์„ค์ •์˜ ์ •์˜
  • ๊ธฐ์กด zero-shot HOI ๋ฐฉ์‹๊ณผ ๋™์ผํ•˜๊ฒŒ, unseen HOI ํด๋ž˜์Šค์˜ ์ด๋ฆ„์„ ํ›ˆ๋ จ ์ค‘์— ํ™œ์šฉ
  • ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค:
    • VCL, FCL, ATL: unseen HOI ํด๋ž˜์Šค ์ด๋ฆ„์„ ์กฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ ๊ตฌ์„ฑ
    • EoID: CLIP์„ ์‚ฌ์ „ ์ •์˜๋œ HOI ํ”„๋กฌํ”„ํŠธ๋กœ ๋””์Šคํ‹ธ (seen + unseen ํด๋ž˜์Šค)
    • HOICLIP: verb class representation ๋„์ž… (seen/unseen ํฌํ•จ)

2. โš™๏ธ ๊ตฌํ˜„ ์„ธ๋ถ€ ์„ค์ •
  • ๊ธฐ๋ณธ ๊ตฌ์กฐ:
    • DETR + ResNet-50 ๋ฐฑ๋ณธ
    • CLIP ๊ธฐ๋ฐ˜ dual encoder ๊ตฌ์กฐ (ํ…์ŠคํŠธ/๋น„์ฃผ์–ผ ๋ชจ๋‘ ํ”„๋กฌํ”„ํŠธ ์‚ฝ์ž…)
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ:
    • ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ: 16
    • ํ•™์Šต๋ฅ : 1e-3
    • ์˜ตํ‹ฐ๋งˆ์ด์ €: AdamW
    • GPU: 4 ร— Nvidia A5000
  • ๋ฐฑ๋ณธ(backbone):
    • Visual encoder: DETR (ResNet-50)
    • Text encoder: LLaVA-v1.5-7b๋กœ ์„ค๋ช… ๊ธฐ๋ฐ˜ ํ”„๋กฌํ”„ํŠธ ์ƒ์„ฑ
  • ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„:
    • ๋ ˆ์ด์–ด ์ˆ˜: N = 9, ํ”„๋กฌํ”„ํŠธ ๊ธธ์ด: p = 2
    • ํ•™์Šต ๊ฐ€๋Šฅํ•œ text & visual prompts๋ฅผ Transformer ๋ ˆ์ด์–ด๋งˆ๋‹ค ์‚ฝ์ž…
  • ์ถ”๊ฐ€ ๊ธฐ๋ฒ•:
    • Intra-HOI fusion: ์‚ฌ๋žŒ-๊ฐ์ฒด ์Œ์˜ feature ์œตํ•ฉ
    • Inter-HOI fusion: ์ด๋ฏธ์ง€ ๋‚ด ์—ฌ๋Ÿฌ HOI ์Œ ๊ฐ„ ๋ฌธ๋งฅ ์ฃผ์ž…
    • LLM ๊ธฐ๋ฐ˜ ์„ธ๋ฐ€ ํ”„๋กฌํ”„ํŠธ (ํ…์ŠคํŠธ ์„ค๋ช… ํฌํ•จ)
    • Visual Adapter (์ฐธ๊ณ : [27])
    • UTPL ๋ชจ๋“ˆ (Unseen Text Prompt Learning)

3. ์‹คํ—˜๊ฒฐ๊ณผ ๋ถ„์„ ๋ฐ Ablation Study

results

  • Unseen-Verb Setting
    • ๊ธฐ์กด ๋Œ€๋น„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ์ตœ๋Œ€ 87.9% ๊ฐ์†Œ
    • CLIP4HOI๋ณด๋‹ค ์„ฑ๋Šฅ์€ ์•ฝ๊ฐ„ ๋‚ฎ์ง€๋งŒ, ํšจ์œจ์„ฑ ๊ทน๋Œ€ํ™”
    • UniHOI ๋Œ€๋น„ 2.77 mAP ํ–ฅ์ƒ, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 26.9% ์ˆ˜์ค€
  • Unseen-Composition Setting (RF-UC / NF-UC)
    • CLIP4HOI ๋Œ€๋น„ ๋ชจ๋“  ์„ค์ •์—์„œ ์„ฑ๋Šฅ ์šฐ์ˆ˜
    • UniHOI ๋Œ€๋น„ RF-UC์—์„œ +5.56 mAP, NF-UC์—์„œ +7.88 mAP
  • Unseen-Object Setting
    • CLIP4HOI๋ณด๋‹ค +1.49 mAP, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 12.08%
    • UniHOI ๋Œ€๋น„ unseen ํด๋ž˜์Šค์—์„œ +13.36 mAP
  • ๐Ÿ”ฌ Ablation Study ํ•ด์„

ablation

ํ•ญ๋ชฉ๊ธฐ๋Šฅ ์„ค๋ช…์„ฑ๋Šฅ ๋ณ€ํ™”ํ•ด์„
Intra-HOI Fusionํ•œ ์ด๋ฏธ์ง€ ๋‚ด์—์„œ ๋‹จ์ผ ์‚ฌ๋žŒ-๊ฐ์ฒด(H-O) ์Œ ๋‚ด๋ถ€์˜ ์ •๋ณด ๊ฒฐํ•ฉseen +7.41 mAPํ•œ ์Œ ๋‚ด ์‚ฌ๋žŒ/๊ฐ์ฒด ๊ด€๊ณ„๋ฅผ ๋” ์ •ํ™•ํžˆ ํฌ์ฐฉํ•จ์œผ๋กœ์จ, ํ•™์Šต๋œ ํด๋ž˜์Šค(seen)์— ๋Œ€ํ•œ ์ธ์‹ ์ •๋ฐ€๋„ ํฌ๊ฒŒ ํ–ฅ์ƒ
Visual Adapter์‹œ๊ฐ ์ธ์ฝ”๋” ๊ฐ ๋ ˆ์ด์–ด์— ์™ธ๋ถ€ ์ •๋ณด(์˜ˆ: ์œ„์น˜, ํด๋ž˜์Šค)๋ฅผ ์‚ฝ์ž…ํ•˜๋Š” ๋ชจ๋“ˆseen โ†‘ / unseen โ†“seen ํด๋ž˜์Šค์—์„œ๋Š” ์ด ์ •๋ณด๊ฐ€ ๋„์›€์ด ๋˜์ง€๋งŒ, unseen ํด๋ž˜์Šค์—๋Š” ๊ณผ์ ํ•ฉ ๊ฐ€๋Šฅ์„ฑ โ†’ ์ผ๋ฐ˜ํ™”์— ๋ฐฉํ•ด ์š”์ธ
LLM GuidanceLLaVA ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์ •๊ตํ•œ ํ…์ŠคํŠธ ์„ค๋ช…์„ ํ”„๋กฌํ”„ํŠธ์— ์‚ฌ์šฉunseen +1.52 mAP๋‹จ์ˆœ ํด๋ž˜์Šค ์ด๋ฆ„๋ณด๋‹ค ์˜๋ฏธ ๊ธฐ๋ฐ˜์˜ ๋ฌ˜์‚ฌ๋ฅผ ํ™œ์šฉํ•จ์œผ๋กœ์จ, ์ฒ˜์Œ ๋ณด๋Š” ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์ดํ•ด๋ ฅ ์ฆ๊ฐ€
UTPL (Unseen Text Prompt Learning)unseen ์ „์šฉ ํ•™์Šต ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋”ฐ๋กœ ํ›ˆ๋ จํ•˜๋Š” ๊ตฌ์กฐunseen +2.42 mAPํ”„๋กฌํ”„ํŠธ๊ฐ€ seen์— ํŽธ์ค‘๋˜์ง€ ์•Š๊ฒŒ ํ•˜๊ณ , unseen์„ ์œ„ํ•œ ํ‘œํ˜„๋ ฅ์„ ์ง์ ‘ ํ•™์Šตํ•˜๊ฒŒ ํ•˜์—ฌ ์„ฑ๋Šฅ ๊ฐ•ํ™”
Inter-HOI Fusion์—ฌ๋Ÿฌ ์‚ฌ๋žŒ-๊ฐ์ฒด ์Œ ๊ฐ„ ๋ฌธ๋งฅ์„ ๊ณต์œ ํ•˜์—ฌ ์ •๋ณด ๋ณด๊ฐ•seen/unseen ๋ชจ๋‘ ํ–ฅ์ƒํ•œ ์ด๋ฏธ์ง€ ๋‚ด์˜ ๋‹ค์–‘ํ•œ ๊ด€๊ณ„๋“ค์ด ์„œ๋กœ ๊ฐ„์— ๋ฌธ๋งฅ์  ๋„์›€์„ ์ฃผ์–ด, ์ „๋ฐ˜์ ์ธ ์ธ์‹๋ ฅ๊ณผ ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ์ƒ์Šน
VLM GuidanceCLIP ๋“ฑ ์‚ฌ์ „ํ•™์Šต๋œ ์‹œ๊ฐ์–ธ์–ด๋ชจ๋ธ์˜ ํŠน์„ฑ์„ ์œ ๋„(align)ํ•˜๋Š” ์ „๋žตunseen +1.33 mAPVLM์˜ ์ผ๋ฐ˜ํ™” ์„ฑ์งˆ์„ ํ”„๋กฌํ”„ํŠธ์— ๋ฐ˜์˜ํ•จ์œผ๋กœ์จ, ์ฒ˜์Œ ๋ณด๋Š” ํด๋ž˜์Šค์—๋„ ์˜๋ฏธ ์œ ์ถ” ๊ฐ€๋Šฅ

๐Ÿง  ๋งˆ๋ฌด๋ฆฌ ์ƒ๊ฐ

Unseen ์ผ€์ด์Šค์— ๋Œ€ํ•œ ๋Œ€๋น„๋ฅผ ์œ„ํ•œ ์—ฐ๊ตฌ!
์ฆ‰ Zero shot์œผ๋กœ ์ฒ˜์Œ ๋ณด๋Š” ์ƒํ™ฉ์—๋„ ์ž˜ ์ ์‘ ํ• ์ˆ˜ ์žˆ๋„๋ก,
๋ฏธ๋ฆฌ ํ•™์Šต๋œ ํ”„๋กฌํฌํŠธ๋ฅผ ๋„ฃ๋Š” ์—ฐ๊ตฌ!!

1) Seen ์ผ€์ด์Šค์— ๋Œ€ํ•˜์—ฌ LLM์œผ๋กœ ์„ค๋ช…ํ•˜๊ณ  ์ด๋ฅผ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ๊ณผ ๋งค์นญ๋  ์ˆ˜ ์žˆ๋„๋ก ํ…์ŠคํŠธ ํ”„๋กฌํฌํŠธ๋ฅผ ํ•™์Šต์‹œํ‚จ ๋‹ค์Œ,
2) Unseen ์ผ€์ด์Šค์˜ ํ”„๋กฌํฌํŠธ๋Š” ์œ ์‚ฌํ•œ Seen ํ”„๋กฌํฌํŠธ์—์„œ ์‹œ์ž‘ํ•ด์„œ, LLM์˜ Unseen ์„ค๋ช… ๋ฐ ์œ ์‚ฌํ•œ Seen ์ด๋ฏธ์ง€๋กœ ์ถ”๊ฐ€ํ•™์Šต!
3) LLM์œผ๋กœ Unseen ์ผ€์ด์Šค์— ๋Œ€ํ•˜์—ฌ ์„ค๋ช…ํ•˜๊ณ , Seen case๊ธฐ๋ฐ˜์˜ ํ”„๋ŸผํฌํŠธ๊ฐ€ Unseen์˜ VLM ์ž„๋ฒ ๋”ฉ๊ณผ ๋งค์นญ๋˜๋„๋ก ํ•™์Šต! 4) ์ด๋ฅผ ํ†ตํ•ด์„œ! Unseen ์˜ Zero shot์„ ์œ„ํ•œ ์™„๋ฒฝํ•œ ํ”„๋กฌํฌํŠธ๋ฅผ ๋งŒ๋“ ๋‹ค!


This post is licensed under CC BY 4.0 by the author.