📝Understanding CLIP4HOI - CLIP4HOI를 알아보자!!!

Posted Jun 23, 2025

By DrFirst

20 min read

🧠 Understanding CLIP4HOI

🔍 Combining CLIP and DETR for Zero-Shot HOI Detection!!!

Paper CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
Conference: NeurIPS 2023 (Mao, Yunyao, et al.)
Code: maoyunyao/CLIP4HOI 코드는 없음,,

🔎 Key Summary

💡 CLIP’s Vision-Language capabilities applied to HOI (Human-Object Interaction) detection!
🏗️ DETR + CLIP Integration: Adding 3 HOI-specific components to DETR for enhanced HOI detection performance
✅ CLIP’s Rich Representation: Leveraging image-text matching capabilities for HOI classification!

🚨 Limitations of Existing Methods

Existing Zero-shot HOI detection methods have evolved around two main paradigms, but each has its own inherent problems:

1️⃣ Limitations of One-Stage Frameworks 🎯

Representative Studies: GEN-VLKT, HOICLIP, QPIC, HOTR, etc.

Approach: Simultaneously predicting ⟨human, verb, object⟩ triplets with a single query
Problems:
- Joint positional distribution overfitting: Decoder overfits to human-object positional distributions for known verb-object combinations
- Position dependency: Reliance on seen category patterns like “when a person holds a cup, they are usually positioned on the right”
- Distribution mismatch vulnerability: Performance drastically drops when there are significant positional distribution differences between seen and unseen categories

Concrete Example: If “person-cup” interactions occurred mainly in specific positions during training, the model fails to recognize the same interaction in completely different positional relationships

2️⃣ Limitations of Knowledge Distillation 🧠

Representative Studies: GEN-VLKT, EoID

Approach: Transferring CLIP’s knowledge to existing HOI detectors through knowledge distillation
Problems:
- Data-sensitive: Heavy dependence on the completeness of training data
- Seen categories bias: Distillation process is dominated by seen category samples
- Poor unseen categories generalization: Lack of unseen categories during training damages generalization ability for those categories
- Distribution imbalance: Action probability distribution distillation is sensitive to imbalanced training data

3️⃣ Limitations of Compositional Learning Approaches 📚

Representative Studies: VCL, FCL, ATL, ConsNet, SCL

Approach: Performing compositional learning by decomposing HOI into verbs and objects
Problems:
- Dependency on predefined categories: Unseen categories must be predefined during learning
- Limited generalization: Restricted to predictable combinations, not truly “practical” zero-shot
- Complex interaction modeling limitations: Simple decomposition has limitations in representing complex HOI
- Lack of scalability: Limited flexibility for new verb-object combinations

4️⃣ Limitations of Existing CLIP Utilization Methods 🔧

Representative Studies: GEN-VLKT, EoID, HOICLIP

Problems:
- Indirect utilization: Using CLIP only as an auxiliary tool, underutilizing its core capabilities
- Complex pipeline: Requiring complex combinations of knowledge distillation and fine-tuning
- Training-free limitations: Training-free approaches like HOICLIP have performance constraints
- Lack of adaptation: Failure to directly apply CLIP’s powerful vision-language capabilities to HOI tasks

💡 CLIP4HOI’s Solution: Solving positional distribution problems with two-stage paradigm + eliminating distillation dependency through direct CLIP adaptation!

🏗️ CLIP4HOI Pipeline

Input Image 
    ↓
1. Object Detector (DETR): Detect humans + objects
    ↓
2. HO Interactor: Generate proposals for each pair
- Example of pair proposal generation:  
- Input:  
    - Human: bbox[100, 50, 200, 300] + features[human visual information]
    - Cup: bbox[180, 120, 220, 160] + features[cup visual information]
- HO Interactor Output (Q: HOI tokens):
    - Spatial relationship: "Cup is near the right side of human"
    - Distance: "30 pixel distance"
    - Interaction features: Vector encoding "graspable distance, near-hand position" etc.
    - Combined representation: HOI proposal combining all this information
    ↓
3. HOI Decoder (Based on CLIP image encoder)
 - Transformer structure, generating output with QKV below    
 - Query: Pairwise HOI proposals from 2. HO Interactor
 - Key/Value: Global image features from CLIP image encoder
 - Output: Visual features specialized for each H-O pair
    ↓
4. HOI Classifier (Compare with CLIP text embeddings)  
 - Compare results from 3 with CLIP text embeddings to select highest score!  
    ↓
Final HOI predictions

📦 3 Key Components (Learnable Parts)

1️⃣ HO Interactor 🤝

Generate ⟨human, object⟩ pairs from DETR’s object detection results
Feature interaction + Spatial information injection

Generate HOI tokens (Q) containing human-object relationship information

- Input:  
    - Human: bbox[100, 50, 200, 300] + features[human visual information]
    - Cup: bbox[180, 120, 220, 160] + features[cup visual information]
- HO Interactor Output:
    - Spatial relationship: "Cup is near the right side of human"
    - Distance: "30 pixel distance"
    - Interaction features: Vector encoding "graspable distance, near-hand position" etc.
    - Combined representation: HOI proposal combining all this information

2️⃣ HOI Decoder 🧠

Utilize CLIP image representation to generate HOI visual features
Apply attention mechanism using HOI tokens as queries

Enrich visual HOI representations

- Query: Pairwise HOI proposals from 2. HO Interactor
- Key/Value: Global image features from CLIP image encoder
- Output: Visual features specialized for each H-O pair

3️⃣ HOI Classifier 🎯

Utilize CLIP’s shared embedding space for direct adaptation
Learnable prompt: [PREFIX] [verb] [CONJUN] [object] format for task-specific text representation
Global + Pairwise HOI scores calculation for multi-label classification

🔍 Detailed Operation Process

CLIP Direct Adaptation: Use CLIP directly as fine-grained classifier instead of knowledge distillation

1. Text Embedding Generation

  
# Using learnable prompt tokens
text = "[PREFIX] verb [CONJUN] object"  # Learnable tokens
T = clip_text_encoder(text)  # [Nc x Co]

2. Visual Features Processing

  
# Linear projection + L2 normalization
I_prime = Linear(I)  # Global image features
P_prime = Linear(P)  # Pairwise H-O features = 2️⃣ HOI Decoder
  
T_hat = L2_Norm(T)   # Text embeddings
I_hat = L2_Norm(I_prime)  # Global visual
P_hat = L2_Norm(P_prime)  # Pairwise visual

3. HOI Scores Calculation

  
# Temperature scaling + Sigmoid (multi-label)
S_glob = Sigmoid(I_hat @ T_hat.T / τ)  # Global HOI scores
S_pair = Sigmoid(P_hat @ T_hat.T / τ)  # Pairwise HOI scores
  
# Integrate with object detector priors for final prediction
final_result = integrate(S_glob, S_pair, object_priors)

Ablation Study Results Analysis!! 🔬

HOI Decoder: Absolutely Critical Component for Zero-shot Performance(1)
- Largest performance drop with 3.77 mAP decrease
- Especially 4.67 mAP sharp decrease in Unseen categories
- → CLIP visual features aggregation is key to zero-shot performance!
Global vs Pairwise: Different optimal strategies needed for Seen/Unseen(2)
- Seen: 0.58 mAP improvement (actually helpful!)
- Unseen: 2.99 mAP decrease (significant loss!)
- → Global context is essential for zero-shot generalization
Multi-modal Integration: Proved complementary roles of CLIP + DETR(3)

📊 Experimental Results & Performance

CLIP4HOI demonstrates significantly superior performance compared to existing methods on major benchmarks:

🏆 Key Benchmark Results

Best performance on Unseen categories!!! Zero-shot!!

Also excellent performance in Fully supervised settings!!

⚠️ Limitations

These limitations were not mentioned in the paper but were considered through discussions with GPT!!

1️⃣ Learnable Prompt’s Seen Category Bias 🎯

Problem: [PREFIX], [CONJUN] tokens are trained only on seen HOI categories
Result: Potential generalization performance degradation in unseen categories due to overfitting
Dilemma: Trade-off between zero-shot performance vs learnable prompt optimization

2️⃣ Limitations of Close-set Zero-shot 📋

Actual Meaning: Zero-shot only within predefined candidate HOI classes
Limitation: Still unable to recognize completely new verb-object combinations
Constraint: Cannot detect interactions not defined in HOI taxonomy

3️⃣ Computational Cost of Two-stage Paradigm ⚡

Structure: Object Detection → HO Interactor → HOI Decoder → HOI Classifier
Problem: High computational cost due to 4-stage sequential processing
Comparison: Increased inference time compared to one-stage methods

Challenge: Difficulty in achieving optimal alignment between visual features and text embeddings
Temperature scaling: Performance sensitivity to τ value tuning
Feature space: Complexity in aligning embedding spaces across different modalities

5️⃣ Limitations in Fine-grained Interaction Distinction 🔍

Examples: "holding phone" vs "talking on phone", "sitting on chair" vs "standing near chair"
Cause: Difficulty in distinguishing subtle differences due to limitations in spatial relationship encoding
Impact: Performance degradation for HOIs that are visually similar but semantically different

✅ Summary

CLIP4HOI is an HOI detection method that combines DETR’s object detection capabilities with CLIP’s Vision-Language understanding.

🎯 Core Ideas

DETR + CLIP Integration: Combination of proven object detection + powerful semantic understanding
3 New Components: Addition of HO Interactor, HOI Decoder, HOI Classifier
Text Template Utilization: Flexible classification by expressing HOI in natural language
Zero-shot Generalization: Inference of untrained HOI classes through text descriptions

⚡ Technical Innovations

Spatial + Semantic Information Integration: Simultaneous utilization of spatial relationships and semantic understanding
Global + Pairwise Scores: Consideration of both holistic and individual HOI scores
CLIP Representation Maximization: Application of pre-trained vision-language model capabilities to HOI

📌 Successfully integrating CLIP’s text-image matching capabilities into HOI detection!
🚀 Practical Value: Effective learning and inference of complex HOI relationships through natural language descriptions!

🧠 (한국어) CLIP4HOI 알아보기?!!

🔍 CLIP과 DETR을 결합해서 ZeroShot HOI만들기!!!

논문: CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
발표: NeurIPS 2023 (Mao, Yunyao, et al.)
코드: maoyunyao/CLIP4HOI 코드는 없음,,

🔎 핵심 요약

💡 CLIP의 Vision-Language 능력을 HOI(Human-Object Interaction) detection에 활용!
🏗️ DETR + CLIP 결합: DETR에 HOI 전용 컴포넌트 3개 추가하여 HOI detection 성능 향상
✅ CLIP의 풍부한 표현력: 이미지-텍스트 매칭 능력을 HOI 분류에 활용!

🚨 기존 방법들의 한계점

기존 Zero-shot HOI detection 방법들은 크게 두 가지 패러다임으로 발전해왔지만, 각각 고유한 문제점들을 가지고 있었습니다:

1️⃣ One-Stage Framework의 한계 🎯

대표 연구들: GEN-VLKT, HOICLIP, QPIC, HOTR 등

방식: 하나의 쿼리로 ⟨human, verb, object⟩ triplet을 동시에 예측
문제점:
- Joint positional distribution 과적합: 알려진 verb-object 조합에 대한 인간-객체 위치 분포에 디코더가 과적합
- 위치 의존성: “사람이 컵을 들 때는 주로 오른쪽에 위치”와 같은 seen category 패턴에 의존
- 분포 불일치 취약성: Seen과 unseen categories 간 상당한 위치 분포 차이가 있을 때 성능 급격히 저하

구체적 예시: 훈련에서 “사람-컵” 상호작용이 주로 특정 위치에서 발생했다면, 전혀 다른 위치 관계에서는 제대로 인식하지 못함

2️⃣ Knowledge Distillation의 한계 🧠

대표 연구들: GEN-VLKT, EoID

방식: CLIP의 지식을 기존 HOI detector에 knowledge distillation으로 전달
문제점:
- Data-sensitive: 훈련 데이터의 완전성에 크게 의존
- Seen categories 편향: 증류 과정이 seen categories 샘플들에 의해 지배됨
- Unseen categories 일반화 부족: 훈련 중 unseen categories가 없어 해당 카테고리 일반화 능력 손상
- 분포 불균형: Action probability distribution distillation이 불균형한 훈련 데이터에 민감

3️⃣ Compositional Learning 방식의 한계 📚

대표 연구들: VCL, FCL, ATL, ConsNet, SCL

방식: HOI를 verb와 object로 분해하여 compositional learning 수행
문제점:
- 미리 정의된 카테고리 의존성: 학습 시 unseen categories를 미리 정의해야 함
- 제한된 일반화: 예측 가능한 조합에만 한정되어 진정한 “practical” zero-shot이 아님
- 복잡한 상호작용 모델링 한계: 단순한 decomposition으로는 복잡한 HOI 표현에 한계
- 확장성 부족: 새로운 verb-object 조합에 대한 유연성 제한

4️⃣ 기존 CLIP 활용 방식의 한계 🔧

대표 연구들: GEN-VLKT, EoID, HOICLIP

문제점:
- 간접적 활용: CLIP을 보조 도구로만 사용하여 핵심 능력 미활용
- 복잡한 파이프라인: Knowledge distillation과 fine-tuning의 복잡한 조합 필요
- Training-free의 한계: HOICLIP처럼 training-free 방식은 성능 제약
- Adaptation 부족: CLIP의 강력한 vision-language 능력을 HOI task에 직접 적용하지 못함

💡 CLIP4HOI의 해결책: Two-stage paradigm으로 위치 분포 문제 해결 + CLIP 직접 adaptation으로 증류 의존성 제거!

🏗️ CLIP4HOI Pipeline

Input Image 
    ↓
1. Object Detector (DETR) : 사람들 + 객체들 탐지
    ↓
2. HO Interactor : 각 pair에 대한 proposal 생성
- pair에 대한 proposal의 예시는 아래와 같음  
- Input:  
    - 사람: bbox[100, 50, 200, 300] + features[사람 시각 정보]
    - 컵: bbox[180, 120, 220, 160] + features[컵 시각 정보]
- HO Interactor Output(Q: HOI tokens):
    - 위치 관계: "컵이 사람 오른쪽 근처에 있음"
    - 거리: "30픽셀 거리"
    - 상호작용 특징: "잡을 수 있는 거리, 손 근처 위치" 등이 인코딩된 벡터
    - 결합된 표현: 이 모든 정보가 합쳐진 HOI proposal
    ↓
3. HOI Decoder (CLIP image encoder 기반)
 - Transformer 구조로, 아래 QKV로 Output 산출    
 - Query: 2. HO Interactor에서 나온 pairwise HOI proposals
 - Key/Value: CLIP image encoder의 전체 이미지 features
 - Output: 각 H-O pair에 특화된 visual features
    ↓
4. HOI Classifier (CLIP text embedding과 비교)  
 - 3에서의 결과물과 CLIP의 text embedding을 비교해서 가장 높은 값을 선정!  
    ↓
Final HOI predictions

📦 3가지 핵심 컴포넌트 (학습되는 부분)

1️⃣ HO Interactor 🤝

DETR의 객체 탐지 결과에서 ⟨human, object⟩ 쌍 생성
Feature interaction + Spatial information injection

인간-객체 간의 관계 정보를 담은 HOI tokens(Q) 생성

- Input:  
    - 사람: bbox[100, 50, 200, 300] + features[사람 시각 정보]
    - 컵: bbox[180, 120, 220, 160] + features[컵 시각 정보]
- HO Interactor Output:
    - 위치 관계: "컵이 사람 오른쪽 근처에 있음"
    - 거리: "30픽셀 거리"
    - 상호작용 특징: "잡을 수 있는 거리, 손 근처 위치" 등이 인코딩된 벡터
    - 결합된 표현: 이 모든 정보가 합쳐진 HOI proposal

2️⃣ HOI Decoder 🧠

CLIP image representation을 활용하여 HOI visual features 생성
HOI tokens를 query로 사용하여 attention mechanism 적용

시각적 HOI 표현을 더욱 풍부하게 만듬

- Query: 2. HO Interactor에서 나온 pairwise HOI proposals
- Key/Value: CLIP image encoder의 전체 이미지 features
- Output: 각 H-O pair에 특화된 visual features

3️⃣ HOI Classifier 🎯

CLIP의 shared embedding space 활용하여 direct adaptation
Learnable prompt: [PREFIX] [verb] [CONJUN] [object] 형태로 task-specific text representation
Global + Pairwise HOI scores 계산하여 multi-label classification

🔍 구체적 작동 방식

CLIP Direct Adaptation: Knowledge distillation 대신 CLIP을 직접 fine-grained classifier로 활용

1. Text Embedding 생성

  
# Learnable prompt tokens 사용
text = "[PREFIX] verb [CONJUN] object"  # 학습 가능한 토큰들
T = clip_text_encoder(text)  # [Nc x Co]

2. Visual Features Processing

  
# Linear projection + L2 normalization
I_prime = Linear(I)  # Global image features
P_prime = Linear(P)  # Pairwise H-O features = 2️⃣ HOI Decoder
  
T_hat = L2_Norm(T)   # Text embeddings
I_hat = L2_Norm(I_prime)  # Global visual
P_hat = L2_Norm(P_prime)  # Pairwise visual

3. HOI Scores 계산

  
# Temperature scaling + Sigmoid (multi-label)
S_glob = Sigmoid(I_hat @ T_hat.T / τ)  # Global HOI scores
S_pair = Sigmoid(P_hat @ T_hat.T / τ)  # Pairwise HOI scores
  
# Object detector priors와 통합하여 최종 예측
final_result = integrate(S_glob, S_pair, object_priors)

Ablation Study 결과 해석!! 🔬

HOI Decoder: Zero-shot 성능의 절대적 핵심 컴포넌트(1)
- 3.77 mAP 감소로 가장 큰 성능 하락
- 특히 Unseen categories에서 4.67 mAP 급격한 감소
- → CLIP visual features aggregation이 zero-shot 성능의 핵심!
Global vs Pairwise: Seen/Unseen 간 서로 다른 최적 전략 필요(2)
- Seen: 0.58 mAP 향상 (오히려 도움!)
- Unseen: 2.99 mAP 감소 (큰 손실!)
- → Zero-shot 일반화를 위해서는 global context 필수
Multi-modal Integration: CLIP + DETR의 상호보완적 역할 입증(3)

📊 실험 결과 & 성능

CLIP4HOI는 주요 벤치마크에서 기존 방법들을 크게 능가하는 성능을 보여줍니다:

🏆 주요 벤치마크 결과

Unseen에서 성능이 제일 좋았다!!! Zeroshot!!

또한 Fully supervised 에서도 성능이 좋았어!!

⚠️ 한계점

논문에는 없었고 GPT와 대화하며 생각해본 한계점임!!

1️⃣ Learnable Prompt의 Seen Category 편향 🎯

문제: [PREFIX], [CONJUN] 토큰들이 seen HOI categories로만 학습
결과: Unseen categories에서 overfitting으로 인한 일반화 성능 저하 가능성
딜레마: Zero-shot 성능 vs Learnable prompt 최적화 간의 trade-off

2️⃣ Close-set Zero-shot의 한계 📋

실제 의미: 미리 정의된 candidate HOI classes 내에서만 zero-shot
한계: 완전히 새로운 verb-object 조합은 여전히 인식 불가
제약: HOI taxonomy에 정의되지 않은 상호작용은 탐지 불가능

3️⃣ Two-stage Paradigm의 Computational Cost ⚡

구조: Object Detection → HO Interactor → HOI Decoder → HOI Classifier
문제: 4단계 sequential processing으로 인한 높은 연산 비용
비교: One-stage 방법들 대비 inference time 증가

Challenge: Visual features와 text embeddings 간의 optimal alignment 달성 어려움
Temperature scaling: τ 값 tuning에 따른 성능 민감성
Feature space: 서로 다른 modality 간 embedding space 일치시키기 복잡

5️⃣ Fine-grained Interaction 구분의 한계 🔍

예시: "holding phone" vs "talking on phone", "sitting on chair" vs "standing near chair"
원인: Spatial relationship encoding의 한계로 미세한 차이 구분 어려움
영향: 시각적으로 유사하지만 의미적으로 다른 HOI들의 성능 저하

✅ 마무리 요약

CLIP4HOI는 DETR의 객체 탐지 능력과 CLIP의 Vision-Language 이해력을 결합한 HOI detection 방법입니다.

🎯 핵심 아이디어

DETR + CLIP 결합: 검증된 객체 탐지 + 강력한 의미 이해 조합
3가지 새 컴포넌트: HO Interactor, HOI Decoder, HOI Classifier 추가
텍스트 템플릿 활용: 자연어로 HOI를 표현하여 유연한 분류 가능
Zero-shot 일반화: 학습되지 않은 HOI 클래스도 텍스트 설명으로 추론

⚡ 기술적 혁신

Spatial + Semantic 정보 결합: 공간적 관계와 의미적 이해를 동시에 활용
Global + Pairwise 스코어: 전체적/개별적 HOI 점수를 모두 고려
CLIP 표현력 극대화: Pre-trained vision-language 모델의 능력을 HOI에 적용

📌 CLIP의 텍스트-이미지 매칭 능력을 HOI detection에 성공적으로 접목!
🚀 실용적 가치: 복잡한 HOI 관계도 자연어 설명을 통해 효과적으로 학습 및 추론 가능!

AI, Research

This post is licensed under CC BY 4.0 by the author.

🧠 Understanding CLIP4HOI

🔎 Key Summary

🚨 Limitations of Existing Methods

1️⃣ Limitations of One-Stage Frameworks 🎯

2️⃣ Limitations of Knowledge Distillation 🧠

3️⃣ Limitations of Compositional Learning Approaches 📚

4️⃣ Limitations of Existing CLIP Utilization Methods 🔧

🏗️ CLIP4HOI Pipeline

📦 3 Key Components (Learnable Parts)

📊 Experimental Results & Performance

🏆 Key Benchmark Results

⚠️ Limitations

1️⃣ Learnable Prompt’s Seen Category Bias 🎯

2️⃣ Limitations of Close-set Zero-shot 📋

3️⃣ Computational Cost of Two-stage Paradigm ⚡

4️⃣ Complexity of Multi-modal Alignment 🔄

5️⃣ Limitations in Fine-grained Interaction Distinction 🔍

✅ Summary

🎯 Core Ideas

⚡ Technical Innovations

🧠 (한국어) CLIP4HOI 알아보기?!!

🔎 핵심 요약

🚨 기존 방법들의 한계점

1️⃣ One-Stage Framework의 한계 🎯

2️⃣ Knowledge Distillation의 한계 🧠

3️⃣ Compositional Learning 방식의 한계 📚

4️⃣ 기존 CLIP 활용 방식의 한계 🔧

🏗️ CLIP4HOI Pipeline

📦 3가지 핵심 컴포넌트 (학습되는 부분)

📊 실험 결과 & 성능

🏆 주요 벤치마크 결과

⚠️ 한계점

1️⃣ Learnable Prompt의 Seen Category 편향 🎯

2️⃣ Close-set Zero-shot의 한계 📋

3️⃣ Two-stage Paradigm의 Computational Cost ⚡

4️⃣ Multi-modal Alignment의 복잡성 🔄

5️⃣ Fine-grained Interaction 구분의 한계 🔍

✅ 마무리 요약

🎯 핵심 아이디어

⚡ 기술적 혁신

Trending Tags