Post

📝Understanding CLIP4HOI - CLIP4HOI를 알아보자!!!

📝Understanding CLIP4HOI - CLIP4HOI를 알아보자!!!

🧠 Understanding CLIP4HOI

🔍 Combining CLIP and DETR for Zero-Shot HOI Detection!!!

manhwa

Paper CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
Conference: NeurIPS 2023 (Mao, Yunyao, et al.)
Code: maoyunyao/CLIP4HOI 코드는 없음,,



🔎 Key Summary

  • 💡 CLIP’s Vision-Language capabilities applied to HOI (Human-Object Interaction) detection!
  • 🏗️ DETR + CLIP Integration: Adding 3 HOI-specific components to DETR for enhanced HOI detection performance
  • CLIP’s Rich Representation: Leveraging image-text matching capabilities for HOI classification!

🚨 Limitations of Existing Methods

Existing Zero-shot HOI detection methods have evolved around two main paradigms, but each has its own inherent problems:

1️⃣ Limitations of One-Stage Frameworks 🎯

Representative Studies: GEN-VLKT, HOICLIP, QPIC, HOTR, etc.

  • Approach: Simultaneously predicting ⟨human, verb, object⟩ triplets with a single query
  • Problems:
    • Joint positional distribution overfitting: Decoder overfits to human-object positional distributions for known verb-object combinations
    • Position dependency: Reliance on seen category patterns like “when a person holds a cup, they are usually positioned on the right”
    • Distribution mismatch vulnerability: Performance drastically drops when there are significant positional distribution differences between seen and unseen categories

Concrete Example: If “person-cup” interactions occurred mainly in specific positions during training, the model fails to recognize the same interaction in completely different positional relationships

2️⃣ Limitations of Knowledge Distillation 🧠

Representative Studies: GEN-VLKT, EoID

  • Approach: Transferring CLIP’s knowledge to existing HOI detectors through knowledge distillation
  • Problems:
    • Data-sensitive: Heavy dependence on the completeness of training data
    • Seen categories bias: Distillation process is dominated by seen category samples
    • Poor unseen categories generalization: Lack of unseen categories during training damages generalization ability for those categories
    • Distribution imbalance: Action probability distribution distillation is sensitive to imbalanced training data

3️⃣ Limitations of Compositional Learning Approaches 📚

Representative Studies: VCL, FCL, ATL, ConsNet, SCL

  • Approach: Performing compositional learning by decomposing HOI into verbs and objects
  • Problems:
    • Dependency on predefined categories: Unseen categories must be predefined during learning
    • Limited generalization: Restricted to predictable combinations, not truly “practical” zero-shot
    • Complex interaction modeling limitations: Simple decomposition has limitations in representing complex HOI
    • Lack of scalability: Limited flexibility for new verb-object combinations

4️⃣ Limitations of Existing CLIP Utilization Methods 🔧

Representative Studies: GEN-VLKT, EoID, HOICLIP

  • Problems:
    • Indirect utilization: Using CLIP only as an auxiliary tool, underutilizing its core capabilities
    • Complex pipeline: Requiring complex combinations of knowledge distillation and fine-tuning
    • Training-free limitations: Training-free approaches like HOICLIP have performance constraints
    • Lack of adaptation: Failure to directly apply CLIP’s powerful vision-language capabilities to HOI tasks

💡 CLIP4HOI’s Solution: Solving positional distribution problems with two-stage paradigm + eliminating distillation dependency through direct CLIP adaptation!


🏗️ CLIP4HOI Pipeline

structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Input Image 
    ↓
1. Object Detector (DETR): Detect humans + objects
    ↓
2. HO Interactor: Generate proposals for each pair
- Example of pair proposal generation:  
- Input:  
    - Human: bbox[100, 50, 200, 300] + features[human visual information]
    - Cup: bbox[180, 120, 220, 160] + features[cup visual information]
- HO Interactor Output (Q: HOI tokens):
    - Spatial relationship: "Cup is near the right side of human"
    - Distance: "30 pixel distance"
    - Interaction features: Vector encoding "graspable distance, near-hand position" etc.
    - Combined representation: HOI proposal combining all this information
    ↓
3. HOI Decoder (Based on CLIP image encoder)
 - Transformer structure, generating output with QKV below    
 - Query: Pairwise HOI proposals from 2. HO Interactor
 - Key/Value: Global image features from CLIP image encoder
 - Output: Visual features specialized for each H-O pair
    ↓
4. HOI Classifier (Compare with CLIP text embeddings)  
 - Compare results from 3 with CLIP text embeddings to select highest score!  
    ↓
Final HOI predictions

📦 3 Key Components (Learnable Parts)

1️⃣ HO Interactor 🤝

  • Generate ⟨human, object⟩ pairs from DETR’s object detection results
  • Feature interaction + Spatial information injection
  • Generate HOI tokens (Q) containing human-object relationship information

    1
    2
    3
    4
    5
    6
    7
    8
    
    - Input:  
        - Human: bbox[100, 50, 200, 300] + features[human visual information]
        - Cup: bbox[180, 120, 220, 160] + features[cup visual information]
    - HO Interactor Output:
        - Spatial relationship: "Cup is near the right side of human"
        - Distance: "30 pixel distance"
        - Interaction features: Vector encoding "graspable distance, near-hand position" etc.
        - Combined representation: HOI proposal combining all this information
    

2️⃣ HOI Decoder 🧠

  • Utilize CLIP image representation to generate HOI visual features
  • Apply attention mechanism using HOI tokens as queries
  • Enrich visual HOI representations

    1
    2
    3
    
    - Query: Pairwise HOI proposals from 2. HO Interactor
    - Key/Value: Global image features from CLIP image encoder
    - Output: Visual features specialized for each H-O pair
    

3️⃣ HOI Classifier 🎯

  • Utilize CLIP’s shared embedding space for direct adaptation
  • Learnable prompt: [PREFIX] [verb] [CONJUN] [object] format for task-specific text representation
  • Global + Pairwise HOI scores calculation for multi-label classification
  • 🔍 Detailed Operation Process
    • CLIP Direct Adaptation: Use CLIP directly as fine-grained classifier instead of knowledge distillation

    1. Text Embedding Generation

    1
    2
    3
    
    # Using learnable prompt tokens
    text = "[PREFIX] verb [CONJUN] object"  # Learnable tokens
    T = clip_text_encoder(text)  # [Nc x Co]
    

    2. Visual Features Processing

    1
    2
    3
    4
    5
    6
    7
    
    # Linear projection + L2 normalization
    I_prime = Linear(I)  # Global image features
    P_prime = Linear(P)  # Pairwise H-O features = 2️⃣ HOI Decoder
      
    T_hat = L2_Norm(T)   # Text embeddings
    I_hat = L2_Norm(I_prime)  # Global visual
    P_hat = L2_Norm(P_prime)  # Pairwise visual
    

    3. HOI Scores Calculation

    1
    2
    3
    4
    5
    6
    
    # Temperature scaling + Sigmoid (multi-label)
    S_glob = Sigmoid(I_hat @ T_hat.T / τ)  # Global HOI scores
    S_pair = Sigmoid(P_hat @ T_hat.T / τ)  # Pairwise HOI scores
      
    # Integrate with object detector priors for final prediction
    final_result = integrate(S_glob, S_pair, object_priors)
    

    Ablation Study Results Analysis!! 🔬

    ablation

    • HOI Decoder: Absolutely Critical Component for Zero-shot Performance(1)
      • Largest performance drop with 3.77 mAP decrease
      • Especially 4.67 mAP sharp decrease in Unseen categories
      • → CLIP visual features aggregation is key to zero-shot performance!
    • Global vs Pairwise: Different optimal strategies needed for Seen/Unseen(2)
      • Seen: 0.58 mAP improvement (actually helpful!)
      • Unseen: 2.99 mAP decrease (significant loss!)
      • → Global context is essential for zero-shot generalization
    • Multi-modal Integration: Proved complementary roles of CLIP + DETR(3)

📊 Experimental Results & Performance

CLIP4HOI demonstrates significantly superior performance compared to existing methods on major benchmarks:

🏆 Key Benchmark Results

results

  • Best performance on Unseen categories!!! Zero-shot!!

fully

  • Also excellent performance in Fully supervised settings!!

⚠️ Limitations

These limitations were not mentioned in the paper but were considered through discussions with GPT!!

1️⃣ Learnable Prompt’s Seen Category Bias 🎯

  • Problem: [PREFIX], [CONJUN] tokens are trained only on seen HOI categories
  • Result: Potential generalization performance degradation in unseen categories due to overfitting
  • Dilemma: Trade-off between zero-shot performance vs learnable prompt optimization

2️⃣ Limitations of Close-set Zero-shot 📋

  • Actual Meaning: Zero-shot only within predefined candidate HOI classes
  • Limitation: Still unable to recognize completely new verb-object combinations
  • Constraint: Cannot detect interactions not defined in HOI taxonomy

3️⃣ Computational Cost of Two-stage Paradigm

  • Structure: Object Detection → HO Interactor → HOI Decoder → HOI Classifier
  • Problem: High computational cost due to 4-stage sequential processing
  • Comparison: Increased inference time compared to one-stage methods

4️⃣ Complexity of Multi-modal Alignment 🔄

  • Challenge: Difficulty in achieving optimal alignment between visual features and text embeddings
  • Temperature scaling: Performance sensitivity to τ value tuning
  • Feature space: Complexity in aligning embedding spaces across different modalities

5️⃣ Limitations in Fine-grained Interaction Distinction 🔍

  • Examples: "holding phone" vs "talking on phone", "sitting on chair" vs "standing near chair"
  • Cause: Difficulty in distinguishing subtle differences due to limitations in spatial relationship encoding
  • Impact: Performance degradation for HOIs that are visually similar but semantically different

✅ Summary

CLIP4HOI is an HOI detection method that combines DETR’s object detection capabilities with CLIP’s Vision-Language understanding.

🎯 Core Ideas

  • DETR + CLIP Integration: Combination of proven object detection + powerful semantic understanding
  • 3 New Components: Addition of HO Interactor, HOI Decoder, HOI Classifier
  • Text Template Utilization: Flexible classification by expressing HOI in natural language
  • Zero-shot Generalization: Inference of untrained HOI classes through text descriptions

Technical Innovations

  • Spatial + Semantic Information Integration: Simultaneous utilization of spatial relationships and semantic understanding
  • Global + Pairwise Scores: Consideration of both holistic and individual HOI scores
  • CLIP Representation Maximization: Application of pre-trained vision-language model capabilities to HOI

📌 Successfully integrating CLIP’s text-image matching capabilities into HOI detection!

🚀 Practical Value: Effective learning and inference of complex HOI relationships through natural language descriptions!


🧠 (한국어) CLIP4HOI 알아보기?!!

🔍 CLIP과 DETR을 결합해서 ZeroShot HOI만들기!!!

manhwa

논문: CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
발표: NeurIPS 2023 (Mao, Yunyao, et al.)
코드: maoyunyao/CLIP4HOI 코드는 없음,,



🔎 핵심 요약

  • 💡 CLIP의 Vision-Language 능력을 HOI(Human-Object Interaction) detection에 활용!
  • 🏗️ DETR + CLIP 결합: DETR에 HOI 전용 컴포넌트 3개 추가하여 HOI detection 성능 향상
  • CLIP의 풍부한 표현력: 이미지-텍스트 매칭 능력을 HOI 분류에 활용!

🚨 기존 방법들의 한계점

기존 Zero-shot HOI detection 방법들은 크게 두 가지 패러다임으로 발전해왔지만, 각각 고유한 문제점들을 가지고 있었습니다:

1️⃣ One-Stage Framework의 한계 🎯

대표 연구들: GEN-VLKT, HOICLIP, QPIC, HOTR 등

  • 방식: 하나의 쿼리로 ⟨human, verb, object⟩ triplet을 동시에 예측
  • 문제점:
    • Joint positional distribution 과적합: 알려진 verb-object 조합에 대한 인간-객체 위치 분포에 디코더가 과적합
    • 위치 의존성: “사람이 컵을 들 때는 주로 오른쪽에 위치”와 같은 seen category 패턴에 의존
    • 분포 불일치 취약성: Seen과 unseen categories 간 상당한 위치 분포 차이가 있을 때 성능 급격히 저하

구체적 예시: 훈련에서 “사람-컵” 상호작용이 주로 특정 위치에서 발생했다면, 전혀 다른 위치 관계에서는 제대로 인식하지 못함

2️⃣ Knowledge Distillation의 한계 🧠

대표 연구들: GEN-VLKT, EoID

  • 방식: CLIP의 지식을 기존 HOI detector에 knowledge distillation으로 전달
  • 문제점:
    • Data-sensitive: 훈련 데이터의 완전성에 크게 의존
    • Seen categories 편향: 증류 과정이 seen categories 샘플들에 의해 지배됨
    • Unseen categories 일반화 부족: 훈련 중 unseen categories가 없어 해당 카테고리 일반화 능력 손상
    • 분포 불균형: Action probability distribution distillation이 불균형한 훈련 데이터에 민감

3️⃣ Compositional Learning 방식의 한계 📚

대표 연구들: VCL, FCL, ATL, ConsNet, SCL

  • 방식: HOI를 verb와 object로 분해하여 compositional learning 수행
  • 문제점:
    • 미리 정의된 카테고리 의존성: 학습 시 unseen categories를 미리 정의해야 함
    • 제한된 일반화: 예측 가능한 조합에만 한정되어 진정한 “practical” zero-shot이 아님
    • 복잡한 상호작용 모델링 한계: 단순한 decomposition으로는 복잡한 HOI 표현에 한계
    • 확장성 부족: 새로운 verb-object 조합에 대한 유연성 제한

4️⃣ 기존 CLIP 활용 방식의 한계 🔧

대표 연구들: GEN-VLKT, EoID, HOICLIP

  • 문제점:
    • 간접적 활용: CLIP을 보조 도구로만 사용하여 핵심 능력 미활용
    • 복잡한 파이프라인: Knowledge distillation과 fine-tuning의 복잡한 조합 필요
    • Training-free의 한계: HOICLIP처럼 training-free 방식은 성능 제약
    • Adaptation 부족: CLIP의 강력한 vision-language 능력을 HOI task에 직접 적용하지 못함

💡 CLIP4HOI의 해결책: Two-stage paradigm으로 위치 분포 문제 해결 + CLIP 직접 adaptation으로 증류 의존성 제거!


🏗️ CLIP4HOI Pipeline

structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Input Image 
    ↓
1. Object Detector (DETR) : 사람들 + 객체들 탐지
    ↓
2. HO Interactor : 각 pair에 대한 proposal 생성
- pair에 대한 proposal의 예시는 아래와 같음  
- Input:  
    - 사람: bbox[100, 50, 200, 300] + features[사람 시각 정보]
    - 컵: bbox[180, 120, 220, 160] + features[컵 시각 정보]
- HO Interactor Output(Q: HOI tokens):
    - 위치 관계: "컵이 사람 오른쪽 근처에 있음"
    - 거리: "30픽셀 거리"
    - 상호작용 특징: "잡을 수 있는 거리, 손 근처 위치" 등이 인코딩된 벡터
    - 결합된 표현: 이 모든 정보가 합쳐진 HOI proposal
    ↓
3. HOI Decoder (CLIP image encoder 기반)
 - Transformer 구조로, 아래 QKV로 Output 산출    
 - Query: 2. HO Interactor에서 나온 pairwise HOI proposals
 - Key/Value: CLIP image encoder의 전체 이미지 features
 - Output: 각 H-O pair에 특화된 visual features
    ↓
4. HOI Classifier (CLIP text embedding과 비교)  
 - 3에서의 결과물과 CLIP의 text embedding을 비교해서 가장 높은 값을 선정!  
    ↓
Final HOI predictions

📦 3가지 핵심 컴포넌트 (학습되는 부분)

1️⃣ HO Interactor 🤝

  • DETR의 객체 탐지 결과에서 ⟨human, object⟩ 쌍 생성
  • Feature interaction + Spatial information injection
  • 인간-객체 간의 관계 정보를 담은 HOI tokens(Q) 생성

    1
    2
    3
    4
    5
    6
    7
    8
    
    - Input:  
        - 사람: bbox[100, 50, 200, 300] + features[사람 시각 정보]
        - 컵: bbox[180, 120, 220, 160] + features[컵 시각 정보]
    - HO Interactor Output:
        - 위치 관계: "컵이 사람 오른쪽 근처에 있음"
        - 거리: "30픽셀 거리"
        - 상호작용 특징: "잡을 수 있는 거리, 손 근처 위치" 등이 인코딩된 벡터
        - 결합된 표현: 이 모든 정보가 합쳐진 HOI proposal
    

2️⃣ HOI Decoder 🧠

  • CLIP image representation을 활용하여 HOI visual features 생성
  • HOI tokens를 query로 사용하여 attention mechanism 적용
  • 시각적 HOI 표현을 더욱 풍부하게 만듬

    1
    2
    3
    
    - Query: 2. HO Interactor에서 나온 pairwise HOI proposals
    - Key/Value: CLIP image encoder의 전체 이미지 features
    - Output: 각 H-O pair에 특화된 visual features
    

3️⃣ HOI Classifier 🎯

  • CLIP의 shared embedding space 활용하여 direct adaptation
  • Learnable prompt: [PREFIX] [verb] [CONJUN] [object] 형태로 task-specific text representation
  • Global + Pairwise HOI scores 계산하여 multi-label classification
  • 🔍 구체적 작동 방식
    • CLIP Direct Adaptation: Knowledge distillation 대신 CLIP을 직접 fine-grained classifier로 활용

    1. Text Embedding 생성

    1
    2
    3
    
    # Learnable prompt tokens 사용
    text = "[PREFIX] verb [CONJUN] object"  # 학습 가능한 토큰들
    T = clip_text_encoder(text)  # [Nc x Co]
    

    2. Visual Features Processing

    1
    2
    3
    4
    5
    6
    7
    
    # Linear projection + L2 normalization
    I_prime = Linear(I)  # Global image features
    P_prime = Linear(P)  # Pairwise H-O features = 2️⃣ HOI Decoder
      
    T_hat = L2_Norm(T)   # Text embeddings
    I_hat = L2_Norm(I_prime)  # Global visual
    P_hat = L2_Norm(P_prime)  # Pairwise visual
    

    3. HOI Scores 계산

    1
    2
    3
    4
    5
    6
    
    # Temperature scaling + Sigmoid (multi-label)
    S_glob = Sigmoid(I_hat @ T_hat.T / τ)  # Global HOI scores
    S_pair = Sigmoid(P_hat @ T_hat.T / τ)  # Pairwise HOI scores
      
    # Object detector priors와 통합하여 최종 예측
    final_result = integrate(S_glob, S_pair, object_priors)
    

    Ablation Study 결과 해석!! 🔬

    ablation

    • HOI Decoder: Zero-shot 성능의 절대적 핵심 컴포넌트(1)
      • 3.77 mAP 감소로 가장 큰 성능 하락
      • 특히 Unseen categories에서 4.67 mAP 급격한 감소
      • → CLIP visual features aggregation이 zero-shot 성능의 핵심!
    • Global vs Pairwise: Seen/Unseen 간 서로 다른 최적 전략 필요(2)
      • Seen: 0.58 mAP 향상 (오히려 도움!)
      • Unseen: 2.99 mAP 감소 (큰 손실!)
      • Zero-shot 일반화를 위해서는 global context 필수
    • Multi-modal Integration: CLIP + DETR의 상호보완적 역할 입증(3)

📊 실험 결과 & 성능

CLIP4HOI는 주요 벤치마크에서 기존 방법들을 크게 능가하는 성능을 보여줍니다:

🏆 주요 벤치마크 결과

results

  • Unseen에서 성능이 제일 좋았다!!! Zeroshot!!

fully

  • 또한 Fully supervised 에서도 성능이 좋았어!!

⚠️ 한계점

논문에는 없었고 GPT와 대화하며 생각해본 한계점임!!

1️⃣ Learnable Prompt의 Seen Category 편향 🎯

  • 문제: [PREFIX], [CONJUN] 토큰들이 seen HOI categories로만 학습
  • 결과: Unseen categories에서 overfitting으로 인한 일반화 성능 저하 가능성
  • 딜레마: Zero-shot 성능 vs Learnable prompt 최적화 간의 trade-off

2️⃣ Close-set Zero-shot의 한계 📋

  • 실제 의미: 미리 정의된 candidate HOI classes 내에서만 zero-shot
  • 한계: 완전히 새로운 verb-object 조합은 여전히 인식 불가
  • 제약: HOI taxonomy에 정의되지 않은 상호작용은 탐지 불가능

3️⃣ Two-stage Paradigm의 Computational Cost

  • 구조: Object Detection → HO Interactor → HOI Decoder → HOI Classifier
  • 문제: 4단계 sequential processing으로 인한 높은 연산 비용
  • 비교: One-stage 방법들 대비 inference time 증가

4️⃣ Multi-modal Alignment의 복잡성 🔄

  • Challenge: Visual features와 text embeddings 간의 optimal alignment 달성 어려움
  • Temperature scaling: τ 값 tuning에 따른 성능 민감성
  • Feature space: 서로 다른 modality 간 embedding space 일치시키기 복잡

5️⃣ Fine-grained Interaction 구분의 한계 🔍

  • 예시: "holding phone" vs "talking on phone", "sitting on chair" vs "standing near chair"
  • 원인: Spatial relationship encoding의 한계로 미세한 차이 구분 어려움
  • 영향: 시각적으로 유사하지만 의미적으로 다른 HOI들의 성능 저하

✅ 마무리 요약

CLIP4HOI는 DETR의 객체 탐지 능력과 CLIP의 Vision-Language 이해력을 결합한 HOI detection 방법입니다.

🎯 핵심 아이디어

  • DETR + CLIP 결합: 검증된 객체 탐지 + 강력한 의미 이해 조합
  • 3가지 새 컴포넌트: HO Interactor, HOI Decoder, HOI Classifier 추가
  • 텍스트 템플릿 활용: 자연어로 HOI를 표현하여 유연한 분류 가능
  • Zero-shot 일반화: 학습되지 않은 HOI 클래스도 텍스트 설명으로 추론

기술적 혁신

  • Spatial + Semantic 정보 결합: 공간적 관계와 의미적 이해를 동시에 활용
  • Global + Pairwise 스코어: 전체적/개별적 HOI 점수를 모두 고려
  • CLIP 표현력 극대화: Pre-trained vision-language 모델의 능력을 HOI에 적용

📌 CLIP의 텍스트-이미지 매칭 능력을 HOI detection에 성공적으로 접목!

🚀 실용적 가치: 복잡한 HOI 관계도 자연어 설명을 통해 효과적으로 학습 및 추론 가능!

This post is licensed under CC BY 4.0 by the author.