🧠 CIR - Composed Image Retrieval on Real-life Images : 이미지 탐색의 시작연구!!

Posted Jul 28, 2025

By DrFirst

16 min read

🧠 (English) Image+Text based Composed Image Retrieval!! CIRR

In this paper, they presented a solution based on a transformer model. But in this post, I will focus only on the dataset.

Title: Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
Conference: ICCV 2021 (Zhang et al.)
Code: CIRR (GitHub)
Dataset : CIRR
Keywords: Composed Image Retrieval, Open-domain Retrieval, CIRR Dataset, MLLM, Zero-Shot

🔍 Research Background

Traditional image retrieval generally relies on a text query or image query (single modality).
However, in real-world scenarios, users often want to provide compositional information such as:

“I want an image similar to this one, but with a different background.”
→ Reference Image + Modification Text = Composed Image Retrieval (CIR)

🧠 Main Contributions

Definition of CIR Task
- Clearly defines Composed Image Retrieval (CIR)
- Query: Reference Image + Textual Modification
- Goal: Retrieve the image that matches the composed query
CIRR Dataset Proposal
- Large-scale benchmark for real-life CIR
- Over 21,000 query–target pairs
- Includes natural scenes, object diversity, and complex textual expressions
Evaluation Set Design
- Fine-grained distractors: semantically similar images increase retrieval difficulty
- Multiple reference forms: ensures diversity at instance-level and scene-level
Benchmarking Existing Methods
- Evaluates representative methods such as TIRG, FiLM, and MAAF
- Demonstrates the difficulty and realism of the CIRR dataset

🧠 CIRR Dataset: Real-life Composed Retrieval

CIRR (Composed Image Retrieval on Real-life Images) is the first major benchmark for compositional image retrieval (CIR).
Its goal is to understand user intent expressed as “reference image + modification text.”

Generality: Includes a wide variety of real-world scenes and objects (not domain-specific like fashion).
Scale: ~17,000 images and 21,000 query–target pairs.
Difficulty: Contains semantic distractors (very similar images), requiring models to capture precise modification intent.

This dataset formally defines CIR and sets a realistic benchmark, becoming the foundation for subsequent research.

📷 Example Query in CIRR

Reference Image: A woman sitting on a bench
Text Modification: “The same woman is standing, wearing different clothes.”
Target Image: The real-life image that satisfies the condition

🧠 Significance of CIRR Dataset

Focused on defining a new problem and demonstrating its difficulty.
Benchmarked existing models (TIRG, FiLM, MAAF) on the dataset.
Showed limitations in complex queries, motivating future solutions.

Existing Models and Their Limitations

Model	Key Idea	Role in CIRR
TIRG	Combines image and text with residual gating	Tests text-guided image modification
FiLM	Feature-wise linear modulation based on text	Reveals limitations for simple compositional queries
MAAF	Modality-aware attention fusion	Explores handling of complex multimodal queries

Results:

Method	Recall@1	Recall@5
TIRG	20.1%	47.6%
FiLM	18.4%	44.1%
MAAF	22.0%	49.2%

All models show relatively low performance
Limitations exposed in complex queries
CIRR task is inherently difficult

Conclusion & Impact
- Defined a new task + provided a realistic benchmark + proved limitations of existing methods
- Inspired follow-up datasets and methods (CIRCO, FashionIQ, CIRPL, etc.)
- A landmark study that initiated CIR research

Timeline of Major Follow-up Research after CIRR (2021–2025)

In the future, we will explore these studies in more detail!!

2021 — CIRR (ICCV 2021)

Contribution: First to define the Composed Image Retrieval (CIR) task and release the CIRR dataset
Significance: Demonstrated limitations of existing models → Sparked follow-up research

2022 — CIRPLANT

Contribution: Proposed a dedicated model architecture for CIR
Idea: Gradually fused image features with textual modifications to better represent intended changes
Significance: One of the first attempts to tackle CIR challenges through dedicated modeling

2022 — CIRCO (ECCV 2022)

Contribution: Expanded dataset for object-centric composed retrieval
Idea: Allowed text modifications to apply to individual objects in the reference image
Significance: Provided a more fine-grained benchmark

2023 — CIRPL

Contribution: Proposed Language-guided Pretraining for CIR
Idea: Adapted large-scale multimodal pretraining models to CIR → Improved performance
Significance: Connected CIR research with the trend of MLLMs

2024 — CIReVL (Vision-by-Language, ICLR 2024)

Contribution: Introduced a training-free CIR model
Idea: Used VLM for image captioning → LLM for caption rewriting → CLIP-based retrieval
Significance: Scalable and interpretable modular design for zero-shot CIR

2024 — Contrastive Scaling (arXiv 2024)

Contribution: Proposed a contrastive scaling strategy to expand positive/negative samples
Idea: Generated triplets via MLLMs → Two-stage fine-tuning for better CIR performance
Significance: Improved results even in low-resource settings

2025 — ConText-CIR (CVPR 2025)

Contribution: Proposed a new framework with Text Concept-Consistency Loss
Idea: Ensured noun phrases in the modification text align with correct parts of the reference image using synthetic data pipeline
Significance: Achieved state-of-the-art in both supervised and zero-shot settings

2025 — OSrCIR (CVPR 2025 Highlight)

Contribution: Introduced a training-free, one-stage reflective CoT-based zero-shot CIR model
Idea: Replaced two-stage pipeline with one-stage multimodal Chain-of-Thought reasoning using MLLMs
Significance: Preserved visual information better, improved performance by 1.8–6.44%, and enhanced interpretability

2025 — COR (Composed Object Retrieval) + CORE

Contribution: Proposed a new object-level retrieval task and dataset
Idea: Instead of retrieving the whole image, it focuses on retrieving/segmenting a specific object guided by reference object + text
Significance: Established a foundation for fine-grained multimodal object retrieval

Overall Summary

2021: CIRR → Task definition + dataset release
2022: CIRPLANT → Model design; CIRCO → Fine-grained dataset
2023: CIRPL → Connection with large-scale pretraining
2024: CIReVL → Training-free CIR; Contrastive Scaling → Expanded contrastive learning
2025: ConText-CIR → Concept consistency loss; OSrCIR → One-stage reasoning;
COR/CORE → Object-level retrieval and segmentation

This trajectory shows how CIR research has become increasingly sophisticated, enhancing both expressiveness and real-world applicability.

🧩 Conclusion

CIRR (ICCV 2021) was the first work to formally define the CIR task and provide a realistic benchmark!!
Subsequent datasets and methods such as CIRCO, FashionIQ, and CIRPL have all built upon this foundation.

🧠 (한국어) 이미지+텍스트를 바탕으로 필요한 이미지를 검색한다!! CIRR

여기서도 Transformer 기반의 해결 구조를 제시! 그런데 그 구조보단 데이터셋에 집중했습니다!!

제목: Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
학회: ICCV 2021 (Zhang et al.)
코드: CIRR (GitHub)
데이터셋 : CIRR
핵심 키워드: Composed Image Retrieval, Open-domain Retrieval, CIRR Dataset, MLLM, Zero-Shot

🔍 연구 배경

기존 이미지 검색(image retrieval)은 일반적으로 텍스트 쿼리 또는 이미지 쿼리 단일 modality에 의존했습니다.
하지만 현실 세계에서 사용자는 다음과 같이 조합적 정보를 제공하고자 하는 경우가 많습니다:

“이 이미지와 비슷한데, 배경만 다르게 해줘”
→ 참조 이미지 + 수정 텍스트 = 조합적 이미지 검색 (Composed Image Retrieval)

🧠 주요 기여

CIR 과제 정립
- Composed Image Retrieval (CIR) 개념을 명확히 정의
- 쿼리: Reference Image + Textual Modification
- 목표: 조합된 쿼리에 부합하는 이미지를 검색
CIRR 데이터셋 제안
- 실제 사진 기반의 대규모 CIR 벤치마크
- 총 21,000개 이상의 쿼리-타깃 쌍
- 자연적 장면, 객체 다양성, 복잡한 문장 표현 반영
평가 세트 설계
- Fine-grained distractors: 시맨틱 유사한 이미지를 포함하여 검색 난이도 상승
- Multiple reference forms: instance-level, scene-level 다양성 보장
기존 기법 성능 비교
- TIRG, FiLM, MAAF 등 대표적인 CIR 방식들을 실험
- CIRR 데이터셋의 난이도와 현실성을 실증함

🧠 CIRR 데이터셋: 현실 이미지 기반 조합적 검색

CIRR (Composed Image Retrieval on Real-life images) 데이터셋은 조합적 이미지 검색(CIR) 연구의 시초가 된 중요한 벤치마크입니다. ‘참조 이미지 + 수정 텍스트’를 결합한 쿼리를 통해 사용자의 복잡한 의도를 파악하는 것을 목표로 합니다.

범용성: 패션 등 특정 도메인이 아닌, 현실 세계의 다양한 풍경과 객체를 포함합니다.
규모: 약 17,000개의 이미지와 21,000개의 쿼리-타겟 쌍으로 구성됩니다.
난이도: 정답과 미묘한 차이가 있는 ‘의미론적 방해물’을 포함하여, 모델이 정확한 수정 의도를 파악해야만 정답을 찾을 수 있게 설계되었습니다.

이 데이터셋은 CIR 과제를 명확히 정의하고, 현실적인 난이도를 제시함으로써 이후 관련 연구들의 중요한 토대가 되었습니다.

📷 CIRR의 쿼리 예시

Reference Image: 어떤 여성이 벤치에 앉아 있는 이미지
Text Modification: “같은 여성이 다른 옷을 입고 서 있다”
Target Image: 조건을 만족하는 현실 이미지

🧠 CIRR 데이터셋의 의의

새로운 문제 정의와 난이도 입증에 초점들 두고,
기존 모델(TIRG, FiLM, MAAF) 활용을 통한 성능 평가해서서
복잡한 쿼리 상황에서 기존 모델의 한계 증명, 이후 연구자들의 해결책 개발을 유도한 선구적 역할을 함

기존 모델 적용과 한계 검증

모델명	주요 아이디어	CIRR 적용 의의
TIRG	이미지와 텍스트를 residual gating 방식으로 결합	텍스트 기반 이미지 수정 성능 검증
FiLM	텍스트 조건에 따라 피처를 선형 변조	단순 조합 쿼리 처리 한계 확인
MAAF	모달리티 인식 attention 융합	복잡한 멀티모달 쿼리 처리 가능성 탐색

결과는?

방법	Recall@1	Recall@5
TIRG	20.1%	47.6%
FiLM	18.4%	44.1%
MAAF	22.0%	49.2%

모든 모델 성능이 절대적으로 낮게 측정
복잡한 쿼리 상황에서 기존 모델의 한계 증명
CIRR 과제가 본질적으로 어려운 문제임을 입증
결론 및 영향
- 새로운 과제 정의 + 현실적 벤치마크 제시 + 기존 방법 한계 증명
- 이후 CIRCO, FashionIQ, CIRPL 등 후속 데이터셋·모델 연구로 발전
- CIR 연구의 출발점이자 기념비적 연구

CIRR 이후 주요 후속 연구 타임라인 (2021–2025)

앞으로 아래 연구들이 대하여 자세히 알아볼것이빈다!!

2021 — CIRR (ICCV 2021)

기여: 최초로 조합적 이미지 검색(CIR) 과제를 정의하고, CIRR 데이터셋 공개
의의: 기존 모델 한계 입증 → 후속 연구 촉발

2022 — CIRPLANT

기여: CIR 전용 모델 구조 제안
아이디어: 이미지 특징 + 텍스트 수정 정보를 점진적으로 융합해 변경 의도 표현 강화
의의: 실제 모델링을 통한 초기 해결 시도

2022 — CIRCO (ECCV 2022)

기여: 객체 중심의 세밀한 구성 검색을 위한 데이터셋 확장
아이디어: 참조 이미지의 개별 객체 단위로 텍스트 수정 반영
의의: 더 정교한 수준의 벤치마크 제공

2023 — CIRPL

기여: 언어 기반 표현 학습(Language-guided Pretraining) 기법 제안
아이디어: 대규모 멀티모달 사전학습 모델을 CIR에 맞게 적응 → 성능 향상
의의: CIR 연구와 MLLM 흐름의 연결

2024 — CIReVL (Vision-by-Language, ICLR 2024)

기여: 학습 없이 구성 검색 수행하는 Training-Free 모델 제안
아이디어: VLM으로 이미지 캡셔닝 → LLM으로 재구성된 캡션 기반 CLIP 검색
의의: 제로샷 CIR에서 확장성 높고 인간 이해 가능한 모듈식 구조 제공 :contentReference[oaicite:1]{index=1}

2024 — Contrastive Scaling (arXiv 2024)

기여: 대조 학습으로 양·음성 샘플 확장 전략 제안
아이디어: MLLM을 활용해 triplets 생성 → 두 단계 파인튜닝으로 성능 개선
의의: 저리소스 상황에서도 CIR 성능 향상 :contentReference[oaicite:2]{index=2}

2025 — ConText-CIR (CVPR 2025)

기여: Text Concept-Consistency Loss 기반 새로운 CIR 프레임워크
아이디어: 명사구가 참조 이미지의 올바른 부분에 집중하도록 유도하는 합성 데이터 파이프라인 포함
의의: supervised 및 zero-shot 설정에서 SOTA 성능 달성 :contentReference[oaicite:3]{index=3}

2025 — OSrCIR (CVPR 2025 Highlight)

기여: Training-Free, One-Stage Reflective CoT 기반 ZS-CIR 모델 제안
아이디어: 기존 두 단계 대신, MLLM을 활용한 일단계 multimodal Chain-of-Thought reasoning 적용
의의: 시각 정보 유지 강화, 1.8~6.44% 성능 향상, 해석력 향상 :contentReference[oaicite:4]{index=4}

2025 — COR (Composed Object Retrieval) + CORE

기여: 객체 수준 정밀 검색을 위한 새로운 과제와 데이터셋 발표
아이디어: 이미지 전체가 아니라 참조 객체 + 텍스트로 특정 객체를 Segmentation 및 Retrieval
의의: 정밀한 개체 기반 다중모달 검색의 기초를 마련 :contentReference[oaicite:5]{index=5}

종합 요약

2021: CIRR → 문제 정의 + 데이터셋 공개
2022: CIRPLANT → 모델 구조; CIRCO → 세밀한 데이터셋
2023: CIRPL → 대규모 사전학습 연계
2024: CIReVL → 학습 없이 검색; Contrastive Scaling → 대조 학습 확장
2025: ConText-CIR → 개념 일관성 손실; OSrCIR → 훈련 없이 일단계 reasoning;
COR/CORE → 객체 수준 검색 및 분할

이 흐름을 통해 CIR 연구는 점점 정교해지고, 표현력과 현실성을 강화하며 발전해왔음을 알 수 이찌요~~

🧩 결론

CIRR (ICCV 2021)는 CIR 과제를 처음으로 명시적으로 정의하고, 현실적 벤치마크를 제시한 선구적 연구!!
이후 등장한 CIRCO, FashionIQ, CIRPL 등의 데이터셋과 방법론들은 이 연구를 기반으로 발전되었습니다.

AI, Research

This post is licensed under CC BY 4.0 by the author.

🧠 (English) Image+Text based Composed Image Retrieval!! CIRR

🔍 Research Background

🧠 Main Contributions

Definition of CIR Task

CIRR Dataset Proposal

Evaluation Set Design

Benchmarking Existing Methods

🧠 CIRR Dataset: Real-life Composed Retrieval

📷 Example Query in CIRR

🧠 Significance of CIRR Dataset

Existing Models and Their Limitations

​ Timeline of Major Follow-up Research after CIRR (2021–2025)

2021 — CIRR (ICCV 2021)

2022 — CIRPLANT

2022 — CIRCO (ECCV 2022)

2023 — CIRPL

2024 — CIReVL (Vision-by-Language, ICLR 2024)

2024 — Contrastive Scaling (arXiv 2024)

2025 — ConText-CIR (CVPR 2025)

2025 — OSrCIR (CVPR 2025 Highlight)

2025 — COR (Composed Object Retrieval) + CORE

Overall Summary

🧩 Conclusion

🧠 (한국어) 이미지+텍스트를 바탕으로 필요한 이미지를 검색한다!! CIRR

🔍 연구 배경

🧠 주요 기여

CIR 과제 정립

CIRR 데이터셋 제안

평가 세트 설계

기존 기법 성능 비교

🧠 CIRR 데이터셋: 현실 이미지 기반 조합적 검색

📷 CIRR의 쿼리 예시

🧠 CIRR 데이터셋의 의의

기존 모델 적용과 한계 검증

​ CIRR 이후 주요 후속 연구 타임라인 (2021–2025)

2021 — CIRR (ICCV 2021)

2022 — CIRPLANT

2022 — CIRCO (ECCV 2022)

2023 — CIRPL

2024 — CIReVL (Vision-by-Language, ICLR 2024)

2024 — Contrastive Scaling (arXiv 2024)

2025 — ConText-CIR (CVPR 2025)

2025 — OSrCIR (CVPR 2025 Highlight)

2025 — COR (Composed Object Retrieval) + CORE

종합 요약

🧩 결론

Trending Tags

Timeline of Major Follow-up Research after CIRR (2021–2025)

CIRR 이후 주요 후속 연구 타임라인 (2021–2025)