🧠 CIRCO - Zero-Shot Composed Image Retrieval with Textual Inversion (ICCV 2023)

Posted Jul 29, 2025

By DrFirst

11 min read

🧠 (한국어) Textual Inversion을 활용한 제로샷 조합 이미지 검색!! CIRCO

기존 CIR연구에서 ZS-CIR 모델을 공개하고, CIRCO 데이터셋도 공개!!

제목: Zero-Shot Composed Image Retrieval with Textual Inversion
학회: ICCV 2023 (Zhang et al.)
코드: CIRCO (GitHub)
핵심 키워드: Composed Image Retrieval, CIRCO, Textual Inversion, Zero-Shot, ICCV 2023, ZS-CIR, SEARLE
추가!! : 용어가 햇갈리는데 ZS-CIR, SEARLE 가 모두 이 논문에서 공개한 모델을 지칭합니다!!
ZS-CIR 는 Zero shot Composed Image Retrieval의 약어, SEARLE은 zero-Shot composEd imAge Retrieval with textuaL invErsion 입니다!!

🔍 연구 배경

CIRR(ICCV 2021)은 조합 이미지 검색(Composed Image Retrieval, CIR)을 정의했지만, 학습 기반 방식과 단일 정답 라벨링에 의존했습니다.
하지만 현실 응용에서는 다음과 같은 요구가 있습니다:

“새로운 도메인에서도 학습 없이(Zero-Shot)
참조 이미지 + 텍스트 수정으로 원하는 이미지를 찾고,
동시에 복수의 정답을 허용해야 한다!”

이를 위해 ICCV 2023에서 발표된 이번 논문은
ZS-CIR 모델 공개!! → Textual Inversion을 활용한 제로샷 CIR 프레임워크 공개
또한 CIRCO라는 데이터 셋으로, 더 현실적이고 정교한 데이터셋을 제안했습니다.

🧠 주요 기여

제로샷 CIR 프레임워크 제안(ZS-CIR)
- Textual Inversion을 활용해 참조 이미지를 새로운 개념 토큰으로 임베딩
- 수정 텍스트와 결합하여 조합 쿼리 형성
- 데이터셋별 학습 없이 다양한 도메인 적용 가능
CIRCO 데이터셋 구축
- COCO 2017 기반의 현실 이미지 사용
- 객체 중심(object-centric) + 다중 객체 포함 쿼리
- 실제 장면에서 객체 속성 변경 + 객체 간 관계 수정 반영
벤치마크 및 제로샷 성능
- CIRR, FashionIQ, CIRCO에서 제로샷 성능 검증
- 학습 없이도 의미 있는 성능 확보

🧠 주요 기여 (자세히!!)

1. 제로샷 CIR 프레임워크 제안 (ZS-CIR)

기존 문제
- CIRR, FashionIQ 같은 기존 데이터셋에서는 대부분 모델이 훈련 데이터 기반 파인튜닝을 거쳐야 했음
- 따라서 새로운 도메인이나 unseen 카테고리에서는 성능이 급격히 저하되는 문제 존재
핵심 아이디어
- Textual Inversion 기법을 CIR에 접목
- 참조 이미지를 새로운 토큰(embedding)으로 변환 → 마치 “단어”처럼 활용
- 수정 텍스트와 결합 → 최종적으로 “이미지+문장 조합 쿼리” 형성
장점
- 추가 학습 없이도 검색 가능 (Zero-Shot)
- 특정 도메인(패션, 실생활 이미지 등)에 국한되지 않고 범용성 확보
- 추론 과정이 단순해 효율성도 보장

2. CIRCO 데이터셋 구축

CIRCO는 CIR 연구에서 처음으로 제로샷 조합 검색을 가능하게 했을 뿐 아니라,
데이터셋 측면에서도
복수의 정답
현실적 이미지
복잡한 쿼리 구성
을 반영해 CIR 평가의 질적 수준을 끌어올린 기념비적 연구입니다.

현실성
- MS-COCO 2017 이미지 기반
- 특정 도메인(예: 패션) 편향을 줄이고, 다양한 물체·배경·관계성을 포함
객체 중심 (Object-Centric)
- 단순히 “전체 장면”이 아니라, 특정 객체의 속성 변화를 반영하는 쿼리 제공
- 예: “사진 속 자동차는 빨간색으로 바꾸고, 옆에 있던 강아지는 고양이로 바꿔줘.”
복수 정답 (Multi-Ground Truths)
- 쿼리당 평균 4.53개의 타깃 이미지 존재
- 기존 FashionIQ 같은 단일 정답 구조의 한계를 극복
- False Negative 문제 완화 → 검색 모델 평가가 훨씬 공정해짐
복잡한 질의 (Complex Queries)
- 객체 속성 수정뿐 아니라 다중 객체 및 객체 간 관계를 포함
- 단순한 “색상 변경”을 넘어
  - “사람이 앉아 있던 위치에 다른 인물이 서 있다”
  - “개가 있던 자리에 고양이가 있다” 같은 복합 쿼리도 포함

3. 벤치마크 및 제로샷 성능

평가 데이터셋: CIRR, FashionIQ, CIRCO 등 주요 CIR 벤치마크에서 제로샷 성능 검증
FashionIQ (Validation Set)
- SEARLE (B/32): 평균 R@10 = 22.89, R@50 = 42.53
- SEARLE-XL (L/14): 평균 R@10 = 25.56, R@50 = 46.23
- 기존 방법 대비 확연히 향상, Bases일때는 프롬포트를 학습시킨 OTI보다 그냥 SEARLE가 더 좋을떄가 많았음!!
CIRR (Test Set)
- SEARLE (B/32): Recall@1 = 24.27, Recall@5 = 53.22, Recall@10 = 66.82
- SEARLE-XL (L/14): Recall@1 = 24.22, Recall@5 = 52.48, Recall@10 = 66.29
- SEARLE이 프롬포트 학습한것 보다 성능이 좋았음!@
- Subset Recall (더 정밀한 평가)에서도 SEARLE-XL이 Recall@3 = 88.19로 SOTA 수준 성능 확보
의의
- 단순한 Zero-Shot 접근임에도 불구하고, FashionIQ와 CIRR에서 기존 학습 기반 기법보다 경쟁력 있는 성능을 기록
- 특히 CIRR에서는 Recall@1이 24%를 넘으며, 학습 없이도 의미 있는 검색 품질을 보장
- 이는 CIR 연구에서 Zero-Shot 접근의 가능성을 최초로 실증한 성과이며, 이후 CIReVL (ICLR 2024), OSrCIR (CVPR 2025) 같은 Training-Free 계열 연구로 이어지는 기반을 마련함

🧩 결론

CIRCO (ICCV 2023)는 Textual Inversion 기반 제로샷 CIR(ZS-CIR)을 제안하고,
현실성과 정교함을 강화한 새로운 데이터셋(CIRCO)을 구축했습니다.
이는 CIR 연구가 “학습 기반 + 단일 정답”에서 “제로샷 + 다중 정답 + 복잡 질의”로
진화하는 출발점이 되었습니다.

🧠 (English) Zero-Shot Composed Image Retrieval with Textual Inversion!! CIRCO

In this work, the authors released the ZS-CIR model and also introduced the CIRCO dataset!!

Title: Zero-Shot Composed Image Retrieval with Textual Inversion
Conference: ICCV 2023 (Zhang et al.)
Code: CIRCO (GitHub)
Key Keywords: Composed Image Retrieval, CIRCO, Textual Inversion, Zero-Shot, ICCV 2023, ZS-CIR, SEARLE
Note!!: The terms ZS-CIR and SEARLE both refer to the model released in this paper.
ZS-CIR is the abbreviation of Zero-Shot Composed Image Retrieval, while SEARLE stands for zero-Shot composEd imAge Retrieval with textuaL invErsion.

🔍 Background

CIRR (ICCV 2021) defined Composed Image Retrieval (CIR) but relied heavily on training-based methods and single ground-truth labels.
However, in real-world scenarios, new demands have emerged:

“We need to retrieve images in unseen domains,
without additional training (Zero-Shot),
using reference images + textual modifications,
while allowing multiple correct answers!”

To meet these demands, the ICCV 2023 paper proposed:

ZS-CIR model → a Textual Inversion-based zero-shot CIR framework
CIRCO dataset → a more realistic and fine-grained benchmark

🧠 Key Contributions

Zero-Shot CIR Framework (ZS-CIR)
- Applied Textual Inversion to embed reference images as new concept tokens
- Combined with modification text to form composed queries
- Applicable across domains without dataset-specific training
CIRCO Dataset
- Based on COCO 2017 real-world images
- Object-centric queries including multiple objects
- Captures attribute changes + object relationships in natural scenes
Benchmark & Zero-Shot Performance
- Evaluated on CIRR, FashionIQ, and CIRCO
- Achieved meaningful zero-shot performance without additional training

🧠 Key Contributions (Detailed)

1. Zero-Shot CIR Framework (ZS-CIR)

Problem
- Previous datasets like CIRR and FashionIQ required fine-tuning on training data
- Performance dropped drastically on unseen domains or categories
Core Idea
- Incorporate Textual Inversion into CIR
- Convert reference images into pseudo-word tokens (embeddings), treated like “words”
- Combine with modification text → final image+text composed query
Advantages
- Enables retrieval without additional training (Zero-Shot)
- Domain-agnostic: works across fashion, real-life, and beyond
- Simple inference pipeline with efficient retrieval

2. CIRCO Dataset

CIRCO is not only the first to enable zero-shot composed retrieval,
but also advances CIR evaluation by providing:
Multiple ground truths
Real-world images
Complex queries
→ raising the evaluation quality of CIR benchmarks

Realism
- Built on MS-COCO 2017
- Avoids domain bias (e.g., fashion-only) and covers diverse scenes, objects, and contexts
Object-Centric
- Queries reflect changes in specific objects, not only the global scene
- Example: “Change the car in the image to red, and replace the dog with a cat.”
Multiple Ground Truths
- On average, 4.53 target images per query
- Overcomes the single-ground-truth limitation of FashionIQ
- Mitigates False Negative issue → fairer evaluation of retrieval systems
Complex Queries
- Includes not only attribute modifications but also multi-object and relational changes
- Beyond “color change,” includes cases like:
  - “A person sitting becomes another person standing”
  - “Replace the dog with a cat”

3. Benchmark & Zero-Shot Performance

Evaluation Datasets: CIRR, FashionIQ, CIRCO
FashionIQ (Validation Set)
- SEARLE (B/32): Avg R@10 = 22.89, R@50 = 42.53
- SEARLE-XL (L/14): Avg R@10 = 25.56, R@50 = 46.23
- In some cases, plain SEARLE outperformed the optimized OTI version
CIRR (Test Set)
- SEARLE (B/32): Recall@1 = 24.27, Recall@5 = 53.22, Recall@10 = 66.82
- SEARLE-XL (L/14): Recall@1 = 24.22, Recall@5 = 52.48, Recall@10 = 66.29
- SEARLE achieved better results than OTI-trained prompts in some settings
- Subset Recall: SEARLE-XL reached Recall@3 = 88.19, achieving SOTA-level performance
Significance
- Even with a pure Zero-Shot setup, SEARLE achieved competitive performance compared to training-based approaches
- On CIRR, Recall@1 exceeded 24%, proving high-quality retrieval without training
- This milestone validated the feasibility of Zero-Shot CIR, laying the groundwork for follow-up works such as CIReVL (ICLR 2024) and OSrCIR (CVPR 2025)

🧩 Conclusion

CIRCO (ICCV 2023) introduced Textual Inversion-based Zero-Shot CIR (ZS-CIR / SEARLE)
and established a new, more realistic dataset (CIRCO).
This work marked the evolution of CIR from “training-based + single ground-truth”
to “zero-shot + multiple ground-truths + complex queries.”

AI, Research

This post is licensed under CC BY 4.0 by the author.