🧠 [CIReVL] VISION-BY-LANGUAGE FOR TRAINING-FREE COMPOSITIONAL IMAGE RETRIEVAL : First Training Free on CIRR

Posted Jul 30, 2025

By DrFirst

10 min read

🧠 (한국어) Training-Free 조합 이미지 검색!! CIReVL (ICLR 2024)

CIRCO(ICCV 2023)의 뒤를 이어, CIR 연구가 학습 기반에서 완전한 Training-Free 방식으로 확장!!

제목: Vision-by-Language for Training-Free Compositional Image Retrieval
학회: ICLR 2024 (Karthik et al.)
코드: Vision-by-Language (GitHub)
핵심 키워드: Composed Image Retrieval, Training-Free, Vision-by-Language, TIFA, ICLR 2024, ZS-CIR

🔍 연구 배경

CIRR(ICCV 2021) → CIRCO(ICCV 2023)까지, CIR 연구는 크게 진화해왔다.
하지만 기존 접근에는 여전히 한계가 있었다:

CIRR, FashionIQ, CIRCO 등의 데이터셋에 맞춘 학습 기반 접근이 주류
제로샷(Zero-Shot) 접근(ZS-CIR)이 등장했지만, Textual Inversion 같은 사전 학습된 컴포넌트에 의존

따라서 연구자들은 완전히 Training-Free한 CIR 프레임워크를 제안하고자 했다.
학습 없이, 오직 VLM(Visual-Language Model)과 언어적 조합을 통해 이미지 검색을 수행하는 방식이 바로 Vision-by-Language이다.

🧠 주요 기여

CIReVL: Training-Free Zero-Shot CIR 프레임워크 제안
- 별도의 학습(Pretraining/Fine-tuning, Textual Inversion 등) 없이 기존 사전학습된 모델(off-the-shelf VLM)만 활용
- 참조 이미지와 텍스트를 언어 공간에서 직접 조합하여 쿼리를 구성
- CIRR, FashionIQ, CIRCO, CIRR-Extended 등 네 가지 CIR 벤치마크에서 기존 학습 기반 방법과 비슷하거나 더 좋은 성능을 달성
모듈성(Modularity) & 언어 수준 추론(Language-level Reasoning)
이부분이 맘에 든다!! 이해할수 없는 벡터차원이 아니라 언어 도메인이어서 중간과정을 따라갈수 있어!
- CIReVL은 언어 도메인에서 쿼리를 처리하기 때문에 검색 과정의 해석 가능성을 높임
- 인간이 이해 가능한 수준에서 조합적 검색 과정을 설명할 수 있고
  사용자가 직접 질의 조합을 수정·개입하는 것도 가능
추가 실험 및 분석 (Ablation Studies)
- 파이프라인 구성 요소에 대한 다양한 ablation 연구 수행
- 언어 기반 reasoning의 중요성을 실증적으로 보여주고
  모듈화된 구조 덕분에 CIReVL이 간단하게 확장 가능(scalable)함을 강조

🧠 CIReVL 상세 구조

핵심 아이디어
- Image Captioning : 이미지를 VLM이 이해할 수 있는 텍스트로 변환
- 수정 텍스트와 결합하여 최종 질의 문장(query sentence)을 생성
- 이 문장을 다시 VLM 임베딩으로 변환 후, 후보 이미지와 매칭

From text embeddings to captions.
- BLIP, BLIP-2, CoCa 를 활용 이미지를 텍스트로 설명
From templates to reasoning targets.
- LLM을 활용해서, 수정요청 프롬포트랑 이미지 caption을 결합!
- LLM은 GPT3.5/4 Llama 등을 활용
- 최종적으로 우리가 찾을 이미지의 설명이라고생각하면됨!

  "I have an image. Given an instruction to edit the image,  carefully generate a description of the edited image. 

  I will put my image content beginning with ’Image Content:{이미지caption}’.
  The instruction I provide will begin with ’Instruction:{수정프롬포트}’.

  The edited description you generate should begin with ’Edited Description:’. Each time generate one instruction and one edited description only."

Compositional image retrieval.
- 2의 결과물 임베딩과 이미지들의 임베딩을 비교!!

🧠 성능 비교（EXPERIMENTS）

Training Free임에도 불구하고 여러 지표에서 우수한 성능!!
- FashionIQ
  - Vision-by-Language 접근만으로도 R@10 기준 20%대 성능 확보
  - SEARLE보다 약간 낮지만, 학습 불필요하다는 점에서 큰 장점
- CIRR / CIRCO
  - Training-Free 접근이 기존 학습 모델과 비슷한 수준의 Recall 달성
  - 특히 TIFA 평가에서는 학습 기반 대비 경쟁력 있는 정합성 확보

또한 Ablation Study를 통해 LLM, VLM 등 성능이 좋아지면, 모델 결과도 좋아짐을 확인!

🧩 결론

CIReVL (ICLR 2024)는 CIR 연구의 패러다임을 “학습 기반 → 제로샷 → Training-Free”로 확장시켰다!!

완전한 Training-Free 프레임워크를 제안하여, 학습 기반 접근 없이도 CIRR, FashionIQ, CIRCO 등 주요 벤치마크에서 경쟁력 있는 성능 달성
언어 기반 조합(Language-level reasoning)을 통해 검색 과정을 해석 가능하게 만들고, 사용자가 직접 개입할 수 있는 투명성과 유연성 제공
모듈성(Modularity) 덕분에 검색 파이프라인을 쉽게 교체·확장할 수 있으며, 향후 더 강력한 VLM·LLM이 등장할수록 성능 향상의 잠재력 보유

따라서, CIReVL은 Zero-Shot CIR → Training-Free CIR로 이어지는 발전 흐름에서 핵심적인 위치를 차지하며,
앞으로의 조합 이미지 검색 연구에 중요한 기반을 제공한다.

🧠 Training-Free Compositional Image Retrieval!! CIReVL (ICLR 2024)

Following CIRCO (ICCV 2023), CIR research evolves from training-based methods to a completely Training-Free approach!!

Title: Vision-by-Language for Training-Free Compositional Image Retrieval
Conference: ICLR 2024 (Karthik et al.)
Code: Vision-by-Language (GitHub)
Keywords: Composed Image Retrieval, Training-Free, Vision-by-Language, TIFA, ICLR 2024, ZS-CIR

🔍 Research Background

From CIRR (ICCV 2021) → CIRCO (ICCV 2023), CIR research has significantly evolved.
However, there were still limitations:

Most approaches relied on training-based pipelines adapted to CIRR, FashionIQ, or CIRCO datasets.
Zero-Shot approaches (ZS-CIR) emerged, but they still depended on pre-trained components like Textual Inversion.

Therefore, the authors proposed a fully Training-Free CIR framework.
Without additional training, they perform image retrieval using only a VLM (Visual-Language Model) and language-level composition — this is exactly Vision-by-Language.

🧠 Main Contributions

CIReVL: A Training-Free Zero-Shot CIR Framework
- Utilizes only off-the-shelf pre-trained models (no Pretraining, Fine-tuning, or Textual Inversion).
- Combines reference images and modification texts directly in the language space to form queries.
- Achieves performance comparable or superior to training-based methods on CIRR, FashionIQ, CIRCO, and CIRR-Extended benchmarks.
Modularity & Language-Level Reasoning
This is the exciting part!! Instead of incomprehensible vector space operations, the reasoning happens in the language domain, making the process easier to follow!
- Since CIReVL processes queries in natural language, the retrieval process is interpretable.
- Human users can understand intermediate reasoning steps and even intervene or edit query composition directly.
Additional Studies (Ablation Analysis)
- Extensive ablation studies on pipeline components.
- Showed the importance of language-based reasoning for effective CIR.
- Highlighted how the modular design makes CIReVL easily scalable and extendable.

🧠 CIReVL Framework Details

Core Idea
- Image Captioning: Convert an image into a textual description understandable by a VLM.
- Combine this caption with the modification instruction to form a final query sentence.
- Encode this query with a VLM and compare it to image embeddings to retrieve results.

From text embeddings to captions
- Use BLIP, BLIP-2, or CoCa to caption the reference image.

From templates to reasoning targets

An LLM (e.g., GPT-3.5/4, LLaMA) fuses the image caption and user modification instruction.
Produces a coherent edited description — essentially a textual specification of the target image.

"I have an image. Given an instruction to edit the image, carefully generate a description of the edited image. 

I will put my image content beginning with ’Image Content:{image caption}’.
The instruction I provide will begin with ’Instruction:{modification prompt}’.

The edited description you generate should begin with ’Edited Description:’. 
Each time generate one instruction and one edited description only."

Compositional image retrieval.
- Compare the embedding of the result from step 2 with the embeddings of the images!!

🧠 Experimental Results

Despite being Training-Free, CIReVL achieves strong performance across benchmarks!!
- FashionIQ
  - Achieves ~20% R@10 purely with Vision-by-Language composition.
  - Slightly lower than SEARLE, but with the major advantage of requiring no training.
- CIRR / CIRCO
  - Training-Free approach reaches Recall levels comparable to training-based models.
  - Particularly on TIFA evaluation, CIReVL demonstrates competitive compositional faithfulness compared to training-based methods.

Ablation studies further confirm that as LLMs and VLMs improve, CIReVL’s performance also improves accordingly!

🧩 Conclusion

CIReVL (ICLR 2024) extends the paradigm of CIR research from “Training-based → Zero-Shot → Training-Free.”

Proposes a completely Training-Free framework, achieving competitive results on CIRR, FashionIQ, and CIRCO without any training.
Enables interpretable retrieval through language-level reasoning, providing transparency and flexibility that allows direct human intervention.
Thanks to its modular design, the retrieval pipeline can be easily replaced or extended, and with the advancement of stronger VLMs/LLMs, CIReVL has clear potential for further improvements.

Therefore, CIReVL occupies a key position in the evolution from Zero-Shot CIR to Training-Free CIR,
and offers an important foundation for the future of compositional image retrieval research.

AI, Research

This post is licensed under CC BY 4.0 by the author.

🧠 (한국어) Training-Free 조합 이미지 검색!! CIReVL (ICLR 2024)

🔍 연구 배경

🧠 주요 기여

CIReVL: Training-Free Zero-Shot CIR 프레임워크 제안

모듈성(Modularity) & 언어 수준 추론(Language-level Reasoning)

추가 실험 및 분석 (Ablation Studies)

🧠 CIReVL 상세 구조

🧠 성능 비교（EXPERIMENTS）

🧩 결론

🧠 Training-Free Compositional Image Retrieval!! CIReVL (ICLR 2024)

🔍 Research Background

🧠 Main Contributions

CIReVL: A Training-Free Zero-Shot CIR Framework

Modularity & Language-Level Reasoning

Additional Studies (Ablation Analysis)

🧠 CIReVL Framework Details

🧠 Experimental Results

🧩 Conclusion

Trending Tags