Post

๐Ÿง  [CIReVL] VISION-BY-LANGUAGE FOR TRAINING-FREE COMPOSITIONAL IMAGE RETRIEVAL : First Training Free on CIRR

๐Ÿง  [CIReVL] VISION-BY-LANGUAGE FOR TRAINING-FREE COMPOSITIONAL IMAGE RETRIEVAL : First Training Free on CIRR

๐Ÿง  (ํ•œ๊ตญ์–ด) Training-Free ์กฐํ•ฉ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰!! CIReVL (ICLR 2024)

CIRCO(ICCV 2023)์˜ ๋’ค๋ฅผ ์ด์–ด, CIR ์—ฐ๊ตฌ๊ฐ€ ํ•™์Šต ๊ธฐ๋ฐ˜์—์„œ ์™„์ „ํ•œ Training-Free ๋ฐฉ์‹์œผ๋กœ ํ™•์žฅ!!

Image


๐Ÿ” ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

CIRR(ICCV 2021) โ†’ CIRCO(ICCV 2023)๊นŒ์ง€, CIR ์—ฐ๊ตฌ๋Š” ํฌ๊ฒŒ ์ง„ํ™”ํ•ด์™”๋‹ค.
ํ•˜์ง€๋งŒ ๊ธฐ์กด ์ ‘๊ทผ์—๋Š” ์—ฌ์ „ํžˆ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค:

  • CIRR, FashionIQ, CIRCO ๋“ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ๋งž์ถ˜ ํ•™์Šต ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์ด ์ฃผ๋ฅ˜
  • ์ œ๋กœ์ƒท(Zero-Shot) ์ ‘๊ทผ(ZS-CIR)์ด ๋“ฑ์žฅํ–ˆ์ง€๋งŒ, Textual Inversion ๊ฐ™์€ ์‚ฌ์ „ ํ•™์Šต๋œ ์ปดํฌ๋„ŒํŠธ์— ์˜์กด

๋”ฐ๋ผ์„œ ์—ฐ๊ตฌ์ž๋“ค์€ ์™„์ „ํžˆ Training-Freeํ•œ CIR ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๊ณ ์ž ํ–ˆ๋‹ค.
ํ•™์Šต ์—†์ด, ์˜ค์ง VLM(Visual-Language Model)๊ณผ ์–ธ์–ด์  ์กฐํ•ฉ์„ ํ†ตํ•ด ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์ด ๋ฐ”๋กœ Vision-by-Language์ด๋‹ค.


๐Ÿง  ์ฃผ์š” ๊ธฐ์—ฌ

  1. CIReVL: Training-Free Zero-Shot CIR ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ

    • ๋ณ„๋„์˜ ํ•™์Šต(Pretraining/Fine-tuning, Textual Inversion ๋“ฑ) ์—†์ด ๊ธฐ์กด ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ(off-the-shelf VLM)๋งŒ ํ™œ์šฉ
    • ์ฐธ์กฐ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ์–ธ์–ด ๊ณต๊ฐ„์—์„œ ์ง์ ‘ ์กฐํ•ฉํ•˜์—ฌ ์ฟผ๋ฆฌ๋ฅผ ๊ตฌ์„ฑ
    • CIRR, FashionIQ, CIRCO, CIRR-Extended ๋“ฑ ๋„ค ๊ฐ€์ง€ CIR ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ
  2. ๋ชจ๋“ˆ์„ฑ(Modularity) & ์–ธ์–ด ์ˆ˜์ค€ ์ถ”๋ก (Language-level Reasoning)

    ์ด๋ถ€๋ถ„์ด ๋ง˜์— ๋“ ๋‹ค!! ์ดํ•ดํ• ์ˆ˜ ์—†๋Š” ๋ฒกํ„ฐ์ฐจ์›์ด ์•„๋‹ˆ๋ผ ์–ธ์–ด ๋„๋ฉ”์ธ์ด์–ด์„œ ์ค‘๊ฐ„๊ณผ์ •์„ ๋”ฐ๋ผ๊ฐˆ์ˆ˜ ์žˆ์–ด!

    • CIReVL์€ ์–ธ์–ด ๋„๋ฉ”์ธ์—์„œ ์ฟผ๋ฆฌ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰ ๊ณผ์ •์˜ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ž„
    • ์ธ๊ฐ„์ด ์ดํ•ด ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์—์„œ ์กฐํ•ฉ์  ๊ฒ€์ƒ‰ ๊ณผ์ •์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๊ณ 
      ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์งˆ์˜ ์กฐํ•ฉ์„ ์ˆ˜์ •ยท๊ฐœ์ž…ํ•˜๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅ
  3. ์ถ”๊ฐ€ ์‹คํ—˜ ๋ฐ ๋ถ„์„ (Ablation Studies)

    • ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์„ฑ ์š”์†Œ์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ablation ์—ฐ๊ตฌ ์ˆ˜ํ–‰
    • ์–ธ์–ด ๊ธฐ๋ฐ˜ reasoning์˜ ์ค‘์š”์„ฑ์„ ์‹ค์ฆ์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๊ณ 
      ๋ชจ๋“ˆํ™”๋œ ๊ตฌ์กฐ ๋•๋ถ„์— CIReVL์ด ๊ฐ„๋‹จํ•˜๊ฒŒ ํ™•์žฅ ๊ฐ€๋Šฅ(scalable)ํ•จ์„ ๊ฐ•์กฐ

๐Ÿง  CIReVL ์ƒ์„ธ ๊ตฌ์กฐ

Image

  • ํ•ต์‹ฌ ์•„์ด๋””์–ด
    • Image Captioning : ์ด๋ฏธ์ง€๋ฅผ VLM์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜
    • ์ˆ˜์ • ํ…์ŠคํŠธ์™€ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ข… ์งˆ์˜ ๋ฌธ์žฅ(query sentence)์„ ์ƒ์„ฑ
    • ์ด ๋ฌธ์žฅ์„ ๋‹ค์‹œ VLM ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ ํ›„, ํ›„๋ณด ์ด๋ฏธ์ง€์™€ ๋งค์นญ
  1. From text embeddings to captions.
    • BLIP, BLIP-2, CoCa ๋ฅผ ํ™œ์šฉ ์ด๋ฏธ์ง€๋ฅผ ํ…์ŠคํŠธ๋กœ ์„ค๋ช…
  2. From templates to reasoning targets.
    • LLM์„ ํ™œ์šฉํ•ด์„œ, ์ˆ˜์ •์š”์ฒญ ํ”„๋กฌํฌํŠธ๋ž‘ ์ด๋ฏธ์ง€ caption์„ ๊ฒฐํ•ฉ!
    • LLM์€ GPT3.5/4 Llama ๋“ฑ์„ ํ™œ์šฉ
    • ์ตœ์ข…์ ์œผ๋กœ ์šฐ๋ฆฌ๊ฐ€ ์ฐพ์„ ์ด๋ฏธ์ง€์˜ ์„ค๋ช…์ด๋ผ๊ณ ์ƒ๊ฐํ•˜๋ฉด๋จ!
1
2
3
4
5
6
  "I have an image. Given an instruction to edit the image,  carefully generate a description of the edited image. 

  I will put my image content beginning with โ€™Image Content:{์ด๋ฏธ์ง€caption}โ€™.
  The instruction I provide will begin with โ€™Instruction:{์ˆ˜์ •ํ”„๋กฌํฌํŠธ}โ€™.

  The edited description you generate should begin with โ€™Edited Description:โ€™. Each time generate one instruction and one edited description only."
  1. Compositional image retrieval.
    Image
    • 2์˜ ๊ฒฐ๊ณผ๋ฌผ ์ž„๋ฒ ๋”ฉ๊ณผ ์ด๋ฏธ์ง€๋“ค์˜ ์ž„๋ฒ ๋”ฉ์„ ๋น„๊ต!!

๐Ÿง  ์„ฑ๋Šฅ ๋น„๊ต๏ผˆEXPERIMENTS๏ผ‰

Image

  • Training Free์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์—ฌ๋Ÿฌ ์ง€ํ‘œ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ!!
    • FashionIQ
      • Vision-by-Language ์ ‘๊ทผ๋งŒ์œผ๋กœ๋„ R@10 ๊ธฐ์ค€ 20%๋Œ€ ์„ฑ๋Šฅ ํ™•๋ณด
      • SEARLE๋ณด๋‹ค ์•ฝ๊ฐ„ ๋‚ฎ์ง€๋งŒ, ํ•™์Šต ๋ถˆํ•„์š”ํ•˜๋‹ค๋Š” ์ ์—์„œ ํฐ ์žฅ์ 
    • CIRR / CIRCO
      • Training-Free ์ ‘๊ทผ์ด ๊ธฐ์กด ํ•™์Šต ๋ชจ๋ธ๊ณผ ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ Recall ๋‹ฌ์„ฑ
      • ํŠนํžˆ TIFA ํ‰๊ฐ€์—์„œ๋Š” ํ•™์Šต ๊ธฐ๋ฐ˜ ๋Œ€๋น„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ •ํ•ฉ์„ฑ ํ™•๋ณด

Image

  • ๋˜ํ•œ Ablation Study๋ฅผ ํ†ตํ•ด LLM, VLM ๋“ฑ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋ฉด, ๋ชจ๋ธ ๊ฒฐ๊ณผ๋„ ์ข‹์•„์ง์„ ํ™•์ธ!

๐Ÿงฉ ๊ฒฐ๋ก 

CIReVL (ICLR 2024)๋Š” CIR ์—ฐ๊ตฌ์˜ ํŒจ๋Ÿฌ๋‹ค์ž„์„ โ€œํ•™์Šต ๊ธฐ๋ฐ˜ โ†’ ์ œ๋กœ์ƒท โ†’ Training-Freeโ€๋กœ ํ™•์žฅ์‹œ์ผฐ๋‹ค!!

  • ์™„์ „ํ•œ Training-Free ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์—ฌ, ํ•™์Šต ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ์—†์ด๋„ CIRR, FashionIQ, CIRCO ๋“ฑ ์ฃผ์š” ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • ์–ธ์–ด ๊ธฐ๋ฐ˜ ์กฐํ•ฉ(Language-level reasoning)์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ๊ณผ์ •์„ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ๊ฐœ์ž…ํ•  ์ˆ˜ ์žˆ๋Š” ํˆฌ๋ช…์„ฑ๊ณผ ์œ ์—ฐ์„ฑ ์ œ๊ณต
  • ๋ชจ๋“ˆ์„ฑ(Modularity) ๋•๋ถ„์— ๊ฒ€์ƒ‰ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‰ฝ๊ฒŒ ๊ต์ฒดยทํ™•์žฅํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ–ฅํ›„ ๋” ๊ฐ•๋ ฅํ•œ VLMยทLLM์ด ๋“ฑ์žฅํ• ์ˆ˜๋ก ์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ ์ž ์žฌ๋ ฅ ๋ณด์œ 

๋”ฐ๋ผ์„œ, CIReVL์€ Zero-Shot CIR โ†’ Training-Free CIR๋กœ ์ด์–ด์ง€๋Š” ๋ฐœ์ „ ํ๋ฆ„์—์„œ ํ•ต์‹ฌ์ ์ธ ์œ„์น˜๋ฅผ ์ฐจ์ง€ํ•˜๋ฉฐ,
์•ž์œผ๋กœ์˜ ์กฐํ•ฉ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ์—ฐ๊ตฌ์— ์ค‘์š”ํ•œ ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•œ๋‹ค.

๐Ÿง  Training-Free Compositional Image Retrieval!! CIReVL (ICLR 2024)

Following CIRCO (ICCV 2023), CIR research evolves from training-based methods to a completely Training-Free approach!!

Image


๐Ÿ” Research Background

From CIRR (ICCV 2021) โ†’ CIRCO (ICCV 2023), CIR research has significantly evolved.
However, there were still limitations:

  • Most approaches relied on training-based pipelines adapted to CIRR, FashionIQ, or CIRCO datasets.
  • Zero-Shot approaches (ZS-CIR) emerged, but they still depended on pre-trained components like Textual Inversion.

Therefore, the authors proposed a fully Training-Free CIR framework.
Without additional training, they perform image retrieval using only a VLM (Visual-Language Model) and language-level composition โ€” this is exactly Vision-by-Language.


๐Ÿง  Main Contributions

  1. CIReVL: A Training-Free Zero-Shot CIR Framework

    • Utilizes only off-the-shelf pre-trained models (no Pretraining, Fine-tuning, or Textual Inversion).
    • Combines reference images and modification texts directly in the language space to form queries.
    • Achieves performance comparable or superior to training-based methods on CIRR, FashionIQ, CIRCO, and CIRR-Extended benchmarks.
  2. Modularity & Language-Level Reasoning

    This is the exciting part!! Instead of incomprehensible vector space operations, the reasoning happens in the language domain, making the process easier to follow!

    • Since CIReVL processes queries in natural language, the retrieval process is interpretable.
    • Human users can understand intermediate reasoning steps and even intervene or edit query composition directly.
  3. Additional Studies (Ablation Analysis)

    • Extensive ablation studies on pipeline components.
    • Showed the importance of language-based reasoning for effective CIR.
    • Highlighted how the modular design makes CIReVL easily scalable and extendable.

๐Ÿง  CIReVL Framework Details

Image

  • Core Idea
    • Image Captioning: Convert an image into a textual description understandable by a VLM.
    • Combine this caption with the modification instruction to form a final query sentence.
    • Encode this query with a VLM and compare it to image embeddings to retrieve results.
  1. From text embeddings to captions
    • Use BLIP, BLIP-2, or CoCa to caption the reference image.
  2. From templates to reasoning targets
    • An LLM (e.g., GPT-3.5/4, LLaMA) fuses the image caption and user modification instruction.
    • Produces a coherent edited description โ€” essentially a textual specification of the target image.
    1
    2
    3
    4
    5
    6
    7
    
    "I have an image. Given an instruction to edit the image, carefully generate a description of the edited image. 
    
    I will put my image content beginning with โ€™Image Content:{image caption}โ€™.
    The instruction I provide will begin with โ€™Instruction:{modification prompt}โ€™.
    
    The edited description you generate should begin with โ€™Edited Description:โ€™. 
    Each time generate one instruction and one edited description only."
    
  3. Compositional image retrieval.
    Image
    • Compare the embedding of the result from step 2 with the embeddings of the images!!

๐Ÿง  Experimental Results

Image

  • Despite being Training-Free, CIReVL achieves strong performance across benchmarks!!
    • FashionIQ
      • Achieves ~20% R@10 purely with Vision-by-Language composition.
      • Slightly lower than SEARLE, but with the major advantage of requiring no training.
    • CIRR / CIRCO
      • Training-Free approach reaches Recall levels comparable to training-based models.
      • Particularly on TIFA evaluation, CIReVL demonstrates competitive compositional faithfulness compared to training-based methods.

Image

  • Ablation studies further confirm that as LLMs and VLMs improve, CIReVLโ€™s performance also improves accordingly!

๐Ÿงฉ Conclusion

CIReVL (ICLR 2024) extends the paradigm of CIR research from โ€œTraining-based โ†’ Zero-Shot โ†’ Training-Free.โ€

  • Proposes a completely Training-Free framework, achieving competitive results on CIRR, FashionIQ, and CIRCO without any training.
  • Enables interpretable retrieval through language-level reasoning, providing transparency and flexibility that allows direct human intervention.
  • Thanks to its modular design, the retrieval pipeline can be easily replaced or extended, and with the advancement of stronger VLMs/LLMs, CIReVL has clear potential for further improvements.

Therefore, CIReVL occupies a key position in the evolution from Zero-Shot CIR to Training-Free CIR,
and offers an important foundation for the future of compositional image retrieval research.

This post is licensed under CC BY 4.0 by the author.