๐ง [CIReVL] VISION-BY-LANGUAGE FOR TRAINING-FREE COMPOSITIONAL IMAGE RETRIEVAL : First Training Free on CIRR
๐ง (ํ๊ตญ์ด) Training-Free ์กฐํฉ ์ด๋ฏธ์ง ๊ฒ์!! CIReVL (ICLR 2024)
CIRCO(ICCV 2023)์ ๋ค๋ฅผ ์ด์ด, CIR ์ฐ๊ตฌ๊ฐ ํ์ต ๊ธฐ๋ฐ์์ ์์ ํ Training-Free ๋ฐฉ์์ผ๋ก ํ์ฅ!!
- ์ ๋ชฉ: Vision-by-Language for Training-Free Compositional Image Retrieval
- ํํ: ICLR 2024 (Karthik et al.)
- ์ฝ๋: Vision-by-Language (GitHub)
- ํต์ฌ ํค์๋:
Composed Image Retrieval
,Training-Free
,Vision-by-Language
,TIFA
,ICLR 2024
,ZS-CIR
๐ ์ฐ๊ตฌ ๋ฐฐ๊ฒฝ
CIRR(ICCV 2021) โ CIRCO(ICCV 2023)๊น์ง, CIR ์ฐ๊ตฌ๋ ํฌ๊ฒ ์งํํด์๋ค.
ํ์ง๋ง ๊ธฐ์กด ์ ๊ทผ์๋ ์ฌ์ ํ ํ๊ณ๊ฐ ์์๋ค:
- CIRR, FashionIQ, CIRCO ๋ฑ์ ๋ฐ์ดํฐ์ ์ ๋ง์ถ ํ์ต ๊ธฐ๋ฐ ์ ๊ทผ์ด ์ฃผ๋ฅ
- ์ ๋ก์ท(Zero-Shot) ์ ๊ทผ(ZS-CIR)์ด ๋ฑ์ฅํ์ง๋ง, Textual Inversion ๊ฐ์ ์ฌ์ ํ์ต๋ ์ปดํฌ๋ํธ์ ์์กด
๋ฐ๋ผ์ ์ฐ๊ตฌ์๋ค์ ์์ ํ Training-Freeํ CIR ํ๋ ์์ํฌ๋ฅผ ์ ์ํ๊ณ ์ ํ๋ค.
ํ์ต ์์ด, ์ค์ง VLM(Visual-Language Model)๊ณผ ์ธ์ด์ ์กฐํฉ์ ํตํด ์ด๋ฏธ์ง ๊ฒ์์ ์ํํ๋ ๋ฐฉ์์ด ๋ฐ๋ก Vision-by-Language์ด๋ค.
๐ง ์ฃผ์ ๊ธฐ์ฌ
CIReVL: Training-Free Zero-Shot CIR ํ๋ ์์ํฌ ์ ์
- ๋ณ๋์ ํ์ต(Pretraining/Fine-tuning, Textual Inversion ๋ฑ) ์์ด ๊ธฐ์กด ์ฌ์ ํ์ต๋ ๋ชจ๋ธ(off-the-shelf VLM)๋ง ํ์ฉ
- ์ฐธ์กฐ ์ด๋ฏธ์ง์ ํ ์คํธ๋ฅผ ์ธ์ด ๊ณต๊ฐ์์ ์ง์ ์กฐํฉํ์ฌ ์ฟผ๋ฆฌ๋ฅผ ๊ตฌ์ฑ
- CIRR, FashionIQ, CIRCO, CIRR-Extended ๋ฑ ๋ค ๊ฐ์ง CIR ๋ฒค์น๋งํฌ์์ ๊ธฐ์กด ํ์ต ๊ธฐ๋ฐ ๋ฐฉ๋ฒ๊ณผ ๋น์ทํ๊ฑฐ๋ ๋ ์ข์ ์ฑ๋ฅ์ ๋ฌ์ฑ
๋ชจ๋์ฑ(Modularity) & ์ธ์ด ์์ค ์ถ๋ก (Language-level Reasoning)
์ด๋ถ๋ถ์ด ๋ง์ ๋ ๋ค!! ์ดํดํ ์ ์๋ ๋ฒกํฐ์ฐจ์์ด ์๋๋ผ ์ธ์ด ๋๋ฉ์ธ์ด์ด์ ์ค๊ฐ๊ณผ์ ์ ๋ฐ๋ผ๊ฐ์ ์์ด!
- CIReVL์ ์ธ์ด ๋๋ฉ์ธ์์ ์ฟผ๋ฆฌ๋ฅผ ์ฒ๋ฆฌํ๊ธฐ ๋๋ฌธ์ ๊ฒ์ ๊ณผ์ ์ ํด์ ๊ฐ๋ฅ์ฑ์ ๋์
- ์ธ๊ฐ์ด ์ดํด ๊ฐ๋ฅํ ์์ค์์ ์กฐํฉ์ ๊ฒ์ ๊ณผ์ ์ ์ค๋ช
ํ ์ ์๊ณ
์ฌ์ฉ์๊ฐ ์ง์ ์ง์ ์กฐํฉ์ ์์ ยท๊ฐ์ ํ๋ ๊ฒ๋ ๊ฐ๋ฅ
์ถ๊ฐ ์คํ ๋ฐ ๋ถ์ (Ablation Studies)
- ํ์ดํ๋ผ์ธ ๊ตฌ์ฑ ์์์ ๋ํ ๋ค์ํ ablation ์ฐ๊ตฌ ์ํ
- ์ธ์ด ๊ธฐ๋ฐ reasoning์ ์ค์์ฑ์ ์ค์ฆ์ ์ผ๋ก ๋ณด์ฌ์ฃผ๊ณ
๋ชจ๋ํ๋ ๊ตฌ์กฐ ๋๋ถ์ CIReVL์ด ๊ฐ๋จํ๊ฒ ํ์ฅ ๊ฐ๋ฅ(scalable)ํจ์ ๊ฐ์กฐ
๐ง CIReVL ์์ธ ๊ตฌ์กฐ
- ํต์ฌ ์์ด๋์ด
- Image Captioning : ์ด๋ฏธ์ง๋ฅผ VLM์ด ์ดํดํ ์ ์๋ ํ ์คํธ๋ก ๋ณํ
- ์์ ํ ์คํธ์ ๊ฒฐํฉํ์ฌ ์ต์ข ์ง์ ๋ฌธ์ฅ(query sentence)์ ์์ฑ
- ์ด ๋ฌธ์ฅ์ ๋ค์ VLM ์๋ฒ ๋ฉ์ผ๋ก ๋ณํ ํ, ํ๋ณด ์ด๋ฏธ์ง์ ๋งค์นญ
- From text embeddings to captions.
- BLIP, BLIP-2, CoCa ๋ฅผ ํ์ฉ ์ด๋ฏธ์ง๋ฅผ ํ ์คํธ๋ก ์ค๋ช
- From templates to reasoning targets.
- LLM์ ํ์ฉํด์, ์์ ์์ฒญ ํ๋กฌํฌํธ๋ ์ด๋ฏธ์ง caption์ ๊ฒฐํฉ!
- LLM์ GPT3.5/4 Llama ๋ฑ์ ํ์ฉ
- ์ต์ข ์ ์ผ๋ก ์ฐ๋ฆฌ๊ฐ ์ฐพ์ ์ด๋ฏธ์ง์ ์ค๋ช ์ด๋ผ๊ณ ์๊ฐํ๋ฉด๋จ!
1
2
3
4
5
6
"I have an image. Given an instruction to edit the image, carefully generate a description of the edited image.
I will put my image content beginning with โImage Content:{์ด๋ฏธ์งcaption}โ.
The instruction I provide will begin with โInstruction:{์์ ํ๋กฌํฌํธ}โ.
The edited description you generate should begin with โEdited Description:โ. Each time generate one instruction and one edited description only."
๐ง ์ฑ๋ฅ ๋น๊ต๏ผEXPERIMENTS๏ผ
- Training Free์์๋ ๋ถ๊ตฌํ๊ณ ์ฌ๋ฌ ์งํ์์ ์ฐ์ํ ์ฑ๋ฅ!!
- FashionIQ
- Vision-by-Language ์ ๊ทผ๋ง์ผ๋ก๋ R@10 ๊ธฐ์ค 20%๋ ์ฑ๋ฅ ํ๋ณด
- SEARLE๋ณด๋ค ์ฝ๊ฐ ๋ฎ์ง๋ง, ํ์ต ๋ถํ์ํ๋ค๋ ์ ์์ ํฐ ์ฅ์
- CIRR / CIRCO
- Training-Free ์ ๊ทผ์ด ๊ธฐ์กด ํ์ต ๋ชจ๋ธ๊ณผ ๋น์ทํ ์์ค์ Recall ๋ฌ์ฑ
- ํนํ TIFA ํ๊ฐ์์๋ ํ์ต ๊ธฐ๋ฐ ๋๋น ๊ฒฝ์๋ ฅ ์๋ ์ ํฉ์ฑ ํ๋ณด
- FashionIQ
- ๋ํ Ablation Study๋ฅผ ํตํด LLM, VLM ๋ฑ ์ฑ๋ฅ์ด ์ข์์ง๋ฉด, ๋ชจ๋ธ ๊ฒฐ๊ณผ๋ ์ข์์ง์ ํ์ธ!
๐งฉ ๊ฒฐ๋ก
CIReVL (ICLR 2024)๋ CIR ์ฐ๊ตฌ์ ํจ๋ฌ๋ค์์ โํ์ต ๊ธฐ๋ฐ โ ์ ๋ก์ท โ Training-Freeโ๋ก ํ์ฅ์์ผฐ๋ค!!
- ์์ ํ Training-Free ํ๋ ์์ํฌ๋ฅผ ์ ์ํ์ฌ, ํ์ต ๊ธฐ๋ฐ ์ ๊ทผ ์์ด๋ CIRR, FashionIQ, CIRCO ๋ฑ ์ฃผ์ ๋ฒค์น๋งํฌ์์ ๊ฒฝ์๋ ฅ ์๋ ์ฑ๋ฅ ๋ฌ์ฑ
- ์ธ์ด ๊ธฐ๋ฐ ์กฐํฉ(Language-level reasoning)์ ํตํด ๊ฒ์ ๊ณผ์ ์ ํด์ ๊ฐ๋ฅํ๊ฒ ๋ง๋ค๊ณ , ์ฌ์ฉ์๊ฐ ์ง์ ๊ฐ์ ํ ์ ์๋ ํฌ๋ช ์ฑ๊ณผ ์ ์ฐ์ฑ ์ ๊ณต
- ๋ชจ๋์ฑ(Modularity) ๋๋ถ์ ๊ฒ์ ํ์ดํ๋ผ์ธ์ ์ฝ๊ฒ ๊ต์ฒดยทํ์ฅํ ์ ์์ผ๋ฉฐ, ํฅํ ๋ ๊ฐ๋ ฅํ VLMยทLLM์ด ๋ฑ์ฅํ ์๋ก ์ฑ๋ฅ ํฅ์์ ์ ์ฌ๋ ฅ ๋ณด์
๋ฐ๋ผ์, CIReVL์ Zero-Shot CIR โ Training-Free CIR๋ก ์ด์ด์ง๋ ๋ฐ์ ํ๋ฆ์์ ํต์ฌ์ ์ธ ์์น๋ฅผ ์ฐจ์งํ๋ฉฐ,
์์ผ๋ก์ ์กฐํฉ ์ด๋ฏธ์ง ๊ฒ์ ์ฐ๊ตฌ์ ์ค์ํ ๊ธฐ๋ฐ์ ์ ๊ณตํ๋ค.
๐ง Training-Free Compositional Image Retrieval!! CIReVL (ICLR 2024)
Following CIRCO (ICCV 2023), CIR research evolves from training-based methods to a completely Training-Free approach!!
- Title: Vision-by-Language for Training-Free Compositional Image Retrieval
- Conference: ICLR 2024 (Karthik et al.)
- Code: Vision-by-Language (GitHub)
- Keywords:
Composed Image Retrieval
,Training-Free
,Vision-by-Language
,TIFA
,ICLR 2024
,ZS-CIR
๐ Research Background
From CIRR (ICCV 2021) โ CIRCO (ICCV 2023), CIR research has significantly evolved.
However, there were still limitations:
- Most approaches relied on training-based pipelines adapted to CIRR, FashionIQ, or CIRCO datasets.
- Zero-Shot approaches (ZS-CIR) emerged, but they still depended on pre-trained components like Textual Inversion.
Therefore, the authors proposed a fully Training-Free CIR framework.
Without additional training, they perform image retrieval using only a VLM (Visual-Language Model) and language-level composition โ this is exactly Vision-by-Language.
๐ง Main Contributions
CIReVL: A Training-Free Zero-Shot CIR Framework
- Utilizes only off-the-shelf pre-trained models (no Pretraining, Fine-tuning, or Textual Inversion).
- Combines reference images and modification texts directly in the language space to form queries.
- Achieves performance comparable or superior to training-based methods on CIRR, FashionIQ, CIRCO, and CIRR-Extended benchmarks.
Modularity & Language-Level Reasoning
This is the exciting part!! Instead of incomprehensible vector space operations, the reasoning happens in the language domain, making the process easier to follow!
- Since CIReVL processes queries in natural language, the retrieval process is interpretable.
- Human users can understand intermediate reasoning steps and even intervene or edit query composition directly.
Additional Studies (Ablation Analysis)
- Extensive ablation studies on pipeline components.
- Showed the importance of language-based reasoning for effective CIR.
- Highlighted how the modular design makes CIReVL easily scalable and extendable.
๐ง CIReVL Framework Details
- Core Idea
- Image Captioning: Convert an image into a textual description understandable by a VLM.
- Combine this caption with the modification instruction to form a final query sentence.
- Encode this query with a VLM and compare it to image embeddings to retrieve results.
- From text embeddings to captions
- Use BLIP, BLIP-2, or CoCa to caption the reference image.
- From templates to reasoning targets
- An LLM (e.g., GPT-3.5/4, LLaMA) fuses the image caption and user modification instruction.
- Produces a coherent edited description โ essentially a textual specification of the target image.
1 2 3 4 5 6 7
"I have an image. Given an instruction to edit the image, carefully generate a description of the edited image. I will put my image content beginning with โImage Content:{image caption}โ. The instruction I provide will begin with โInstruction:{modification prompt}โ. The edited description you generate should begin with โEdited Description:โ. Each time generate one instruction and one edited description only."
- Compositional image retrieval.
- Compare the embedding of the result from step 2 with the embeddings of the images!!
๐ง Experimental Results
- Despite being Training-Free, CIReVL achieves strong performance across benchmarks!!
- FashionIQ
- Achieves ~20% R@10 purely with Vision-by-Language composition.
- Slightly lower than SEARLE, but with the major advantage of requiring no training.
- CIRR / CIRCO
- Training-Free approach reaches Recall levels comparable to training-based models.
- Particularly on TIFA evaluation, CIReVL demonstrates competitive compositional faithfulness compared to training-based methods.
- FashionIQ
- Ablation studies further confirm that as LLMs and VLMs improve, CIReVLโs performance also improves accordingly!
๐งฉ Conclusion
CIReVL (ICLR 2024) extends the paradigm of CIR research from โTraining-based โ Zero-Shot โ Training-Free.โ
- Proposes a completely Training-Free framework, achieving competitive results on CIRR, FashionIQ, and CIRCO without any training.
- Enables interpretable retrieval through language-level reasoning, providing transparency and flexibility that allows direct human intervention.
- Thanks to its modular design, the retrieval pipeline can be easily replaced or extended, and with the advancement of stronger VLMs/LLMs, CIReVL has clear potential for further improvements.
Therefore, CIReVL occupies a key position in the evolution from Zero-Shot CIR to Training-Free CIR,
and offers an important foundation for the future of compositional image retrieval research.