๐ง FashionIQ - Fashion Image Retrieval with Natural Language: ํจ์ ์ด๋ฏธ์ง ๊ฒ์์ ์๋ก์ด ํ์ค
๐ง Fashion Image Retrieval with Natural Language: A New Standard
- Title: Fashion IQ: A Novel Dataset for Fashion Image Retrieval with Natural Language
- Conference: CVPR 2021 (WU et al.)
- Code: FashionIQ (GitHub)
- Keywords:
Image Retrieval
,Natural Language
,FashionIQ Dataset
,Composed Image Retrieval
๐ Research Background
Traditional Image Retrieval has been a research topic for a long time. The core goal is to find the most similar image to a given query.
However, the new field of Composed Image Retrieval (CIR), which the FashionIQ paper addresses, has distinct characteristics. Instead of a single modality (a single piece of information), it uses two modalities simultaneously: a reference image and a text modification. The goal isnโt just to โfind this image,โ but to โfind an image like this, but modified in this way!โ
โFind a dress similar to this, but with longer sleeves.โ โ Reference Image + Text Modification = Composed Fashion Image Retrieval
This type of natural language-based fashion search was a challenging task for existing methods.
๐ง Key Contributions
A pioneering work in Composed Image Retrieval!
Establishing the Fashion-specific CIR Task
- The paper clearly defined the concept of Fashion Composed Image Retrieval.
- The goal is to capture subtle stylistic changes requested by users.
Introducing the FashionIQ Dataset
- This is a large-scale benchmark for fashion image retrieval, including over 209,000 images and 77,000 queries.
- The quality of the queries is high because the natural language text was written by professional fashion designers.
- It includes a variety of clothing categories, such as
dresses
,shirts
,tops
, andpants
.
A Custom Query Structure for Fashion
- The text queries include specific modifications related to fashion, such as color, material, pattern, and design.
- Example: โAdd a belt,โ โChange from stripes to polka dots.โ
Comparing Performance with Existing Methods
- The authors evaluated the performance of various state-of-the-art models like CLIP, TIRG, and FiLM on the FashionIQ dataset.
- They demonstrated that due to the specific nature of the fashion domain, existing models still have much room for improvement, thus establishing a new direction for fashion AI research.
๐ The FashionIQ Dataset: โReference Image + Textโ Fashion Search
The FashionIQ dataset is a large-scale fashion dataset specifically for Composed Image Retrieval (CIR). Unlike traditional image retrieval, its goal is to understand a userโs complex intent through a combination of a โreference imageโ and โnatural language textโ to find the desired fashion item.
๐งฉ Key Features
- Scale: Consists of around 209,000 high-quality fashion images and 77,000 queries.
- Query Composition: Each query is made up of a reference image and two natural language texts that describe the visual difference between the reference and the target image.
- Professionalism: The text queries were written by professional fashion designers, reflecting granular and realistic fashion attributes like โshorten the sleevesโ or โchange the color to black.โ
- Categories: It includes a wide range of clothing categories, such as womenโs dresses, tops, and menโs shirts.
This dataset was a major contribution, as it pushed the development of AI models to go beyond simple image search and understand user intent. This is why itโs frequently used as a reference not just in fashion, but in the broader Composed Image Retrieval field.
๐ท FashionIQ Query Example
- Reference Image: A long-sleeved red dress
- Text Modification: โChange color to black and make sleeves shorterโ
- Target Image: A short-sleeved black dress
๐ง FashionIQ Model Learning Structure Summary
The FashionIQ paper uses the following learning structure for Composed Image Retrieval:
- Step 1: Encoding with Pre-trained Models
- Images: It uses the EfficientNet-b7 model, pre-trained on the
DeepFashion
dataset, to extract image features. - Text: It uses GloVe embeddings, pre-trained on a large external text corpus, to convert words into vectors.
- Images: It uses the EfficientNet-b7 model, pre-trained on the
- Step 2: Integrated Learning with a Custom Transformer
- The image features and text embeddings from Step 1 are fed into a newly designed 6-layer transformer that combines the two modalities.
- This multimodal transformer learns the relationship between the image and text to generate the final query vector used for retrieval.
In conclusion, this model is a hybrid structure that leverages pre-trained encoders for each modality (image, text), while its core role of integrating the two is handled by a custom-built transformer.
๐งฉ Conclusion
FashionIQ (CVPR 2021) is a pioneering study that established the CIR task specific to the fashion domain and built a large-scale benchmark with professional natural language queries. The emergence of this dataset was a crucial catalyst, accelerating research in image retrieval that understands human intent within the field of fashion AI.
๐ง (ํ๊ตญ์ด) ์์ฐ์ด๋ก ํจ์ ์ด๋ฏธ์ง๋ฅผ ์ฐพ์๋ธ๋ค!
- ์ ๋ชฉ: Fashion IQ: A Novel Dataset for Fashion Image Retrieval with Natural Language
- ํํ: CVPR 2021 (WU et al.)
- ์ฝ๋: FashionIQ (GitHub)
- ํต์ฌ ํค์๋:
Image Retrieval
,Natural Language
,FashionIQ Dataset
,Composed Image Retrieval
๐ ์ฐ๊ตฌ ๋ฐฐ๊ฒฝ
์ผ๋ฐ์ ์ธ ์ด๋ฏธ์ง ๊ฒ์(Image Retrieval)์ ์ฌ์ค ๊ฝค ์ค๋์ ๋ถํฐ ์ฐ๊ตฌ๋์ด ์์ต๋๋ค. ํต์ฌ์ โ์ฃผ์ด์ง ์ฟผ๋ฆฌ(์ง๋ฌธ)โ์ โ๊ฐ์ฅ ์ ์ฌํ ์ด๋ฏธ์งโ๋ฅผ ์ฐพ๋ ๊ฒ์ด์ฃ .
ํ์ง๋ง FashionIQ๊ฐ ๋ค๋ฃฌ CIR์ด๋ผ๋ ์๋ก์ด ๋ถ์ผ๋ ๋ค์๊ณผ ๊ฐ์ ํน์ง์ ๊ฐ์ง๋๋ค. ํ๋์ ๋ชจ๋ฌ๋ฆฌํฐ(๋จ์ผ ์ ๋ณด)๊ฐ ์๋, ๋ ๊ฐ์ง ๋ชจ๋ฌ๋ฆฌํฐ(์ฐธ์กฐ ์ด๋ฏธ์ง + ์์ ํ ์คํธ)๋ฅผ ๋์์ ์ฌ์ฉํ๋ ๊ฒ์ ๋๋ค. โ์ด ์ด๋ฏธ์ง๋ฅผ ์ฐพ์์คโ๊ฐ ์๋, โ์ด ์ด๋ฏธ์ง๋ฅผ ๋ฐํ์ผ๋ก ์ด๋ ๊ฒ ๋ฐ๊ฟ์ ์ฐพ์์คโ๋ผ๋ ์ฌ์ฉ์ ์๋๋ฅผ ํ์ ํ๋ ๊ฒ!!
โ์ด ๋๋ ์ค์ ๋น์ทํ๋ฐ, ์๋งค๊ฐ ๊ธด ๊ฑธ๋ก ์ฐพ์์ค.โ โ ์ฐธ์กฐ ์ด๋ฏธ์ง + ์์ ํ ์คํธ = ์กฐํฉ์ ํจ์ ์ด๋ฏธ์ง ๊ฒ์
์ด๋ฌํ ์์ฐ์ด ๊ธฐ๋ฐ์ ํจ์ ๊ฒ์์ ๊ธฐ์กด ๋ฐฉ๋ฒ๋ก ์ผ๋ก๋ ํด๊ฒฐํ๊ธฐ ์ด๋ ค์ด ๊ณผ์ ์์ต๋๋ค.
๐ง ์ฃผ์ ๊ธฐ์ฌ
Composed Image Retrieval์ ์์กฐ๊ฒฉ!?!!
ํจ์ ํนํ CIR ๊ณผ์ ์ ๋ฆฝ
- Fashion Composed Image Retrieval์ด๋ผ๋ ๊ฐ๋ ์ ๋ช ํํ ์ ์ํ์ต๋๋ค.
- ์ฌ์ฉ์์ ๋ฏธ๋ฌํ ์คํ์ผ ๋ณ๊ฒฝ ์๊ตฌ์ฌํญ์ ๋ฐ์ํ๋ ๊ฒ์ด ๋ชฉํ์ ๋๋ค.
FashionIQ ๋ฐ์ดํฐ์ ์ ์!!
- ๋๊ท๋ชจ ํจ์ ์ด๋ฏธ์ง ๊ฒ์ ๋ฒค์น๋งํฌ๋ก, 209,000๊ฐ ์ด์์ ์ด๋ฏธ์ง์ 77,000๊ฐ ์ด์์ ์ฟผ๋ฆฌ๋ฅผ ํฌํจํฉ๋๋ค.
- ์ ๋ฌธ ํจ์ ๋์์ด๋๋ค์ด ์ง์ ์์ฑํ ์์ฐ์ด ํ ์คํธ๋ฅผ ์ฌ์ฉํ์ฌ ์ฟผ๋ฆฌ์ ํ์ง์ ๋์์ต๋๋ค.
๋๋ ์ค
,์ ์ธ
,ํ
,๋ฐ์ง
๋ฑ ๋ค์ํ ์๋ฅ ์นดํ ๊ณ ๋ฆฌ๋ฅผ ํฌํจํฉ๋๋ค.
ํจ์ ๋ง์ถคํ ์ฟผ๋ฆฌ ๊ตฌ์กฐ
- ์์, ์ฌ์ง, ํจํด, ๋์์ธ ๋ฑ ํจ์ ์ ํนํ๋ ์์ ์ฌํญ์ ํ ์คํธ ์ฟผ๋ฆฌ์ ๋ด์์ต๋๋ค.
- ์) โAdd a beltโ, โChange from stripes to polka dotsโ
๊ธฐ์กด ๋ฐฉ์ ์ฑ๋ฅ ๋น๊ต
- CLIP, TIRG, FiLM ๋ฑ ๋ค์ํ ์ต์ ๋ชจ๋ธ๋ค์ ์ฑ๋ฅ์ FashionIQ ๋ฐ์ดํฐ์ ์์ ํ๊ฐํ์ต๋๋ค.
- ํจ์ ๋๋ฉ์ธ์ ํน์์ฑ ๋๋ฌธ์ ๊ธฐ์กด ๋ชจ๋ธ๋ค์ด ์ฌ์ ํ ๊ฐ์ ์ ์ฌ์ง๊ฐ ๋ง๋ค๋ ๊ฒ์ ๋ณด์ฌ์ฃผ๋ฉฐ, ํจ์ AI ์ฐ๊ตฌ์ ์๋ก์ด ๋ฐฉํฅ์ ์ ์ํ์ต๋๋ค.
๐ FashionIQ ๋ฐ์ดํฐ์ : โ์ฐธ์กฐ ์ด๋ฏธ์ง + ํ ์คํธโ ํจ์ ๊ฒ์
FashionIQ๋ ์กฐํฉ์ ์ด๋ฏธ์ง ๊ฒ์(Composed Image Retrieval, CIR)์ ํนํ๋ ๋๊ท๋ชจ ํจ์
๋ฐ์ดํฐ์
!!
โ์ฐธ์กฐ ์ด๋ฏธ์งโ์ โ์์ฐ์ด ํ
์คํธโ๋ฅผ ์กฐํฉํ ์ฟผ๋ฆฌ๋ฅผ ํตํด,
์ฌ์ฉ์์ ๋ณต์กํ ์๋๋ฅผ ํ์
ํ๊ณ ์ํ๋ ํจ์
์์ดํ
์ ์ฐพ์๋ด๋ ๊ฒ์ ๋ชฉํ๋ก ํจ!!
๐งฉ ์ฃผ์ ํน์ง
- ๊ท๋ชจ: ์ฝ 209,000๊ฐ์ ๊ณ ํ์ง ํจ์ ์ด๋ฏธ์ง์ 77,000๊ฐ์ ์ฟผ๋ฆฌ๋ก ๊ตฌ์ฑ
- ์ฟผ๋ฆฌ ๊ตฌ์ฑ: ๊ฐ ์ฟผ๋ฆฌ๋ ์ฐธ์กฐ ์ด๋ฏธ์ง์ ์ ๋ต ์ด๋ฏธ์ง์ ์ฐจ์ด๋ฅผ ์ค๋ช ํ๋ ๋ ๊ฐ์ ์์ฐ์ด ํ ์คํธ๋ก ๊ตฌ์ฑ
- ์ ๋ฌธ์ฑ: ํจ์ ์ ๋ฌธ ๋์์ด๋๊ฐ ์ง์ ์์ฑํ ํ ์คํธ ์ฟผ๋ฆฌ๋ฅผ ํฌํจํ์ฌ, ์๋งค๋ฅผ ์งง๊ฒ, ์์์ ๊ฒ์์์ผ๋ก์ ๊ฐ์ด ์ธ๋ฐํ๊ณ ํ์ค์ ์ธ ํจ์ ์์ฑ์ ๋ฐ์
- ์นดํ ๊ณ ๋ฆฌ: ์ฌ์ฑ์ฉ ๋๋ ์ค, ์์, ๋จ์ฑ์ฉ ์ ์ธ ๋ฑ ๋ค์ํ ์๋ฅ ์นดํ ๊ณ ๋ฆฌ ํฌํจ
์ด ๋ฐ์ดํฐ์ ์ ๊ธฐ์กด์ ๋จ์ํ ์ด๋ฏธ์ง ๊ฒ์์ ๋์ด ์ฌ์ฉ์์ ์๋๋ฅผ ์ดํดํ๋ AI ๋ชจ๋ธ ๊ฐ๋ฐ์ ์ค์ํ ๊ธฐ์ฌ๋ฅผ ํจ!!
๊ทธ๋์ ๊ผญ ํจ์ ์ด ์๋๋ผ Composed Image Retrieval ์ชฝ์์ ๋ ํผ๋ฐ์ค๋ก ๋ง์ด ์ฐ์์
๐ท FashionIQ์ ์ฟผ๋ฆฌ ์์
- Reference Image: ๋นจ๊ฐ์ ๊ธดํ ๋๋ ์ค
- Text Modification: โChange color to black and make sleeves shorterโ
- Target Image: ๊ฒ์์ ๋ฐํ ๋๋ ์ค
๐ง FashionIQ ๋ชจ๋ธ ํ์ต ๊ตฌ์กฐ ์์ฝ
FashionIQ ๋ ผ๋ฌธ์ ์กฐํฉ์ ์ด๋ฏธ์ง ๊ฒ์(Composed Image Retrieval)์ ์ํด ๋ค์๊ณผ ๊ฐ์ ํ์ต ๊ตฌ์กฐ๋ฅผ ์ฌ์ฉ
- 1๋จ๊ณ: ์ฌ์ ํ์ต๋ ๋ชจ๋ธ์ ์ฌ์ฉํ ์ธ์ฝ๋ฉ
- ์ด๋ฏธ์ง:
DeepFashion
๋ฐ์ดํฐ์ ์ผ๋ก ๋ฏธ๋ฆฌ ํ์ต๋ EfficientNet-b7 ๋ชจ๋ธ์ ํ์ฉํ์ฌ ์ด๋ฏธ์ง์ ํน์ง ์ถ์ถ - ํ ์คํธ: ์ธ๋ถ ๋๋์ ํ ์คํธ๋ก ๋ฏธ๋ฆฌ ํ์ต๋ GloVe ์๋ฒ ๋ฉ์ ์ฌ์ฉ
- ์ด๋ฏธ์ง:
- 2๋จ๊ณ: ์์ฒด ํธ๋์คํฌ๋จธ๋ฅผ ํตํ ํตํฉ ํ์ต
- ์ 1๋จ๊ณ์์ ์ป์ ์ด๋ฏธ์ง์ ํ ์คํธ์ ๋ฒกํฐ๋ฅผ ์ ๋ ฅ์ผ๋ก ๋ฐ์, ๋ ผ๋ฌธ์์ ์๋กญ๊ฒ ์ค๊ณํ 6๊ฐ ๋ ์ด์ด์ ํธ๋์คํฌ๋จธ๋ฅผ ํตํด ๋ ์ ๋ณด๋ฅผ ๊ฒฐํฉ
- ์ด ๋ฉํฐ๋ชจ๋ฌ ํธ๋์คํฌ๋จธ๋ ์ด๋ฏธ์ง์ ํ ์คํธ์ ๊ด๊ณ๋ฅผ ํ์ตํ์ฌ ์ต์ข ์ ์ผ๋ก ๊ฒ์์ ์ฌ์ฉ๋ ์ฟผ๋ฆฌ ๋ฒกํฐ๋ฅผ ์์ฑ
๊ฒฐ๋ก ์ ์ผ๋ก, ์ด ๋ชจ๋ธ์ ๊ฐ๊ฐ์ ๋ชจ๋ฌ๋ฆฌํฐ(์ด๋ฏธ์ง, ํ ์คํธ)๋ฅผ ์ํ ์ฌ์ ํ์ต๋ ์ธ์ฝ๋๋ฅผ ํ์ฉํ๋ฉด์, ์ด ๋์ ํตํฉํ๋ ํต์ฌ์ ์ธ ์ญํ ์ ์์ฒด ์ ์ํ ํธ๋์คํฌ๋จธ๊ฐ ๋ด๋นํ๋ ํ์ด๋ธ๋ฆฌ๋ ๊ตฌ์กฐ๋ก ๊ตฌ์ฑ
๐งฉ ๊ฒฐ๋ก
FashionIQ (CVPR 2021)๋ ํจ์ ๋๋ฉ์ธ์ ํนํ๋ CIR ๊ณผ์ ๋ฅผ ์ ๋ฆฝํ๊ณ , ์ ๋ฌธ์ ์ธ ์์ฐ์ด ์ฟผ๋ฆฌ๋ฅผ ํฌํจํ ๋๊ท๋ชจ ๋ฒค์น๋งํฌ๋ฅผ ๊ตฌ์ถํ ์ ๊ตฌ์ ์ธ ์ฐ๊ตฌ์ ๋๋ค. ์ด ๋ฐ์ดํฐ์ ์ ๋ฑ์ฅ์ ํจ์ AI ๋ถ์ผ์์ ์ฌ๋์ ์๋๋ฅผ ์ดํดํ๋ ์ด๋ฏธ์ง ๊ฒ์ ์ฐ๊ตฌ๋ฅผ ๊ฐ์ํํ๋ ์ค์ํ ๊ณ๊ธฐ๊ฐ ๋์์ต๋๋ค.