Post

๐Ÿง  FashionIQ - Fashion Image Retrieval with Natural Language: ํŒจ์…˜ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์˜ ์ƒˆ๋กœ์šด ํ‘œ์ค€

๐Ÿง  FashionIQ - Fashion Image Retrieval with Natural Language: ํŒจ์…˜ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์˜ ์ƒˆ๋กœ์šด ํ‘œ์ค€

๐Ÿง  Fashion Image Retrieval with Natural Language: A New Standard

fashionIQ_main


๐Ÿ” Research Background

Traditional Image Retrieval has been a research topic for a long time. The core goal is to find the most similar image to a given query.

However, the new field of Composed Image Retrieval (CIR), which the FashionIQ paper addresses, has distinct characteristics. Instead of a single modality (a single piece of information), it uses two modalities simultaneously: a reference image and a text modification. The goal isnโ€™t just to โ€œfind this image,โ€ but to โ€œfind an image like this, but modified in this way!โ€

โ€œFind a dress similar to this, but with longer sleeves.โ€ โ†’ Reference Image + Text Modification = Composed Fashion Image Retrieval

This type of natural language-based fashion search was a challenging task for existing methods.


๐Ÿง  Key Contributions

A pioneering work in Composed Image Retrieval!

  1. Establishing the Fashion-specific CIR Task

    • The paper clearly defined the concept of Fashion Composed Image Retrieval.
    • The goal is to capture subtle stylistic changes requested by users.
  2. Introducing the FashionIQ Dataset

    • This is a large-scale benchmark for fashion image retrieval, including over 209,000 images and 77,000 queries.
    • The quality of the queries is high because the natural language text was written by professional fashion designers.
    • It includes a variety of clothing categories, such as dresses, shirts, tops, and pants.
  3. A Custom Query Structure for Fashion

    • The text queries include specific modifications related to fashion, such as color, material, pattern, and design.
    • Example: โ€œAdd a belt,โ€ โ€œChange from stripes to polka dots.โ€
  4. Comparing Performance with Existing Methods

    • The authors evaluated the performance of various state-of-the-art models like CLIP, TIRG, and FiLM on the FashionIQ dataset.
    • They demonstrated that due to the specific nature of the fashion domain, existing models still have much room for improvement, thus establishing a new direction for fashion AI research.

The FashionIQ dataset is a large-scale fashion dataset specifically for Composed Image Retrieval (CIR). Unlike traditional image retrieval, its goal is to understand a userโ€™s complex intent through a combination of a โ€˜reference imageโ€™ and โ€˜natural language textโ€™ to find the desired fashion item.


๐Ÿงฉ Key Features

  • Scale: Consists of around 209,000 high-quality fashion images and 77,000 queries.
  • Query Composition: Each query is made up of a reference image and two natural language texts that describe the visual difference between the reference and the target image.
  • Professionalism: The text queries were written by professional fashion designers, reflecting granular and realistic fashion attributes like โ€œshorten the sleevesโ€ or โ€œchange the color to black.โ€
  • Categories: It includes a wide range of clothing categories, such as womenโ€™s dresses, tops, and menโ€™s shirts.

This dataset was a major contribution, as it pushed the development of AI models to go beyond simple image search and understand user intent. This is why itโ€™s frequently used as a reference not just in fashion, but in the broader Composed Image Retrieval field.


๐Ÿ“ท FashionIQ Query Example

  • Reference Image: A long-sleeved red dress
  • Text Modification: โ€œChange color to black and make sleeves shorterโ€
  • Target Image: A short-sleeved black dress

๐Ÿง  FashionIQ Model Learning Structure Summary

The FashionIQ paper uses the following learning structure for Composed Image Retrieval:

  • Step 1: Encoding with Pre-trained Models
    • Images: It uses the EfficientNet-b7 model, pre-trained on the DeepFashion dataset, to extract image features.
    • Text: It uses GloVe embeddings, pre-trained on a large external text corpus, to convert words into vectors.
  • Step 2: Integrated Learning with a Custom Transformer
    • The image features and text embeddings from Step 1 are fed into a newly designed 6-layer transformer that combines the two modalities.
    • This multimodal transformer learns the relationship between the image and text to generate the final query vector used for retrieval.

In conclusion, this model is a hybrid structure that leverages pre-trained encoders for each modality (image, text), while its core role of integrating the two is handled by a custom-built transformer.


๐Ÿงฉ Conclusion

FashionIQ (CVPR 2021) is a pioneering study that established the CIR task specific to the fashion domain and built a large-scale benchmark with professional natural language queries. The emergence of this dataset was a crucial catalyst, accelerating research in image retrieval that understands human intent within the field of fashion AI.


๐Ÿง  (ํ•œ๊ตญ์–ด) ์ž์—ฐ์–ด๋กœ ํŒจ์…˜ ์ด๋ฏธ์ง€๋ฅผ ์ฐพ์•„๋‚ธ๋‹ค!

fashionIQ_main



๐Ÿ” ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

์ผ๋ฐ˜์ ์ธ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(Image Retrieval)์€ ์‚ฌ์‹ค ๊ฝค ์˜ค๋ž˜์ „๋ถ€ํ„ฐ ์—ฐ๊ตฌ๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ โ€˜์ฃผ์–ด์ง„ ์ฟผ๋ฆฌ(์งˆ๋ฌธ)โ€™์™€ โ€˜๊ฐ€์žฅ ์œ ์‚ฌํ•œ ์ด๋ฏธ์ง€โ€™๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด์ฃ .

ํ•˜์ง€๋งŒ FashionIQ๊ฐ€ ๋‹ค๋ฃฌ CIR์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ถ„์•ผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ง•์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(๋‹จ์ผ ์ •๋ณด)๊ฐ€ ์•„๋‹Œ, ๋‘ ๊ฐ€์ง€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(์ฐธ์กฐ ์ด๋ฏธ์ง€ + ์ˆ˜์ • ํ…์ŠคํŠธ)๋ฅผ ๋™์‹œ์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. โ€˜์ด ์ด๋ฏธ์ง€๋ฅผ ์ฐพ์•„์ค˜โ€™๊ฐ€ ์•„๋‹Œ, โ€˜์ด ์ด๋ฏธ์ง€๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด๋ ‡๊ฒŒ ๋ฐ”๊ฟ”์„œ ์ฐพ์•„์ค˜โ€™๋ผ๋Š” ์‚ฌ์šฉ์ž ์˜๋„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ!!

โ€œ์ด ๋“œ๋ ˆ์Šค์™€ ๋น„์Šทํ•œ๋ฐ, ์†Œ๋งค๊ฐ€ ๊ธด ๊ฑธ๋กœ ์ฐพ์•„์ค˜.โ€ โ†’ ์ฐธ์กฐ ์ด๋ฏธ์ง€ + ์ˆ˜์ • ํ…์ŠคํŠธ = ์กฐํ•ฉ์  ํŒจ์…˜ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰

์ด๋Ÿฌํ•œ ์ž์—ฐ์–ด ๊ธฐ๋ฐ˜์˜ ํŒจ์…˜ ๊ฒ€์ƒ‰์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ๋Š” ํ•ด๊ฒฐํ•˜๊ธฐ ์–ด๋ ค์šด ๊ณผ์ œ์˜€์Šต๋‹ˆ๋‹ค.


๐Ÿง  ์ฃผ์š” ๊ธฐ์—ฌ

Composed Image Retrieval์˜ ์›์กฐ๊ฒฉ!?!!

  1. ํŒจ์…˜ ํŠนํ™” CIR ๊ณผ์ œ ์ •๋ฆฝ

    • Fashion Composed Image Retrieval์ด๋ผ๋Š” ๊ฐœ๋…์„ ๋ช…ํ™•ํžˆ ์ •์˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์‚ฌ์šฉ์ž์˜ ๋ฏธ๋ฌ˜ํ•œ ์Šคํƒ€์ผ ๋ณ€๊ฒฝ ์š”๊ตฌ์‚ฌํ•ญ์„ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.
  2. FashionIQ ๋ฐ์ดํ„ฐ์…‹ ์ œ์•ˆ!!

    • ๋Œ€๊ทœ๋ชจ ํŒจ์…˜ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ๋ฒค์น˜๋งˆํฌ๋กœ, 209,000๊ฐœ ์ด์ƒ์˜ ์ด๋ฏธ์ง€์™€ 77,000๊ฐœ ์ด์ƒ์˜ ์ฟผ๋ฆฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
    • ์ „๋ฌธ ํŒจ์…˜ ๋””์ž์ด๋„ˆ๋“ค์ด ์ง์ ‘ ์ž‘์„ฑํ•œ ์ž์—ฐ์–ด ํ…์ŠคํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฟผ๋ฆฌ์˜ ํ’ˆ์งˆ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค.
    • ๋“œ๋ ˆ์Šค, ์…”์ธ , ํƒ‘, ๋ฐ”์ง€ ๋“ฑ ๋‹ค์–‘ํ•œ ์˜๋ฅ˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  3. ํŒจ์…˜ ๋งž์ถคํ˜• ์ฟผ๋ฆฌ ๊ตฌ์กฐ

    • ์ƒ‰์ƒ, ์žฌ์งˆ, ํŒจํ„ด, ๋””์ž์ธ ๋“ฑ ํŒจ์…˜์— ํŠนํ™”๋œ ์ˆ˜์ • ์‚ฌํ•ญ์„ ํ…์ŠคํŠธ ์ฟผ๋ฆฌ์— ๋‹ด์•˜์Šต๋‹ˆ๋‹ค.
    • ์˜ˆ) โ€œAdd a beltโ€, โ€œChange from stripes to polka dotsโ€
  4. ๊ธฐ์กด ๋ฐฉ์‹ ์„ฑ๋Šฅ ๋น„๊ต

    • CLIP, TIRG, FiLM ๋“ฑ ๋‹ค์–‘ํ•œ ์ตœ์‹  ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์„ FashionIQ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ํŒจ์…˜ ๋„๋ฉ”์ธ์˜ ํŠน์ˆ˜์„ฑ ๋•Œ๋ฌธ์— ๊ธฐ์กด ๋ชจ๋ธ๋“ค์ด ์—ฌ์ „ํžˆ ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ๋งŽ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ํŒจ์…˜ AI ์—ฐ๊ตฌ์˜ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“˜ FashionIQ ๋ฐ์ดํ„ฐ์…‹: โ€˜์ฐธ์กฐ ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธโ€™ ํŒจ์…˜ ๊ฒ€์ƒ‰

FashionIQ๋Š” ์กฐํ•ฉ์  ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(Composed Image Retrieval, CIR)์— ํŠนํ™”๋œ ๋Œ€๊ทœ๋ชจ ํŒจ์…˜ ๋ฐ์ดํ„ฐ์…‹!!
โ€˜์ฐธ์กฐ ์ด๋ฏธ์ง€โ€™์™€ โ€˜์ž์—ฐ์–ด ํ…์ŠคํŠธโ€™๋ฅผ ์กฐํ•ฉํ•œ ์ฟผ๋ฆฌ๋ฅผ ํ†ตํ•ด,
์‚ฌ์šฉ์ž์˜ ๋ณต์žกํ•œ ์˜๋„๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์›ํ•˜๋Š” ํŒจ์…˜ ์•„์ดํ…œ์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•จ!!


๐Ÿงฉ ์ฃผ์š” ํŠน์ง•

  • ๊ทœ๋ชจ: ์•ฝ 209,000๊ฐœ์˜ ๊ณ ํ’ˆ์งˆ ํŒจ์…˜ ์ด๋ฏธ์ง€์™€ 77,000๊ฐœ์˜ ์ฟผ๋ฆฌ๋กœ ๊ตฌ์„ฑ
  • ์ฟผ๋ฆฌ ๊ตฌ์„ฑ: ๊ฐ ์ฟผ๋ฆฌ๋Š” ์ฐธ์กฐ ์ด๋ฏธ์ง€์™€ ์ •๋‹ต ์ด๋ฏธ์ง€์˜ ์ฐจ์ด๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋‘ ๊ฐœ์˜ ์ž์—ฐ์–ด ํ…์ŠคํŠธ๋กœ ๊ตฌ์„ฑ
  • ์ „๋ฌธ์„ฑ: ํŒจ์…˜ ์ „๋ฌธ ๋””์ž์ด๋„ˆ๊ฐ€ ์ง์ ‘ ์ž‘์„ฑํ•œ ํ…์ŠคํŠธ ์ฟผ๋ฆฌ๋ฅผ ํฌํ•จํ•˜์—ฌ, ์†Œ๋งค๋ฅผ ์งง๊ฒŒ, ์ƒ‰์ƒ์„ ๊ฒ€์€์ƒ‰์œผ๋กœ์™€ ๊ฐ™์ด ์„ธ๋ฐ€ํ•˜๊ณ  ํ˜„์‹ค์ ์ธ ํŒจ์…˜ ์†์„ฑ์„ ๋ฐ˜์˜
  • ์นดํ…Œ๊ณ ๋ฆฌ: ์—ฌ์„ฑ์šฉ ๋“œ๋ ˆ์Šค, ์ƒ์˜, ๋‚จ์„ฑ์šฉ ์…”์ธ  ๋“ฑ ๋‹ค์–‘ํ•œ ์˜๋ฅ˜ ์นดํ…Œ๊ณ ๋ฆฌ ํฌํ•จ

์ด ๋ฐ์ดํ„ฐ์…‹์€ ๊ธฐ์กด์˜ ๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์„ ๋„˜์–ด ์‚ฌ์šฉ์ž์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๋Š” AI ๋ชจ๋ธ ๊ฐœ๋ฐœ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•จ!!
๊ทธ๋ž˜์„œ ๊ผญ ํŒจ์…˜์ด ์•„๋‹ˆ๋ผ Composed Image Retrieval ์ชฝ์—์„œ ๋ ˆํผ๋Ÿฐ์Šค๋กœ ๋งŽ์ด ์“ฐ์ž„์ž„


๐Ÿ“ท FashionIQ์˜ ์ฟผ๋ฆฌ ์˜ˆ์‹œ

  • Reference Image: ๋นจ๊ฐ„์ƒ‰ ๊ธดํŒ” ๋“œ๋ ˆ์Šค
  • Text Modification: โ€œChange color to black and make sleeves shorterโ€
  • Target Image: ๊ฒ€์€์ƒ‰ ๋ฐ˜ํŒ” ๋“œ๋ ˆ์Šค

๐Ÿง  FashionIQ ๋ชจ๋ธ ํ•™์Šต ๊ตฌ์กฐ ์š”์•ฝ

FashionIQ ๋…ผ๋ฌธ์€ ์กฐํ•ฉ์  ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(Composed Image Retrieval)์„ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•™์Šต ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉ

  • 1๋‹จ๊ณ„: ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์ธ์ฝ”๋”ฉ
    • ์ด๋ฏธ์ง€: DeepFashion ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ฏธ๋ฆฌ ํ•™์Šต๋œ EfficientNet-b7 ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ ํŠน์ง• ์ถ”์ถœ
    • ํ…์ŠคํŠธ: ์™ธ๋ถ€ ๋Œ€๋Ÿ‰์˜ ํ…์ŠคํŠธ๋กœ ๋ฏธ๋ฆฌ ํ•™์Šต๋œ GloVe ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉ
  • 2๋‹จ๊ณ„: ์ž์ฒด ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ํ†ตํ•œ ํ†ตํ•ฉ ํ•™์Šต
    • ์œ„ 1๋‹จ๊ณ„์—์„œ ์–ป์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ๋ฒกํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ๋…ผ๋ฌธ์—์„œ ์ƒˆ๋กญ๊ฒŒ ์„ค๊ณ„ํ•œ 6๊ฐœ ๋ ˆ์ด์–ด์˜ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ํ†ตํ•ด ๋‘ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉ
    • ์ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๊ฒ€์ƒ‰์— ์‚ฌ์šฉ๋  ์ฟผ๋ฆฌ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑ

๊ฒฐ๋ก ์ ์œผ๋กœ, ์ด ๋ชจ๋ธ์€ ๊ฐ๊ฐ์˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(์ด๋ฏธ์ง€, ํ…์ŠคํŠธ)๋ฅผ ์œ„ํ•œ ์‚ฌ์ „ ํ•™์Šต๋œ ์ธ์ฝ”๋”๋ฅผ ํ™œ์šฉํ•˜๋ฉด์„œ, ์ด ๋‘˜์„ ํ†ตํ•ฉํ•˜๋Š” ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์€ ์ž์ฒด ์ œ์ž‘ํ•œ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ๋‹ด๋‹นํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ

๐Ÿงฉ ๊ฒฐ๋ก 

FashionIQ (CVPR 2021)๋Š” ํŒจ์…˜ ๋„๋ฉ”์ธ์— ํŠนํ™”๋œ CIR ๊ณผ์ œ๋ฅผ ์ •๋ฆฝํ•˜๊ณ , ์ „๋ฌธ์ ์ธ ์ž์—ฐ์–ด ์ฟผ๋ฆฌ๋ฅผ ํฌํ•จํ•œ ๋Œ€๊ทœ๋ชจ ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ตฌ์ถ•ํ•œ ์„ ๊ตฌ์ ์ธ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์˜ ๋“ฑ์žฅ์€ ํŒจ์…˜ AI ๋ถ„์•ผ์—์„œ ์‚ฌ๋žŒ์˜ ์˜๋„๋ฅผ ์ดํ•ดํ•˜๋Š” ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ์—ฐ๊ตฌ๋ฅผ ๊ฐ€์†ํ™”ํ•˜๋Š” ์ค‘์š”ํ•œ ๊ณ„๊ธฐ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

This post is licensed under CC BY 4.0 by the author.