๐ Peeking into the Mind of AI: Understanding CAM! - AI์ ์๋ง์์ ๋ค์ฌ๋ค๋ณธ๋ค!! CAM ์์๋ณด๊ธฐ
(English) Peeking into the Mind of AI: Understanding CAM!
A groundbreaking study that allowed us to peek into a computerโs decision-making process: CAM!
Learn About the Groundbreaking Research of CAM! (CVPR 2016, 13,000+ Citations)
This paper was presented at CVPR 2016 and has been cited more than 13,046 times!
In image analysis, not knowing CAM might be a crime!
Even if youโre unfamiliar with the research, youโve probably seen the image below:
In todayโs post, weโll take a deep dive into this game-changing research!
๐ค Why Was CAM Introduced?
Back in the days when AlphaGo made headlines (around 2015), image classification models like ResNet dramatically improved accuracy.
But regardless of whether predictions were right or wrong,
โWhy did the model make that prediction?โ was still a tough question.
Why did it predict this image is a dog?
Is the model really looking at the right part?
This curiosity led to the birth of CAM (Class Activation Map) research.
๐ What is CAM (Class Activation Map)?
CAM is a technique that visually shows which part of the image the model focused on to make a prediction.
In other words, it highlights the decisive areas in the form of a heatmap.
As shown in the cartoon thumbnail, CAM allows AI to say:
โI predicted this as a dog because I focused on the eyes and ears!โ
๐ง How Does CAM Work? + Example
- Extract the feature map from the last convolutional layer of a CNN.
- Instead of using the fully connected layerโs class weights (as in traditional models),
apply Global Average Pooling (GAP) to the feature map to get a feature vector. - Multiply this feature vector by the class-specific weights in the softmax layer
to calculate a CAM that shows which spatial locations contributed most to the prediction.
If that sounds a bit complicated, letโs compare it side by side to understand it better!
Before We Start: Compare Traditional [Image Classifier] vs. [CAM-based Structure]
- Traditional: conv โ flatten โ FC โ softmax
- CAM: conv โ GAP โ FC โ softmax
Traditional Classification Model: Feeding a Single Image (224ร224) into a CNN with FC Layer
Step | Data Shape | Description |
---|---|---|
๐ท Input Image | [3, 224, 224] | RGB image |
๐ Last Conv Output | [512, 7, 7] | 512 feature maps of size 7ร7 |
๐ Flatten | [512 ร 7 ร 7] = [25088] | Flattened into a vector |
๐งฎ Fully Connected Layer | [N_classes] | Generates class scores (weight shape = [N_classes, 512] ) |
โ ๏ธ The weights [512] for a specific class are used as class_weight for CAM | ย | ย |
๐ฏ Softmax | [N_classes] | Converts scores to probabilities |
๐ซ CAM Not Available | โ Not possible | Spatial information is lost during flattening |
- As shown above, CAM is not possible in this structure.
- Only the final probabilities
[N_classes]
are available. N_classes
is the number of classes to distinguish (e.g., dog, cat, etc.).- The weights used in the Fully Connected Layer serve as class_weight in CAM.
CAM Flow: Feeding a Single Image (224ร224) into ResNet18 to Generate CAM
Step | Data Shape | Description |
---|---|---|
๐ท Input Image | [3, 224, 224] | RGB image |
๐ Last Conv Output | [512, 7, 7] | 512 feature maps of size 7ร7 |
๐ฅ CAM Calculation | [7, 7] | Weighted sum of feature maps ร class_weight |
๐ผ Final CAM Image (Upsample) | [224, 224] | Upsampled to overlay on the original image |
๐ GAP (Global Average Pooling) | [512] | Channel-wise average of the [512, 7, 7] feature map |
๐งฎ FC Layer | [N_classes] | Converts GAP result to class scores |
๐ฏ Softmax | [N_classes] | Outputs prediction probabilities |
- CAM generates an interpretable heatmap
- The class prediction from GAP โ FC โ softmax may differ from traditional CNNs!
๐ธ Real Example: How CAM is Actually Used
- If AI predicts an image as โdogโ,
CAM highlights the face, ears, and tail regions that contributed most. - This allows us to verify if the model made a reasonable prediction.
- Below is a CAM result for a golden retriever. It seems the model focused on the ears!
In the next post, weโll dive into the code behind this and explore its structure in detail! |
๐งฐ CAMโs Transformative Impact
- CAM was one of the first methods to visually explain CNN predictions.
- It had a massive influence on weakly supervised object localization (WSOL).
- Inspired follow-up methods like Grad-CAM, which removed architectural constraints.
๐ก Why CAM is Amazing
- โ Helps visually confirm why a model made a prediction
- โ Makes it easier to find errors, bias, and reasoning mistakes
- โ Improves model transparency and trustworthiness
CAM was the beginning of the end for the โblack boxโ era in AI.
Even today, researchers are extending CAM into the broader domain of explainable AI (XAI).
Try implementing CAM yourself to take a peek inside your AIโs mind!
(ํ๊ตญ์ด) AI์ ์๋ง์์ ๋ค์ฌ๋ค๋ณธ๋ค!! CAM ์์๋ณด๊ธฐ
์ด๋ฏธ์ง ๋ถ๋ฅ์์์ด, ์ปดํจํฐ์ ์๋ง์์ ์๊ฒ ํด์ค ์ฐ๊ตฌ CAM!!!
์์ฒญ๋ ์ฐ๊ตฌ, CAM์ ์์๋ณด์!! (CVPR 2016, ์ธ์ฉ 13,000ํ+)
2016๋
CVPR์์ ๋ฐํ๋ ์ด ๋
ผ๋ฌธ, ์ธ์ฉ์๊ฐ ๋ฌด๋ ค 13,046ํ์ ๋ฌํฉ๋๋ค!
์ด๋ฏธ์ง ๋ถ์์์ โCAMโ ์ ๋ชจ๋ฅด๋ฉด ๊ฐ์ฒฉ!? ๋น๋ก ์ฐ๊ตฌ๋ ๋ชจ๋ฅผ์ ์์ง๋ง ์๋์ ์ด๋ฏธ์ง๋ ๋ง์ด ๋ณด์
จ์๋ฏ ํ๋ค์~~!
์ค๋์ ํฌ์คํ ์ ์ด ํ๊ธฐ์ ์ธ ์ฐ๊ตฌ๋ฅผ ์์๋ณด๊ฒ ์ต๋๋ค!!!
๐ค ์ CAM์ด ๋ฑ์ฅํ์๊น?
์ํ๊ณ ๊ฐ ํ์ ๊ฐ ๋์๋ ์์ (2015๋
์ฆ์), ์ด๋ฏธ์ง ๋ถ๋ฅ ๋ถ์ผ์์๋ ResNet ๋ฑ ๋ฐ์ด๋ ๋ชจ๋ธ์ด ๋์
์ ํ๋๊ฐ ๋์ ๋๊ฒ ํฅ์๋์์ง๋ง,
ํ๋ฆฌ๋์ง, ๋ง๋์ง ๊ฒฐ๊ณผ๋ฅผ ๋ ๋์,
โ๋ชจ๋ธ์ด ์ ๊ทธ๋ฐ ์์ธก์ ํ๋์งโ์ ๋ํ ํด์์ ์ฌ์ ํ ์ด๋ ค์ด ์์ ์์ต๋๋ค.
์ ์ด ์ด๋ฏธ์ง๋ฅผ ๊ฐ์์ง๋ผ๊ณ ์๊ฐํ์ง?
์ ๋ง ๋ชจ๋ธ์ด ์ ๋๋ก ๋ณด๊ณ ์๋ ๊ฑธ๊น?
์ด๋ฐ ๊ถ๊ธ์ฆ์์ CAM(Class Activation Map) ์ฐ๊ตฌ๊ฐ ํ์ํ๊ฒ ๋ฉ๋๋ค.
๐ CAM(Class Activation Map)์ด๋?
CAM์ ์ด๋ฏธ์ง ๋ถ๋ฅ ๋ชจ๋ธ์ด ์ด๋ค ๋ถ๋ถ์ ๊ทผ๊ฑฐ๋ก ์์ธกํ๋์ง ์๊ฐ์ ์ผ๋ก ๋ณด์ฌ์ฃผ๋ ๊ธฐ๋ฒ์
๋๋ค.
์ฆ, ์ด๋ฏธ์ง์ ๊ฒฐ์ ์ ๋ถ์๋ฅผ heatmap์ผ๋ก ํ์ํด์ฃผ์ฃ !
์ธ๋ค์ผ ๋งํ์์์ฒ๋ผ, AI๊ฐ
โ์ด ๊ฐ์์ง์ ๋๊ณผ ๊ท๋ฅผ ๋ณด๊ณ โ๊ฐ์์งโ๋ผ๊ณ ํ์ด์!โ
๋ผ๊ณ ์ค๋ช ํ ์ ์๊ฒ ํด์ค๋๋ค.
๐ง CAM์ ์๋ ์๋ฆฌ + ์์
- CNN์ ๋ง์ง๋ง ํฉ์ฑ๊ณฑ ์ธต์์ ๋์จ feature map์ ๋ฝ์๋ ๋๋ค.
- Fully Connected Layer์ ํด๋์ค๋ณ ๊ฐ์ค์น๋ฅผ ๊ฐ์ ธ์ ธ์ค๋ ๋์ !! (๊ธฐ์กด ์ด๋ฏธ์ง ๋ถ๋ฅ์์๋ ์ด๋ ๊ฒ ํ์๋๋ ์!!),
Global Average Pooling (GAP) ์ ํตํด ๊ฐ feature map์ ํ๋์ ๊ฐ์ผ๋ก ํ๊ท ๋ด์ด feature ๋ฒกํฐ๋ฅผ ์์ฑํฉ๋๋ค. - ๊ทธ feature vector์ ๋ํด Softmax์ ์ฐ๊ฒฐ๋ ํด๋์ค๋ณ ๊ฐ์ค์น๋ฅผ ๊ณฑํด์ CAM์ ๊ณ์ฐํ ์ ์์ต๋๋ค. ์ด๋ ์์น๊ฐ ํด๋น ํด๋์ค ์์ธก์ ๊ธฐ์ฌํ๋์ง heatmap ํํ๋ก ์๊ฐํํฉ๋๋ค.
์ ๋ด์ฉ์ด ์์ฝ์ด์ง๋ง,, ์กฐ๊ธ ์ด๋ ค์ธ์๋!? ๊ทธ๋์ ์๋์ ๊ฐ์ด ๋น๊ตํด๋ณด๋ฉฐ ์์๋ด ์๋ค!! |
์์ ์ , [CAM ๊ตฌ์กฐ]์ ์ฐจ์ด ๊ฐ๋จํ๋ณด๊ธฐ!!!
- ๊ธฐ์กด ๊ตฌ์กฐ: ์ด๋ฏธ์ง โ conv โ flatten โ FC โ softmax
- CAM ๊ตฌ์กฐ: ์ด๋ฏธ์ง โ conv โ GAP โ FC โ softmax
๊ธฐ์กด์ ๋ถ๋ฅ๋ชจ๋ธ ์๋ฆฌ!! : ํ ์ฅ์ ์ด๋ฏธ์ง (224ร224) ๋ฅผ ์ผ๋ฐ์ ์ธ CNN (FC layer ํฌํจ) ๋ถ๋ฅ ๋ชจ๋ธ๋ฃ์๋!!
| ๋จ๊ณ | ๋ฐ์ดํฐ ํํ | ์ค๋ช
| |โโ|โโโโโโโ|โโ| | ๐ท ์
๋ ฅ ์ด๋ฏธ์ง | [3, 224, 224]
| RGB ์ด๋ฏธ์ง | | ๐ CNN ๋ง์ง๋ง conv ์ถ๋ ฅ | [512, 7, 7]
| 512๊ฐ์ 7ร7 feature map | | ๐ Flatten | [512 ร 7 ร 7]
= [25088]
| ๊ณต๊ฐ ์ ๋ณด๋ฅผ ํผ์ณ์ 1์ฐจ์ ๋ฒกํฐ๋ก ๋ง๋ฆ | | ๐งฎ Fully Connected Layer | [N_classes]
| FC Layer์์ ์์ธก score ์์ฑ
๐ก weight shape = [N_classes, 25088]
โ ๏ธ ์ฌ๊ธฐ์ ํน์ ํด๋์ค์ weight [25088]
์ค ์ผ๋ถ๊ฐ CAM์ class_weight๋ก ์ฐ์ | | ๐ฏ Softmax | [N_classes]
| ํ๋ฅ ํ๋ ์์ธก ๊ฒฐ๊ณผ | | ๐ซ CAM ๋ถ๊ฐ๋ฅ | โ ์์ | ๊ณต๊ฐ ์ ๋ณด๊ฐ flatten์ผ๋ก ์ฌ๋ผ์ ธ CAM ์์ฑ์ด ๋ถ๊ฐ๋ฅ |
- ์์ ๊ฐ์ด CAM์ ๋ถ๊ฐํ๋ฉฐ, [N_classes] ๋ค์ ๋ํ ํ๋ฅ ๊ฐ๋ง ๋์ค๊ฒ๋ฉ๋๋ค!!
- N_classes๋ ๊ฐ์ฑ๋ก ๊ตฌ๋ถํ๊ณ ์ํ๋ ๋์์ ๊ฐฏ์ (ex. ๊ฐ์์ง,๊ณ ์์ด ๋ฑ ๊ฐ์ฒด์ ๊ฐฏ์)
- ์ด๋ Fully Connected Layer ์ ์ฌ์ฉ๋๋ weight!! ๊ทธ weight๊ฐ class_weight๋ก์ CAM์ ํ์ฉ๋์ด์!!
ํ ์ฅ์ ์ด๋ฏธ์ง (224ร224) ๋ฅผ ResNet18์ ๋ฃ์ด CAM ์ด๋ฏธ์ง ๋ง๋๋ ๊ณผ์ !!
๋จ๊ณ | ๋ฐ์ดํฐ ํํ | ์ค๋ช |
---|---|---|
๐ท ์ ๋ ฅ ์ด๋ฏธ์ง | [3, 224, 224] | RGB ์ด๋ฏธ์ง |
๐ CNN(resnet) ๋ง์ง๋ง conv ์ถ๋ ฅ | [512, 7, 7] | 512๊ฐ์ 7ร7 feature map |
๐ฅ CAM ๊ณ์ฐ : CNN(resnet) ๋ง์ง๋ง conv ์ถ๋ ฅ class_weight๊ณผ feature map์ weighted sum | [7, 7] | 7ร7 feature map |
๐ผ ์ต์ข CAM ์ด๋ฏธ์ง ๋ง๋ค๊ธฐ (Upsample) | [224, 224] | ์๋ณธ ์ด๋ฏธ์ง ์์ ํํธ๋งต overlay ๊ฐ๋ฅ |
๐ GAP(Global Average Pooling) | [512] | feature map[512, 7, 7]์ ์ฑ๋๋ณ ํ๊ท ๋ฒกํฐ |
๐งฎ FC Layer | [N_classes] | GAP ๊ฒฐ๊ณผ๋ฅผ ํด๋์ค๋ณ score๋ก ๋ณํ |
๐ฏ Softmax | [N_classes] | ์์ธก ํด๋์ค ํ๋ฅ ๊ฐ ์ถ๋ ฅ |
- CAM์ด๋ฏธ์ง๋ฅผ ๋ง๋ค๊ณ ์ฌ๊ธฐ์๋ ์ต์ข Class๊ตฌ๋ถ์ ํ ์ ์๋๋ฐ, ๊ธฐ์กด์ ๋ถ๋ฅ๋ชจ๋ธ๊ณผ ๋ค๋ฅธ ๊ฒฐ๊ณผ๊ฐ ๋์ฌ์๋ ์์ต๋๋ค!!
- 7X7 ์ฌ์ด์ฆ์ featuremap์ Nearest Neighbor Interpolation (์ต๊ทผ์ ์ด์ ๋ณด๊ฐ๋ฒ) ๋ฑ์ ๋ฐฉ์์ผ๋ก interpolate๋๋ฉฐ Upsample ๋ฉ๋๋ค!
๐ธ CAM์ ์ค์ ํ์ฉ ์์
- ์๋ฅผ ๋ค์ด, AI๊ฐ ๊ฐ์์ง ์ด๋ฏธ์ง๋ฅผ โdogโ๋ก ๋ถ๋ฅํ๋ค๋ฉด
CAM์ ์ผ๊ตด, ๊ท, ๊ผฌ๋ฆฌ ๋ฑ ๊ฐ์์ง์ ์ฃผ์ ํน์ง ๋ถ์๋ฅผ ๋ฐ๊ฒ ํ์ํด์ค๋๋ค. - ์ฌ์ฉ์๋ โAI๊ฐ ์ ๋ง ํฉ๋ฆฌ์ ์ผ๋ก ๋ถ๋ฅํ๋๊ฐ?โ๋ฅผ ์ง๊ด์ ์ผ๋ก ํ์ ํ ์ ์์ต๋๋ค.
- ์๋๋ ๊ณจ๋ ๋ฆฌํธ๋ฆฌ๋ฒ๋ฅผ CAM์ผ๋ก ๋ถ๋ฅํด๋ณด์์ด์!! ๊ท๋ถ๋ถ์ ๋ฐํ์ผ๋ก ๋ถ๋ฅํ๋ค๊ณ ํ๋ค์~~^^
| ๋ค์ ํฌ์คํ ์์ ์ด ์ฝ๋๋ฅผ ๋ถ์ํด๋ณด๋ฉฐ ๊ตฌ์กฐ์ ๋ํ์ฌ ๋ ์์ธํ ์์๋ณด๊ฒ ์ต๋๋ค!!
๐งฐ CAM์ด ๋๋ผ์ด ์ํฅ๋ ฅ
- CAM์ CNN ์์ธก ๊ฒฐ๊ณผ๋ฅผ ์๊ฐ์ ์ผ๋ก ์ค๋ช ํ ์ต์ด ์ฐ๊ตฌ ์ค ํ๋์ ๋๋ค.
- ์ฝ์ง๋ ํ์ต ๊ธฐ๋ฐ ๊ฐ์ฒด ์ง์ญํ(Weakly-Supervised Object Localization) ๋ถ์ผ์ ๋ฐ์ ์ ํฐ ์ํฅ์ ์ฃผ์๊ณ ,
์ดํ Grad-CAM ๋ฑ ๋ ๋ค์ํ ํด์ ๊ฐ๋ฅ ๋ฐฉ๋ฒ์ด ๊ฐ๋ฐ๋๋ ๊ณ๊ธฐ๊ฐ ๋์์ต๋๋ค.
๐ก ๊ฒฐ๋ก : CAM์ด ๋๋ผ์ด ์ด์ !!
- โ ๋ชจ๋ธ์ ์์ธก ๊ทผ๊ฑฐ๋ฅผ ์๊ฐ์ ์ผ๋ก ํ์ธํ ์ ์๋ค
- โ ์๋ชป๋ ํ๋จ, ํธํฅ, ์ค๋ฅ์ ์์ธ์ ์ฝ๊ฒ ์ง๋จํ ์ ์๋ค
- โ ๋ชจ๋ธ์ ์ ๋ขฐ์ฑ๊ณผ ํฌ๋ช ์ฑ์ด ๋ํญ ํฅ์๋๋ค
CAM์ โAI ๋ธ๋๋ฐ์คโ์ ๋ฒฝ์ ํ๋ฌด๋ ์์์ ์ด์์ต๋๋ค.
์ง๊ธ๋ ๋ง์ ์ฐ๊ตฌ์๋ค์ด ๋ค์ํ ํด์ ๊ฐ๋ฅ ์ธ๊ณต์ง๋ฅ(XAI) ๋ถ์ผ๋ก ํ์ฅํด ๋๊ฐ๊ณ ์์ต๋๋ค.
์ฌ๋ฌ๋ถ๋ ์ง์ CAM์ ์ค์ตํด๋ณด๋ฉฐ AI์ ์๋ง์์ ๋ค์ฌ๋ค๋ณด๋ ๊ฒฝํ์ ๊ผญ ํด๋ณด์ธ์!