Post

Studying CAM with Python! - 파이썬으로 CAM 공부하기

Studying CAM with Python! - 파이썬으로 CAM 공부하기

(English) Studying CAM with Python

Today, we will delve into CAM (Class Activation Map) in detail using Python code!! Before we begin, the necessary packages are as follows!!
Don’t worry, it’s all possible with CPU without a GPU~!^^

1
2
3
4
5
6
7
8
import torch
from torchvision import models, transforms
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
import numpy as np
import cv2

CAM basically starts with image classification!
Today, we aim to create a CAM image based on a ResNet classification model trained on ImageNet!!!

What is ImageNet!?

image_net

  • Contains over 14 million images and is categorized into approximately 20,000+ noun hierarchical structures (based on WordNet).
  • Made a significant contribution to the development of deep learning in the field of computer vision.
  • ResNet is also trained based on this ImageNet data!!

What is ResNet?

resnet

  • Innovative model in the vision field: An important structure that greatly improved image recognition performance!! - Announced in 2015 by MS Research!!
  • Overcame the difficulty of training deep neural networks with residual connections.
  • Residual connections: Prevents gradient vanishing by adding the learned changes to the input!!
  • Enables the formation of truly deep networks (DNNs): Effective learning is possible even in deep layers!

Code Start!!

1
2
3
4
5
6
7
8
9
10
11
12
13
# ✅ Loading ImageNet class index (for dog class identification)
import json
imagenet_url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
imagenet_classes = requests.get(imagenet_url).text.splitlines()

# ✅ Loading test image (e.g., from the internet)
img_url = "https://images.unsplash.com/photo-1558788353-f76d92427f16"  # Dog photo
response = requests.get(img_url)
img = Image.open(BytesIO(response.content)).convert("RGB")

# ✅ Loading pre-trained ResNet18 (inference without training)
model = models.resnet18(pretrained=True)
model.eval()

Through the above process, we load the ImageNet class data, the dog photo, and finally the pre-trained ResNet18 model!
You can also see the model structure through eval as shown below~~
We will explore the detailed structure of the model in the next ResNet study!

View ResNet Detailed Structure ``` ResNet( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer2): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer3): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer4): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (fc): Linear(in_features=512, out_features=1000, bias=True) ) ```

Preprocessing Start!!

1
2
3
4
5
6
7
8
9
10
# ✅ Extracting dog classes (simple method: names containing 'golden retriever')
dog_classes = [i for i, name in enumerate(imagenet_classes) if 'golden retriever' in name.lower()]

# ✅ Image preprocessing
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
])
input_tensor = transform(img).unsqueeze(0)  # shape: [1, 3, 224, 224]

Now all preparations are complete!! Let’s put the data into the model and perform inference!!

Simple Classification Model!! (Fully Connected Layer + Softmax)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# ✅ Load pre-trained ResNet18 (inference without training)
model = models.resnet18(pretrained=True)
model.eval()

# ✅ Inference
with torch.no_grad():
    output = model(input_tensor)
    pred_class = output.argmax(dim=1).item()

# ✅ Result check
pred_label = imagenet_classes[pred_class]
is_dog = pred_class in dog_classes

print(f"Predicted label: {pred_label}")
print("🦴 Is it a dog?", "✅ Yes" if is_dog else "❌ No")

Through the above code, you can see ‘whether it is simply classified as a dog!’
This was the method of the classification model before CAM~~

If you actually examine the output vector output,

1
output[0][205:210]

You can see that the value at index 207 is indeed the largest at 13.7348!
(Since 207 came out from arg.max(dim=1), it is indeed the highest value, right!?)

1
tensor([ 9.8655,  6.4875, 13.7348, 11.1263,  8.8567])

CAM Start!!!!

Let’s refer back to the CAM structure we summarized in the previous post.

StepData ShapeDescription
📷 Input Image[3, 224, 224]RGB Image
🔄 CNN(resnet) Last conv Output[512, 7, 7]512 7×7 feature maps
🔥 CAM Calculation: Weighted sum of CNN(resnet) last conv output and class_weight[7, 7]7×7 feature map
🔼 Final CAM Image Creation (Upsample)[224, 224]Heatmap overlay possible on the original image
📉 GAP (Global Average Pooling)[512]Channel-wise average vector of feature map [512, 7, 7]
🧮 FC Layer[N_classes]Converts GAP result to class scores
🎯 Softmax[N_classes]Outputs predicted class probability values

Load the model again, just like before!

1
2
3
# ✅ Load pretrained ResNet18
model = models.resnet18(pretrained=True)
model.eval()
feature map Extraction!

Now, the important part begins! It’s as follows!!

1
2
3
4
5
6
features = []

def hook_fn(module, input, output):
    features.append(output)

model.layer4.register_forward_hook(hook_fn)  # Last conv block   
  • hook_fn: A function to call data within the model. module is the layer object, input is the input tuple, and output is the output tensor!
  • model.layer4.register_forward_hook(hook_fn): Places hook_fn at the end of model.layer4, so that the output of conv layer4 is stored in the features list.

After placing the hook_fn function at the end of the model’s layer4,
execute the model in the same way as the simple classification model.

Now, let’s proceed with the [CNN(resnet) Last conv Output]!

StepData ShapeDescription
🔄 CNN(resnet) Last conv Output[512, 7, 7]512 7×7 feature maps
1
2
3
4
5
6
7
# ✅ Get weights from the final linear layer
params = list(model.parameters())
fc_weights = params[-2]  # shape: [1000, 512]
class_weights = fc_weights[pred_class].detach().cpu().numpy()  # [512]

# ✅ Get feature map from hook
feature_map = features[0].squeeze(0).detach().cpu().numpy()  # [512, 7, 7]

Through this, we have extracted the [512, 7, 7] size feature map!

create CAM!!!

Now, this is the process of creating a CAM image from this feature map!

StepData ShapeDescription
🔥 CAM Calculation: Weighted sum of CNN(resnet) last conv output and class_weight[7, 7]7×7 feature map
🔼 Final CAM Image Creation (Upsample)[224, 224]Heatmap overlay possible on the original image
1
2
3
4
5
6
7
8
9
10
# ✅ Compute CAM
cam = np.zeros((7, 7), dtype=np.float32)
for i in range(len(class_weights)):
    cam += class_weights[i] * feature_map[i]

cam = np.maximum(cam, 0)
cam = cam - np.min(cam)
cam = cam / np.max(cam)
cam = cv2.resize(cam, (224, 224))
heatmap = cv2.applyColorMap(np.uint8(255 *cam), cv2.COLORMAP_JET)

In the above process, we obtain a [7, 7] size CAM by calculating the weighted sum of the class_weights and the [512, 7, 7] feature map from the last conv output of ResNet! Then, we create the final heatmap image through resizing, i.e., Upsampling!

Finally, we overlay the heatmap image on the original image for visualization!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# ✅ Overlay CAM on original image
img_cv = np.array(transforms.Resize((224, 224))(img))[:, :, ::-1]  # PIL → OpenCV BGR
overlay = cv2.addWeighted(img_cv, 0.5, heatmap, 0.5, 0)

# ✅ Show
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.title("Original Image")
plt.imshow(img)
plt.axis('off')

plt.subplot(1, 2, 2)
plt.title(f"CAM: {imagenet_classes[pred_class]}")
plt.imshow(overlay[:, :, ::-1])  # Back to RGB
plt.axis('off')
plt.tight_layout()
plt.show()

Then!! You will get the image shown directly below, which we saw in the previous post~!

golden

image classification using CAM!

In addition to this!!
We can also distinguish within the CAM!!!
By passing through GAP and the FC layer, and calculating the softmax, we can also see the accuracy of the result distinction~!

StepData ShapeDescription
📉 GAP (Global Average Pooling)[512]Channel-wise average vector of feature map [512, 7, 7]
🧮 FC Layer[N_classes]Converts GAP result to class scores
🎯 Softmax[N_classes]Outputs predicted class probability values
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# ✅ GAP operation: [512, 7, 7] → [512]
gap_vector = feature_map.mean(axis=(1, 2))  # shape: [512]

# ✅ FC operation: [512] × [1000, 512]^T → [1000]
logits = np.dot(fc_weights.detach().cpu().numpy(), gap_vector)  # shape: [1000]

# ✅ Softmax
exp_logits = np.exp(logits - np.max(logits))  # numerical stability
probs = exp_logits / exp_logits.sum()

# ✅ Predicted class
gap_pred_class = np.argmax(probs)
gap_pred_label = imagenet_classes[gap_pred_class]

# ✅ Result comparison
print("\n✅ GAP → FC → Softmax based prediction result")
print(f"Predicted label: {gap_pred_label}")
print("🦴 Is it a dog?", "✅ Yes" if gap_pred_class in dog_classes else "❌ No")

After going through the above process!?
You can confirm the result:
Predicted label: golden retriever

Is it real?

1
probs[205:210]

Looking at this, the value at index 207 is indeed the largest, right!?
However!! You can see that it is different from the original classification vector value of 13.7348!!

1
[1.7553568e-02, 6.1262056e-04, 8.4515899e-01, 6.3063554e-02, 6.3092457e-03]

Through today’s process, we were able to understand the detailed operation of CAM well!!

View Full Code ```python import torch from torchvision import models, transforms from PIL import Image import requests from io import BytesIO import numpy as np import matplotlib.pyplot as plt import numpy as np import cv2 import json imagenet_url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt" imagenet_classes = requests.get(imagenet_url).text.splitlines() # ✅ Loading test image (e.g., from the internet) img_url = "https://images.unsplash.com/photo-1558788353-f76d92427f16" # Dog photo response = requests.get(img_url) img = Image.open(BytesIO(response.content)).convert("RGB") # ✅ Loading pre-trained ResNet18 (inference without training) model = models.resnet18(pretrained=True) model.eval() # ✅ Extracting dog classes (simple method: names containing 'golden retriever') dog_classes = [i for i, name in enumerate(imagenet_classes) if 'golden retriever' in name.lower()] # ✅ Image preprocessing transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), ]) input_tensor = transform(img).unsqueeze(0) # shape: [1, 3, 224, 224] # ✅ Inference with torch.no_grad(): output = model(input_tensor) pred_class = output.argmax(dim=1).item() # ✅ Result check pred_label = imagenet_classes[pred_class] is_dog = pred_class in dog_classes print(f"Predicted label: {pred_label}") print("🦴 Is it a dog?", "✅ Yes" if is_dog else "❌ No") # ✅ Load pretrained ResNet18 model = models.resnet18(pretrained=True) model.eval() # ✅ Hook to get final conv feature map features = [] def hook_fn(module, input, output): features.append(output) model.layer4.register_forward_hook(hook_fn) # Last conv block # ✅ Predict with torch.no_grad(): output = model(input_tensor) pred_class = output.argmax(dim=1).item() # ✅ Get weights from the final linear layer params = list(model.parameters()) fc_weights = params[-2] # shape: [1000, 512] class_weights = fc_weights[pred_class].detach().cpu().numpy() # [512] # ✅ Get feature map from hook feature_map = features[0].squeeze(0).detach().cpu().numpy() # [512, 7, 7] # ✅ Compute CAM cam = np.zeros((7, 7), dtype=np.float32) for i in range(len(class_weights)): cam += class_weights[i] * feature_map[i] # Normalize & resize cam = np.maximum(cam, 0) cam = cam - np.min(cam) cam = cam / np.max(cam) cam = cv2.resize(cam, (224, 224)) heatmap = cv2.applyColorMap(np.uint8(255 * cam), cv2.COLORMAP_JET) # ✅ Overlay CAM on original image img_cv = np.array(transforms.Resize((224, 224))(img))[:, :, ::-1] # PIL → OpenCV BGR overlay = cv2.addWeighted(img_cv, 0.5, heatmap, 0.5, 0) # ✅ Show plt.figure(figsize=(10, 5)) plt.subplot(1, 2, 1) plt.title("Original Image") plt.imshow(img) plt.axis('off') plt.subplot(1, 2, 2) plt.title(f"CAM: {imagenet_classes[pred_class]}") plt.imshow(overlay[:, :, ::-1]) # Back to RGB plt.axis('off') plt.tight_layout() plt.show() # ✅ Result text output print(f"Predicted label: {imagenet_classes[pred_class]}") print("🦴 Is it a dog?", "✅ Yes" if pred_class in dog_classes else "❌ No") ```

(한국어) 파이썬으로 CAM 공부하기

오늘은 Python 코드로 CAM(Class Activation Map) 에 대하여 자세히 알아보겠습니다!! 알아보기에 앞서 필요한 패키지들은 아래와 같습니다!! GPU 없이!! CPU 로도 모두 가능하니 걱정마세요~!^^

1
2
3
4
5
6
7
8
import torch
from torchvision import models, transforms
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
import numpy as np
import cv2

CAM도 기본적으로 이미지 분류에서 시작됩니다! 오늘은 imagenet으로 학습된 resnet의 분류 모델을 바탕으로 CAM이미지를 만들어 보고자합니다!!!

imagenet이란!?

image_net

  • 약 1,400만 개가 넘는 이미지를 포함하고 있으며, 약 2만 개 이상의 명사 계층 구조 (WordNet 기반)로 이루어진 카테고리로 분류
  • 컴퓨터 비전 분야의 딥러닝 발전에 지대한 공헌을 함
  • resnet도 이 imagenet데이터를 바탕으로 학습함!!

resnet이란?

resnet

  • 비전 분야 혁신 모델: 이미지 인식 성능을 크게 향상시킨 중요한 구조!! -2015년 MS research에서 발표!!
  • 잔차 연결(Residual connections)로 깊은 신경망 학습 어려움을 극복
  • 잔차 연결(Residual connections): 입력에 학습된 변화량을 더해 기울기 소실을 막음!!
  • 진짜 깊은 네트워크(DNN) 형성 가능: 깊은 층에서도 효과적인 학습이 가능해짐!

코드 시작!!

관련 데이터 및 모델 준비

1
2
3
4
5
6
7
8
9
10
11
12
13
# ✅ ImageNet class index 로딩 (강아지 클래스 구분용)
import json
imagenet_url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
imagenet_classes = requests.get(imagenet_url).text.splitlines()

# ✅ 테스트 이미지 불러오기 (예: 인터넷 이미지)
img_url = "https://images.unsplash.com/photo-1558788353-f76d92427f16"  # 강아지 사진
response = requests.get(img_url)
img = Image.open(BytesIO(response.content)).convert("RGB")

# ✅ 사전 학습된 ResNet18 로드 (학습 없이 inference)
model = models.resnet18(pretrained=True)
model.eval()

위의 과정을 통해서, imagenet의 클래스 데이터도 받아오고, 강아지 사진도 받아오고! 마지막으로 resnet18의 사전학습된 모델도 불러오게됩니다!
eval을 통해서 모델도 아래와 같이 볼수 있지요~~
모델의 세부구조는!? 다음 resnet 공부에서 알아보겠습니다!

resnet 세부 구조 보기 ``` ResNet( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer2): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer3): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer4): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (fc): Linear(in_features=512, out_features=1000, bias=True) ) ```

전처리 시작!!

1
2
3
4
5
6
7
8
9
10
# ✅ 강아지 클래스들 추려내기 (간단한 방법: 이름에 'golden retriever' 포함된 것)
dog_classes = [i for i, name in enumerate(imagenet_classes) if 'golden retriever' in name.lower()]

# ✅ 이미지 전처리
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
])
input_tensor = transform(img).unsqueeze(0)  # shape: [1, 3, 224, 224]

이제 모둔 준비가 끝났습니다!! 모델에 데이터를 넣어서 추론해보아요!!

단순 분류모델!! (Fully Connected Layer + Softmax)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# ✅ 사전 학습된 ResNet18 로드 (학습 없이 inference)
model = models.resnet18(pretrained=True)
model.eval()

# ✅ 추론
with torch.no_grad():
    output = model(input_tensor)
    pred_class = output.argmax(dim=1).item()

# ✅ 결과 확인
pred_label = imagenet_classes[pred_class]
is_dog = pred_class in dog_classes

print(f"Predicted label: {pred_label}")
print("🦴 Is it a dog?", "✅ Yes" if is_dog else "❌ No")

위 코드를 통해서 ‘단순히 강아지로 분류하는가!’ 에 대하여 알아볼수 있습니다. CAM이전 분류모델의 방식이었지요~~

실제로 결과값 벡터 output을 조사해보면!

1
output[0][205:210]

정말로 207번쨰의 값이 13.7348로 큰 값임을 알수 있습니다! (arg.max(dim=1)에서 207이 나왔으니최고값이 맞는것 아시죠!?)

1
tensor([ 9.8655,  6.4875, 13.7348, 11.1263,  8.8567])

CAM 시작!!!!

지난 포스팅에서 정리해본 CAM의 구조를 다시 참고해봅니다

단계데이터 형태설명
📷 입력 이미지[3, 224, 224]RGB 이미지
🔄 CNN(resnet) 마지막 conv 출력[512, 7, 7]512개의 7×7 feature map
🔥 CAM 계산 : CNN(resnet) 마지막 conv 출력 과 class_weight의 weighted sum[7, 7]7×7 feature map
🔼 최종 CAM 이미지 만들기 (Upsample)[224, 224]원본 이미지 위에 히트맵 overlay 가능
📉 GAP(Global Average Pooling)[512]feature map[512, 7, 7]의 채널별 평균 벡터
🧮 FC Layer[N_classes]GAP 결과를 클래스별 score로 변환
🎯 Softmax[N_classes]예측 클래스 확률값 출력

전과 동일하게 모델을 불러와 줍니다!

1
2
3
# ✅ Load pretrained ResNet18
model = models.resnet18(pretrained=True)
model.eval()
feature map 추출하기!!

지금부터 중요한 부분이 시작됩나다! 아래와 같습니다!!

1
2
3
4
5
6
features = []

def hook_fn(module, input, output):
    features.append(output)

model.layer4.register_forward_hook(hook_fn)  # 마지막 conv block
  • hook_fn : 모델 내에 데이터를 호출하는 함수입니다. module 은 layer 객체, input은 입력튜플, output은 출력 텐서입니다!
  • model.layer4.register_forward_hook(hook_fn) : 모델의 layer4에 hook_fn을 배치, conv의 layer4의 결과물을 features 리스트에 저장되도록합니다.

위의 hook_fn함수를 모델 layer4 뒷단에 에 배치시킨 뒤
단순 분류모델과 동일하게 모델을 실행합니다.

이제 [CNN(resnet) 마지막 conv 출력] 을 진행해보겠습니다!!

단계데이터 형태설명
🔄 CNN(resnet) 마지막 conv 출력[512, 7, 7]512개의 7×7 feature map
1
2
3
4
5
6
7
# ✅ Get weights from the final linear layer
params = list(model.parameters())
fc_weights = params[-2]  # shape: [1000, 512]
class_weights = fc_weights[pred_class].detach().cpu().numpy()  # [512]

# ✅ Get feature map from hook
feature_map = features[0].squeeze(0).detach().cpu().numpy()  # [512, 7, 7]

이를 통해서 [512,7,7] 사이즈의 feature map을 추출했습니다!

feature map과 class_weight 곱해서, CAM 만들기!!

이젠, 이 feature map으로 CAM이미지를 만드는 과정입니다!

단계데이터 형태설명
🔥 CAM 계산 : CNN(resnet) 마지막 conv 출력 과 class_weight의 weighted sum[7, 7]7×7 feature map
🔼 최종 CAM 이미지 만들기 (Upsample)[224, 224]원본 이미지 위에 히트맵 overlay 가능
1
2
3
4
5
6
7
8
9
# ✅ Compute CAM
cam = np.zeros((7, 7), dtype=np.float32)
for i in range(len(class_weights)):
    cam += class_weights[i] * feature_map[i]

cam = np.maximum(cam, 0)
cam = cam - np.min(cam)
cam = cam / np.max(cam)
cam = cv2.resize(cam, (224, 224))

위 과정에서 CNN(resnet) 마지막 conv 출력 class_weights와 [512, 7,7] feature map과의 weight sum을 구하여 [7,7]사이즈의 cam 을 구합니다!! 그리고 resize, 즉 Upsample을 통해서 최종 heatmap이미지를 만들어줍니다!

heat map 만들기!~!

마지막으로 heatmap이미지를 기존 이미지와 겹쳐서 시각화 합니다!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# ✅ Overlay CAM on original image
img_cv = np.array(transforms.Resize((224, 224))(img))[:, :, ::-1]  # PIL → OpenCV BGR
overlay = cv2.addWeighted(img_cv, 0.5, heatmap, 0.5, 0)

# ✅ Show
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.title("Original Image")
plt.imshow(img)
plt.axis('off')

plt.subplot(1, 2, 2)
plt.title(f"CAM: {imagenet_classes[pred_class]}")
plt.imshow(overlay[:, :, ::-1])  # Back to RGB
plt.axis('off')
plt.tight_layout()
plt.show()

그럼!! 지난 포스팅에서 봤던 바로 아래 이미지가 나오게됩니다~!

golden

CAMP모델로 분류하기!~!

이에 더해서!! CAM에서도 구분을 할 수 있다!!! GAP과 FC layer를 통과, softmax를 구하게 되면 결과 구분의 정확도도 볼수 있는데요~!

단계데이터 형태설명
📉 GAP(Global Average Pooling)[512]feature map[512, 7, 7]의 채널별 평균 벡터
🧮 FC Layer[N_classes]GAP 결과를 클래스별 score로 변환
🎯 Softmax[N_classes]예측 클래스 확률값 출력
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# ✅ GAP 연산: [512, 7, 7] → [512]
gap_vector = feature_map.mean(axis=(1, 2))  # shape: [512]

# ✅ FC 연산: [512] × [1000, 512]^T → [1000]
logits = np.dot(fc_weights.detach().cpu().numpy(), gap_vector)  # shape: [1000]

# ✅ Softmax
exp_logits = np.exp(logits - np.max(logits))  # numerical stability
probs = exp_logits / exp_logits.sum()

# ✅ 예측 클래스
gap_pred_class = np.argmax(probs)
gap_pred_label = imagenet_classes[gap_pred_class]

# ✅ 결과 비교
print("\n✅ GAP → FC → Softmax 기반 예측 결과")
print(f"Predicted label: {gap_pred_label}")
print("🦴 Is it a dog?", "✅ Yes" if gap_pred_class in dog_classes else "❌ No")

위의 과정을 걸치면!? Predicted label: golden retriever 라는 결과를 확인할 수 있습니다!!
정말일까요?

1
probs[205:210]

를 보면 정말로 207번쨰의 값이 가장 큰 값이죠!?
하지만!! 기존 분류벡터의 값 13.7348과는 다름을 알 수 있습니다!!

1
[1.7553568e-02, 6.1262056e-04, 8.4515899e-01, 6.3063554e-02, 6.3092457e-03]

오늘의 과정을 통해서 CAM의 세부 동작을 잘 알아볼수 있었습니다!!

전체 코드 보기 ```python import torch from torchvision import models, transforms from PIL import Image import requests from io import BytesIO import numpy as np import matplotlib.pyplot as plt import numpy as np import cv2 import json imagenet_url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt" imagenet_classes = requests.get(imagenet_url).text.splitlines() # ✅ 테스트 이미지 불러오기 (예: 인터넷 이미지) img_url = "https://images.unsplash.com/photo-1558788353-f76d92427f16" # 강아지 사진 response = requests.get(img_url) img = Image.open(BytesIO(response.content)).convert("RGB") # ✅ 사전 학습된 ResNet18 로드 (학습 없이 inference) model = models.resnet18(pretrained=True) model.eval() # ✅ 강아지 클래스들 추려내기 (간단한 방법: 이름에 'golden retriever' 포함된 것) dog_classes = [i for i, name in enumerate(imagenet_classes) if 'golden retriever' in name.lower()] # ✅ 이미지 전처리 transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), ]) input_tensor = transform(img).unsqueeze(0) # shape: [1, 3, 224, 224] # ✅ 추론 with torch.no_grad(): output = model(input_tensor) pred_class = output.argmax(dim=1).item() # ✅ 결과 확인 pred_label = imagenet_classes[pred_class] is_dog = pred_class in dog_classes print(f"Predicted label: {pred_label}") print("🦴 Is it a dog?", "✅ Yes" if is_dog else "❌ No") # ✅ Load pretrained ResNet18 model = models.resnet18(pretrained=True) model.eval() # ✅ Hook to get final conv feature map features = [] def hook_fn(module, input, output): features.append(output) model.layer4.register_forward_hook(hook_fn) # 마지막 conv block # ✅ Predict with torch.no_grad(): output = model(input_tensor) pred_class = output.argmax(dim=1).item() # ✅ Get weights from the final linear layer params = list(model.parameters()) fc_weights = params[-2] # shape: [1000, 512] class_weights = fc_weights[pred_class].detach().cpu().numpy() # [512] # ✅ Get feature map from hook feature_map = features[0].squeeze(0).detach().cpu().numpy() # [512, 7, 7] # ✅ Compute CAM cam = np.zeros((7, 7), dtype=np.float32) for i in range(len(class_weights)): cam += class_weights[i] * feature_map[i] # Normalize & resize cam = np.maximum(cam, 0) cam = cam - np.min(cam) cam = cam / np.max(cam) cam = cv2.resize(cam, (224, 224)) heatmap = cv2.applyColorMap(np.uint8(255 * cam), cv2.COLORMAP_JET) # ✅ Overlay CAM on original image img_cv = np.array(transforms.Resize((224, 224))(img))[:, :, ::-1] # PIL → OpenCV BGR overlay = cv2.addWeighted(img_cv, 0.5, heatmap, 0.5, 0) # ✅ Show plt.figure(figsize=(10, 5)) plt.subplot(1, 2, 1) plt.title("Original Image") plt.imshow(img) plt.axis('off') plt.subplot(1, 2, 2) plt.title(f"CAM: {imagenet_classes[pred_class]}") plt.imshow(overlay[:, :, ::-1]) # Back to RGB plt.axis('off') plt.tight_layout() plt.show() # ✅ 결과 텍스트 출력 print(f"Predicted label: {imagenet_classes[pred_class]}") print("🦴 Is it a dog?", "✅ Yes" if pred_class in dog_classes else "❌ No") ```
This post is licensed under CC BY 4.0 by the author.