Post

๐Ÿ–ฅ๏ธ Object Detection with DETR! Python Practice!! - DETR์„ ํ™œ์šฉํ•œ ๊ฐ์ฒด ํƒ์ง€! ํŒŒ์ด์ฌ ์‹ค์Šต!!

๐Ÿ–ฅ๏ธ Object Detection with DETR! Python Practice!! - DETR์„ ํ™œ์šฉํ•œ ๊ฐ์ฒด ํƒ์ง€! ํŒŒ์ด์ฌ ์‹ค์Šต!!

(English) Object Detection with DETR! Python Practice!!

In the previous post we studied DETR!! Today, based on this DETR model, we will directly perform Object Detection!

detr_result

  • Letโ€™s start with the conclusion again!!!
  • It finds and shows multiple detected objects in the image!!
  • It accurately displays many people and frisbees, along with their accuracy!!
  • Letโ€™s explore the process together with Python code!!

1. Loading the DETR model from Hugging Face!!

Todayโ€™s DETR model will be loaded from Hugging Face, using the facebook/detr-resnet-50 model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
from transformers import DetrImageProcessor, DetrForObjectDetection


# 1๏ธโƒฃ Set device (use CUDA if GPU is available)
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2๏ธโƒฃ Load DETR model and processor (pretrained model)
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50").to(device)
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")

processor : ๐Ÿ–ผ๏ธ Image Processor (DetrImageProcessor)

Role: To preprocess the input image into a format that the DETR model can effectively understand and process.

Main Tasks:

  1. Image Resizing: Changes the size of the input image to a specific size required by the model.
  2. Image Normalization: Adjusts the pixel values of the image to a specific range to improve the stability of model training and inference.
  3. Tensor Conversion: Converts the image into a tensor format that can be used by deep learning frameworks such as PyTorch.
  4. Handling Model-Specific Requirements: Performs additional preprocessing tasks according to the model architecture (e.g., mask generation).

If we actually check the internal workings of the processor, we can see the preprocessing steps as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
DetrImageProcessor {
  "do_convert_annotations": true,
  "do_normalize": true,
  "do_pad": true,
  "do_rescale": true,
  "do_resize": true,
  "format": "coco_detection",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_processor_type": "DetrImageProcessor",
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "pad_size": null,
  "resample": 2,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 1333,
    "shortest_edge": 800
  }
}

model : ๐Ÿค– DETR Object Detection Model (DetrForObjectDetection)

Role: To perform object detection on the preprocessed image and predict the location and class of objects within the image. This is the core role.

Main Tasks:

  1. Feature Extraction: Extracts important visual features for object detection from the input image.
  2. Transformer Encoder-Decoder: Processes the extracted features through the Transformer structure to understand the relationships between objects in the image and learn information about each object.
  3. Object Prediction: Finally outputs the bounding box coordinates, the corresponding class labels, and the confidence scores of the detected objects in the image.

The DETR model is structured as shown below:

detr_model

2. Starting Object Detection with DETR!

Itโ€™s done with just a few lines of simple code!!!

I have prepared an image above where several people are playing with a frisbee! And then!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
import torchvision.transforms as T
from PIL import Image
from transformers import DetrImageProcessor, DetrForObjectDetection

# 1๏ธโƒฃ Set device (use CUDA if GPU is available)
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2๏ธโƒฃ Load DETR model and processor (pretrained model)
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50").to(device)
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")

# 3๏ธโƒฃ Load the bike.jpg image from the local directory
image_path = "catch_frisbee.jpg"
image = Image.open(image_path)

# 4๏ธโƒฃ Preprocess the image (convert to DETR model input format)
inputs = processor(images=image, return_tensors="pt").to(device)

# 5๏ธโƒฃ Model inference
with torch.no_grad():
    outputs = model(**inputs)

# 6๏ธโƒฃ Post-process the results (convert Bounding Box & Labels)
target_sizes = torch.tensor([image.size[::-1]])  # (height, width) format
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]

# 7๏ธโƒฃ Output detected objects
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    if score > 0.7:  # Output objects with confidence above 70%
        box = [round(i, 2) for i in box.tolist()]
        print(f"Detected {model.config.id2label[label.item()]} with confidence {round(score.item(), 3)} at {box}")

If we briefly analyze the code above:

  • It loads the model.
  • It loads the catch_frisbee image!
  • It preprocesses it through the processor.
  • It puts it into the model and performs inference!
  • It prints the detected content from results!

Then the output is! As shown below! It tells us the detected objects, their accuracy (confidence), and finally the bounding box coordinates!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Detected person with confidence 0.783 at [12.91, 355.33, 32.23, 383.66]
Detected person with confidence 0.999 at [279.08, 255.76, 365.66, 423.82]
Detected person with confidence 0.995 at [533.57, 280.23, 584.71, 401.82]
Detected umbrella with confidence 0.744 at [459.41, 324.56, 496.24, 340.89]
Detected person with confidence 0.933 at [488.93, 340.06, 510.23, 376.37]
Detected person with confidence 0.835 at [0.01, 355.79, 11.03, 384.31]
Detected person with confidence 0.906 at [261.05, 346.35, 284.02, 378.22]
Detected person with confidence 0.99 at [574.15, 301.1, 605.79, 395.45]
Detected person with confidence 0.713 at [244.5, 349.68, 262.29, 378.9]
Detected person with confidence 0.997 at [132.21, 31.6, 310.32, 329.97]
Detected person with confidence 0.732 at [349.66, 352.63, 365.67, 378.28]
Detected person with confidence 0.796 at [209.17, 326.9, 232.89, 355.65]
Detected person with confidence 0.777 at [149.0, 347.84, 169.28, 381.43]
Detected person with confidence 0.991 at [163.45, 299.99, 206.14, 399.0]
Detected frisbee with confidence 1.0 at [181.55, 139.33, 225.96, 161.49]
Detected person with confidence 0.734 at [200.95, 350.37, 229.14, 380.88]
Detected person with confidence 0.737 at [467.46, 347.11, 483.07, 376.49]
Detected person with confidence 0.978 at [413.58, 253.38, 465.11, 416.57]
Detected person with confidence 0.73 at [597.38, 342.37, 613.34, 380.89]
Detected person with confidence 0.998 at [304.64, 70.92, 538.5, 410.45]

3. Visualization of Object Detection Results!!

Instead of simple text detection, letโ€™s display bounding boxes on the image!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
from transformers import DetrImageProcessor, DetrForObjectDetection


# 1๏ธโƒฃ Set device (use CUDA if GPU is available)
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2๏ธโƒฃ Load DETR model and processor (pretrained model)
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50").to(device)
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")

# 3๏ธโƒฃ Load the bike.jpg image from the local directory
image_path = "catch_frisbee.jpg"
image = Image.open(image_path)

# 4๏ธโƒฃ Preprocess the image (convert to DETR model input format)
inputs = processor(images=image, return_tensors="pt").to(device)

# 5๏ธโƒฃ Model inference
with torch.no_grad():
    outputs = model(**inputs)

# 6๏ธโƒฃ Post-process the results (convert Bounding Box & Labels)
target_sizes = torch.tensor([image.size[::-1]])  # (height, width) format
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]


# 7๏ธโƒฃ Visualize detected objects with Bounding Boxes on the image
fig, ax = plt.subplots(1, figsize=(10, 6))
ax.imshow(image)

# Draw Bounding Boxes
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    if score > 0.7:  # ๐Ÿ”น Visualize objects with confidence above 70%
        box = [round(i, 2) for i in box.tolist()]
        x, y, w, h = box
        rect = patches.Rectangle((x, y), w - x, h - y, linewidth=2, edgecolor='r', facecolor='none')
        ax.add_patch(rect)
        ax.text(x, y, f"{model.config.id2label[label.item()]}: {round(score.item(), 2)}",
                fontsize=12, bbox=dict(facecolor='yellow', alpha=0.5))

# 8๏ธโƒฃ Save the result
output_path = "detr_output.jpg"  # ๐Ÿ”น Filename to save
plt.axis("off")  # ๐Ÿ”น Remove axes
plt.savefig(output_path, bbox_inches="tight")
plt.show()

print(f"Detection result saved as {output_path}")

Through the code above,
The detected objects are visualized,
And saved as detr_output.jpg!!

detr_result

Object detection, itโ€™s really easy, right?
However, it takes 8.5 seconds to detect objects in a single imageโ€ฆ itโ€™s still a bit slow!


(ํ•œ๊ตญ์–ด) DETR์„ ํ™œ์šฉํ•œ ๊ฐ์ฒด ํƒ์ง€! ํŒŒ์ด์ฌ ์‹ค์Šต!!

์ง€๋‚œ ํฌ์ŠคํŒ… ์—์„œ ๊ณต๋ถ€ํ•ด๋ณด์•˜๋˜ DETR!!
์˜ค๋Š˜์€ ์ด DETR ๋ชจ๋ธ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ง์ ‘ ๊ฐ์ฒด ํƒ์ง€(Object Detection)์„ ์ง„ํ–‰ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค~!

detr_result

  • ์˜ค๋Š˜๋„๋„ ๊ฒฐ๋ก ๋ถ€ํ„ฐ!!!
  • ์ด๋ฏธ์ง€์—์„œ ํƒ์ง€๋œ ์—ฌ๋Ÿฌ ๊ฐ์ฒด๋“ค์„ ์ฐพ์•„์„œ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค!!
  • ๋งŽ์€ ์‚ฌ๋žŒ๋“ค๊ณผ ํ”„๋ฆฌ์Šค๋น„ ๋“ฑ ๊ฐ์ฒด๋ฅผ ์ •ํ™•๋„์™€ ํ•จ๊ป˜ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค!!
  • ํ•จ๊ป˜, ํŒŒ์ด์ฌ ์ฝ”๋“œ๋กœ ๊ทธ ๊ณผ์ •์„ ์•Œ์•„๋ณด์•„์š”!!

1. huggingface์—์„œ DETR ๋ชจ๋ธ ๋ฐ›๊ธฐ!!

์˜ค๋Š˜์˜ DETR ๋ชจ๋ธ์€ Huggingface๋กœ๋ถ€ํ„ฐ, facebook/detr-resnet-50 ๋ชจ๋ธ์„ ๋ฐ›์•„ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
from transformers import DetrImageProcessor, DetrForObjectDetection


# 1๏ธโƒฃ ๋””๋ฐ”์ด์Šค ์„ค์ • (GPU ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉด CUDA๋กœ ์„ค์ •)
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2๏ธโƒฃ DETR ๋ชจ๋ธ ๋ฐ ํ”„๋กœ์„ธ์„œ ๋กœ๋“œ (์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ)
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50").to(device)
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")

์œ„ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด, ์‚ฌ์ „ ํ•™์Šต๋œ facebook/detr-resnet-50์˜ Model๊ณผ Processor ๋ฅผ ๋กœ๋“œํ•˜๋Š”๋ฐ์š”~!
๊ฐ๊ฐ์˜ ์—ญํ• ์„ ์•Œ์•„๋ณด์ž๋ฉด!

processor : ๐Ÿ–ผ๏ธ ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ (DetrImageProcessor)

์—ญํ• : ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ DETR ๋ชจ๋ธ์ด ํšจ๊ณผ์ ์œผ๋กœ ์ดํ•ดํ•˜๊ณ  ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ์ „์ฒ˜๋ฆฌ(Preprocessing)ํ•˜๋Š” ์—ญํ• 

์ฃผ์š” ์ž‘์—…:

  1. ์ด๋ฏธ์ง€ ํฌ๊ธฐ ์กฐ์ • (Resizing): ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋ฅผ ๋ชจ๋ธ์ด ์š”๊ตฌํ•˜๋Š” ํŠน์ • ํฌ๊ธฐ๋กœ ๋ณ€๊ฒฝ
  2. ์ด๋ฏธ์ง€ ์ •๊ทœํ™” (Normalization): ์ด๋ฏธ์ง€ ํ”ฝ์…€ ๊ฐ’์„ ํŠน์ • ๋ฒ”์œ„๋กœ ์กฐ์ •ํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต ๋ฐ ์ถ”๋ก  ์•ˆ์ •์„ฑ ํ–ฅ์ƒ
  3. ํ…์„œ ๋ณ€ํ™˜ (Tensor Conversion): ์ด๋ฏธ์ง€๋ฅผ ํŒŒ์ดํ† ์น˜(PyTorch)์™€ ๊ฐ™์€ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ…์„œ(Tensor) ํ˜•ํƒœ ๋ณ€ํ™˜
  4. ๋ชจ๋ธ๋ณ„ ์š”๊ตฌ ์‚ฌํ•ญ ์ฒ˜๋ฆฌ: ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ ์ถ”๊ฐ€์ ์ธ ์ „์ฒ˜๋ฆฌ ์ž‘์—… (์˜ˆ: ๋งˆ์Šคํฌ ์ƒ์„ฑ ๋“ฑ)์„ ์ˆ˜ํ–‰

์‹ค์ œ๋กœ processor ๋ฅผ ๋‚ด๋ถ€๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๋ณผ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค~

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
DetrImageProcessor {
  "do_convert_annotations": true,
  "do_normalize": true,
  "do_pad": true,
  "do_rescale": true,
  "do_resize": true,
  "format": "coco_detection",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_processor_type": "DetrImageProcessor",
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "pad_size": null,
  "resample": 2,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "longest_edge": 1333,
    "shortest_edge": 800
  }
}

model : ๐Ÿค– DETR ๊ฐ์ฒด ๊ฐ์ง€ ๋ชจ๋ธ (DetrForObjectDetection)

์—ญํ• : ์ „์ฒ˜๋ฆฌ๋œ ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ์ด๋ฏธ์ง€ ๋‚ด์˜ ๊ฐ์ฒด๋ฅผ ๊ฐ์ง€(Object Detection)ํ•˜๊ณ , ํ•ด๋‹น ๊ฐ์ฒด์˜ ์œ„์น˜์™€ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ•ต์‹ฌ์ ์ธ ์—ญํ•  ์ˆ˜ํ–‰

์ฃผ์š” ์ž‘์—…:

  1. ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ (Feature Extraction): ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ฒด ๊ฐ์ง€์— ์ค‘์š”ํ•œ ์‹œ๊ฐ์  ํŠน์ง•๋“ค์„ ์ถ”์ถœ
  2. ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”-๋””์ฝ”๋” (Transformer Encoder-Decoder): ์ถ”์ถœ๋œ ํŠน์ง•๋“ค์„ ํŠธ๋žœ์Šคํฌ๋จธ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์ฒ˜๋ฆฌํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋‚ด ๊ฐ์ฒด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๊ณ , ๊ฐ ๊ฐ์ฒด์˜ ์ •๋ณด๋ฅผ ํ•™์Šต
  3. ๊ฐ์ฒด ์˜ˆ์ธก (Object Prediction): ์ตœ์ข…์ ์œผ๋กœ ์ด๋ฏธ์ง€ ๋‚ด์— ์กด์žฌํ•˜๋Š” ๊ฐ์ฒด๋“ค์˜ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ขŒํ‘œ, ํ•ด๋‹น ๊ฐ์ฒด์˜ ํด๋ž˜์Šค ๋ ˆ์ด๋ธ”, ๊ทธ๋ฆฌ๊ณ  ์˜ˆ์ธก์˜ ์‹ ๋ขฐ๋„ ์ ์ˆ˜ ์ถœ๋ ฅ

์•„๋ž˜์™€ ๊ฐ™์ด DETR์˜ ๋ชจ๋ธ๋กœ ๊ตฌ์„ฑ๋จ์„ ๋ณผ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!!

detr_model

2. DETR๋กœ ๊ฐ์ฒดํƒ์ง€ ์‹œ์ž‘!

๊ฐ„๋‹จํ•œ ์ฝ”๋“œ ๋ช‡์ค„์ด๋ฉด ๋!!!

Image

์œ„์™€ ๊ฐ™์ด ์—ฌ๋Ÿฌ์‚ฌ๋žŒ๋“ค์ด ํ”„๋ฆฌ์Šค๋น„๋กœ ๋†€๊ณ ์žˆ๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ค€๋น„ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค!! ๊ทธ๋ฆฌ๊ณ @!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
import torchvision.transforms as T
from PIL import Image
from transformers import DetrImageProcessor, DetrForObjectDetection

# 1๏ธโƒฃ ๋””๋ฐ”์ด์Šค ์„ค์ • (GPU ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉด CUDA๋กœ ์„ค์ •)
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2๏ธโƒฃ DETR ๋ชจ๋ธ ๋ฐ ํ”„๋กœ์„ธ์„œ ๋กœ๋“œ (์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ)
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50").to(device)
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")

# 3๏ธโƒฃ ๋กœ์ปฌ ๋””๋ ‰ํ† ๋ฆฌ์˜ bike.jpg ์ด๋ฏธ์ง€ ๋กœ๋“œ
image_path = "catch_frisbee.jpg"
image = Image.open(image_path)

# 4๏ธโƒฃ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ (DETR ๋ชจ๋ธ ์ž…๋ ฅ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜)
inputs = processor(images=image, return_tensors="pt").to(device)

# 5๏ธโƒฃ ๋ชจ๋ธ ์ถ”๋ก 
with torch.no_grad():
    outputs = model(**inputs)

# 6๏ธโƒฃ ๊ฒฐ๊ณผ ํ›„์ฒ˜๋ฆฌ (Bounding Box & Labels ๋ณ€ํ™˜)
target_sizes = torch.tensor([image.size[::-1]])  # (height, width) ํ˜•์‹
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]

# 7๏ธโƒฃ ๊ฐ์ง€๋œ ๊ฐ์ฒด ์ถœ๋ ฅ
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    if score > 0.7:  # ์‹ ๋ขฐ๋„ 70% ์ด์ƒ์ธ ๊ฐ์ฒด๋งŒ ์ถœ๋ ฅ
        box = [round(i, 2) for i in box.tolist()]
        print(f"Detected {model.config.id2label[label.item()]} with confidence {round(score.item(), 3)} at {box}")

์œ„์˜ ์ฝ”๋“œ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ถ„์„ํ•ด๋ณด๋ฉด,

  • ๋ชจ๋ธ์„ ๋กœ๋“œํ•˜๊ณ 
  • catch_frisbee ์ด๋ฏธ์ง€๋ฅผ ๋กœ๋“œํ•˜๊ณ !
  • processor ๋ฅผ ํ†ตํ•ด ์ „์ฒ˜๋ฆฌํ•˜๊ณ ,
  • model์— ๋„ฃ์–ด์„œ!! ์ถ”๋ก ํ•œ ๋’ค.
  • results ์—์„œ ํƒ์ง€๋œ ๋‚ด์šฉ print ํ•˜๊ธฐ!!

๊ทธ๋Ÿผ ๊ทธ output์€!!
์•„๋ž˜์™€ ๊ฐ™์ด!! ํƒ์ง€๋œ ๊ฐ์ฒด์™€, ๊ทธ ์ •ํ™•๋„(confidence), ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ขŒํ‘œ๋ฅผ ์•Œ๋ ค์ค๋‹ˆ๋‹ค@!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Detected person with confidence 0.783 at [12.91, 355.33, 32.23, 383.66]
Detected person with confidence 0.999 at [279.08, 255.76, 365.66, 423.82]
Detected person with confidence 0.995 at [533.57, 280.23, 584.71, 401.82]
Detected umbrella with confidence 0.744 at [459.41, 324.56, 496.24, 340.89]
Detected person with confidence 0.933 at [488.93, 340.06, 510.23, 376.37]
Detected person with confidence 0.835 at [0.01, 355.79, 11.03, 384.31]
Detected person with confidence 0.906 at [261.05, 346.35, 284.02, 378.22]
Detected person with confidence 0.99 at [574.15, 301.1, 605.79, 395.45]
Detected person with confidence 0.713 at [244.5, 349.68, 262.29, 378.9]
Detected person with confidence 0.997 at [132.21, 31.6, 310.32, 329.97]
Detected person with confidence 0.732 at [349.66, 352.63, 365.67, 378.28]
Detected person with confidence 0.796 at [209.17, 326.9, 232.89, 355.65]
Detected person with confidence 0.777 at [149.0, 347.84, 169.28, 381.43]
Detected person with confidence 0.991 at [163.45, 299.99, 206.14, 399.0]
Detected frisbee with confidence 1.0 at [181.55, 139.33, 225.96, 161.49]
Detected person with confidence 0.734 at [200.95, 350.37, 229.14, 380.88]
Detected person with confidence 0.737 at [467.46, 347.11, 483.07, 376.49]
Detected person with confidence 0.978 at [413.58, 253.38, 465.11, 416.57]
Detected person with confidence 0.73 at [597.38, 342.37, 613.34, 380.89]
Detected person with confidence 0.998 at [304.64, 70.92, 538.5, 410.45]

3. ๊ฐ์ฒด ํƒ์ง€๊ฒฐ๊ณผ๋ฌผ์˜ ์‹œ๊ฐํ™”!!

๋‹จ์ˆœ ํ…์ŠคํŠธ ํƒ์ง€๊ฐ€ ์•„๋‹ˆ๋ผ ๊ทธ๋ฆผ์— ๋ฐ”์šด๋”ฉ๋ฐ•์Šค๋กœ ํ‘œ์‹œํ•ด๋ด…๋‹ˆ๋‹ค!@

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
from transformers import DetrImageProcessor, DetrForObjectDetection


# 1๏ธโƒฃ ๋””๋ฐ”์ด์Šค ์„ค์ • (GPU ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉด CUDA๋กœ ์„ค์ •)
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2๏ธโƒฃ DETR ๋ชจ๋ธ ๋ฐ ํ”„๋กœ์„ธ์„œ ๋กœ๋“œ (์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ)
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50").to(device)
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")

# 3๏ธโƒฃ ๋กœ์ปฌ ๋””๋ ‰ํ† ๋ฆฌ์˜ bike.jpg ์ด๋ฏธ์ง€ ๋กœ๋“œ
image_path = "catch_frisbee.jpg"
image = Image.open(image_path)

# 4๏ธโƒฃ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ (DETR ๋ชจ๋ธ ์ž…๋ ฅ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜)
inputs = processor(images=image, return_tensors="pt").to(device)

# 5๏ธโƒฃ ๋ชจ๋ธ ์ถ”๋ก 
with torch.no_grad():
    outputs = model(**inputs)

# 6๏ธโƒฃ ๊ฒฐ๊ณผ ํ›„์ฒ˜๋ฆฌ (Bounding Box & Labels ๋ณ€ํ™˜)
target_sizes = torch.tensor([image.size[::-1]])  # (height, width) ํ˜•์‹
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]


# 7๏ธโƒฃ ๊ฐ์ง€๋œ ๊ฐ์ฒด๋ฅผ ์ด๋ฏธ์ง€์— Bounding Box๋กœ ์‹œ๊ฐํ™”
fig, ax = plt.subplots(1, figsize=(10, 6))
ax.imshow(image)

# Bounding Box ๊ทธ๋ฆฌ๊ธฐ
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    if score > 0.7:  # ๐Ÿ”น ์‹ ๋ขฐ๋„ 70% ์ด์ƒ์ธ ๊ฐ์ฒด๋งŒ ์‹œ๊ฐํ™”
        box = [round(i, 2) for i in box.tolist()]
        x, y, w, h = box
        rect = patches.Rectangle((x, y), w-x, h-y, linewidth=2, edgecolor='r', facecolor='none')
        ax.add_patch(rect)
        ax.text(x, y, f"{model.config.id2label[label.item()]}: {round(score.item(), 2)}",
                fontsize=12, bbox=dict(facecolor='yellow', alpha=0.5))

# 8๏ธโƒฃ ๊ฒฐ๊ณผ ์ €์žฅ
output_path = "detr_output.jpg"  # ๐Ÿ”น ์ €์žฅํ•  ํŒŒ์ผ๋ช…
plt.axis("off")  # ๐Ÿ”น ์ถ• ์ œ๊ฑฐ
plt.savefig(output_path, bbox_inches="tight")
plt.show()

print(f"Detection result saved as {output_path}")

์œ„ ์ฝ”๋“œ๋ฅผ ํ†ตํ•˜์—ฌ,
๊ฐ์ง€๋œ ๊ฐ์ฒด๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ณ 
detr_output.jpg ๋กœ๋„ ์ €์žฅํ•˜๊ฒŒ๋ฉ๋‹ˆ๋‹ค~!!

detr_result

๊ฐ์ฒด ํƒ์ง€, ์ฐธ ์‰ฝ์ฃ ~?
๋‹ค๋งŒ, 1๊ฐœ ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ฒด ํƒ์ง€์— ์‹œ๊ฐ„์ด 8.5์ดˆ๊ฐ€ ์†Œ์š”,, ์—ญ์‹œ ์ข€ ์˜ค๋ž˜๊ฑธ๋ฆฌ๋„ค์š”!

This post is licensed under CC BY 4.0 by the author.