[5분 SOTA 논문 컨트리뷰션 리뷰 #12] ECCV 2020, End-to-End Object Detection with Transformers

[5분 SOTA 논문 컨트리뷰션 리뷰 #12] ECCV 2020, End-to-End Object Detection with Transformers

2024. 11. 19. 16:00ㆍ[5분 SOTA 논문 컨트리뷰션 리뷰]

본 포스팅에서는 End-to-End Object Detection with Transformers (ECCV 2020) 논문을 간단히 리뷰하였습니다.

그림과 설명은 논문자료를 참고하였습니다.

원문 링크: https://arxiv.org/abs/2005.12872

End-to-End Object Detection with Transformers

We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor gene

arxiv.org

1. Motivation

Fig. 1: DETR directly predicts (in parallel) the final set of detections by combining a common CNN with a transformer architecture. During training, bipartite matching uniquely assigns predictions with ground truth boxes. Prediction with no match should yield a “no object” (∅) class prediction.

본 논문에서는 object detection를 직접 집합 예측 문제(direct set prediction problem)로 보는 새로운 방법을 제시한다.

제안된 방법은 detection pipeline을 효율적으로 간소화하여, non-maximum suppression(NMS)나 anchor generation와 같은 수작업으로 설계된 요소들을 더 이상 필요 없게 만든다.

본 논문에서 제안한 방법인 DETR(DEtection TRansformer)의 주요 구성요소는, bipartite matching을 통해 고유한 예측을 force 하는 global set-based loss와 transformer encoder-decoder구조이다.

새로운 모델은 개념적을 단순하며, 다른 modern detectors과 달리 특수한 library를 요구하지 않는다는 장점이 있다.

DETR은 COCO dataset에서 highly-optimize 된 Faster R-CNN baseline과 비교하여도 정확도와 run-time에서 성능이 동등한 성능을 보여줄 뿐만 아니라 large objects에 대해서는 더 높은 성능을 보여준다.

또한, DETR은 panoptic segmentation을 unified 방식으로 쉽게 generalized 할 수 있다.

2. Unique methodology

Fig. 2: DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model attens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small xed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a \no object" class.

본 논문에서 제안한 방법인 DETR은 모든 object를 한 번에 예측하며, 예측된 object와 실제 object 간의 bipartite matching을 수행하는 set loss function을 사용하여 end-to-end방식으로 training 된다.

object detection set prediction loss : DETR은 고정된 N개의 예측을 사용하여 bipartite matching을 통해 예측과 실제 object를 매칭하고, Hungarian 알고리즘을 사용해 최적의 매칭을 찾는다. (여기서 N은 이미지에 있는 object들의 일반적인 수보다 상당히 더 크게 설정됨) 그 후, class 예측과 box 예측의 loss을 합쳐 최종 loss을 계산하며, class 불균형 문제를 해결하기 위해 특정 class에 대해 가중치를 조정한다.
bounding box loss : DETR은 직접 bounding box를 예측하며, sclae문제를 해결하기 위해 L1 loss와 generalized IoU loss를 결합하여 사용한다.

또한, DETR은 기존 대부분의 detection방법들과 달리 customize layer를 필요로 하지 않기 때문에 standard CNN과 transformer classes를 포함하는 어떤 프레임워크에서든 손쉽게 재현할 수 있다는 장점이 있다.

DETR은 세 가지 주요 구성 요소로 이루어져 있다.

CNN backbone : 입력 image에서 compact 한 feature을 추출한다.
encoder-decoder transformer : image에서 추출된 feature들을 처리하고 이해하는 주된 모델이다.
feed forward network(FFN) : 최종적인 detection 예측을 수행한다.

3. Results

Table 1: Comparison with Faster R-CNN with a ResNet-50 and ResNet-101 backbones on the COCO validation set. The top section shows results for Faster R-CNN models in Detectron2 [50], the middle section shows results for Faster R-CNN models with GIoU [38], random crops train-time augmentation, and the long 9x training schedule. DETR models achieve comparable results to heavily tuned Faster R-CNN baselines, having lower APS but greatly improved APL. We use torchscript Faster R-CNN and DETR models to measure FLOPS and FPS. Results without R101 in the name correspond to ResNet-50.

Table 1은 COCO validation set에서 DETR과 Faster R-CNN의 성능을 비교한 결과를 보여준다. 이 비교에서 DETR은 Faster R-CNN보다 더 높은 성능을 나타내며, 특히 large object에 대해서 뛰어난 성능을 보인다.

Table 2: Effect of encoder size. Each row corresponds to a model with varied number of encoder layers and xed number of decoder layers. Performance gradually improves with more encoder layers.

Table 2는 encoder size에 대한 영향을 나타낸 표이다. 실험 결과 layers가 없을 경우, overall AP는 3.9 감소하며 large object에 대한 AP의 경우 6.0 감소하는 것을 확인할 수 있다.

Table 2는 encoder size의 변화가 성능에 미치는 영향을 보여주는 표이다. 실험 결과, encoder layer가 없을 경우 overall AP는 3.9 감소하며, 특히 large object에 대한 AP는 6.0 더 큰 감소를 보인다. 이는 encoder가 global image 정보를 처리하는 데 중요한 역할을 한다는 점을 보여준다.

Fig. 3: Encoder self-attention for a set of reference points. The encoder is able to separate individual instances. Predictions are made with baseline DETR model on a validation set image.

Figure 3은 trained model의 last encoder layer의 attention map을 시각화한 것이다.

'[5분 SOTA 논문 컨트리뷰션 리뷰]' 카테고리의 다른 글

[5분 SOTA 논문 컨트리뷰션 리뷰 #11] CVPR 2024, ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation (1)	2024.10.28
[5분 SOTA 논문 컨트리뷰션 리뷰 #10] CVPR 2024, Open-World Semantic Segmentation Including Class Similarity (10)	2024.10.14
[5분 SOTA 논문 컨트리뷰션 리뷰 #9] ECCV 2020, SF-Net: Single-Frame Supervision for Temporal Action Localization (0)	2022.03.06
[5분 SOTA 논문 컨트리뷰션 리뷰 #8] CVPRW 2021, SRFlow-DA: Super-Resolution Using Normalizing Flow with Deep Convolutional Block (0)	2022.01.15
[5분 SOTA 논문 컨트리뷰션 리뷰 #7] CVPR 2015, Deep Residual Learning for Image Recognition (0)	2021.12.16

Machine Learning Lab

Machine Learning Lab

태그

최근글

댓글

공지사항

아카이브

'[5분 SOTA 논문 컨트리뷰션 리뷰]' 카테고리의 다른 글

관련글

티스토리툴바