Path: blob/master/examples/vision/md/object_detection_using_vision_transformer.md
3508 views
Object detection with Vision Transformers
Author: Karan V. Dave
Date created: 2022/03/27
Last modified: 2023/11/20
Description: A simple Keras implementation of object detection using Vision Transformers.
Introduction
The article Vision Transformer (ViT) architecture by Alexey Dosovitskiy et al. demonstrates that a pure transformer applied directly to sequences of image patches can perform well on object detection tasks.
In this Keras example, we implement an object detection ViT and we train it on the Caltech 101 dataset to detect an airplane in the given image.
Imports and setup
Prepare dataset
We use the Caltech 101 Dataset.
Implement multilayer-perceptron (MLP)
We use the code from the Keras example Image classification with Vision Transformer as a reference.
Implement the patch creation layer
Display patches for an input image
Build the ViT model
The ViT model has multiple Transformer blocks. The MultiHeadAttention
layer is used for self-attention, applied to the sequence of image patches. The encoded patches (skip connection) and self-attention layer outputs are normalized and fed into a multilayer perceptron (MLP). The model outputs four dimensions representing the bounding box coordinates of an object.