Path: blob/master/examples/vision/md/image_classification_with_vision_transformer.md
3508 views
Image classification with Vision Transformer
Author: Khalid Salama
Date created: 2021/01/18
Last modified: 2021/01/18
Description: Implementing the Vision Transformer (ViT) model for image classification.
Introduction
This example implements the Vision Transformer (ViT) model by Alexey Dosovitskiy et al. for image classification, and demonstrates it on the CIFAR-100 dataset. The ViT model applies the Transformer architecture with self-attention to sequences of image patches, without using convolution layers.
Setup
Prepare the data
Use data augmentation
Implement multilayer perceptron (MLP)
Implement patch creation as a layer
Let's display patches for a sample image
Build the ViT model
The ViT model consists of multiple Transformer blocks, which use the layers.MultiHeadAttention
layer as a self-attention mechanism applied to the sequence of patches. The Transformer blocks produce a [batch_size, num_patches, projection_dim]
tensor, which is processed via an classifier head with softmax to produce the final class probabilities output.
Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten()
and used as the image representation input to the classifier head. Note that the layers.GlobalAveragePooling1D
layer could also be used instead to aggregate the outputs of the Transformer block, especially when the number of patches and the projection dimensions are large.
Compile, train, and evaluate the mode
313/313 ━━━━━━━━━━━━━━━━━━━━ 66s 198ms/step - accuracy: 0.1001 - loss: 3.8428 - top-5-accuracy: 0.3107 Test accuracy: 10.61% Test top 5 accuracy: 31.51%