Path: blob/master/examples/vision/image_classification_with_vision_transformer.py
3507 views
"""1Title: Image classification with Vision Transformer2Author: [Khalid Salama](https://www.linkedin.com/in/khalid-salama-24403144/)3Date created: 2021/01/184Last modified: 2021/01/185Description: Implementing the Vision Transformer (ViT) model for image classification.6Accelerator: GPU7"""89"""10## Introduction1112This example implements the [Vision Transformer (ViT)](https://arxiv.org/abs/2010.11929)13model by Alexey Dosovitskiy et al. for image classification,14and demonstrates it on the CIFAR-100 dataset.15The ViT model applies the Transformer architecture with self-attention to sequences of16image patches, without using convolution layers.1718"""1920"""21## Setup22"""2324import os2526os.environ["KERAS_BACKEND"] = "jax" # @param ["tensorflow", "jax", "torch"]2728import keras29from keras import layers30from keras import ops3132import numpy as np33import matplotlib.pyplot as plt3435"""36## Prepare the data37"""3839num_classes = 10040input_shape = (32, 32, 3)4142(x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()4344print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}")45print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}")464748"""49## Configure the hyperparameters50"""5152learning_rate = 0.00153weight_decay = 0.000154batch_size = 25655num_epochs = 10 # For real training, use num_epochs=100. 10 is a test value56image_size = 72 # We'll resize input images to this size57patch_size = 6 # Size of the patches to be extract from the input images58num_patches = (image_size // patch_size) ** 259projection_dim = 6460num_heads = 461transformer_units = [62projection_dim * 2,63projection_dim,64] # Size of the transformer layers65transformer_layers = 866mlp_head_units = [672048,681024,69] # Size of the dense layers of the final classifier707172"""73## Use data augmentation74"""7576data_augmentation = keras.Sequential(77[78layers.Normalization(),79layers.Resizing(image_size, image_size),80layers.RandomFlip("horizontal"),81layers.RandomRotation(factor=0.02),82layers.RandomZoom(height_factor=0.2, width_factor=0.2),83],84name="data_augmentation",85)86# Compute the mean and the variance of the training data for normalization.87data_augmentation.layers[0].adapt(x_train)888990"""91## Implement multilayer perceptron (MLP)92"""939495def mlp(x, hidden_units, dropout_rate):96for units in hidden_units:97x = layers.Dense(units, activation=keras.activations.gelu)(x)98x = layers.Dropout(dropout_rate)(x)99return x100101102"""103## Implement patch creation as a layer104"""105106107class Patches(layers.Layer):108def __init__(self, patch_size):109super().__init__()110self.patch_size = patch_size111112def call(self, images):113input_shape = ops.shape(images)114batch_size = input_shape[0]115height = input_shape[1]116width = input_shape[2]117channels = input_shape[3]118num_patches_h = height // self.patch_size119num_patches_w = width // self.patch_size120patches = keras.ops.image.extract_patches(images, size=self.patch_size)121patches = ops.reshape(122patches,123(124batch_size,125num_patches_h * num_patches_w,126self.patch_size * self.patch_size * channels,127),128)129return patches130131def get_config(self):132config = super().get_config()133config.update({"patch_size": self.patch_size})134return config135136137"""138Let's display patches for a sample image139"""140141plt.figure(figsize=(4, 4))142image = x_train[np.random.choice(range(x_train.shape[0]))]143plt.imshow(image.astype("uint8"))144plt.axis("off")145146resized_image = ops.image.resize(147ops.convert_to_tensor([image]), size=(image_size, image_size)148)149patches = Patches(patch_size)(resized_image)150print(f"Image size: {image_size} X {image_size}")151print(f"Patch size: {patch_size} X {patch_size}")152print(f"Patches per image: {patches.shape[1]}")153print(f"Elements per patch: {patches.shape[-1]}")154155n = int(np.sqrt(patches.shape[1]))156plt.figure(figsize=(4, 4))157for i, patch in enumerate(patches[0]):158ax = plt.subplot(n, n, i + 1)159patch_img = ops.reshape(patch, (patch_size, patch_size, 3))160plt.imshow(ops.convert_to_numpy(patch_img).astype("uint8"))161plt.axis("off")162163"""164## Implement the patch encoding layer165166The `PatchEncoder` layer will linearly transform a patch by projecting it into a167vector of size `projection_dim`. In addition, it adds a learnable position168embedding to the projected vector.169"""170171172class PatchEncoder(layers.Layer):173def __init__(self, num_patches, projection_dim):174super().__init__()175self.num_patches = num_patches176self.projection = layers.Dense(units=projection_dim)177self.position_embedding = layers.Embedding(178input_dim=num_patches, output_dim=projection_dim179)180181def call(self, patch):182positions = ops.expand_dims(183ops.arange(start=0, stop=self.num_patches, step=1), axis=0184)185projected_patches = self.projection(patch)186encoded = projected_patches + self.position_embedding(positions)187return encoded188189def get_config(self):190config = super().get_config()191config.update({"num_patches": self.num_patches})192return config193194195"""196## Build the ViT model197198The ViT model consists of multiple Transformer blocks,199which use the `layers.MultiHeadAttention` layer as a self-attention mechanism200applied to the sequence of patches. The Transformer blocks produce a201`[batch_size, num_patches, projection_dim]` tensor, which is processed via an202classifier head with softmax to produce the final class probabilities output.203204Unlike the technique described in the [paper](https://arxiv.org/abs/2010.11929),205which prepends a learnable embedding to the sequence of encoded patches to serve206as the image representation, all the outputs of the final Transformer block are207reshaped with `layers.Flatten()` and used as the image208representation input to the classifier head.209Note that the `layers.GlobalAveragePooling1D` layer210could also be used instead to aggregate the outputs of the Transformer block,211especially when the number of patches and the projection dimensions are large.212"""213214215def create_vit_classifier():216inputs = keras.Input(shape=input_shape)217# Augment data.218augmented = data_augmentation(inputs)219# Create patches.220patches = Patches(patch_size)(augmented)221# Encode patches.222encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)223224# Create multiple layers of the Transformer block.225for _ in range(transformer_layers):226# Layer normalization 1.227x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)228# Create a multi-head attention layer.229attention_output = layers.MultiHeadAttention(230num_heads=num_heads, key_dim=projection_dim, dropout=0.1231)(x1, x1)232# Skip connection 1.233x2 = layers.Add()([attention_output, encoded_patches])234# Layer normalization 2.235x3 = layers.LayerNormalization(epsilon=1e-6)(x2)236# MLP.237x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)238# Skip connection 2.239encoded_patches = layers.Add()([x3, x2])240241# Create a [batch_size, projection_dim] tensor.242representation = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)243representation = layers.Flatten()(representation)244representation = layers.Dropout(0.5)(representation)245# Add MLP.246features = mlp(representation, hidden_units=mlp_head_units, dropout_rate=0.5)247# Classify outputs.248logits = layers.Dense(num_classes)(features)249# Create the Keras model.250model = keras.Model(inputs=inputs, outputs=logits)251return model252253254"""255## Compile, train, and evaluate the mode256"""257258259def run_experiment(model):260optimizer = keras.optimizers.AdamW(261learning_rate=learning_rate, weight_decay=weight_decay262)263264model.compile(265optimizer=optimizer,266loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),267metrics=[268keras.metrics.SparseCategoricalAccuracy(name="accuracy"),269keras.metrics.SparseTopKCategoricalAccuracy(5, name="top-5-accuracy"),270],271)272273checkpoint_filepath = "/tmp/checkpoint.weights.h5"274checkpoint_callback = keras.callbacks.ModelCheckpoint(275checkpoint_filepath,276monitor="val_accuracy",277save_best_only=True,278save_weights_only=True,279)280281history = model.fit(282x=x_train,283y=y_train,284batch_size=batch_size,285epochs=num_epochs,286validation_split=0.1,287callbacks=[checkpoint_callback],288)289290model.load_weights(checkpoint_filepath)291_, accuracy, top_5_accuracy = model.evaluate(x_test, y_test)292print(f"Test accuracy: {round(accuracy * 100, 2)}%")293print(f"Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%")294295return history296297298vit_classifier = create_vit_classifier()299history = run_experiment(vit_classifier)300301302def plot_history(item):303plt.plot(history.history[item], label=item)304plt.plot(history.history["val_" + item], label="val_" + item)305plt.xlabel("Epochs")306plt.ylabel(item)307plt.title("Train and Validation {} Over Epochs".format(item), fontsize=14)308plt.legend()309plt.grid()310plt.show()311312313plot_history("loss")314plot_history("top-5-accuracy")315316317"""318After 100 epochs, the ViT model achieves around 55% accuracy and31982% top-5 accuracy on the test data. These are not competitive results on the CIFAR-100 dataset,320as a ResNet50V2 trained from scratch on the same data can achieve 67% accuracy.321322Note that the state of the art results reported in the323[paper](https://arxiv.org/abs/2010.11929) are achieved by pre-training the ViT model using324the JFT-300M dataset, then fine-tuning it on the target dataset. To improve the model quality325without pre-training, you can try to train the model for more epochs, use a larger number of326Transformer layers, resize the input images, change the patch size, or increase the projection dimensions.327Besides, as mentioned in the paper, the quality of the model is affected not only by architecture choices,328but also by parameters such as the learning rate schedule, optimizer, weight decay, etc.329In practice, it's recommended to fine-tune a ViT model330that was pre-trained using a large, high-resolution dataset.331"""332333334