CoCalc -- attention_mil_classification.py

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/vision/attention_mil_classification.py
³⁵⁰⁷ views
1
"""
2
Title: Classification using Attention-based Deep Multiple Instance Learning (MIL).
3
Author: [Mohamad Jaber](https://www.linkedin.com/in/mohamadjaber1/)
4
Date created: 2021/08/16
5
Last modified: 2021/11/25
6
Description: MIL approach to classify bags of instances and get their individual instance score.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
### What is Multiple Instance Learning (MIL)?
14

15
Usually, with supervised learning algorithms, the learner receives labels for a set of
16
instances. In the case of MIL, the learner receives labels for a set of bags, each of which
17
contains a set of instances. The bag is labeled positive if it contains at least
18
one positive instance, and negative if it does not contain any.
19

20
### Motivation
21

22
It is often assumed in image classification tasks that each image clearly represents a
23
class label. In medical imaging (e.g. computational pathology, etc.) an *entire image*
24
is represented by a single class label (cancerous/non-cancerous) or a region of interest
25
could be given. However, one will be interested in knowing which patterns in the image
26
is actually causing it to belong to that class. In this context, the image(s) will be
27
divided and the subimages will form the bag of instances.
28

29
Therefore, the goals are to:
30

31
1. Learn a model to predict a class label for a bag of instances.
32
2. Find out which instances within the bag caused a position class label
33
prediction.
34

35
### Implementation
36

37
The following steps describe how the model works:
38

39
1. The feature extractor layers extract feature embeddings.
40
2. The embeddings are fed into the MIL attention layer to get
41
the attention scores. The layer is designed as permutation-invariant.
42
3. Input features and their corresponding attention scores are multiplied together.
43
4. The resulting output is passed to a softmax function for classification.
44

45
### References
46

47
- [Attention-based Deep Multiple Instance Learning](https://arxiv.org/abs/1802.04712).
48
- Some of the attention operator code implementation was inspired from https://github.com/utayao/Atten_Deep_MIL.
49
- Imbalanced data [tutorial](https://www.tensorflow.org/tutorials/structured_data/imbalanced_data)
50
by TensorFlow.
51

52
"""
53
"""
54
## Setup
55
"""
56

57
import numpy as np
58
import keras
59
from keras import layers
60
from keras import ops
61
from tqdm import tqdm
62
from matplotlib import pyplot as plt
63

64
plt.style.use("ggplot")
65

66
"""
67
## Create dataset
68

69
We will create a set of bags and assign their labels according to their contents.
70
If at least one positive instance
71
is available in a bag, the bag is considered as a positive bag. If it does not contain any
72
positive instance, the bag will be considered as negative.
73

74
### Configuration parameters
75

76
- `POSITIVE_CLASS`: The desired class to be kept in the positive bag.
77
- `BAG_COUNT`: The number of training bags.
78
- `VAL_BAG_COUNT`: The number of validation bags.
79
- `BAG_SIZE`: The number of instances in a bag.
80
- `PLOT_SIZE`: The number of bags to plot.
81
- `ENSEMBLE_AVG_COUNT`: The number of models to create and average together. (Optional:
82
often results in better performance - set to 1 for single model)
83
"""
84

85
POSITIVE_CLASS = 1
86
BAG_COUNT = 1000
87
VAL_BAG_COUNT = 300
88
BAG_SIZE = 3
89
PLOT_SIZE = 3
90
ENSEMBLE_AVG_COUNT = 1
91

92
"""
93
### Prepare bags
94

95
Since the attention operator is a permutation-invariant operator, an instance with a
96
positive class label is randomly placed among the instances in the positive bag.
97
"""
98

99

100
def create_bags(input_data, input_labels, positive_class, bag_count, instance_count):
101
    # Set up bags.
102
    bags = []
103
    bag_labels = []
104

105
    # Normalize input data.
106
    input_data = np.divide(input_data, 255.0)
107

108
    # Count positive samples.
109
    count = 0
110

111
    for _ in range(bag_count):
112
        # Pick a fixed size random subset of samples.
113
        index = np.random.choice(input_data.shape[0], instance_count, replace=False)
114
        instances_data = input_data[index]
115
        instances_labels = input_labels[index]
116

117
        # By default, all bags are labeled as 0.
118
        bag_label = 0
119

120
        # Check if there is at least a positive class in the bag.
121
        if positive_class in instances_labels:
122
            # Positive bag will be labeled as 1.
123
            bag_label = 1
124
            count += 1
125

126
        bags.append(instances_data)
127
        bag_labels.append(np.array([bag_label]))
128

129
    print(f"Positive bags: {count}")
130
    print(f"Negative bags: {bag_count - count}")
131

132
    return (list(np.swapaxes(bags, 0, 1)), np.array(bag_labels))
133

134

135
# Load the MNIST dataset.
136
(x_train, y_train), (x_val, y_val) = keras.datasets.mnist.load_data()
137

138
# Create training data.
139
train_data, train_labels = create_bags(
140
    x_train, y_train, POSITIVE_CLASS, BAG_COUNT, BAG_SIZE
141
)
142

143
# Create validation data.
144
val_data, val_labels = create_bags(
145
    x_val, y_val, POSITIVE_CLASS, VAL_BAG_COUNT, BAG_SIZE
146
)
147

148
"""
149
## Create the model
150

151
We will now build the attention layer, prepare some utilities, then build and train the
152
entire model.
153

154
### Attention operator implementation
155

156
The output size of this layer is decided by the size of a single bag.
157

158
The attention mechanism uses a weighted average of instances in a bag, in which the sum
159
of the weights must equal to 1 (invariant of the bag size).
160

161
The weight matrices (parameters) are **w** and **v**. To include positive and negative
162
values, hyperbolic tangent element-wise non-linearity is utilized.
163

164
A **Gated attention mechanism** can be used to deal with complex relations. Another weight
165
matrix, **u**, is added to the computation.
166
A sigmoid non-linearity is used to overcome approximately linear behavior for *x* ∈ [−1, 1]
167
by hyperbolic tangent non-linearity.
168
"""
169

170

171
class MILAttentionLayer(layers.Layer):
172
    """Implementation of the attention-based Deep MIL layer.
173

174
    Args:
175
      weight_params_dim: Positive Integer. Dimension of the weight matrix.
176
      kernel_initializer: Initializer for the `kernel` matrix.
177
      kernel_regularizer: Regularizer function applied to the `kernel` matrix.
178
      use_gated: Boolean, whether or not to use the gated mechanism.
179

180
    Returns:
181
      List of 2D tensors with BAG_SIZE length.
182
      The tensors are the attention scores after softmax with shape `(batch_size, 1)`.
183
    """
184

185
    def __init__(
186
        self,
187
        weight_params_dim,
188
        kernel_initializer="glorot_uniform",
189
        kernel_regularizer=None,
190
        use_gated=False,
191
        **kwargs,
192
    ):
193
        super().__init__(**kwargs)
194

195
        self.weight_params_dim = weight_params_dim
196
        self.use_gated = use_gated
197

198
        self.kernel_initializer = keras.initializers.get(kernel_initializer)
199
        self.kernel_regularizer = keras.regularizers.get(kernel_regularizer)
200

201
        self.v_init = self.kernel_initializer
202
        self.w_init = self.kernel_initializer
203
        self.u_init = self.kernel_initializer
204

205
        self.v_regularizer = self.kernel_regularizer
206
        self.w_regularizer = self.kernel_regularizer
207
        self.u_regularizer = self.kernel_regularizer
208

209
    def build(self, input_shape):
210
        # Input shape.
211
        # List of 2D tensors with shape: (batch_size, input_dim).
212
        input_dim = input_shape[0][1]
213

214
        self.v_weight_params = self.add_weight(
215
            shape=(input_dim, self.weight_params_dim),
216
            initializer=self.v_init,
217
            name="v",
218
            regularizer=self.v_regularizer,
219
            trainable=True,
220
        )
221

222
        self.w_weight_params = self.add_weight(
223
            shape=(self.weight_params_dim, 1),
224
            initializer=self.w_init,
225
            name="w",
226
            regularizer=self.w_regularizer,
227
            trainable=True,
228
        )
229

230
        if self.use_gated:
231
            self.u_weight_params = self.add_weight(
232
                shape=(input_dim, self.weight_params_dim),
233
                initializer=self.u_init,
234
                name="u",
235
                regularizer=self.u_regularizer,
236
                trainable=True,
237
            )
238
        else:
239
            self.u_weight_params = None
240

241
        self.input_built = True
242

243
    def call(self, inputs):
244
        # Assigning variables from the number of inputs.
245
        instances = [self.compute_attention_scores(instance) for instance in inputs]
246

247
        # Stack instances into a single tensor.
248
        instances = ops.stack(instances)
249

250
        # Apply softmax over instances such that the output summation is equal to 1.
251
        alpha = ops.softmax(instances, axis=0)
252

253
        # Split to recreate the same array of tensors we had as inputs.
254
        return [alpha[i] for i in range(alpha.shape[0])]
255

256
    def compute_attention_scores(self, instance):
257
        # Reserve in-case "gated mechanism" used.
258
        original_instance = instance
259

260
        # tanh(v*h_k^T)
261
        instance = ops.tanh(ops.tensordot(instance, self.v_weight_params, axes=1))
262

263
        # for learning non-linear relations efficiently.
264
        if self.use_gated:
265
            instance = instance * ops.sigmoid(
266
                ops.tensordot(original_instance, self.u_weight_params, axes=1)
267
            )
268

269
        # w^T*(tanh(v*h_k^T)) / w^T*(tanh(v*h_k^T)*sigmoid(u*h_k^T))
270
        return ops.tensordot(instance, self.w_weight_params, axes=1)
271

272

273
"""
274
## Visualizer tool
275

276
Plot the number of bags (given by `PLOT_SIZE`) with respect to the class.
277

278
Moreover, if activated, the class label prediction with its associated instance score
279
for each bag (after the model has been trained) can be seen.
280
"""
281

282

283
def plot(data, labels, bag_class, predictions=None, attention_weights=None):
284
    """ "Utility for plotting bags and attention weights.
285

286
    Args:
287
      data: Input data that contains the bags of instances.
288
      labels: The associated bag labels of the input data.
289
      bag_class: String name of the desired bag class.
290
        The options are: "positive" or "negative".
291
      predictions: Class labels model predictions.
292
      If you don't specify anything, ground truth labels will be used.
293
      attention_weights: Attention weights for each instance within the input data.
294
      If you don't specify anything, the values won't be displayed.
295
    """
296
    return  ## TODO
297
    labels = np.array(labels).reshape(-1)
298

299
    if bag_class == "positive":
300
        if predictions is not None:
301
            labels = np.where(predictions.argmax(1) == 1)[0]
302
            bags = np.array(data)[:, labels[0:PLOT_SIZE]]
303

304
        else:
305
            labels = np.where(labels == 1)[0]
306
            bags = np.array(data)[:, labels[0:PLOT_SIZE]]
307

308
    elif bag_class == "negative":
309
        if predictions is not None:
310
            labels = np.where(predictions.argmax(1) == 0)[0]
311
            bags = np.array(data)[:, labels[0:PLOT_SIZE]]
312
        else:
313
            labels = np.where(labels == 0)[0]
314
            bags = np.array(data)[:, labels[0:PLOT_SIZE]]
315

316
    else:
317
        print(f"There is no class {bag_class}")
318
        return
319

320
    print(f"The bag class label is {bag_class}")
321
    for i in range(PLOT_SIZE):
322
        figure = plt.figure(figsize=(8, 8))
323
        print(f"Bag number: {labels[i]}")
324
        for j in range(BAG_SIZE):
325
            image = bags[j][i]
326
            figure.add_subplot(1, BAG_SIZE, j + 1)
327
            plt.grid(False)
328
            if attention_weights is not None:
329
                plt.title(np.around(attention_weights[labels[i]][j], 2))
330
            plt.imshow(image)
331
        plt.show()
332

333

334
# Plot some of validation data bags per class.
335
plot(val_data, val_labels, "positive")
336
plot(val_data, val_labels, "negative")
337

338
"""
339
## Create model
340

341
First we will create some embeddings per instance, invoke the attention operator and then
342
use the softmax function to output the class probabilities.
343
"""
344

345

346
def create_model(instance_shape):
347
    # Extract features from inputs.
348
    inputs, embeddings = [], []
349
    shared_dense_layer_1 = layers.Dense(128, activation="relu")
350
    shared_dense_layer_2 = layers.Dense(64, activation="relu")
351
    for _ in range(BAG_SIZE):
352
        inp = layers.Input(instance_shape)
353
        flatten = layers.Flatten()(inp)
354
        dense_1 = shared_dense_layer_1(flatten)
355
        dense_2 = shared_dense_layer_2(dense_1)
356
        inputs.append(inp)
357
        embeddings.append(dense_2)
358

359
    # Invoke the attention layer.
360
    alpha = MILAttentionLayer(
361
        weight_params_dim=256,
362
        kernel_regularizer=keras.regularizers.L2(0.01),
363
        use_gated=True,
364
        name="alpha",
365
    )(embeddings)
366

367
    # Multiply attention weights with the input layers.
368
    multiply_layers = [
369
        layers.multiply([alpha[i], embeddings[i]]) for i in range(len(alpha))
370
    ]
371

372
    # Concatenate layers.
373
    concat = layers.concatenate(multiply_layers, axis=1)
374

375
    # Classification output node.
376
    output = layers.Dense(2, activation="softmax")(concat)
377

378
    return keras.Model(inputs, output)
379

380

381
"""
382
## Class weights
383

384
Since this kind of problem could simply turn into imbalanced data classification problem,
385
class weighting should be considered.
386

387
Let's say there are 1000 bags. There often could be cases were ~90 % of the bags do not
388
contain any positive label and ~10 % do.
389
Such data can be referred to as **Imbalanced data**.
390

391
Using class weights, the model will tend to give a higher weight to the rare class.
392
"""
393

394

395
def compute_class_weights(labels):
396
    # Count number of positive and negative bags.
397
    negative_count = len(np.where(labels == 0)[0])
398
    positive_count = len(np.where(labels == 1)[0])
399
    total_count = negative_count + positive_count
400

401
    # Build class weight dictionary.
402
    return {
403
        0: (1 / negative_count) * (total_count / 2),
404
        1: (1 / positive_count) * (total_count / 2),
405
    }
406

407

408
"""
409
## Build and train model
410

411
The model is built and trained in this section.
412
"""
413

414

415
def train(train_data, train_labels, val_data, val_labels, model):
416
    # Train model.
417
    # Prepare callbacks.
418
    # Path where to save best weights.
419

420
    # Take the file name from the wrapper.
421
    file_path = "/tmp/best_model.weights.h5"
422

423
    # Initialize model checkpoint callback.
424
    model_checkpoint = keras.callbacks.ModelCheckpoint(
425
        file_path,
426
        monitor="val_loss",
427
        verbose=0,
428
        mode="min",
429
        save_best_only=True,
430
        save_weights_only=True,
431
    )
432

433
    # Initialize early stopping callback.
434
    # The model performance is monitored across the validation data and stops training
435
    # when the generalization error cease to decrease.
436
    early_stopping = keras.callbacks.EarlyStopping(
437
        monitor="val_loss", patience=10, mode="min"
438
    )
439

440
    # Compile model.
441
    model.compile(
442
        optimizer="adam",
443
        loss="sparse_categorical_crossentropy",
444
        metrics=["accuracy"],
445
    )
446

447
    # Fit model.
448
    model.fit(
449
        train_data,
450
        train_labels,
451
        validation_data=(val_data, val_labels),
452
        epochs=20,
453
        class_weight=compute_class_weights(train_labels),
454
        batch_size=1,
455
        callbacks=[early_stopping, model_checkpoint],
456
        verbose=0,
457
    )
458

459
    # Load best weights.
460
    model.load_weights(file_path)
461

462
    return model
463

464

465
# Building model(s).
466
instance_shape = train_data[0][0].shape
467
models = [create_model(instance_shape) for _ in range(ENSEMBLE_AVG_COUNT)]
468

469
# Show single model architecture.
470
print(models[0].summary())
471

472
# Training model(s).
473
trained_models = [
474
    train(train_data, train_labels, val_data, val_labels, model)
475
    for model in tqdm(models)
476
]
477

478
"""
479
## Model evaluation
480

481
The models are now ready for evaluation.
482
With each model we also create an associated intermediate model to get the
483
weights from the attention layer.
484

485
We will compute a prediction for each of our `ENSEMBLE_AVG_COUNT` models, and
486
average them together for our final prediction.
487
"""
488

489

490
def predict(data, labels, trained_models):
491
    # Collect info per model.
492
    models_predictions = []
493
    models_attention_weights = []
494
    models_losses = []
495
    models_accuracies = []
496

497
    for model in trained_models:
498
        # Predict output classes on data.
499
        predictions = model.predict(data)
500
        models_predictions.append(predictions)
501

502
        # Create intermediate model to get MIL attention layer weights.
503
        intermediate_model = keras.Model(model.input, model.get_layer("alpha").output)
504

505
        # Predict MIL attention layer weights.
506
        intermediate_predictions = intermediate_model.predict(data)
507

508
        attention_weights = np.squeeze(np.swapaxes(intermediate_predictions, 1, 0))
509
        models_attention_weights.append(attention_weights)
510

511
        loss, accuracy = model.evaluate(data, labels, verbose=0)
512
        models_losses.append(loss)
513
        models_accuracies.append(accuracy)
514

515
    print(
516
        f"The average loss and accuracy are {np.sum(models_losses, axis=0) / ENSEMBLE_AVG_COUNT:.2f}"
517
        f" and {100 * np.sum(models_accuracies, axis=0) / ENSEMBLE_AVG_COUNT:.2f} % resp."
518
    )
519

520
    return (
521
        np.sum(models_predictions, axis=0) / ENSEMBLE_AVG_COUNT,
522
        np.sum(models_attention_weights, axis=0) / ENSEMBLE_AVG_COUNT,
523
    )
524

525

526
# Evaluate and predict classes and attention scores on validation data.
527
class_predictions, attention_params = predict(val_data, val_labels, trained_models)
528

529
# Plot some results from our validation data.
530
plot(
531
    val_data,
532
    val_labels,
533
    "positive",
534
    predictions=class_predictions,
535
    attention_weights=attention_params,
536
)
537
plot(
538
    val_data,
539
    val_labels,
540
    "negative",
541
    predictions=class_predictions,
542
    attention_weights=attention_params,
543
)
544

545
"""
546
## Conclusion
547

548
From the above plot, you can notice that the weights always sum to 1. In a
549
positively predict bag, the instance which resulted in the positive labeling will have
550
a substantially higher attention score than the rest of the bag. However, in a negatively
551
predicted bag, there are two cases:
552

553
* All instances will have approximately similar scores.
554
* An instance will have relatively higher score (but not as high as of a positive instance).
555
This is because the feature space of this instance is close to that of the positive instance.
556

557
## Remarks
558

559
- If the model is overfit, the weights will be equally distributed for all bags. Hence,
560
the regularization techniques are necessary.
561
- In the paper, the bag sizes can differ from one bag to another. For simplicity, the
562
bag sizes are fixed here.
563
- In order not to rely on the random initial weights of a single model, averaging ensemble
564
methods should be considered.
565
"""
566

567
Product

Resources

Company