CoCalc -- active_learning_review

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/active_learning_review_classification.py
³⁵⁰⁷ views
1
"""
2
Title: Review Classification using Active Learning
3
Author: [Darshan Deshpande](https://twitter.com/getdarshan)
4
Date created: 2021/10/29
5
Last modified: 2024/05/08
6
Description: Demonstrating the advantages of active learning through review classification.
7
Accelerator: GPU
8
Converted to Keras 3 by: [Sachin Prasad](https://github.com/sachinprasadhs)
9
"""
10

11
"""
12
## Introduction
13

14
With the growth of data-centric Machine Learning, Active Learning has grown in popularity
15
amongst businesses and researchers. Active Learning seeks to progressively
16
train ML models so that the resultant model requires lesser amount of training data to
17
achieve competitive scores.
18

19
The structure of an Active Learning pipeline involves a classifier and an oracle. The
20
oracle is an annotator that cleans, selects, labels the data, and feeds it to the model
21
when required. The oracle is a trained individual or a group of individuals that
22
ensure consistency in labeling of new data.
23

24
The process starts with annotating a small subset of the full dataset and training an
25
initial model. The best model checkpoint is saved and then tested on a balanced test
26
set. The test set must be carefully sampled because the full training process will be
27
dependent on it. Once we have the initial evaluation scores, the oracle is tasked with
28
labeling more samples; the number of data points to be sampled is usually determined by
29
the business requirements. After that, the newly sampled data is added to the training
30
set, and the training procedure repeats. This cycle continues until either an
31
acceptable score is reached or some other business metric is met.
32

33
This tutorial provides a basic demonstration of how Active Learning works by
34
demonstrating a ratio-based (least confidence) sampling strategy that results in lower
35
overall false positive and negative rates when compared to a model trained on the entire
36
dataset. This sampling falls under the domain of *uncertainty sampling*, in which new
37
datasets are sampled based on the uncertainty that the model outputs for the
38
corresponding label. In our example, we compare our model's false positive and false
39
negative rates and annotate the new data based on their ratio.
40

41
Some other sampling techniques include:
42

43
1. [Committee sampling](https://www.researchgate.net/publication/51909346_Committee-Based_Sample_Selection_for_Probabilistic_Classifiers):
44
Using multiple models to vote for the best data points to be sampled
45
2. [Entropy reduction](https://www.researchgate.net/publication/51909346_Committee-Based_Sample_Selection_for_Probabilistic_Classifiers):
46
Sampling according to an entropy threshold, selecting more of the samples that produce the highest entropy score.
47
3. [Minimum margin based sampling](https://arxiv.org/abs/1906.00025v1):
48
Selects data points closest to the decision boundary
49
"""
50

51
"""
52
## Importing required libraries
53
"""
54

55
import os
56

57
os.environ["KERAS_BACKEND"] = "tensorflow"  # @param ["tensorflow", "jax", "torch"]
58
import keras
59
from keras import ops
60
from keras import layers
61
import tensorflow_datasets as tfds
62
import tensorflow as tf
63
import matplotlib.pyplot as plt
64
import re
65
import string
66

67
tfds.disable_progress_bar()
68

69
"""
70
## Loading and preprocessing the data
71

72
We will be using the IMDB reviews dataset for our experiments. This dataset has 50,000
73
reviews in total, including training and testing splits. We will merge these splits and
74
sample our own, balanced training, validation and testing sets.
75
"""
76

77
dataset = tfds.load(
78
    "imdb_reviews",
79
    split="train + test",
80
    as_supervised=True,
81
    batch_size=-1,
82
    shuffle_files=False,
83
)
84
reviews, labels = tfds.as_numpy(dataset)
85

86
print("Total examples:", reviews.shape[0])
87

88
"""
89
Active learning starts with labeling a subset of data.
90
For the ratio sampling technique that we will be using, we will need well-balanced training,
91
validation and testing splits.
92
"""
93

94
val_split = 2500
95
test_split = 2500
96
train_split = 7500
97

98
# Separating the negative and positive samples for manual stratification
99
x_positives, y_positives = reviews[labels == 1], labels[labels == 1]
100
x_negatives, y_negatives = reviews[labels == 0], labels[labels == 0]
101

102
# Creating training, validation and testing splits
103
x_val, y_val = (
104
    tf.concat((x_positives[:val_split], x_negatives[:val_split]), 0),
105
    tf.concat((y_positives[:val_split], y_negatives[:val_split]), 0),
106
)
107
x_test, y_test = (
108
    tf.concat(
109
        (
110
            x_positives[val_split : val_split + test_split],
111
            x_negatives[val_split : val_split + test_split],
112
        ),
113
        0,
114
    ),
115
    tf.concat(
116
        (
117
            y_positives[val_split : val_split + test_split],
118
            y_negatives[val_split : val_split + test_split],
119
        ),
120
        0,
121
    ),
122
)
123
x_train, y_train = (
124
    tf.concat(
125
        (
126
            x_positives[val_split + test_split : val_split + test_split + train_split],
127
            x_negatives[val_split + test_split : val_split + test_split + train_split],
128
        ),
129
        0,
130
    ),
131
    tf.concat(
132
        (
133
            y_positives[val_split + test_split : val_split + test_split + train_split],
134
            y_negatives[val_split + test_split : val_split + test_split + train_split],
135
        ),
136
        0,
137
    ),
138
)
139

140
# Remaining pool of samples are stored separately. These are only labeled as and when required
141
x_pool_positives, y_pool_positives = (
142
    x_positives[val_split + test_split + train_split :],
143
    y_positives[val_split + test_split + train_split :],
144
)
145
x_pool_negatives, y_pool_negatives = (
146
    x_negatives[val_split + test_split + train_split :],
147
    y_negatives[val_split + test_split + train_split :],
148
)
149

150
# Creating TF Datasets for faster prefetching and parallelization
151
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
152
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
153
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
154

155
pool_negatives = tf.data.Dataset.from_tensor_slices(
156
    (x_pool_negatives, y_pool_negatives)
157
)
158
pool_positives = tf.data.Dataset.from_tensor_slices(
159
    (x_pool_positives, y_pool_positives)
160
)
161

162
print(f"Initial training set size: {len(train_dataset)}")
163
print(f"Validation set size: {len(val_dataset)}")
164
print(f"Testing set size: {len(test_dataset)}")
165
print(f"Unlabeled negative pool: {len(pool_negatives)}")
166
print(f"Unlabeled positive pool: {len(pool_positives)}")
167

168
"""
169
### Fitting the `TextVectorization` layer
170

171
Since we are working with text data, we will need to encode the text strings as vectors which
172
would then be passed through an `Embedding` layer. To make this tokenization process
173
faster, we use the `map()` function with its parallelization functionality.
174
"""
175

176

177
vectorizer = layers.TextVectorization(
178
    3000, standardize="lower_and_strip_punctuation", output_sequence_length=150
179
)
180
# Adapting the dataset
181
vectorizer.adapt(
182
    train_dataset.map(lambda x, y: x, num_parallel_calls=tf.data.AUTOTUNE).batch(256)
183
)
184

185

186
def vectorize_text(text, label):
187
    text = vectorizer(text)
188
    return text, label
189

190

191
train_dataset = train_dataset.map(
192
    vectorize_text, num_parallel_calls=tf.data.AUTOTUNE
193
).prefetch(tf.data.AUTOTUNE)
194
pool_negatives = pool_negatives.map(vectorize_text, num_parallel_calls=tf.data.AUTOTUNE)
195
pool_positives = pool_positives.map(vectorize_text, num_parallel_calls=tf.data.AUTOTUNE)
196

197
val_dataset = val_dataset.batch(256).map(
198
    vectorize_text, num_parallel_calls=tf.data.AUTOTUNE
199
)
200
test_dataset = test_dataset.batch(256).map(
201
    vectorize_text, num_parallel_calls=tf.data.AUTOTUNE
202
)
203

204
"""
205
## Creating Helper Functions
206
"""
207

208

209
# Helper function for merging new history objects with older ones
210
def append_history(losses, val_losses, accuracy, val_accuracy, history):
211
    losses = losses + history.history["loss"]
212
    val_losses = val_losses + history.history["val_loss"]
213
    accuracy = accuracy + history.history["binary_accuracy"]
214
    val_accuracy = val_accuracy + history.history["val_binary_accuracy"]
215
    return losses, val_losses, accuracy, val_accuracy
216

217

218
# Plotter function
219
def plot_history(losses, val_losses, accuracies, val_accuracies):
220
    plt.plot(losses)
221
    plt.plot(val_losses)
222
    plt.legend(["train_loss", "val_loss"])
223
    plt.xlabel("Epochs")
224
    plt.ylabel("Loss")
225
    plt.show()
226

227
    plt.plot(accuracies)
228
    plt.plot(val_accuracies)
229
    plt.legend(["train_accuracy", "val_accuracy"])
230
    plt.xlabel("Epochs")
231
    plt.ylabel("Accuracy")
232
    plt.show()
233

234

235
"""
236
## Creating the Model
237

238
We create a small bidirectional LSTM model. When using Active Learning, you should make sure
239
that the model architecture is capable of overfitting to the initial data.
240
Overfitting gives a strong hint that the model will have enough capacity for
241
future, unseen data.
242
"""
243

244

245
def create_model():
246
    model = keras.models.Sequential(
247
        [
248
            layers.Input(shape=(150,)),
249
            layers.Embedding(input_dim=3000, output_dim=128),
250
            layers.Bidirectional(layers.LSTM(32, return_sequences=True)),
251
            layers.GlobalMaxPool1D(),
252
            layers.Dense(20, activation="relu"),
253
            layers.Dropout(0.5),
254
            layers.Dense(1, activation="sigmoid"),
255
        ]
256
    )
257
    model.summary()
258
    return model
259

260

261
"""
262
## Training on the entire dataset
263

264
To show the effectiveness of Active Learning, we will first train the model on the entire
265
dataset containing 40,000 labeled samples. This model will be used for comparison later.
266
"""
267

268

269
def train_full_model(full_train_dataset, val_dataset, test_dataset):
270
    model = create_model()
271
    model.compile(
272
        loss="binary_crossentropy",
273
        optimizer="rmsprop",
274
        metrics=[
275
            keras.metrics.BinaryAccuracy(),
276
            keras.metrics.FalseNegatives(),
277
            keras.metrics.FalsePositives(),
278
        ],
279
    )
280

281
    # We will save the best model at every epoch and load the best one for evaluation on the test set
282
    history = model.fit(
283
        full_train_dataset.batch(256),
284
        epochs=20,
285
        validation_data=val_dataset,
286
        callbacks=[
287
            keras.callbacks.EarlyStopping(patience=4, verbose=1),
288
            keras.callbacks.ModelCheckpoint(
289
                "FullModelCheckpoint.keras", verbose=1, save_best_only=True
290
            ),
291
        ],
292
    )
293

294
    # Plot history
295
    plot_history(
296
        history.history["loss"],
297
        history.history["val_loss"],
298
        history.history["binary_accuracy"],
299
        history.history["val_binary_accuracy"],
300
    )
301

302
    # Loading the best checkpoint
303
    model = keras.models.load_model("FullModelCheckpoint.keras")
304

305
    print("-" * 100)
306
    print(
307
        "Test set evaluation: ",
308
        model.evaluate(test_dataset, verbose=0, return_dict=True),
309
    )
310
    print("-" * 100)
311
    return model
312

313

314
# Sampling the full train dataset to train on
315
full_train_dataset = (
316
    train_dataset.concatenate(pool_positives)
317
    .concatenate(pool_negatives)
318
    .cache()
319
    .shuffle(20000)
320
)
321

322
# Training the full model
323
full_dataset_model = train_full_model(full_train_dataset, val_dataset, test_dataset)
324

325
"""
326
## Training via Active Learning
327

328
The general process we follow when performing Active Learning is demonstrated below:
329

330
![Active Learning](https://i.imgur.com/dmNKusp.png)
331

332
The pipeline can be summarized in five parts:
333

334
1. Sample and annotate a small, balanced training dataset
335
2. Train the model on this small subset
336
3. Evaluate the model on a balanced testing set
337
4. If the model satisfies the business criteria, deploy it in a real time setting
338
5. If it doesn't pass the criteria, sample a few more samples according to the ratio of
339
false positives and negatives, add them to the training set and repeat from step 2 till
340
the model passes the tests or till all available data is exhausted.
341

342
For the code below, we will perform sampling using the following formula:<br/>
343

344
![Ratio Sampling](https://i.imgur.com/LyZEiZL.png)
345

346
Active Learning techniques use callbacks extensively for progress tracking. We will be
347
using model checkpointing and early stopping for this example. The `patience` parameter
348
for Early Stopping can help minimize overfitting and the time required. We have set it
349
`patience=4` for now but since the model is robust, we can increase the patience level if
350
desired.
351

352
Note: We are not loading the checkpoint after the first training iteration. In my
353
experience working on Active Learning techniques, this helps the model probe the
354
newly formed loss landscape. Even if the model fails to improve in the second iteration,
355
we will still gain insight about the possible future false positive and negative rates.
356
This will help us sample a better set in the next iteration where the model will have a
357
greater chance to improve.
358
"""
359

360

361
def train_active_learning_models(
362
    train_dataset,
363
    pool_negatives,
364
    pool_positives,
365
    val_dataset,
366
    test_dataset,
367
    num_iterations=3,
368
    sampling_size=5000,
369
):
370

371
    # Creating lists for storing metrics
372
    losses, val_losses, accuracies, val_accuracies = [], [], [], []
373

374
    model = create_model()
375
    # We will monitor the false positives and false negatives predicted by our model
376
    # These will decide the subsequent sampling ratio for every Active Learning loop
377
    model.compile(
378
        loss="binary_crossentropy",
379
        optimizer="rmsprop",
380
        metrics=[
381
            keras.metrics.BinaryAccuracy(),
382
            keras.metrics.FalseNegatives(),
383
            keras.metrics.FalsePositives(),
384
        ],
385
    )
386

387
    # Defining checkpoints.
388
    # The checkpoint callback is reused throughout the training since it only saves the best overall model.
389
    checkpoint = keras.callbacks.ModelCheckpoint(
390
        "AL_Model.keras", save_best_only=True, verbose=1
391
    )
392
    # Here, patience is set to 4. This can be set higher if desired.
393
    early_stopping = keras.callbacks.EarlyStopping(patience=4, verbose=1)
394

395
    print(f"Starting to train with {len(train_dataset)} samples")
396
    # Initial fit with a small subset of the training set
397
    history = model.fit(
398
        train_dataset.cache().shuffle(20000).batch(256),
399
        epochs=20,
400
        validation_data=val_dataset,
401
        callbacks=[checkpoint, early_stopping],
402
    )
403

404
    # Appending history
405
    losses, val_losses, accuracies, val_accuracies = append_history(
406
        losses, val_losses, accuracies, val_accuracies, history
407
    )
408

409
    for iteration in range(num_iterations):
410
        # Getting predictions from previously trained model
411
        predictions = model.predict(test_dataset)
412

413
        # Generating labels from the output probabilities
414
        rounded = ops.where(ops.greater(predictions, 0.5), 1, 0)
415

416
        # Evaluating the number of zeros and ones incorrrectly classified
417
        _, _, false_negatives, false_positives = model.evaluate(test_dataset, verbose=0)
418

419
        print("-" * 100)
420
        print(
421
            f"Number of zeros incorrectly classified: {false_negatives}, Number of ones incorrectly classified: {false_positives}"
422
        )
423

424
        # This technique of Active Learning demonstrates ratio based sampling where
425
        # Number of ones/zeros to sample = Number of ones/zeros incorrectly classified / Total incorrectly classified
426
        if false_negatives != 0 and false_positives != 0:
427
            total = false_negatives + false_positives
428
            sample_ratio_ones, sample_ratio_zeros = (
429
                false_positives / total,
430
                false_negatives / total,
431
            )
432
        # In the case where all samples are correctly predicted, we can sample both classes equally
433
        else:
434
            sample_ratio_ones, sample_ratio_zeros = 0.5, 0.5
435

436
        print(
437
            f"Sample ratio for positives: {sample_ratio_ones}, Sample ratio for negatives:{sample_ratio_zeros}"
438
        )
439

440
        # Sample the required number of ones and zeros
441
        sampled_dataset = pool_negatives.take(
442
            int(sample_ratio_zeros * sampling_size)
443
        ).concatenate(pool_positives.take(int(sample_ratio_ones * sampling_size)))
444

445
        # Skip the sampled data points to avoid repetition of sample
446
        pool_negatives = pool_negatives.skip(int(sample_ratio_zeros * sampling_size))
447
        pool_positives = pool_positives.skip(int(sample_ratio_ones * sampling_size))
448

449
        # Concatenating the train_dataset with the sampled_dataset
450
        train_dataset = train_dataset.concatenate(sampled_dataset).prefetch(
451
            tf.data.AUTOTUNE
452
        )
453

454
        print(f"Starting training with {len(train_dataset)} samples")
455
        print("-" * 100)
456

457
        # We recompile the model to reset the optimizer states and retrain the model
458
        model.compile(
459
            loss="binary_crossentropy",
460
            optimizer="rmsprop",
461
            metrics=[
462
                keras.metrics.BinaryAccuracy(),
463
                keras.metrics.FalseNegatives(),
464
                keras.metrics.FalsePositives(),
465
            ],
466
        )
467
        history = model.fit(
468
            train_dataset.cache().shuffle(20000).batch(256),
469
            validation_data=val_dataset,
470
            epochs=20,
471
            callbacks=[
472
                checkpoint,
473
                keras.callbacks.EarlyStopping(patience=4, verbose=1),
474
            ],
475
        )
476

477
        # Appending the history
478
        losses, val_losses, accuracies, val_accuracies = append_history(
479
            losses, val_losses, accuracies, val_accuracies, history
480
        )
481

482
        # Loading the best model from this training loop
483
        model = keras.models.load_model("AL_Model.keras")
484

485
    # Plotting the overall history and evaluating the final model
486
    plot_history(losses, val_losses, accuracies, val_accuracies)
487
    print("-" * 100)
488
    print(
489
        "Test set evaluation: ",
490
        model.evaluate(test_dataset, verbose=0, return_dict=True),
491
    )
492
    print("-" * 100)
493

494
    return model
495

496

497
active_learning_model = train_active_learning_models(
498
    train_dataset, pool_negatives, pool_positives, val_dataset, test_dataset
499
)
500

501
"""
502
## Conclusion
503

504
Active Learning is a growing area of research. This example demonstrates the cost-efficiency
505
benefits of using Active Learning, as it eliminates the need to annotate large amounts of
506
data, saving resources.
507

508
The following are some noteworthy observations from this example:
509

510
1. We only require 30,000 samples to reach the same (if not better) scores as the model
511
trained on the full dataset. This means that in a real life setting, we save the effort
512
required for annotating 10,000 images!
513
2. The number of false negatives and false positives are well balanced at the end of the
514
training as compared to the skewed ratio obtained from the full training. This makes the
515
model slightly more useful in real life scenarios where both the labels hold equal
516
importance.
517

518
For further reading about the types of sampling ratios, training techniques or available
519
open source libraries/implementations, you can refer to the resources below:
520

521
1. [Active Learning Literature Survey](http://burrsettles.com/pub/settles.activelearning.pdf) (Burr Settles, 2010).
522
2. [modAL](https://github.com/modAL-python/modAL): A Modular Active Learning framework.
523
3. Google's unofficial [Active Learning playground](https://github.com/google/active-learning).
524
"""
525

526
Product

Resources

Company