Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/active_learning_review_classification.py
3507 views
1
"""
2
Title: Review Classification using Active Learning
3
Author: [Darshan Deshpande](https://twitter.com/getdarshan)
4
Date created: 2021/10/29
5
Last modified: 2024/05/08
6
Description: Demonstrating the advantages of active learning through review classification.
7
Accelerator: GPU
8
Converted to Keras 3 by: [Sachin Prasad](https://github.com/sachinprasadhs)
9
"""
10
11
"""
12
## Introduction
13
14
With the growth of data-centric Machine Learning, Active Learning has grown in popularity
15
amongst businesses and researchers. Active Learning seeks to progressively
16
train ML models so that the resultant model requires lesser amount of training data to
17
achieve competitive scores.
18
19
The structure of an Active Learning pipeline involves a classifier and an oracle. The
20
oracle is an annotator that cleans, selects, labels the data, and feeds it to the model
21
when required. The oracle is a trained individual or a group of individuals that
22
ensure consistency in labeling of new data.
23
24
The process starts with annotating a small subset of the full dataset and training an
25
initial model. The best model checkpoint is saved and then tested on a balanced test
26
set. The test set must be carefully sampled because the full training process will be
27
dependent on it. Once we have the initial evaluation scores, the oracle is tasked with
28
labeling more samples; the number of data points to be sampled is usually determined by
29
the business requirements. After that, the newly sampled data is added to the training
30
set, and the training procedure repeats. This cycle continues until either an
31
acceptable score is reached or some other business metric is met.
32
33
This tutorial provides a basic demonstration of how Active Learning works by
34
demonstrating a ratio-based (least confidence) sampling strategy that results in lower
35
overall false positive and negative rates when compared to a model trained on the entire
36
dataset. This sampling falls under the domain of *uncertainty sampling*, in which new
37
datasets are sampled based on the uncertainty that the model outputs for the
38
corresponding label. In our example, we compare our model's false positive and false
39
negative rates and annotate the new data based on their ratio.
40
41
Some other sampling techniques include:
42
43
1. [Committee sampling](https://www.researchgate.net/publication/51909346_Committee-Based_Sample_Selection_for_Probabilistic_Classifiers):
44
Using multiple models to vote for the best data points to be sampled
45
2. [Entropy reduction](https://www.researchgate.net/publication/51909346_Committee-Based_Sample_Selection_for_Probabilistic_Classifiers):
46
Sampling according to an entropy threshold, selecting more of the samples that produce the highest entropy score.
47
3. [Minimum margin based sampling](https://arxiv.org/abs/1906.00025v1):
48
Selects data points closest to the decision boundary
49
"""
50
51
"""
52
## Importing required libraries
53
"""
54
55
import os
56
57
os.environ["KERAS_BACKEND"] = "tensorflow" # @param ["tensorflow", "jax", "torch"]
58
import keras
59
from keras import ops
60
from keras import layers
61
import tensorflow_datasets as tfds
62
import tensorflow as tf
63
import matplotlib.pyplot as plt
64
import re
65
import string
66
67
tfds.disable_progress_bar()
68
69
"""
70
## Loading and preprocessing the data
71
72
We will be using the IMDB reviews dataset for our experiments. This dataset has 50,000
73
reviews in total, including training and testing splits. We will merge these splits and
74
sample our own, balanced training, validation and testing sets.
75
"""
76
77
dataset = tfds.load(
78
"imdb_reviews",
79
split="train + test",
80
as_supervised=True,
81
batch_size=-1,
82
shuffle_files=False,
83
)
84
reviews, labels = tfds.as_numpy(dataset)
85
86
print("Total examples:", reviews.shape[0])
87
88
"""
89
Active learning starts with labeling a subset of data.
90
For the ratio sampling technique that we will be using, we will need well-balanced training,
91
validation and testing splits.
92
"""
93
94
val_split = 2500
95
test_split = 2500
96
train_split = 7500
97
98
# Separating the negative and positive samples for manual stratification
99
x_positives, y_positives = reviews[labels == 1], labels[labels == 1]
100
x_negatives, y_negatives = reviews[labels == 0], labels[labels == 0]
101
102
# Creating training, validation and testing splits
103
x_val, y_val = (
104
tf.concat((x_positives[:val_split], x_negatives[:val_split]), 0),
105
tf.concat((y_positives[:val_split], y_negatives[:val_split]), 0),
106
)
107
x_test, y_test = (
108
tf.concat(
109
(
110
x_positives[val_split : val_split + test_split],
111
x_negatives[val_split : val_split + test_split],
112
),
113
0,
114
),
115
tf.concat(
116
(
117
y_positives[val_split : val_split + test_split],
118
y_negatives[val_split : val_split + test_split],
119
),
120
0,
121
),
122
)
123
x_train, y_train = (
124
tf.concat(
125
(
126
x_positives[val_split + test_split : val_split + test_split + train_split],
127
x_negatives[val_split + test_split : val_split + test_split + train_split],
128
),
129
0,
130
),
131
tf.concat(
132
(
133
y_positives[val_split + test_split : val_split + test_split + train_split],
134
y_negatives[val_split + test_split : val_split + test_split + train_split],
135
),
136
0,
137
),
138
)
139
140
# Remaining pool of samples are stored separately. These are only labeled as and when required
141
x_pool_positives, y_pool_positives = (
142
x_positives[val_split + test_split + train_split :],
143
y_positives[val_split + test_split + train_split :],
144
)
145
x_pool_negatives, y_pool_negatives = (
146
x_negatives[val_split + test_split + train_split :],
147
y_negatives[val_split + test_split + train_split :],
148
)
149
150
# Creating TF Datasets for faster prefetching and parallelization
151
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
152
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
153
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
154
155
pool_negatives = tf.data.Dataset.from_tensor_slices(
156
(x_pool_negatives, y_pool_negatives)
157
)
158
pool_positives = tf.data.Dataset.from_tensor_slices(
159
(x_pool_positives, y_pool_positives)
160
)
161
162
print(f"Initial training set size: {len(train_dataset)}")
163
print(f"Validation set size: {len(val_dataset)}")
164
print(f"Testing set size: {len(test_dataset)}")
165
print(f"Unlabeled negative pool: {len(pool_negatives)}")
166
print(f"Unlabeled positive pool: {len(pool_positives)}")
167
168
"""
169
### Fitting the `TextVectorization` layer
170
171
Since we are working with text data, we will need to encode the text strings as vectors which
172
would then be passed through an `Embedding` layer. To make this tokenization process
173
faster, we use the `map()` function with its parallelization functionality.
174
"""
175
176
177
vectorizer = layers.TextVectorization(
178
3000, standardize="lower_and_strip_punctuation", output_sequence_length=150
179
)
180
# Adapting the dataset
181
vectorizer.adapt(
182
train_dataset.map(lambda x, y: x, num_parallel_calls=tf.data.AUTOTUNE).batch(256)
183
)
184
185
186
def vectorize_text(text, label):
187
text = vectorizer(text)
188
return text, label
189
190
191
train_dataset = train_dataset.map(
192
vectorize_text, num_parallel_calls=tf.data.AUTOTUNE
193
).prefetch(tf.data.AUTOTUNE)
194
pool_negatives = pool_negatives.map(vectorize_text, num_parallel_calls=tf.data.AUTOTUNE)
195
pool_positives = pool_positives.map(vectorize_text, num_parallel_calls=tf.data.AUTOTUNE)
196
197
val_dataset = val_dataset.batch(256).map(
198
vectorize_text, num_parallel_calls=tf.data.AUTOTUNE
199
)
200
test_dataset = test_dataset.batch(256).map(
201
vectorize_text, num_parallel_calls=tf.data.AUTOTUNE
202
)
203
204
"""
205
## Creating Helper Functions
206
"""
207
208
209
# Helper function for merging new history objects with older ones
210
def append_history(losses, val_losses, accuracy, val_accuracy, history):
211
losses = losses + history.history["loss"]
212
val_losses = val_losses + history.history["val_loss"]
213
accuracy = accuracy + history.history["binary_accuracy"]
214
val_accuracy = val_accuracy + history.history["val_binary_accuracy"]
215
return losses, val_losses, accuracy, val_accuracy
216
217
218
# Plotter function
219
def plot_history(losses, val_losses, accuracies, val_accuracies):
220
plt.plot(losses)
221
plt.plot(val_losses)
222
plt.legend(["train_loss", "val_loss"])
223
plt.xlabel("Epochs")
224
plt.ylabel("Loss")
225
plt.show()
226
227
plt.plot(accuracies)
228
plt.plot(val_accuracies)
229
plt.legend(["train_accuracy", "val_accuracy"])
230
plt.xlabel("Epochs")
231
plt.ylabel("Accuracy")
232
plt.show()
233
234
235
"""
236
## Creating the Model
237
238
We create a small bidirectional LSTM model. When using Active Learning, you should make sure
239
that the model architecture is capable of overfitting to the initial data.
240
Overfitting gives a strong hint that the model will have enough capacity for
241
future, unseen data.
242
"""
243
244
245
def create_model():
246
model = keras.models.Sequential(
247
[
248
layers.Input(shape=(150,)),
249
layers.Embedding(input_dim=3000, output_dim=128),
250
layers.Bidirectional(layers.LSTM(32, return_sequences=True)),
251
layers.GlobalMaxPool1D(),
252
layers.Dense(20, activation="relu"),
253
layers.Dropout(0.5),
254
layers.Dense(1, activation="sigmoid"),
255
]
256
)
257
model.summary()
258
return model
259
260
261
"""
262
## Training on the entire dataset
263
264
To show the effectiveness of Active Learning, we will first train the model on the entire
265
dataset containing 40,000 labeled samples. This model will be used for comparison later.
266
"""
267
268
269
def train_full_model(full_train_dataset, val_dataset, test_dataset):
270
model = create_model()
271
model.compile(
272
loss="binary_crossentropy",
273
optimizer="rmsprop",
274
metrics=[
275
keras.metrics.BinaryAccuracy(),
276
keras.metrics.FalseNegatives(),
277
keras.metrics.FalsePositives(),
278
],
279
)
280
281
# We will save the best model at every epoch and load the best one for evaluation on the test set
282
history = model.fit(
283
full_train_dataset.batch(256),
284
epochs=20,
285
validation_data=val_dataset,
286
callbacks=[
287
keras.callbacks.EarlyStopping(patience=4, verbose=1),
288
keras.callbacks.ModelCheckpoint(
289
"FullModelCheckpoint.keras", verbose=1, save_best_only=True
290
),
291
],
292
)
293
294
# Plot history
295
plot_history(
296
history.history["loss"],
297
history.history["val_loss"],
298
history.history["binary_accuracy"],
299
history.history["val_binary_accuracy"],
300
)
301
302
# Loading the best checkpoint
303
model = keras.models.load_model("FullModelCheckpoint.keras")
304
305
print("-" * 100)
306
print(
307
"Test set evaluation: ",
308
model.evaluate(test_dataset, verbose=0, return_dict=True),
309
)
310
print("-" * 100)
311
return model
312
313
314
# Sampling the full train dataset to train on
315
full_train_dataset = (
316
train_dataset.concatenate(pool_positives)
317
.concatenate(pool_negatives)
318
.cache()
319
.shuffle(20000)
320
)
321
322
# Training the full model
323
full_dataset_model = train_full_model(full_train_dataset, val_dataset, test_dataset)
324
325
"""
326
## Training via Active Learning
327
328
The general process we follow when performing Active Learning is demonstrated below:
329
330
![Active Learning](https://i.imgur.com/dmNKusp.png)
331
332
The pipeline can be summarized in five parts:
333
334
1. Sample and annotate a small, balanced training dataset
335
2. Train the model on this small subset
336
3. Evaluate the model on a balanced testing set
337
4. If the model satisfies the business criteria, deploy it in a real time setting
338
5. If it doesn't pass the criteria, sample a few more samples according to the ratio of
339
false positives and negatives, add them to the training set and repeat from step 2 till
340
the model passes the tests or till all available data is exhausted.
341
342
For the code below, we will perform sampling using the following formula:<br/>
343
344
![Ratio Sampling](https://i.imgur.com/LyZEiZL.png)
345
346
Active Learning techniques use callbacks extensively for progress tracking. We will be
347
using model checkpointing and early stopping for this example. The `patience` parameter
348
for Early Stopping can help minimize overfitting and the time required. We have set it
349
`patience=4` for now but since the model is robust, we can increase the patience level if
350
desired.
351
352
Note: We are not loading the checkpoint after the first training iteration. In my
353
experience working on Active Learning techniques, this helps the model probe the
354
newly formed loss landscape. Even if the model fails to improve in the second iteration,
355
we will still gain insight about the possible future false positive and negative rates.
356
This will help us sample a better set in the next iteration where the model will have a
357
greater chance to improve.
358
"""
359
360
361
def train_active_learning_models(
362
train_dataset,
363
pool_negatives,
364
pool_positives,
365
val_dataset,
366
test_dataset,
367
num_iterations=3,
368
sampling_size=5000,
369
):
370
371
# Creating lists for storing metrics
372
losses, val_losses, accuracies, val_accuracies = [], [], [], []
373
374
model = create_model()
375
# We will monitor the false positives and false negatives predicted by our model
376
# These will decide the subsequent sampling ratio for every Active Learning loop
377
model.compile(
378
loss="binary_crossentropy",
379
optimizer="rmsprop",
380
metrics=[
381
keras.metrics.BinaryAccuracy(),
382
keras.metrics.FalseNegatives(),
383
keras.metrics.FalsePositives(),
384
],
385
)
386
387
# Defining checkpoints.
388
# The checkpoint callback is reused throughout the training since it only saves the best overall model.
389
checkpoint = keras.callbacks.ModelCheckpoint(
390
"AL_Model.keras", save_best_only=True, verbose=1
391
)
392
# Here, patience is set to 4. This can be set higher if desired.
393
early_stopping = keras.callbacks.EarlyStopping(patience=4, verbose=1)
394
395
print(f"Starting to train with {len(train_dataset)} samples")
396
# Initial fit with a small subset of the training set
397
history = model.fit(
398
train_dataset.cache().shuffle(20000).batch(256),
399
epochs=20,
400
validation_data=val_dataset,
401
callbacks=[checkpoint, early_stopping],
402
)
403
404
# Appending history
405
losses, val_losses, accuracies, val_accuracies = append_history(
406
losses, val_losses, accuracies, val_accuracies, history
407
)
408
409
for iteration in range(num_iterations):
410
# Getting predictions from previously trained model
411
predictions = model.predict(test_dataset)
412
413
# Generating labels from the output probabilities
414
rounded = ops.where(ops.greater(predictions, 0.5), 1, 0)
415
416
# Evaluating the number of zeros and ones incorrrectly classified
417
_, _, false_negatives, false_positives = model.evaluate(test_dataset, verbose=0)
418
419
print("-" * 100)
420
print(
421
f"Number of zeros incorrectly classified: {false_negatives}, Number of ones incorrectly classified: {false_positives}"
422
)
423
424
# This technique of Active Learning demonstrates ratio based sampling where
425
# Number of ones/zeros to sample = Number of ones/zeros incorrectly classified / Total incorrectly classified
426
if false_negatives != 0 and false_positives != 0:
427
total = false_negatives + false_positives
428
sample_ratio_ones, sample_ratio_zeros = (
429
false_positives / total,
430
false_negatives / total,
431
)
432
# In the case where all samples are correctly predicted, we can sample both classes equally
433
else:
434
sample_ratio_ones, sample_ratio_zeros = 0.5, 0.5
435
436
print(
437
f"Sample ratio for positives: {sample_ratio_ones}, Sample ratio for negatives:{sample_ratio_zeros}"
438
)
439
440
# Sample the required number of ones and zeros
441
sampled_dataset = pool_negatives.take(
442
int(sample_ratio_zeros * sampling_size)
443
).concatenate(pool_positives.take(int(sample_ratio_ones * sampling_size)))
444
445
# Skip the sampled data points to avoid repetition of sample
446
pool_negatives = pool_negatives.skip(int(sample_ratio_zeros * sampling_size))
447
pool_positives = pool_positives.skip(int(sample_ratio_ones * sampling_size))
448
449
# Concatenating the train_dataset with the sampled_dataset
450
train_dataset = train_dataset.concatenate(sampled_dataset).prefetch(
451
tf.data.AUTOTUNE
452
)
453
454
print(f"Starting training with {len(train_dataset)} samples")
455
print("-" * 100)
456
457
# We recompile the model to reset the optimizer states and retrain the model
458
model.compile(
459
loss="binary_crossentropy",
460
optimizer="rmsprop",
461
metrics=[
462
keras.metrics.BinaryAccuracy(),
463
keras.metrics.FalseNegatives(),
464
keras.metrics.FalsePositives(),
465
],
466
)
467
history = model.fit(
468
train_dataset.cache().shuffle(20000).batch(256),
469
validation_data=val_dataset,
470
epochs=20,
471
callbacks=[
472
checkpoint,
473
keras.callbacks.EarlyStopping(patience=4, verbose=1),
474
],
475
)
476
477
# Appending the history
478
losses, val_losses, accuracies, val_accuracies = append_history(
479
losses, val_losses, accuracies, val_accuracies, history
480
)
481
482
# Loading the best model from this training loop
483
model = keras.models.load_model("AL_Model.keras")
484
485
# Plotting the overall history and evaluating the final model
486
plot_history(losses, val_losses, accuracies, val_accuracies)
487
print("-" * 100)
488
print(
489
"Test set evaluation: ",
490
model.evaluate(test_dataset, verbose=0, return_dict=True),
491
)
492
print("-" * 100)
493
494
return model
495
496
497
active_learning_model = train_active_learning_models(
498
train_dataset, pool_negatives, pool_positives, val_dataset, test_dataset
499
)
500
501
"""
502
## Conclusion
503
504
Active Learning is a growing area of research. This example demonstrates the cost-efficiency
505
benefits of using Active Learning, as it eliminates the need to annotate large amounts of
506
data, saving resources.
507
508
The following are some noteworthy observations from this example:
509
510
1. We only require 30,000 samples to reach the same (if not better) scores as the model
511
trained on the full dataset. This means that in a real life setting, we save the effort
512
required for annotating 10,000 images!
513
2. The number of false negatives and false positives are well balanced at the end of the
514
training as compared to the skewed ratio obtained from the full training. This makes the
515
model slightly more useful in real life scenarios where both the labels hold equal
516
importance.
517
518
For further reading about the types of sampling ratios, training techniques or available
519
open source libraries/implementations, you can refer to the resources below:
520
521
1. [Active Learning Literature Survey](http://burrsettles.com/pub/settles.activelearning.pdf) (Burr Settles, 2010).
522
2. [modAL](https://github.com/modAL-python/modAL): A Modular Active Learning framework.
523
3. Google's unofficial [Active Learning playground](https://github.com/google/active-learning).
524
"""
525
526