CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
pytorch

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

GitHub Repository: pytorch/tutorials
Path: blob/main/beginner_source/knowledge_distillation_tutorial.py
Views: 712
1
# -*- coding: utf-8 -*-
2
"""
3
Knowledge Distillation Tutorial
4
===============================
5
**Author**: `Alexandros Chariton <https://github.com/AlexandrosChrtn>`_
6
"""
7
8
######################################################################
9
# Knowledge distillation is a technique that enables knowledge transfer from large, computationally expensive
10
# models to smaller ones without losing validity. This allows for deployment on less powerful
11
# hardware, making evaluation faster and more efficient.
12
#
13
# In this tutorial, we will run a number of experiments focused at improving the accuracy of a
14
# lightweight neural network, using a more powerful network as a teacher.
15
# The computational cost and the speed of the lightweight network will remain unaffected,
16
# our intervention only focuses on its weights, not on its forward pass.
17
# Applications of this technology can be found in devices such as drones or mobile phones.
18
# In this tutorial, we do not use any external packages as everything we need is available in ``torch`` and
19
# ``torchvision``.
20
#
21
# In this tutorial, you will learn:
22
#
23
# - How to modify model classes to extract hidden representations and use them for further calculations
24
# - How to modify regular train loops in PyTorch to include additional losses on top of, for example, cross-entropy for classification
25
# - How to improve the performance of lightweight models by using more complex models as teachers
26
#
27
# Prerequisites
28
# ~~~~~~~~~~~~~
29
#
30
# * 1 GPU, 4GB of memory
31
# * PyTorch v2.0 or later
32
# * CIFAR-10 dataset (downloaded by the script and saved in a directory called ``/data``)
33
34
import torch
35
import torch.nn as nn
36
import torch.optim as optim
37
import torchvision.transforms as transforms
38
import torchvision.datasets as datasets
39
40
# Check if GPU is available, and if not, use the CPU
41
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
42
43
######################################################################
44
# Loading CIFAR-10
45
# ----------------
46
# CIFAR-10 is a popular image dataset with ten classes. Our objective is to predict one of the following classes for each input image.
47
#
48
# .. figure:: /../_static/img/cifar10.png
49
# :align: center
50
#
51
# Example of CIFAR-10 images
52
#
53
# The input images are RGB, so they have 3 channels and are 32x32 pixels. Basically, each image is described by 3 x 32 x 32 = 3072 numbers ranging from 0 to 255.
54
# A common practice in neural networks is to normalize the input, which is done for multiple reasons,
55
# including avoiding saturation in commonly used activation functions and increasing numerical stability.
56
# Our normalization process consists of subtracting the mean and dividing by the standard deviation along each channel.
57
# The tensors "mean=[0.485, 0.456, 0.406]" and "std=[0.229, 0.224, 0.225]" were already computed,
58
# and they represent the mean and standard deviation of each channel in the
59
# predefined subset of CIFAR-10 intended to be the training set.
60
# Notice how we use these values for the test set as well, without recomputing the mean and standard deviation from scratch.
61
# This is because the network was trained on features produced by subtracting and dividing the numbers above, and we want to maintain consistency.
62
# Furthermore, in real life, we would not be able to compute the mean and standard deviation of the test set since,
63
# under our assumptions, this data would not be accessible at that point.
64
#
65
# As a closing point, we often refer to this held-out set as the validation set, and we use a separate set,
66
# called the test set, after optimizing a model's performance on the validation set.
67
# This is done to avoid selecting a model based on the greedy and biased optimization of a single metric.
68
69
# Below we are preprocessing data for CIFAR-10. We use an arbitrary batch size of 128.
70
transforms_cifar = transforms.Compose([
71
transforms.ToTensor(),
72
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
73
])
74
75
# Loading the CIFAR-10 dataset:
76
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms_cifar)
77
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms_cifar)
78
79
########################################################################
80
# .. note:: This section is for CPU users only who are interested in quick results. Use this option only if you're interested in a small scale experiment. Keep in mind the code should run fairly quickly using any GPU. Select only the first ``num_images_to_keep`` images from the train/test dataset
81
#
82
# .. code-block:: python
83
#
84
# #from torch.utils.data import Subset
85
# #num_images_to_keep = 2000
86
# #train_dataset = Subset(train_dataset, range(min(num_images_to_keep, 50_000)))
87
# #test_dataset = Subset(test_dataset, range(min(num_images_to_keep, 10_000)))
88
89
#Dataloaders
90
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)
91
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=2)
92
93
######################################################################
94
# Defining model classes and utility functions
95
# --------------------------------------------
96
# Next, we need to define our model classes. Several user-defined parameters need to be set here. We use two different architectures, keeping the number of filters fixed across our experiments to ensure fair comparisons.
97
# Both architectures are Convolutional Neural Networks (CNNs) with a different number of convolutional layers that serve as feature extractors, followed by a classifier with 10 classes.
98
# The number of filters and neurons is smaller for the students.
99
100
# Deeper neural network class to be used as teacher:
101
class DeepNN(nn.Module):
102
def __init__(self, num_classes=10):
103
super(DeepNN, self).__init__()
104
self.features = nn.Sequential(
105
nn.Conv2d(3, 128, kernel_size=3, padding=1),
106
nn.ReLU(),
107
nn.Conv2d(128, 64, kernel_size=3, padding=1),
108
nn.ReLU(),
109
nn.MaxPool2d(kernel_size=2, stride=2),
110
nn.Conv2d(64, 64, kernel_size=3, padding=1),
111
nn.ReLU(),
112
nn.Conv2d(64, 32, kernel_size=3, padding=1),
113
nn.ReLU(),
114
nn.MaxPool2d(kernel_size=2, stride=2),
115
)
116
self.classifier = nn.Sequential(
117
nn.Linear(2048, 512),
118
nn.ReLU(),
119
nn.Dropout(0.1),
120
nn.Linear(512, num_classes)
121
)
122
123
def forward(self, x):
124
x = self.features(x)
125
x = torch.flatten(x, 1)
126
x = self.classifier(x)
127
return x
128
129
# Lightweight neural network class to be used as student:
130
class LightNN(nn.Module):
131
def __init__(self, num_classes=10):
132
super(LightNN, self).__init__()
133
self.features = nn.Sequential(
134
nn.Conv2d(3, 16, kernel_size=3, padding=1),
135
nn.ReLU(),
136
nn.MaxPool2d(kernel_size=2, stride=2),
137
nn.Conv2d(16, 16, kernel_size=3, padding=1),
138
nn.ReLU(),
139
nn.MaxPool2d(kernel_size=2, stride=2),
140
)
141
self.classifier = nn.Sequential(
142
nn.Linear(1024, 256),
143
nn.ReLU(),
144
nn.Dropout(0.1),
145
nn.Linear(256, num_classes)
146
)
147
148
def forward(self, x):
149
x = self.features(x)
150
x = torch.flatten(x, 1)
151
x = self.classifier(x)
152
return x
153
154
######################################################################
155
# We employ 2 functions to help us produce and evaluate the results on our original classification task.
156
# One function is called ``train`` and takes the following arguments:
157
#
158
# - ``model``: A model instance to train (update its weights) via this function.
159
# - ``train_loader``: We defined our ``train_loader`` above, and its job is to feed the data into the model.
160
# - ``epochs``: How many times we loop over the dataset.
161
# - ``learning_rate``: The learning rate determines how large our steps towards convergence should be. Too large or too small steps can be detrimental.
162
# - ``device``: Determines the device to run the workload on. Can be either CPU or GPU depending on availability.
163
#
164
# Our test function is similar, but it will be invoked with ``test_loader`` to load images from the test set.
165
#
166
# .. figure:: /../_static/img/knowledge_distillation/ce_only.png
167
# :align: center
168
#
169
# Train both networks with Cross-Entropy. The student will be used as a baseline:
170
#
171
172
def train(model, train_loader, epochs, learning_rate, device):
173
criterion = nn.CrossEntropyLoss()
174
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
175
176
model.train()
177
178
for epoch in range(epochs):
179
running_loss = 0.0
180
for inputs, labels in train_loader:
181
# inputs: A collection of batch_size images
182
# labels: A vector of dimensionality batch_size with integers denoting class of each image
183
inputs, labels = inputs.to(device), labels.to(device)
184
185
optimizer.zero_grad()
186
outputs = model(inputs)
187
188
# outputs: Output of the network for the collection of images. A tensor of dimensionality batch_size x num_classes
189
# labels: The actual labels of the images. Vector of dimensionality batch_size
190
loss = criterion(outputs, labels)
191
loss.backward()
192
optimizer.step()
193
194
running_loss += loss.item()
195
196
print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")
197
198
def test(model, test_loader, device):
199
model.to(device)
200
model.eval()
201
202
correct = 0
203
total = 0
204
205
with torch.no_grad():
206
for inputs, labels in test_loader:
207
inputs, labels = inputs.to(device), labels.to(device)
208
209
outputs = model(inputs)
210
_, predicted = torch.max(outputs.data, 1)
211
212
total += labels.size(0)
213
correct += (predicted == labels).sum().item()
214
215
accuracy = 100 * correct / total
216
print(f"Test Accuracy: {accuracy:.2f}%")
217
return accuracy
218
219
######################################################################
220
# Cross-entropy runs
221
# ------------------
222
# For reproducibility, we need to set the torch manual seed. We train networks using different methods, so to compare them fairly,
223
# it makes sense to initialize the networks with the same weights.
224
# Start by training the teacher network using cross-entropy:
225
226
torch.manual_seed(42)
227
nn_deep = DeepNN(num_classes=10).to(device)
228
train(nn_deep, train_loader, epochs=10, learning_rate=0.001, device=device)
229
test_accuracy_deep = test(nn_deep, test_loader, device)
230
231
# Instantiate the lightweight network:
232
torch.manual_seed(42)
233
nn_light = LightNN(num_classes=10).to(device)
234
235
######################################################################
236
# We instantiate one more lightweight network model to compare their performances.
237
# Back propagation is sensitive to weight initialization,
238
# so we need to make sure these two networks have the exact same initialization.
239
240
torch.manual_seed(42)
241
new_nn_light = LightNN(num_classes=10).to(device)
242
243
######################################################################
244
# To ensure we have created a copy of the first network, we inspect the norm of its first layer.
245
# If it matches, then we are safe to conclude that the networks are indeed the same.
246
247
# Print the norm of the first layer of the initial lightweight model
248
print("Norm of 1st layer of nn_light:", torch.norm(nn_light.features[0].weight).item())
249
# Print the norm of the first layer of the new lightweight model
250
print("Norm of 1st layer of new_nn_light:", torch.norm(new_nn_light.features[0].weight).item())
251
252
######################################################################
253
# Print the total number of parameters in each model:
254
total_params_deep = "{:,}".format(sum(p.numel() for p in nn_deep.parameters()))
255
print(f"DeepNN parameters: {total_params_deep}")
256
total_params_light = "{:,}".format(sum(p.numel() for p in nn_light.parameters()))
257
print(f"LightNN parameters: {total_params_light}")
258
259
######################################################################
260
# Train and test the lightweight network with cross entropy loss:
261
train(nn_light, train_loader, epochs=10, learning_rate=0.001, device=device)
262
test_accuracy_light_ce = test(nn_light, test_loader, device)
263
264
######################################################################
265
# As we can see, based on test accuracy, we can now compare the deeper network that is to be used as a teacher with the lightweight network that is our supposed student. So far, our student has not intervened with the teacher, therefore this performance is achieved by the student itself.
266
# The metrics so far can be seen with the following lines:
267
268
print(f"Teacher accuracy: {test_accuracy_deep:.2f}%")
269
print(f"Student accuracy: {test_accuracy_light_ce:.2f}%")
270
271
######################################################################
272
# Knowledge distillation run
273
# --------------------------
274
# Now let's try to improve the test accuracy of the student network by incorporating the teacher.
275
# Knowledge distillation is a straightforward technique to achieve this,
276
# based on the fact that both networks output a probability distribution over our classes.
277
# Therefore, the two networks share the same number of output neurons.
278
# The method works by incorporating an additional loss into the traditional cross entropy loss,
279
# which is based on the softmax output of the teacher network.
280
# The assumption is that the output activations of a properly trained teacher network carry additional information that can be leveraged by a student network during training.
281
# The original work suggests that utilizing ratios of smaller probabilities in the soft targets can help achieve the underlying objective of deep neural networks,
282
# which is to create a similarity structure over the data where similar objects are mapped closer together.
283
# For example, in CIFAR-10, a truck could be mistaken for an automobile or airplane,
284
# if its wheels are present, but it is less likely to be mistaken for a dog.
285
# Therefore, it makes sense to assume that valuable information resides not only in the top prediction of a properly trained model but in the entire output distribution.
286
# However, cross entropy alone does not sufficiently exploit this information as the activations for non-predicted classes
287
# tend to be so small that propagated gradients do not meaningfully change the weights to construct this desirable vector space.
288
#
289
# As we continue defining our first helper function that introduces a teacher-student dynamic, we need to include a few extra parameters:
290
#
291
# - ``T``: Temperature controls the smoothness of the output distributions. Larger ``T`` leads to smoother distributions, thus smaller probabilities get a larger boost.
292
# - ``soft_target_loss_weight``: A weight assigned to the extra objective we're about to include.
293
# - ``ce_loss_weight``: A weight assigned to cross-entropy. Tuning these weights pushes the network towards optimizing for either objective.
294
#
295
# .. figure:: /../_static/img/knowledge_distillation/distillation_output_loss.png
296
# :align: center
297
#
298
# Distillation loss is calculated from the logits of the networks. It only returns gradients to the student:
299
#
300
301
def train_knowledge_distillation(teacher, student, train_loader, epochs, learning_rate, T, soft_target_loss_weight, ce_loss_weight, device):
302
ce_loss = nn.CrossEntropyLoss()
303
optimizer = optim.Adam(student.parameters(), lr=learning_rate)
304
305
teacher.eval() # Teacher set to evaluation mode
306
student.train() # Student to train mode
307
308
for epoch in range(epochs):
309
running_loss = 0.0
310
for inputs, labels in train_loader:
311
inputs, labels = inputs.to(device), labels.to(device)
312
313
optimizer.zero_grad()
314
315
# Forward pass with the teacher model - do not save gradients here as we do not change the teacher's weights
316
with torch.no_grad():
317
teacher_logits = teacher(inputs)
318
319
# Forward pass with the student model
320
student_logits = student(inputs)
321
322
#Soften the student logits by applying softmax first and log() second
323
soft_targets = nn.functional.softmax(teacher_logits / T, dim=-1)
324
soft_prob = nn.functional.log_softmax(student_logits / T, dim=-1)
325
326
# Calculate the soft targets loss. Scaled by T**2 as suggested by the authors of the paper "Distilling the knowledge in a neural network"
327
soft_targets_loss = torch.sum(soft_targets * (soft_targets.log() - soft_prob)) / soft_prob.size()[0] * (T**2)
328
329
# Calculate the true label loss
330
label_loss = ce_loss(student_logits, labels)
331
332
# Weighted sum of the two losses
333
loss = soft_target_loss_weight * soft_targets_loss + ce_loss_weight * label_loss
334
335
loss.backward()
336
optimizer.step()
337
338
running_loss += loss.item()
339
340
print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")
341
342
# Apply ``train_knowledge_distillation`` with a temperature of 2. Arbitrarily set the weights to 0.75 for CE and 0.25 for distillation loss.
343
train_knowledge_distillation(teacher=nn_deep, student=new_nn_light, train_loader=train_loader, epochs=10, learning_rate=0.001, T=2, soft_target_loss_weight=0.25, ce_loss_weight=0.75, device=device)
344
test_accuracy_light_ce_and_kd = test(new_nn_light, test_loader, device)
345
346
# Compare the student test accuracy with and without the teacher, after distillation
347
print(f"Teacher accuracy: {test_accuracy_deep:.2f}%")
348
print(f"Student accuracy without teacher: {test_accuracy_light_ce:.2f}%")
349
print(f"Student accuracy with CE + KD: {test_accuracy_light_ce_and_kd:.2f}%")
350
351
######################################################################
352
# Cosine loss minimization run
353
# ----------------------------
354
# Feel free to play around with the temperature parameter that controls the softness of the softmax function and the loss coefficients.
355
# In neural networks, it is easy to include additional loss functions to the main objectives to achieve goals like better generalization.
356
# Let's try including an objective for the student, but now let's focus on their hidden states rather than their output layers.
357
# Our goal is to convey information from the teacher's representation to the student by including a naive loss function,
358
# whose minimization implies that the flattened vectors that are subsequently passed to the classifiers have become more *similar* as the loss decreases.
359
# Of course, the teacher does not update its weights, so the minimization depends only on the student's weights.
360
# The rationale behind this method is that we are operating under the assumption that the teacher model has a better internal representation that is
361
# unlikely to be achieved by the student without external intervention, therefore we artificially push the student to mimic the internal representation of the teacher.
362
# Whether or not this will end up helping the student is not straightforward, though, because pushing the lightweight network
363
# to reach this point could be a good thing, assuming that we have found an internal representation that leads to better test accuracy,
364
# but it could also be harmful because the networks have different architectures and the student does not have the same learning capacity as the teacher.
365
# In other words, there is no reason for these two vectors, the student's and the teacher's to match per component.
366
# The student could reach an internal representation that is a permutation of the teacher's and it would be just as efficient.
367
# Nonetheless, we can still run a quick experiment to figure out the impact of this method.
368
# We will be using the ``CosineEmbeddingLoss`` which is given by the following formula:
369
#
370
# .. figure:: /../_static/img/knowledge_distillation/cosine_embedding_loss.png
371
# :align: center
372
# :width: 450px
373
#
374
# Formula for CosineEmbeddingLoss
375
#
376
# Obviously, there is one thing that we need to resolve first.
377
# When we applied distillation to the output layer we mentioned that both networks have the same number of neurons, equal to the number of classes.
378
# However, this is not the case for the layer following our convolutional layers. Here, the teacher has more neurons than the student
379
# after the flattening of the final convolutional layer. Our loss function accepts two vectors of equal dimensionality as inputs,
380
# therefore we need to somehow match them. We will solve this by including an average pooling layer after the teacher's convolutional layer to reduce its dimensionality to match that of the student.
381
#
382
# To proceed, we will modify our model classes, or create new ones.
383
# Now, the forward function returns not only the logits of the network but also the flattened hidden representation after the convolutional layer. We include the aforementioned pooling for the modified teacher.
384
385
class ModifiedDeepNNCosine(nn.Module):
386
def __init__(self, num_classes=10):
387
super(ModifiedDeepNNCosine, self).__init__()
388
self.features = nn.Sequential(
389
nn.Conv2d(3, 128, kernel_size=3, padding=1),
390
nn.ReLU(),
391
nn.Conv2d(128, 64, kernel_size=3, padding=1),
392
nn.ReLU(),
393
nn.MaxPool2d(kernel_size=2, stride=2),
394
nn.Conv2d(64, 64, kernel_size=3, padding=1),
395
nn.ReLU(),
396
nn.Conv2d(64, 32, kernel_size=3, padding=1),
397
nn.ReLU(),
398
nn.MaxPool2d(kernel_size=2, stride=2),
399
)
400
self.classifier = nn.Sequential(
401
nn.Linear(2048, 512),
402
nn.ReLU(),
403
nn.Dropout(0.1),
404
nn.Linear(512, num_classes)
405
)
406
407
def forward(self, x):
408
x = self.features(x)
409
flattened_conv_output = torch.flatten(x, 1)
410
x = self.classifier(flattened_conv_output)
411
flattened_conv_output_after_pooling = torch.nn.functional.avg_pool1d(flattened_conv_output, 2)
412
return x, flattened_conv_output_after_pooling
413
414
# Create a similar student class where we return a tuple. We do not apply pooling after flattening.
415
class ModifiedLightNNCosine(nn.Module):
416
def __init__(self, num_classes=10):
417
super(ModifiedLightNNCosine, self).__init__()
418
self.features = nn.Sequential(
419
nn.Conv2d(3, 16, kernel_size=3, padding=1),
420
nn.ReLU(),
421
nn.MaxPool2d(kernel_size=2, stride=2),
422
nn.Conv2d(16, 16, kernel_size=3, padding=1),
423
nn.ReLU(),
424
nn.MaxPool2d(kernel_size=2, stride=2),
425
)
426
self.classifier = nn.Sequential(
427
nn.Linear(1024, 256),
428
nn.ReLU(),
429
nn.Dropout(0.1),
430
nn.Linear(256, num_classes)
431
)
432
433
def forward(self, x):
434
x = self.features(x)
435
flattened_conv_output = torch.flatten(x, 1)
436
x = self.classifier(flattened_conv_output)
437
return x, flattened_conv_output
438
439
# We do not have to train the modified deep network from scratch of course, we just load its weights from the trained instance
440
modified_nn_deep = ModifiedDeepNNCosine(num_classes=10).to(device)
441
modified_nn_deep.load_state_dict(nn_deep.state_dict())
442
443
# Once again ensure the norm of the first layer is the same for both networks
444
print("Norm of 1st layer for deep_nn:", torch.norm(nn_deep.features[0].weight).item())
445
print("Norm of 1st layer for modified_deep_nn:", torch.norm(modified_nn_deep.features[0].weight).item())
446
447
# Initialize a modified lightweight network with the same seed as our other lightweight instances. This will be trained from scratch to examine the effectiveness of cosine loss minimization.
448
torch.manual_seed(42)
449
modified_nn_light = ModifiedLightNNCosine(num_classes=10).to(device)
450
print("Norm of 1st layer:", torch.norm(modified_nn_light.features[0].weight).item())
451
452
######################################################################
453
# Naturally, we need to change the train loop because now the model returns a tuple ``(logits, hidden_representation)``. Using a sample input tensor
454
# we can print their shapes.
455
456
# Create a sample input tensor
457
sample_input = torch.randn(128, 3, 32, 32).to(device) # Batch size: 128, Filters: 3, Image size: 32x32
458
459
# Pass the input through the student
460
logits, hidden_representation = modified_nn_light(sample_input)
461
462
# Print the shapes of the tensors
463
print("Student logits shape:", logits.shape) # batch_size x total_classes
464
print("Student hidden representation shape:", hidden_representation.shape) # batch_size x hidden_representation_size
465
466
# Pass the input through the teacher
467
logits, hidden_representation = modified_nn_deep(sample_input)
468
469
# Print the shapes of the tensors
470
print("Teacher logits shape:", logits.shape) # batch_size x total_classes
471
print("Teacher hidden representation shape:", hidden_representation.shape) # batch_size x hidden_representation_size
472
473
######################################################################
474
# In our case, ``hidden_representation_size`` is ``1024``. This is the flattened feature map of the final convolutional layer of the student and as you can see,
475
# it is the input for its classifier. It is ``1024`` for the teacher too, because we made it so with ``avg_pool1d`` from ``2048``.
476
# The loss applied here only affects the weights of the student prior to the loss calculation. In other words, it does not affect the classifier of the student.
477
# The modified training loop is the following:
478
#
479
# .. figure:: /../_static/img/knowledge_distillation/cosine_loss_distillation.png
480
# :align: center
481
#
482
# In Cosine Loss minimization, we want to maximize the cosine similarity of the two representations by returning gradients to the student:
483
#
484
485
def train_cosine_loss(teacher, student, train_loader, epochs, learning_rate, hidden_rep_loss_weight, ce_loss_weight, device):
486
ce_loss = nn.CrossEntropyLoss()
487
cosine_loss = nn.CosineEmbeddingLoss()
488
optimizer = optim.Adam(student.parameters(), lr=learning_rate)
489
490
teacher.to(device)
491
student.to(device)
492
teacher.eval() # Teacher set to evaluation mode
493
student.train() # Student to train mode
494
495
for epoch in range(epochs):
496
running_loss = 0.0
497
for inputs, labels in train_loader:
498
inputs, labels = inputs.to(device), labels.to(device)
499
500
optimizer.zero_grad()
501
502
# Forward pass with the teacher model and keep only the hidden representation
503
with torch.no_grad():
504
_, teacher_hidden_representation = teacher(inputs)
505
506
# Forward pass with the student model
507
student_logits, student_hidden_representation = student(inputs)
508
509
# Calculate the cosine loss. Target is a vector of ones. From the loss formula above we can see that is the case where loss minimization leads to cosine similarity increase.
510
hidden_rep_loss = cosine_loss(student_hidden_representation, teacher_hidden_representation, target=torch.ones(inputs.size(0)).to(device))
511
512
# Calculate the true label loss
513
label_loss = ce_loss(student_logits, labels)
514
515
# Weighted sum of the two losses
516
loss = hidden_rep_loss_weight * hidden_rep_loss + ce_loss_weight * label_loss
517
518
loss.backward()
519
optimizer.step()
520
521
running_loss += loss.item()
522
523
print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")
524
525
######################################################################
526
#We need to modify our test function for the same reason. Here we ignore the hidden representation returned by the model.
527
528
def test_multiple_outputs(model, test_loader, device):
529
model.to(device)
530
model.eval()
531
532
correct = 0
533
total = 0
534
535
with torch.no_grad():
536
for inputs, labels in test_loader:
537
inputs, labels = inputs.to(device), labels.to(device)
538
539
outputs, _ = model(inputs) # Disregard the second tensor of the tuple
540
_, predicted = torch.max(outputs.data, 1)
541
542
total += labels.size(0)
543
correct += (predicted == labels).sum().item()
544
545
accuracy = 100 * correct / total
546
print(f"Test Accuracy: {accuracy:.2f}%")
547
return accuracy
548
549
######################################################################
550
# In this case, we could easily include both knowledge distillation and cosine loss minimization in the same function. It is common to combine methods to achieve better performance in teacher-student paradigms.
551
# For now, we can run a simple train-test session.
552
553
# Train and test the lightweight network with cross entropy loss
554
train_cosine_loss(teacher=modified_nn_deep, student=modified_nn_light, train_loader=train_loader, epochs=10, learning_rate=0.001, hidden_rep_loss_weight=0.25, ce_loss_weight=0.75, device=device)
555
test_accuracy_light_ce_and_cosine_loss = test_multiple_outputs(modified_nn_light, test_loader, device)
556
557
######################################################################
558
# Intermediate regressor run
559
# --------------------------
560
# Our naive minimization does not guarantee better results for several reasons, one being the dimensionality of the vectors.
561
# Cosine similarity generally works better than Euclidean distance for vectors of higher dimensionality,
562
# but we were dealing with vectors with 1024 components each, so it is much harder to extract meaningful similarities.
563
# Furthermore, as we mentioned, pushing towards a match of the hidden representation of the teacher and the student is not supported by theory.
564
# There are no good reasons why we should be aiming for a 1:1 match of these vectors.
565
# We will provide a final example of training intervention by including an extra network called regressor.
566
# The objective is to first extract the feature map of the teacher after a convolutional layer,
567
# then extract a feature map of the student after a convolutional layer, and finally try to match these maps.
568
# However, this time, we will introduce a regressor between the networks to facilitate the matching process.
569
# The regressor will be trainable and ideally will do a better job than our naive cosine loss minimization scheme.
570
# Its main job is to match the dimensionality of these feature maps so that we can properly define a loss function between the teacher and the student.
571
# Defining such a loss function provides a teaching "path," which is basically a flow to back-propagate gradients that will change the student's weights.
572
# Focusing on the output of the convolutional layers right before each classifier for our original networks, we have the following shapes:
573
#
574
575
# Pass the sample input only from the convolutional feature extractor
576
convolutional_fe_output_student = nn_light.features(sample_input)
577
convolutional_fe_output_teacher = nn_deep.features(sample_input)
578
579
# Print their shapes
580
print("Student's feature extractor output shape: ", convolutional_fe_output_student.shape)
581
print("Teacher's feature extractor output shape: ", convolutional_fe_output_teacher.shape)
582
583
######################################################################
584
# We have 32 filters for the teacher and 16 filters for the student.
585
# We will include a trainable layer that converts the feature map of the student to the shape of the feature map of the teacher.
586
# In practice, we modify the lightweight class to return the hidden state after an intermediate regressor that matches the sizes of the convolutional
587
# feature maps and the teacher class to return the output of the final convolutional layer without pooling or flattening.
588
#
589
# .. figure:: /../_static/img/knowledge_distillation/fitnets_knowledge_distill.png
590
# :align: center
591
#
592
# The trainable layer matches the shapes of the intermediate tensors and Mean Squared Error (MSE) is properly defined:
593
#
594
595
class ModifiedDeepNNRegressor(nn.Module):
596
def __init__(self, num_classes=10):
597
super(ModifiedDeepNNRegressor, self).__init__()
598
self.features = nn.Sequential(
599
nn.Conv2d(3, 128, kernel_size=3, padding=1),
600
nn.ReLU(),
601
nn.Conv2d(128, 64, kernel_size=3, padding=1),
602
nn.ReLU(),
603
nn.MaxPool2d(kernel_size=2, stride=2),
604
nn.Conv2d(64, 64, kernel_size=3, padding=1),
605
nn.ReLU(),
606
nn.Conv2d(64, 32, kernel_size=3, padding=1),
607
nn.ReLU(),
608
nn.MaxPool2d(kernel_size=2, stride=2),
609
)
610
self.classifier = nn.Sequential(
611
nn.Linear(2048, 512),
612
nn.ReLU(),
613
nn.Dropout(0.1),
614
nn.Linear(512, num_classes)
615
)
616
617
def forward(self, x):
618
x = self.features(x)
619
conv_feature_map = x
620
x = torch.flatten(x, 1)
621
x = self.classifier(x)
622
return x, conv_feature_map
623
624
class ModifiedLightNNRegressor(nn.Module):
625
def __init__(self, num_classes=10):
626
super(ModifiedLightNNRegressor, self).__init__()
627
self.features = nn.Sequential(
628
nn.Conv2d(3, 16, kernel_size=3, padding=1),
629
nn.ReLU(),
630
nn.MaxPool2d(kernel_size=2, stride=2),
631
nn.Conv2d(16, 16, kernel_size=3, padding=1),
632
nn.ReLU(),
633
nn.MaxPool2d(kernel_size=2, stride=2),
634
)
635
# Include an extra regressor (in our case linear)
636
self.regressor = nn.Sequential(
637
nn.Conv2d(16, 32, kernel_size=3, padding=1)
638
)
639
self.classifier = nn.Sequential(
640
nn.Linear(1024, 256),
641
nn.ReLU(),
642
nn.Dropout(0.1),
643
nn.Linear(256, num_classes)
644
)
645
646
def forward(self, x):
647
x = self.features(x)
648
regressor_output = self.regressor(x)
649
x = torch.flatten(x, 1)
650
x = self.classifier(x)
651
return x, regressor_output
652
653
######################################################################
654
# After that, we have to update our train loop again. This time, we extract the regressor output of the student, the feature map of the teacher,
655
# we calculate the ``MSE`` on these tensors (they have the exact same shape so it's properly defined) and we back propagate gradients based on that loss,
656
# in addition to the regular cross entropy loss of the classification task.
657
658
def train_mse_loss(teacher, student, train_loader, epochs, learning_rate, feature_map_weight, ce_loss_weight, device):
659
ce_loss = nn.CrossEntropyLoss()
660
mse_loss = nn.MSELoss()
661
optimizer = optim.Adam(student.parameters(), lr=learning_rate)
662
663
teacher.to(device)
664
student.to(device)
665
teacher.eval() # Teacher set to evaluation mode
666
student.train() # Student to train mode
667
668
for epoch in range(epochs):
669
running_loss = 0.0
670
for inputs, labels in train_loader:
671
inputs, labels = inputs.to(device), labels.to(device)
672
673
optimizer.zero_grad()
674
675
# Again ignore teacher logits
676
with torch.no_grad():
677
_, teacher_feature_map = teacher(inputs)
678
679
# Forward pass with the student model
680
student_logits, regressor_feature_map = student(inputs)
681
682
# Calculate the loss
683
hidden_rep_loss = mse_loss(regressor_feature_map, teacher_feature_map)
684
685
# Calculate the true label loss
686
label_loss = ce_loss(student_logits, labels)
687
688
# Weighted sum of the two losses
689
loss = feature_map_weight * hidden_rep_loss + ce_loss_weight * label_loss
690
691
loss.backward()
692
optimizer.step()
693
694
running_loss += loss.item()
695
696
print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")
697
698
# Notice how our test function remains the same here with the one we used in our previous case. We only care about the actual outputs because we measure accuracy.
699
700
# Initialize a ModifiedLightNNRegressor
701
torch.manual_seed(42)
702
modified_nn_light_reg = ModifiedLightNNRegressor(num_classes=10).to(device)
703
704
# We do not have to train the modified deep network from scratch of course, we just load its weights from the trained instance
705
modified_nn_deep_reg = ModifiedDeepNNRegressor(num_classes=10).to(device)
706
modified_nn_deep_reg.load_state_dict(nn_deep.state_dict())
707
708
# Train and test once again
709
train_mse_loss(teacher=modified_nn_deep_reg, student=modified_nn_light_reg, train_loader=train_loader, epochs=10, learning_rate=0.001, feature_map_weight=0.25, ce_loss_weight=0.75, device=device)
710
test_accuracy_light_ce_and_mse_loss = test_multiple_outputs(modified_nn_light_reg, test_loader, device)
711
712
######################################################################
713
# It is expected that the final method will work better than ``CosineLoss`` because now we have allowed a trainable layer between the teacher and the student,
714
# which gives the student some wiggle room when it comes to learning, rather than pushing the student to copy the teacher's representation.
715
# Including the extra network is the idea behind hint-based distillation.
716
717
print(f"Teacher accuracy: {test_accuracy_deep:.2f}%")
718
print(f"Student accuracy without teacher: {test_accuracy_light_ce:.2f}%")
719
print(f"Student accuracy with CE + KD: {test_accuracy_light_ce_and_kd:.2f}%")
720
print(f"Student accuracy with CE + CosineLoss: {test_accuracy_light_ce_and_cosine_loss:.2f}%")
721
print(f"Student accuracy with CE + RegressorMSE: {test_accuracy_light_ce_and_mse_loss:.2f}%")
722
723
######################################################################
724
# Conclusion
725
# --------------------------------------------
726
# None of the methods above increases the number of parameters for the network or inference time,
727
# so the performance increase comes at the little cost of calculating gradients during training.
728
# In ML applications, we mostly care about inference time because training happens before the model deployment.
729
# If our lightweight model is still too heavy for deployment, we can apply different ideas, such as post-training quantization.
730
# Additional losses can be applied in many tasks, not just classification, and you can experiment with quantities like coefficients,
731
# temperature, or number of neurons. Feel free to tune any numbers in the tutorial above,
732
# but keep in mind, if you change the number of neurons / filters chances are a shape mismatch might occur.
733
#
734
# For more information, see:
735
#
736
# * `Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Neural Information Processing System Deep Learning Workshop (2015) <https://arxiv.org/abs/1503.02531>`_
737
#
738
# * `Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: Proceedings of the International Conference on Learning Representations (2015) <https://arxiv.org/abs/1412.6550>`_
739
740