Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Custom and Distributed Training with Tensorflow/Week 4 - Distributed Training/C2W4_Assignment.ipynb
Views: 13370
Week 4 Assignment: Custom training with tf.distribute.Strategy
Welcome to the final assignment of this course! For this week, you will implement a distribution strategy to train on the Oxford Flowers 102 dataset. As the name suggests, distribution strategies allow you to setup training across multiple devices. We are just using a single device in this lab but the syntax you'll apply should also work when you have a multi-device setup. Let's begin!
Imports
Download the dataset
Create a strategy to distribute the variables and the graph
How does tf.distribute.MirroredStrategy
strategy work?
All the variables and the model graph are replicated on the replicas.
Input is evenly distributed across the replicas.
Each replica calculates the loss and gradients for the input it received.
The gradients are synced across all the replicas by summing them.
After the sync, the same update is made to the copies of the variables on each replica.
Setup input pipeline
Set some constants, including the buffer size, number of epochs, and the image size.
Define a function to format the image (resizes the image and scales the pixel values to range from [0,1].
Set the global batch size (please complete this section)
Given the batch size per replica and the strategy, set the global batch size.
The global batch size is the batch size per replica times the number of replicas in the strategy.
Hint: You'll want to use the num_replicas_in_sync
stored in the strategy.
Set the GLOBAL_BATCH_SIZE with the function that you just defined
Expected Output:
Create the datasets using the global batch size and distribute the batches for training, validation and test batches
Define the distributed datasets (please complete this section)
Create the distributed datasets using experimental_distribute_dataset()
of the Strategy class and pass in the training batches.
Do the same for the validation batches and test batches.
Call the function that you just defined to get the distributed datasets.
Take a look at the type of the train_dist_dataset
Expected Output:
Also get familiar with a single batch from the train_dist_dataset:
Each batch has 64 features and labels
Create the model
Use the Model Subclassing API to create model ResNetModel
as a subclass of tf.keras.Model
.
Create a checkpoint directory to store the checkpoints (the model's weights during training).
Define the loss function
You'll define the loss_object
and compute_loss
within the strategy.scope()
.
loss_object
will be used later to calculate the loss on the test set.compute_loss
will be used later to calculate the average loss on the training data.
You will be using these two loss calculations later.
Define the metrics to track loss and accuracy
These metrics track the test loss and training and test accuracy.
You can use
.result()
to get the accumulated statistics at any time, for example,train_accuracy.result()
.
Instantiate the model, optimizer, and checkpoints
This code is given to you. Just remember that they are created within the strategy.scope()
.
Instantiate the ResNetModel, passing in the number of classes
Create an instance of the Adam optimizer.
Create a checkpoint for this model and its optimizer.
Training loop (please complete this section)
You will define a regular training step and test step, which could work without a distributed strategy. You can then use strategy.run
to apply these functions in a distributed manner.
Notice that you'll define
train_step
andtest_step
inside another functiontrain_testp_step_fns
, which will then return these two functions.
Define train_step
Within the strategy's scope, define train_step(inputs)
inputs
will be a tuple containing(images, labels)
.Create a gradient tape block.
Within the gradient tape block:
Call the model, passing in the images and setting training to be
True
(complete this part).Call the
compute_loss
function (defined earlier) to compute the training loss (complete this part).Use the gradient tape to calculate the gradients.
Use the optimizer to update the weights using the gradients.
Define test_step
Also within the strategy's scope, define test_step(inputs)
inputs
is a tuple containing(images, labels)
.Call the model, passing in the images and set training to
False
, because the model is not going to train on the test data. (complete this part).Use the
loss_object
, which will compute the test loss. Checkcompute_loss
, defined earlier, to see what parameters to pass intoloss_object
. (complete this part).Next, update
test_loss
(the running test loss) with thet_loss
(the loss for the current batch).Also update the
test_accuracy
.
Use the train_test_step_fns
function to produce the train_step
and test_step
functions.
Distributed training and testing (please complete this section)
The train_step
and test_step
could be used in a non-distributed, regular model training. To apply them in a distributed way, you'll use strategy.run.
distributed_train_step
Call the
run
function of thestrategy
, passing in the train step function (which you defined earlier), as well as the arguments that go in the train step function.The run function is defined like this
run(fn, args=() )
.args
will take in the dataset inputs
distributed_test_step
Similar to training, the distributed test step will use the
run
function of your strategy, taking in the test step function as well as the dataset inputs that go into the test step function.
Hint:
You saw earlier that each batch in
train_dist_dataset
is tuple with two values:a batch of features
a batch of labels.
Let's think about how you'll want to pass in the dataset inputs into args
by running this next cell of code:
Notice that depending on how list_of_inputs
is passed to args
affects whether fun1
sees one or two positional arguments.
If you see an error message about positional arguments when running the training code later, please come back to check how you're passing in the inputs to
run
.
Please complete the following function.
Call the function that you just defined to get the distributed train step function and distributed test step function.
An important note before you continue:
The following sections will guide you through how to train your model and save it to a .zip file. These sections are not required for you to pass this assignment but you are encouraged to continue anyway. If you consider no more work is needed in previous sections, please submit now and carry on.
After training your model, you can download it as a .zip file and upload it back to the platform to know how well it performed. However, training your model takes around 20 minutes within the Coursera environment. Because of this, there are two methods to train your model:
Method 1
If 20 mins is too long for you, we recommend to download this notebook (after submitting it for grading) and upload it to Colab to finish the training in a GPU-enabled runtime. If you decide to do this, these are the steps to follow:
Save this notebok.
Click the
jupyter
logo on the upper left corner of the window. This will take you to the Jupyter workspace.Select this notebook (C2W4_Assignment.ipynb) and click
Shutdown
.Once the notebook is shutdown, you can go ahead and download it.
Head over to Colab and select the
upload
tab and upload your notebook.Before running any cell go into
Runtime
-->Change Runtime Type
and make sure thatGPU
is enabled.Run all of the cells in the notebook. After training, follow the rest of the instructions of the notebook to download your model.
Method 2
If you prefer to wait the 20 minutes and not leave Coursera, keep going through this notebook. Once you are done, follow these steps:
Click the
jupyter
logo on the upper left corner of the window. This will take you to the jupyter filesystem.In the filesystem you should see a file named
mymodel.zip
. Go ahead and download it.
Independent of the method you choose, you should end up with a mymodel.zip
file which can be uploaded for evaluation after this assignment. Once again, this is optional but we strongly encourage you to do it as it is a lot of fun.
With this out of the way, let's continue.
Run the distributed training in a loop
You'll now use a for-loop to go through the desired number of epochs and train the model in a distributed manner. In each epoch:
Loop through each distributed training set
For each training batch, call
distributed_train_step
and get the loss.
After going through all training batches, calculate the training loss as the average of the batch losses.
Loop through each batch of the distributed test set.
For each test batch, run the distributed test step. The test loss and test accuracy are updated within the test step function.
Print the epoch number, training loss, training accuracy, test loss and test accuracy.
Reset the losses and accuracies before continuing to another epoch.
Things to note in the example above:
We are iterating over the
train_dist_dataset
andtest_dist_dataset
using afor x in ...
construct.The scaled loss is the return value of the
distributed_train_step
. This value is aggregated across replicas using thetf.distribute.Strategy.reduce
call and then across batches by summing the return value of thetf.distribute.Strategy.reduce
calls.tf.keras.Metrics
should be updated insidetrain_step
andtest_step
that gets executed bytf.distribute.Strategy.experimental_run_v2
. *tf.distribute.Strategy.experimental_run_v2
returns results from each local replica in the strategy, and there are multiple ways to consume this result. You can dotf.distribute.Strategy.reduce
to get an aggregated value. You can also dotf.distribute.Strategy.experimental_local_results
to get the list of values contained in the result, one per local replica.
Save the Model for submission (Optional)
You'll get a saved model of this trained model. You'll then need to zip that to upload it to the testing infrastructure. We provide the code to help you with that here:
Step 1: Save the model as a SavedModel
This code will save your model as a SavedModel
Step 2: Zip the SavedModel Directory into /mymodel.zip
This code will zip your saved model directory contents into a single file.
If you are on colab, you can use the file browser pane to the left of colab to find mymodel.zip
. Right click on it and select 'Download'.
If the download fails because you aren't allowed to download multiple files from colab, check out the guidance here: https://ccm.net/faq/32938-google-chrome-allow-websites-to-perform-simultaneous-downloads
If you are in Coursera, follow the instructions previously provided.
It's a large file, so it might take some time to download.