Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/C4 - Convolutional Neural Networks/Week 3/Image Segmentation Unet/Image_segmentation_Unet_v2.ipynb
Views: 4818
Image Segmentation with U-Net
Welcome to the final assignment of Week 3! You'll be building your own U-Net, a type of CNN designed for quick, precise image segmentation, and using it to predict a label for every single pixel in an image - in this case, an image from a self-driving car dataset.
This type of image classification is called semantic image segmentation. It's similar to object detection in that both ask the question: "What objects are in this image and where in the image are those objects located?," but where object detection labels objects with bounding boxes that may include pixels that aren't part of the object, semantic image segmentation allows you to predict a precise mask for each object in the image by labeling each pixel in the image with its corresponding class. The word “semantic” here refers to what's being shown, so for example the “Car” class is indicated below by the dark blue mask, and "Person" is indicated with a red mask:

As you might imagine, region-specific labeling is a pretty crucial consideration for self-driving cars, which require a pixel-perfect understanding of their environment so they can change lanes and avoid other cars, or any number of traffic obstacles that can put peoples' lives in danger.
By the time you finish this notebook, you'll be able to:
Build your own U-Net
Explain the difference between a regular CNN and a U-net
Implement semantic image segmentation on the CARLA self-driving car dataset
Apply sparse categorical crossentropy for pixelwise prediction
Onward, to this grand and glorious quest!
Table of Content
Check out the some of the unmasked and masked images from the dataset:
3 - U-Net
U-Net, named for its U-shape, was originally created in 2015 for tumor detection, but in the years since has become a very popular choice for other semantic segmentation tasks.
U-Net builds on a previous architecture called the Fully Convolutional Network, or FCN, which replaces the dense layers found in a typical CNN with a transposed convolution layer that upsamples the feature map back to the size of the original input image, while preserving the spatial information. This is necessary because the dense layers destroy spatial information (the "where" of the image), which is an essential part of image segmentation tasks. An added bonus of using transpose convolutions is that the input size no longer needs to be fixed, as it does when dense layers are used.
Unfortunately, the final feature layer of the FCN suffers from information loss due to downsampling too much. It then becomes difficult to upsample after so much information has been lost, causing an output that looks rough.
U-Net improves on the FCN, using a somewhat similar design, but differing in some important ways. Instead of one transposed convolution at the end of the network, it uses a matching number of convolutions for downsampling the input image to a feature map, and transposed convolutions for upsampling those maps back up to the original input image size. It also adds skip connections, to retain information that would otherwise become lost during encoding. Skip connections send information to every upsampling layer in the decoder from the corresponding downsampling layer in the encoder, capturing finer information while also keeping computation low. These help prevent information loss, as well as model overfitting.
3.1 - Model Details

Contracting path (Encoder containing downsampling steps):
Images are first fed through several convolutional layers which reduce height and width, while growing the number of channels.
The contracting path follows a regular CNN architecture, with convolutional layers, their activations, and pooling layers to downsample the image and extract its features. In detail, it consists of the repeated application of two 3 x 3 unpadded convolutions, each followed by a rectified linear unit (ReLU) and a 2 x 2 max pooling operation with stride 2 for downsampling. At each downsampling step, the number of feature channels is doubled.
Crop function: This step crops the image from the contracting path and concatenates it to the current image on the expanding path to create a skip connection.
Expanding path (Decoder containing upsampling steps):
The expanding path performs the opposite operation of the contracting path, growing the image back to its original size, while shrinking the channels gradually.
In detail, each step in the expanding path upsamples the feature map, followed by a 2 x 2 convolution (the transposed convolution). This transposed convolution halves the number of feature channels, while growing the height and width of the image.
Next is a concatenation with the correspondingly cropped feature map from the contracting path, and two 3 x 3 convolutions, each followed by a ReLU. You need to perform cropping to handle the loss of border pixels in every convolution.
Final Feature Mapping Block: In the final layer, a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes. The channel dimensions from the previous layer correspond to the number of filters used, so when you use 1x1 convolutions, you can transform that dimension by choosing an appropriate number of 1x1 filters. When this idea is applied to the last layer, you can reduce the channel dimensions to have one layer per class.
The U-Net network has 23 convolutional layers in total.
3.2 - Encoder (Downsampling Block)

The encoder is a stack of various conv_blocks:
Each conv_block()
is composed of 2 Conv2D layers with ReLU activations. We will apply Dropout, and MaxPooling2D to some conv_blocks, as you will verify in the following sections, specifically to the last two blocks of the downsampling.
The function will return two tensors:
next_layer
: That will go into the next block.skip_connection
: That will go into the corresponding decoding block.
Note: If max_pooling=True
, the next_layer
will be the output of the MaxPooling2D layer, but the skip_connection
will be the output of the previously applied layer(Conv2D or Dropout, depending on the case). Else, both results will be identical.
Exercise 1 - conv_block
Implement conv_block(...)
. Here are the instructions for each step in the conv_block
, or contracting block:
Add 2 Conv2D layers with
n_filters
filters withkernel_size
set to 3,kernel_initializer
set to 'he_normal',padding
set to 'same' and 'relu' activation.if
dropout_prob
> 0, then add a Dropout layer with parameterdropout_prob
If
max_pooling
is set to True, then add a MaxPooling2D layer with 2x2 pool size
Block 1:
['InputLayer', [(None, 96, 128, 3)], 0]
['Conv2D', (None, 96, 128, 32), 896, 'same', 'relu', 'HeNormal']
['Conv2D', (None, 96, 128, 32), 9248, 'same', 'relu', 'HeNormal']
['MaxPooling2D', (None, 48, 64, 32), 0, (2, 2)]
All tests passed!
Block 2:
['InputLayer', [(None, 96, 128, 3)], 0]
['Conv2D', (None, 96, 128, 1024), 28672, 'same', 'relu', 'HeNormal']
['Conv2D', (None, 96, 128, 1024), 9438208, 'same', 'relu', 'HeNormal']
['Dropout', (None, 96, 128, 1024), 0, 0.1]
['MaxPooling2D', (None, 48, 64, 1024), 0, (2, 2)]
All tests passed!
3.3 - Decoder (Upsampling Block)
The decoder, or upsampling block, upsamples the features back to the original image size. At each upsampling level, you'll take the output of the corresponding encoder block and concatenate it before feeding to the next decoder block.

There are two new components in the decoder: up
and merge
. These are the transpose convolution and the skip connections. In addition, there are two more convolutional layers set to the same parameters as in the encoder.
Here you'll encounter the Conv2DTranspose
layer, which performs the inverse of the Conv2D
layer. You can read more about it here.
Exercise 2 - upsampling_block
Implement upsampling_block(...)
.
For the function upsampling_block
:
Takes the arguments
expansive_input
(which is the input tensor from the previous layer) andcontractive_input
(the input tensor from the previous skip layer)The number of filters here is the same as in the downsampling block you completed previously
Your
Conv2DTranspose
layer will taken_filters
with shape (3,3) and a stride of (2,2), with padding set tosame
. It's applied toexpansive_input
, or the input tensor from the previous layer.
This block is also where you'll concatenate the outputs from the encoder blocks, creating skip connections.
Concatenate your Conv2DTranspose layer output to the contractive input, with an
axis
of 3. In general, you can concatenate the tensors in the order that you prefer. But for the grader, it is important that you use[up, contractive_input]
For the final component, set the parameters for two Conv2D layers to the same values that you set for the two Conv2D layers in the encoder (ReLU activation, He normal initializer, same
padding).
Block 1:
['InputLayer', [(None, 12, 16, 256)], 0]
['Conv2DTranspose', (None, 24, 32, 32), 73760]
['InputLayer', [(None, 24, 32, 128)], 0]
['Concatenate', (None, 24, 32, 160), 0]
['Conv2D', (None, 24, 32, 32), 46112, 'same', 'relu', 'HeNormal']
['Conv2D', (None, 24, 32, 32), 9248, 'same', 'relu', 'HeNormal']
All tests passed!
3.4 - Build the Model
This is where you'll put it all together, by chaining the encoder, bottleneck, and decoder! You'll need to specify the number of output channels, which for this particular set would be 23. That's because there are 23 possible labels for each pixel in this self-driving car dataset.
Exercise 3 - unet_model
For the function unet_model
, specify the input shape, number of filters, and number of classes (23 in this case).
For the first half of the model:
Begin with a conv block that takes the inputs of the model and the number of filters
Then, chain the first output element of each block to the input of the next convolutional block
Next, double the number of filters at each step
Beginning with
conv_block4
, adddropout
of 0.3For the final conv_block, set
dropout
to 0.3 again, and turn off max pooling. (Heren_filters=n_filters* 16
)
For the second half:
From here, at each step, use half the number of filters of the previous block
Use cblock5 as expansive_input and cblock4 as contractive_input, with
n_filters
* 8. This is your bottleneck layer.Chain the output of the previous block as expansive_input and the corresponding contractive block output.
Note that you must use the second element of the contractive block before the max pooling layer.
conv9
is a Conv2D layer with ReLU activation, He normal initializer,same
paddingFinally,
conv10
is a Conv2D that takes the number of classes as the filter, a kernel size of 1, and "same" padding. The output ofconv10
is the output of your model.
All tests passed!
Check out the model summary below!
3.6 - Loss Function
In semantic segmentation, you need as many masks as you have object classes. In the dataset you're using, each pixel in every mask has been assigned a single integer probability that it belongs to a certain class, from 0 to num_classes-1. The correct class is the layer with the higher probability.
This is different from categorical crossentropy, where the labels should be one-hot encoded (just 0s and 1s). Here, you'll use sparse categorical crossentropy as your loss function, to perform pixel-wise multiclass prediction. Sparse categorical crossentropy is more efficient than other loss functions when you're dealing with lots of classes.
With 40 epochs you get amazing results!
Conclusion
You've come to the end of this assignment. Awesome work creating a state-of-the art model for semantic image segmentation! This is a very important task for self-driving cars to get right. Elon Musk will surely be knocking down your door at any moment. 😉
What you should remember:
Semantic image segmentation predicts a label for every single pixel in an image
U-Net uses an equal number of convolutional blocks and transposed convolutions for downsampling and upsampling
Skip connections are used to prevent border pixel information loss and overfitting in U-Net