Path: blob/main/transformers_doc/en/tensorflow/semantic_segmentation.ipynb
5906 views
Image Segmentation
Image segmentation models separate areas corresponding to different areas of interest in an image. These models work by assigning a label to each pixel. There are several types of segmentation: semantic segmentation, instance segmentation, and panoptic segmentation.
In this guide, we will:
Before you begin, make sure you have all the necessary libraries installed:
We encourage you to log in to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to log in:
Types of Segmentation
Semantic segmentation assigns a label or class to every single pixel in an image. Let's take a look at a semantic segmentation model output. It will assign the same class to every instance of an object it comes across in an image, for example, all cats will be labeled as "cat" instead of "cat-1", "cat-2". We can use transformers' image segmentation pipeline to quickly infer a semantic segmentation model. Let's take a look at the example image.

We will use nvidia/segformer-b1-finetuned-cityscapes-1024-1024.
The segmentation pipeline output includes a mask for every predicted class.
Taking a look at the mask for the car class, we can see every car is classified with the same mask.

In instance segmentation, the goal is not to classify every pixel, but to predict a mask for every instance of an object in a given image. It works very similar to object detection, where there is a bounding box for every instance, there's a segmentation mask instead. We will use facebook/mask2former-swin-large-cityscapes-instance for this.
As you can see below, there are multiple cars classified, and there's no classification for pixels other than pixels that belong to car and person instances.
Checking out one of the car masks below.

Panoptic segmentation combines semantic segmentation and instance segmentation, where every pixel is classified into a class and an instance of that class, and there are multiple masks for each instance of a class. We can use facebook/mask2former-swin-large-cityscapes-panoptic for this.
As you can see below, we have more classes. We will later illustrate to see that every pixel is classified into one of the classes.
Let's have a side by side comparison for all types of segmentation.

Seeing all types of segmentation, let's have a deep dive on fine-tuning a model for semantic segmentation.
Common real-world applications of semantic segmentation include training self-driving cars to identify pedestrians and important traffic information, identifying cells and abnormalities in medical imagery, and monitoring environmental changes from satellite imagery.
Fine-tuning a Model for Segmentation
We will now:
Finetune SegFormer on the SceneParse150 dataset.
Use your fine-tuned model for inference.
To see all architectures and checkpoints compatible with this task, we recommend checking the task-page
Load SceneParse150 dataset
Start by loading a smaller subset of the SceneParse150 dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.
Split the dataset's train
split into a train and test set with the train_test_split method:
Then take a look at an example:
image
: a PIL image of the scene.annotation
: a PIL image of the segmentation map, which is also the model's target.scene_category
: a category id that describes the image scene like "kitchen" or "office". In this guide, you'll only needimage
andannotation
, both of which are PIL images.
You'll also want to create a dictionary that maps a label id to a label class which will be useful when you set up the model later. Download the mappings from the Hub and create the id2label
and label2id
dictionaries:
Custom dataset
You could also create and use your own dataset if you prefer to train with the run_semantic_segmentation.py script instead of a notebook instance. The script requires:
a DatasetDict with two Image columns, "image" and "label"
an id2label dictionary mapping the class integers to their class names
As an example, take a look at this example dataset which was created with the steps shown above.
Preprocess
The next step is to load a SegFormer image processor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't actually included in the 150 classes, so you'll need to set do_reduce_labels=True
to subtract one from all the labels. The zero-index is replaced by 255
so it's ignored by SegFormer's loss function:
It is common to apply some data augmentations to an image dataset to make a model more robust against overfitting. In this guide, you'll use the ColorJitter
function from torchvision to randomly change the color properties of an image, but you can also use any image library you like.
Now create two preprocessing functions to prepare the images and annotations for the model. These functions convert the images into pixel_values
and annotations to labels
. For the training set, jitter
is applied before providing the images to the image processor. For the test set, the image processor crops and normalizes the images
, and only crops the labels
because no data augmentation is applied during testing.
To apply the jitter
over the entire dataset, use the 🤗 Datasets set_transform function. The transform is applied on the fly which is faster and consumes less disk space:
Evaluate
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 Evaluate library. For this task, load the mean Intersection over Union (IoU) metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):
Your compute_metrics
function is ready to go now, and you'll return to it when you setup your training.
Train
If you aren't familiar with finetuning a model with the Trainer, take a look at the basic tutorial here!
You're ready to start training your model now! Load SegFormer with AutoModelForSemanticSegmentation, and pass the model the mapping between label ids and label classes:
At this point, only three steps remain:
Define your training hyperparameters in TrainingArguments. It is important you don't remove unused columns because this'll drop the
image
column. Without theimage
column, you can't createpixel_values
. Setremove_unused_columns=False
to prevent this behavior! The only other required parameter isoutput_dir
which specifies where to save your model. You'll push this model to the Hub by settingpush_to_hub=True
(you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the Trainer will evaluate the IoU metric and save the training checkpoint.Pass the training arguments to Trainer along with the model, dataset, tokenizer, data collator, and
compute_metrics
function.Call train() to finetune your model.
Once training is completed, share your model to the Hub with the push_to_hub() method so everyone can use your model:
Inference
Great, now that you've finetuned a model, you can use it for inference!
Reload the dataset and load an image for inference.

We will now see how to infer without a pipeline. Process the image with an image processor and place the pixel_values
on a GPU:
Pass your input to the model and return the logits
:
Next, rescale the logits to the original image size:
To visualize the results, load the dataset color palette as ade_palette()
that maps each class to their RGB values.
Then you can combine and plot your image and the predicted segmentation map:
