Image segmentation
Image segmentation is one of the key problems in the field of computer vision. It is one of the steps that lead to an automatic image understanding. We are going to present how it can be solved using convolutional neural networks. At the end of the section, there is a detailed description of the network called U-Net. This network architecture represents the state-of-the-art on biomedical images and also inspired our research.
Computer Vision Tasks
Image segmentation is not an isolated task in the field of computer vision. It is one of the several tasks that leads to automatic image understanding. Let’s describe some of the basic ones.
In the beginning, there is image classification that can be described as an assigning the image to the class. In this context, classes differ images according to the content. The goal of object detection task is to recognise if the given object is in the picture. A task object localisation provides additional information about spatial location of the recognized object. A centroid or a bounding box can define the object location.
Image segmentation is another branch of computer vision tasks. The segmentation aim is to recognise all the given classes in the image at the pixel level. It can be described as a pixel-wise classification. Segmentation mask defines all the pixels of one class. The combination of object detection and image segmentation is called instance image segmentation that recognises different objects from the same class and provide their segmentation masks. To further refine an image hierarchical segmentation can be used. It analyses the input image in several scales in parallel.
Let’s define the term segmentation. Let is the set of all input image pixels and ℙ = {1, 2,… , }, ∈ ℕ is a set of recognized classes. Segmentation is a mapping ∶ → ℙ that assigns to each pixel exactly one value from the set P . The output of the segmentation can be a labelled image where each pixel has a value according to the class it represents. In the case of binary segmentation, the output is a binary mask.
Image segmentation is a useful tool in various fields of study including the medical research. In this field, the precision of used methods is essential and human resources are very expensive. Modern medical imagining techniques lead to the growth of image data being captured, and automation can help to process them more effectively. Examples of image segmentation task in biomedical field are mitosis detection or tracking neuronal processes in EM images.
CNN Architectures for Image Segmentation
There is no direct way to use the CNN for image segmentation. Typical usage of CNNs is the image classification task. There are several approaches how to modify the network to be suitable also for segmentation. To the successful ones belong sliding window, fully convolutional network and SegNet.
One of the first methods using CNNs for segmentation was the sliding window. Each pixel is classified by evaluating small cropped area that surrounds the pixel. It is not easy to correctly determine the size of the cropped part. There is a trade-off between the computational requirements and amount of contextual information. If the cropped area is too large, the computation takes to much time. If the area is too small, it does not contain enough information. Compared to the current methods, the method of the sliding window is computationally expensive and inefficient.
The Fully convolutional network (FCN) introduces an end-to-end convolutional network architecture for segmentation.There are no pooling layers is this architecture, so the network output has the same dimension as an input. The model produces the labelled image directly. In the last layer, there is a 1×1 convolution, that sets to each pixel the vector for one-hot classification.
End-to-end learning is a successful concept for building a neural network for segmentation. The advantage of this approach is the simplicity of the model. One method replaces entire segmentation pipeline. More computation steps lead to a higher probability of an error, and it amplifies. On the other hand, the end-to-end architecture is suitable only if there is enough data to train it properly.
A team of V. Badrinarayanan introduced this architecture to reduce the computational costs of FCN and to improve the network ability of generalization. It belongs to a class of the end-to-end architectures. It uses a type of neural network layer called up-sampling layer that provides a reverse operation to the pooling. There are no learnable variables in this layer, and it can increase the feature map dimension. The SegNet topology recalls the topology of a neural network type called autoencoder.
SegNet consists of two symmetric parts, an encoder and a decoder. The encoder part uses a standard CNN layer structure. There are convolution layers followed by pooling layers. In the decoder part, convolutional layers are interlaced by upsampling layers to increase a feature maps dimensions up to the size of the original image.
U-Net
The U-Net network was developed in 2015 by Olaf Ronnenberger, Phillip Fischer and Thomas Brox at the University of Freiburg. The au- thors designed the network for semantic segmentation of biomedical images. In that time, it was ranked the best image segmentation method for the most challenging datasets in the ISBI Cell Tracking Challenge. It performs well in other ISBI segmentation challenges, too.
The primary motivation of Olaf Ronnenberger research group was to improve the sliding window approach by a making it simpler and more effective. They developed a robust network that involves the whole context of the image in the computation and directly produces the image segmentation masks.
The network has the SegNet architecture with several improvements. Figure 1 shows the architecture. In the encoder part, there are four down-sampling steps. At each step, there are two convolution layers with 3 × 3 kernel and ReLu activation, followed by 2 × 2 max-pooling layer. The number of feature maps in the first layer is 64, and it doubles after every down-sampling step. The decoder topology is symmetrical to the encoder.
There are four up-sampling steps; each contains two convolutional layers. After every up-sampling step, the number of feature maps is halved. There are also the skipping connections between corresponding encoder and decoder steps. The output layer uses 1 x1 convolution, which for every pixel reduces information from all the feature maps into a two-element vector.
The U-Net is possible to train also on the datasets with only a few available training samples. In the original article, the authors describe some augmentation methods on how to increase the dataset size. New images are generated by shifts and rotations. They also used elastic deformation of original samples to simulate the shape variability of biomedical objects. The deformation was made by a random displacement of vectors in 3×3 grid sampled from a Gaussian distribution with standard deviation of ten pixels. They also mention the problem of grey value variations of the input images.
U-Net is successful in the image segmentation of various biomedical images. It still occupies the first places in the ranking of the Cell Tracking Challenge. The method is clear to understand. In recent last years, several another approaches inspired by the U-Net topology was published. For these reasons, we decided to base our research on this network.