Face Detection
This section reviews the related works on face detection which is the first step in the face recognition pipeline as represented in Figure 1 To clarify what exactly is the face recognition pipeline, consider how a person recognizes humans with their vision. Since the human face is a very distinctive part of a human, first, human faces are being detected one by one. Then the distinctive properties are being analyzed (e.g. eyes, nose, hair). Finally, distinctive properties of faces are compared with the faces human in question experienced in the past. The distinctive property of humans in face recognition is face. The distinctive parts of the faces are face features. Finally, comparing them with the faces’ features that we experienced in the past represents the classification part.
Traditional methods on face detection
Compared to CNN, traditional methods offer lower accuracy on face detection. But those methods are superior to CNN methods in means of computational efficiency. That’s why traditional methods are still preferred in some applications. This section reviews some of the most popular traditional methods on face detection.
Viola-Jones
Viola-Jones [12] is the first object detection algorithm that can be used in real-time and is very accurate. Similar to kernels used in CNN, Viola-Jones algorithm uses kernels at the core of its algorithm. The kernels used in Viola-Jones have rectangular shapes. In each of those kernels, there are negative and positive areas. As seen in the Figure 2, each feature has negative (represented as black) and positive (represented as white) areas. Firstly, the weighted sum of pixels inside the positive area is subtracted from the weighted sum of pixel values inside the negative area. The resulting value is then used as a feature that is fed to a classifier to tell if the extracted feature belongs to a face image or not.
The training is done by using Adaboost [13] algorithm. After training, strong classifiers (kernels) are discovered. In Figure 2, the features are the 1st and 2nd features calculated using the Adaboost algorithm. As clearly seen in Figure 2.4, extracted features are focused on very distinguishing areas of the human face. After discovering those strong classifiers, sliding windows at varying scales can be used to detect which areas have face image. Using such a sliding window, we have to repeatedly sum every pixel inside the feature areas. Let’s say such a feature window has 50 pixels. So a total of 50 operations is required to calculate the result.
The Viola-Jones method further improves the efficiency by introducing an integral image. Integral image has the same number of pixels as the original image. Every individual pixel value inside the integral image with a coordinate of x0, y0 is the summation of every pixel value inside the original image with a coordinate of x1, y1 that satisfies the following condition: x1 < x0 and y1 < y0. Once the integral version of an image is extracted, it can be used for every sliding window. Figure 3 represents how the sum of pixels are efficiently calculated using integral image. Respectively, point p1, p2, p3 and p4 has the value of A, A+B, A+C, A+B+C+D. We can calculate the sum of pixel values inside D with only 4 operations: p4 + p1 – (p2 + p3).
Furthermore, Viola-Jones method uses cascaded features to improve the efficiency even more. Figure 4 is an abstraction of this cascade method. Features are calculated in an orderly manner. Classifiers are ordered from least complex to most complex. If any feature detection results in a false prediction, then all other features are discarded. Using cascades extremely improves the efficiency without sacrificing the accuracy. Note that cascaded classification can be used with any feature descriptor to improve efficiency (e.g. MTCNN).
Histogram of Oriented Gradients
Histogram of Oriented Gradients (HOG) [14] was originally designed for human detection, but can also be used on face detection. Method used to extract features is unsupervised. It can be used on both color and grayscale images. The logic behind its algorithm is to detect local gradients. First, gradient is calculated for each pixel. Figure 2.7 represents a portion of an image. Each of the values represents an 8 bit pixel value. Let’s calculate the gradient on the pixel highlighted with yellow which is 120.
Gradient on the x axis is the difference between blue values, and gradient on the y axis is the difference between red values. After calculating gradients on x and y axis, using Pythagoras’ theorem, direction and intensity of the gradient is found. Then, the image is divided into 16×16 blocks. Those blocks are strided by 8, meaning that every block is overlapped by 1/2. The image size used in the paper is 64×128, so there will be a total of ((64–8)÷8)×((128–8)÷8) = 7×15 = 105 16×16 blocks. Each block will have 4 8×8 cells.
Histograms of gradients are calculated for each of those cells using 9 bins. Values assigned to bins are equally spaced between 0 and 180 (unsigned). Weighted values of gradients are assigned to bins. Suppose that a sample gradient value is 50 and its angle of its direction is 30. 25 is assigned to the bin with value 20, and remainder 25 is assigned to the bin with value 40. Then histogram values are normalized using clipped L2 norm in each of the 16×16 blocks. The purpose of using clipped L2 norm is to make extracted features immune to improper contrast or illumination. After normalizing the 16×16 blocks, we have 9 (bin size) × 4 (cell size) = 36 histogram values per block. As there are 105 16×16 blocks, total size of feature vector is 105×36 = 3780.
HOG method is not scale invariant, meaning that it would not work if the object inside test images are scaled differently compared to train images. So a simple sliding window method for detection would not work. The solution is to use image pyramids [15]. As shown in Figure 2.8(a), while keeping the classifier size in pixels the same, multiple resolutions of the same image are being used. Furthermore, classifier pyramids can be used to reduce the number of image pyramids for fast detection as shown in Figure 2.8(c).
CNN methods on face detection
This section reviews some recent SOTA CNN methods used on face detection. First Single Shot Detector (SSD), which is a widely used object detection technique, afterwards Multi-task Cascaded Convolutional Networks (MTCNN), which is a SOTA face detection technique, are reviewed.
Single Shot Detector (SSD) [16] is a CNN based method that is used for multi-class object detection. As the name suggests, with only a single feed-forward network, it can extract both bounding boxes and class scores of those bounding boxes. In Figure 2.9, network architecture of SSD is represented. VGG-16 [17] is used as the base model. Features extracted using the base model are then fed into the SSD’s classifier layers.
Classifier layers are used to extract class scores and localization scores. Class scores are the probabilities for each predefined class. Localization scores are coordinate values relative to a node inside the feature map and size (width, depth) of the object. As seen in the Figure 2.10 (b), every node inside the feature map has 4 dashed rectangles. Those rectangles are called anchor boxes. Network outputs an anchor box which has a predefined shape.
Predicted anchor boxes then can be used to extract coordinate values. In Figure 2.10 (b, c), there are different sizes of feature maps (4×4, 8×8). The reason for using multiple sizes of feature maps is to make scale-invariant predictions. As seen in Figure 2.10 (a), 8×8 feature map can detect smaller objects (cat in the figure), while 4×4 can be used to detect larger objects (dog in the figure). Note that face detection has only one class (human face) so there will only be one class score per feature map node.
MTCNN [18], is a CNN based method that extracts both bounding boxes and facial landmarks. As the name suggests, it uses multiple CNN models that are cascaded as a pipeline. First of those models is called Proposal Network (P-Net). P-Net extracts candidate bounding boxes. As seen in the Figure 2.11, output of P-Net has lots of false positives. CNN model used in this stage is very weak and computationally inexpensive. Using such a weak model at the beginning of the model dramatically reduces computation cost.
After extracting candidate bounding boxes, non-max suppression is used to merge highly overlapped candidates. The second model in the pipeline is called Refine Network (R-Net). R-Net model is used to detect and reject false candidates. R-Net model has a much higher number of parameters compared to P-Net. The final model is called Output Network (O-Net). O-Net is the most powerful model among all. Execution time of O-Net is drastically higher compared to others. O-Net is used to detect five facial landmarks (eyes, nose tip, edges of mouth) and finalize bounding box predictions. Using such a cascaded mechanism (P-Net -> R-Net -> O-Net) results in having a very efficient and accurate model.