Face recognition[translation [deep learning to understand human faces]

Source: Internet
Author: User
Tags benchmark dnn

This article is translated from "deep Learning for Understanding Faces:machines" May is just as good, or better, than humans. In order to be convenient, the index position of the paper remains unchanged, so it is convenient to find references in the original text directly.

In recent years, the development of deep convolutional neural networks has greatly advanced the detection and identification of various targets. This also benefits from the large number of annotation datasets and the use of GPUs, which make it possible to understand faces in unlimited images and videos, automate such things as face detection, posture estimation, key point positioning and face recognition. In this paper, we mainly introduce some deep learning methods applied in face recognition. This paper discusses the different modules in an automatic face recognition system and the role that deep learning plays in it. Then we discuss some problems that the deep convolutional neural network has not solved in face recognition.

1. What can we learn from the human face?

Face analysis is a challenging thing in CV and has been studied for more than 20 years [1]. The goal is to extract as much information from the human face as possible, such as position, posture, gender, ID, age, facial expression and so on. These technologies can be applied in such as video surveillance, active authentication of mobile phones, payment verification and so on.
This paper mainly introduces the automatic face verification and recognition system based on deep learning in recent years. It consists of three modules:

  • Face detection, used to locate a face in an image or video. For a robust enough system, face detection needs to be detected in variable posture, illumination, and scale. At the same time, the position of the face and the size of the face frame should be as precise as possible, do not frame to the background
  • Key point detection, used to locate important face key points, such as the eye point, the tip of the nose, mouth two corners of the mouth. These points can be used to face alignment, the human face normalized to the normative coordinate system, in order to alleviate the internal rotation of the face and the impact of scaling
  • Feature description, which is used to extract sufficiently identifiable information from the aligned human face.

On the basis of a given face representation, the similarity score between faces can be computed in a metric way, and if the score falls below the threshold, the 2 faces are shown to be from the same person. Since the 1990 's, there have been a lot of methods for face verification and recognition that have worked well, but they are all based on constraints. However, once these methods are used, the accuracy decreases rapidly in the case of posture, illumination, resolution, expression, age, background disturbance and occlusion. Moreover, in the video surveillance and other scenes, the target needs to be verified from hundreds of low-resolution video, which puts forward stricter requirements on the robustness and real-time of the algorithm.
To solve these problems, the researchers introduced deep learning to do the feature extraction needed. DCNN has been proven to be very powerful in the image analysis [3] task. Over the past 5 years, DCNN has been used to solve many CV problems, such as target recognition [3]-[5] and target detection [6]-[8]. A typical dcnn is the network structure of multiple convolutional layers and relu activation functions, which can learn rich and discriminating characterization, DCNN has recently been successfully used in such as face Detection [2,9,10], key point positioning [2,10,11], face recognition and verification [12]. One of the key success factors is still attributed to the large number of tagged data such as:

  • Data set for face recognition casia-webface[13],megaface[14,15],ns-celeb-1m[16],vggface[17]
  • Data set for human face detection wider face[18]

These datasets contain a wealth of variability, such as posture, illumination, emoticons, occlusion, and so on. All of these can make dcnn more robust to learn these changes and extract their valuable features.

2. Human face detection in unconstrained images

Face detection is a key link in the face recognition process, and given a picture, face detection needs to extract all the face positions in the image and return the frame coordinates of each person's face. Prior to unconstrained human detection, the use of features such as Haar wavelet and hog features can not be in different resolutions, visual angle, lighting, expression, skin color, occlusion, makeup and other situations to grasp the significant face information. Compared to classifiers, the effect of poor feature extraction is greater. However, with the use of deep learning techniques and GPUs in recent years, DCNN can better feature extraction. As described in [3], pre-trained dcnn on a large data set can become a more meaningful feature extractor. These depth features can then be used extensively as a common target and as a human face detection. The face detection method based on DCNN can be divided into two categories: region-based and frame-based.

Region-based

The region-based approach is to generate a bunch of candidate boxes (a picture of about 2k) and then dcnn to classify the candidate box that is not yet a human face. Most of the candidate boxes are extracted based on [2,10,19]. For example, using Slective search[20] First to create a candidate box, and then use the DCNN feature extraction, and classifier to classify the candidate box is a human face. HYPERFACE[10] and all in one face[2] are region-based methods.

Faster RCNN
Recently the main feature extractor is faster rcnn[19], which can return the boundary coordinates of each person's face candidate frame at the same time. LI[21] et faster a multi-task face detection based on the framework of RCNN, which integrates a dcnn and a three-dimensional average face model, this three-dimensional average face model can be used to enhance the human face detection performance based on RPN, which greatly enhances the candidate frame pruning and refinement after the face normalization. Similarly, chen[22] creates a high-quality candidate frame based on the reduction of redundant face candidates by training a multitasking RPN to perform face and key detection, keeping the balance between high recall and accuracy as far as possible, and the candidate boxes are then normalized by the detected key points, Then use a dcnn face classifier to improve performance.

Frame-based

The frame-based method calculates the corresponding face detection score and candidate frame coordinates at each location of feature map on the basis of a given scale. This method is faster than the regional method and can be achieved only by convolution operation. Detection at different scales is usually done by constructing an image pyramid. DP2MFD[9] and ddfd[25],faceness[26 in this way are used to add a half-face response based on the whole face response and combine them based on the spatial configuration, finally to determine the face score, LI[27] proposed a multi-resolution cascade structure, You can quickly reject background targets in a low-resolution phase, and then only a handful of difficult candidate boxes are left in the high-resolution phase.

Single Shot detector

LIU[8] put forward the SSD structure, the SSD structure is based on the frame of a detector, it does not through the production of image pyramid form, but the network structure itself using the internal pyramid structure, through the different network layer to pool, transfer it to the last layer to complete the face classification and candidate box regression. Because the detection is a forward transmission, the total SSD calculation time is less than faster rcnn. There are also network structures based on the idea of SSDs, such as Yang's proposed scaleface[28] to extract scale information from different layers of the network and then fuse them to the last layer to complete face detection. Zhang proposed s3fd[29], which uses a scale-balanced framework and scale-compensated anchor-point matching strategy to improve the detection of small Faces. Figure 1 is the schema of the method.


Due to the existence of a large number of unconstrained face detection datasets that can be trained, such as the fddb[30] data set is the mainstream unconstrained face detection data set, it contains 2,845 pictures, a total of 5,171 faces, all from yahoo.com news reports. The malf[31] DataSet contains 5,250 high-resolution images, including 11,931 faces, from Flickr and Baidu search engines. These data sets are in the occlusion, posture, illumination under a lot of changes.
WIDER[18] The face dataset contains 32,203 images, 50% of which are for training and 10% for validation. The human face in this data set also has a lot of changes in posture, illumination, occlusion and scale. Face detection based on this data set trained to achieve better performance [19,23,28,29,32,33]. The results of the data set assessment reveal that finding a small face in a crowded environment is still a challenge. Recent hu[33] and others have suggested ways to display contextual information to help detect small Faces. It captures semantic information from lower-level features and captures contextual information from higher-level features to detect small Faces. 2.

Due to the limitations of this article, the traditional face detection method is not discussed here, which can be referenced in [34], which introduces more traditional cascade methods and deformable component models (deformable part-based MODEL,DPM). In addition, for videos with multiple faces, face tracking can be performed on each object with a face-to-head connection. Refer to [12] for video-based face recognition. Figure 3 (a) provides performance comparisons based on different face detection methods on the FDDB data set

3. Critical point detection and head angle detection

Face critical point detection is also an important preprocessing part in face recognition and verification. Face key points such as the center of the eye, nose, mouth, etc., can be used to align the face to the normalized coordinates, such a face normalization helps face recognition [35] and attribute detection. Head Posture Assessment is also the process required for face analysis based on posture. These two problems also have a lot of research results in recent years, most of the existing facial key location methods used are:

  • Model-Based approach:
  • A method based on cascaded regression

wang[36] has a review based on the traditional method , which includes the active appearance model (active appearance model, AAM), the active shape model (active shapes model, ASM), the limited local model ( Constrained local model, CLM), and some regression methods such as a supervised descent method (supervised descent METHOD,SDM
)。 Chrysos[37] Also summarizes the use of traditional face detection method in video to carry out face key point tracking work . Here we just summarize in recent years based on DCNN to carry out the flow detection method.

Model-based

A model-based approach, such as AAM,ASM,CLM, is to learn a shape model during the training process and then use it to fit new faces in the testing process. As Antonakos [43] proposes a method of extracting multiple blocks from an area, and then modeling the shape of the face using multiple pairs of graph-based normal distributions (Gauss-Markov random field) between blocks. However, the model can not adapt to the complex posture, expression, illumination changes, the same, it is also very sensitive to the initialization of gradient descent optimization. Therefore, we also consider how to make face alignment based on 3 dimensional space. Jourabloo proposed pifa[44], using cascaded regression in three-dimensional space to predict the coefficients and datum shape coefficients of three-dimensional to two-dimensional projection matrices. Another work from jourabloo[45] is to consider a human face as a three-dimensional model fit problem, where the camera-mapped residence and three-dimensional shape parameters are evaluated in a way that is cascaded based on the dcnn regression. Zhu proposed 3ddfa[46], using a three-dimensional face model to fit the image, in which the depth of the data using Z-buffer mode to modeling.

A method based on cascaded regression

Because human face alignment is a regression problem, many methods based on regression have been proposed in recent years. In general, these methods learn a model to directly map the image's appearance to the target output. However, these methods rely on the robustness of local descriptors. SUN[47] presents a cascade approach based on well-designed dcnn that, at each stage, outputs from multiple networks are fused to evaluate critical points for good results. ZHANG[48] presents an automatic encoder network from coarse-grained to fine-grained, by cascading several sequential stacks of automatic encoder networks (Sans), which are used to predict the rough position of each face key point, The subsequent San then extracts the local features based on the results of the current detection on the basis of higher resolution, and uses the feature as the input of the network to correct the key points. KUMAR[11] By carefully designing a single DCNN structure to predict the key points and get better results, 4.


XIONG[49] proposed a domain-dependent descent mapping (domain-dependent descent map). ZHU[38] It is observed that the optimization of the basic shape coefficients and projections is not a direct causal relationship, since the smaller parameter errors do not necessarily equal the smaller alignment errors. So they put forward the ccl[38], which is based on the head gesture and domain selection of the regression device (head-pose-based and domain selective regressors), first based on the head posture to divide the optimization field into multiple directions, The results of multiple domain regressions are combined by composing evaluation functions (composition estimator function). TRIGEORIGIS[50] The end-to-end learning of the regression based on the convolution recurrent neural network is proposed and used in the cascaded regression framework. He avoids the problem of training each of the returnees independently. BULAT[51] proposed a DCNN structure, the first part of the human face detection, that is, using the characteristics of the previous layer of DCNN score map to roughly locate each face key points, and then through a regression branch to the key points to be corrected. Therefore, the algorithm is insensitive to the quality of the human face frame detected, and the system can be trained at end-to-end. KUMAR[52] Also proposed a high efficiency in unconstrained conditions to do critical point estimation and attitude prediction, which mainly through the study of a thermal diagram to solve the problem of face alignment, where the value of the thermal map represents a probability value, intended to represent a point in the specific location of the probability of existence.

On the other side, different datasets also provide different key points, and the Faces in the Wild database (53) has become a benchmark for measuring the performance of different key point methods, It contains more than 12000 images with 68 key points, including labeled Face Parts in the wild[36], Helen [map], AFW [+], Ibug [+], and test images. (i.e., indoor and outdoor.)

In addition to using a two-dimensional transformation for face alignment, Hassner and other [54] propose an effective method to make facial face positive with the help of a universal three-dimensional human face model. However, the effectiveness of the method is also highly dependent on the quality of the detected face key points (i.e., when the face key is poor quality, the method usually introduces an error message). In addition, there are many methods are based on multi-task (multitask learning,mtl) angle to face detection, they are training a face detection and corresponding face critical point estimation. MTL helps to train the network more robustly, because the network is under additional supervision. For example, the center of the eye and the nose tip obtained from the key points help the network discriminate the structure of the face. ZHANG[32],CHEN[22],LI[21] and hyperface[10] Use this idea, all in one face[2] based on MTL, the task is extended to face verification, gender, smile and age estimates, Figure 3 (b) shows the afw[55 based] The performance comparison of the key points estimated by different algorithms under the data set.

4. Face Recognition and verification

This section describes the work on face verification and recognition, and in Figure 5, the training and testing process for face verification and recognition using DCNN is described.


There are 2 important components in it:

  • Robust human face characterization;
  • A discriminant classification model (face recognition) or similarity measurement (face verification).

Because this article focuses only on the method of deep learning, [56] summarizes the traditional methods, such as lbp,fisher vectors and so on, measures learning such as one-shot similarity (OSS), Mahalanobis metric Learning, Cosine metric Learning, Large-margin nearest neighbor, attribute-based classifier, and joint Bayesian (joint BAYESIAN,JB).

4.1 Robust feature learning for human faces using deep learning

In the face recognition system, it is a key step to study the characteristic representation with invariance and discriminant. The deep learning approach has demonstrated the ability to learn a compact and discriminating characterization of very large datasets. Here is a summary of the use of deep learning to do feature characterization learning methods.
HUANG[57] and other people abandoned the traditional manual design features such as LBP, but proposed based on the local convolution limited Boltzmann machine using convolutional depth confidence network to learn face characterization. They first learn useful representations on unsupervised basis based on unmarked image datasets in the natural scene, and then use Classifiers (SVM) and Metric learning Methods (OSS) to apply these learned representations to face verification and recognition. The results of the LFW data set are also satisfactory when the method is not trained with mass-labeled face datasets.

The application of DCNN face recognition in the early three-dimensional space is proposed by Taigman deepface[58]. In this method, a nine-layer dnn is used to learn the face representation, which contains more than 120 million parameters, and uses the non-weighted shared local connection layer instead of the standard convolution layer. The training set used is a dataset that contains 4 million faces with more than 4,000 IDs.

Because the collection of large-scale data sets is time-consuming, Sun proposes a deepid structure [59-61], using a joint Bayesian approach (JB) to face verification, which leverages the integrated learning approach, which is relatively deepface in terms of the shallow and smaller depth convolutional network ( Each dcnn contains four convolution layers, the input size is 39x31x1, and the dataset used is 202599 images of 10,177 targets. Based on a large number of different IDs of datasets and DCNN based on the training of different local and global human face blocks, yes Deepid has learned a discriminating and informative facial representation. This method is also the first method to surpass humans on LFW datasets.

Schroff proposed a human face recognition method based on CNN called Facenet[62], which directly optimizes the face vectors themselves rather than those bottleneck layers such as deep learning. They are based on a roughly aligned ternary group of matched/mismatched face blocks, using the online ternary group mining (line triplet) method. Their datasets are a large, proprietary face dataset consisting of 100 million to 200 million face thumbnails containing approximately 8 million different IDs.
YANG[13] collected a public large-scale human face data set, Casia-webface, from the IMDb collection of 494,414 face pictures containing 10,575 IDs of data sets, network parameters of more than 5 million. The model also uses the combined Bayesian method to obtain satisfactory results on the LFW. Casia-webface is also a mainstream data set.

PARKHI[17] also has a public large-scale face data set, Vggface, containing 2.6 faces, 2,600 IDs. As the famous vggnet[24] can be used to do target recognition, he used the triplet embedding to face verification. The DCNN model, which is trained with vggface, has good results on both static faces (LFW) and video faces (YouTube face, YTF), and is only available for a single network structure and is open source. The Vggface dataset is also a major data set.

In recent years, adbalmageed[63] through the DCNN-based, training front, semi-contour and full-profile attitude to improve the non-constrained environment of face recognition performance, to solve the problem of attitude change. MASI[64] The process of collecting data from a large number of synthetic faces instead of crowdsourcing annotation tasks by using a-I deformable model to enhance the Casia-webface dataset. DING[65] A new triplet loss is used to achieve the best face recognition in video from different network feature layers based on the depth feature fusion around the key points of the face. WEN[66] A new loss function is proposed, which takes the center point of each class into account and uses it as a regular constraint of softmax loss, and learns more discriminating facial representations based on the residual neural network. LIU[67] Based on the modified Softmax loss, a novel angular loss is proposed. The resulting discriminant angular feature characterization is optimized based on common similarity metrics and cos distances, and the model results in a comparable best model based on the results of training on smaller training sets. Ranjan et [68] also trained Softmax loss on a subset of the recently released ms-celeb-1m face datasets using scaled L2 norm, and the author's work shows that the loss margin between categories is optimized by the working of the authors. This method obtains the best results on the IARPA benchmark A (IJB-A) DataSet [69]. In addition to the average aggregation of face representations per frame video, Yang proposes a neural aggregation network [70] based on human face frames in multiple face images or human face video to perform dynamic weight aggregation, a simple and powerful representation of video face characterization is obtained. This method obtains the best results on multiple image sets and video face sets. BODLA[71] proposed a converged network, based on two different dcnn models to combine facial representations to enhance recognition performance.

Study on discriminant measurement of 4.2 faces

Learning a classifier or similarity metric from the data is another key component of the face recognition system. Many of the methods presented in the literature are essentially using face images or tag information in face pairs. HU[72] Use the DNN structure to learn a discriminant measure. SCHROFF[62] and PARKHI[17] based on triplet loss optimization of DCNN parameters, can directly embed the DCNN feature into a discriminant subspace, thereby enhancing the results of face verification. In [73], a probabilistic model is used to study the discriminant low-rank vectors for face verification and clustering. Song [74] a method of making full use of training data by considering the distance between samples is presented.

Unlike supervised face recognition based on DCNN, yang[75] presents joint depth characterization and image clustering in a cyclic structure. Each image is initially treated as a separate cluster, and the initial grouping is used to train the depth network. The depth representations and class members are then continuously modified by iteration, knowing that the number of clusters reaches the predetermined value. The unsupervised method has been shown to be used in various tasks such as face recognition, data classification, and so on. ZHANG[76] presents a human face image in a clustering video by alternating between the depth representation of adaptive and clustering. TRIGEORGIS[77] proposes a deep semi-supervised nonnegative matrix decomposition method to learn hidden representations that allow them to interpret clustering based on the different unknown attributes of a given face dataset, such as posture, emotion, and identity. Their approach also gives the solution to the difficult face data set. On the other hand, lin[78] proposes an unsupervised clustering algorithm, which uses the neighborhood structure between samples to implement the adaptive domain implicitly to improve the clustering performance. They also used this method to make a large-scale noise face data set, such as ms-celeb-1m[79].

4.3 implementation

Facial recognition can be divided into 2 tasks:

  • Human face verification;
  • Human Face recognition

For face verification, a 2-face picture is given, and the system verifies that the two faces are not from the same person. For face recognition, it is a face picture given an unknown ID, and then the system determines the ID of the image in the database by the way the feature is matched.
For these two tasks, it is very important to obtain the characteristics of discriminant and robustness. For face verification, the face first needs to be detected by human face detection, and then through the detection of the face key points, the similarity transformation normalized to the coordinates of the norm. Then each face picture is then dcnn to get its face representation, and once the feature is produced, the measurement score can be calculated by the similarity metric. Most of the similarity metrics used are:

  • The L2 distance between facial features;
  • Cosine similarity, which can represent the distance between features in the angular space.

You can also use multiple dcnn to fuse network features or similarity scores, such as the DEEPID architecture [59-61] or the converged network [71]. For face recognition tasks, the face image in the training set passes through the DCNN, and then the characteristics of each ID are present in the database. When a new face picture comes in, calculate its characterization first, and then calculate the similarity score for each feature in the database.

Training data set for 4.4 face recognition

In table 1, we summarize the public data sets used to test the performance of the algorithm and train the DCNN model.

  • MS-CELEB-1M[79] is currently the largest public face recognition data set, containing more than 10 million tagged face images, the top 100,000 IDs of the 1 million celebrity lists have obvious posture, illumination, occlusion and other changes. [78] Because the dataset also contains a lot of label noise, readers of interest can read it.
  • For other datasets, such as the Celeba DataSet [80], it is a dataset containing 40 face attributes and 5 key points, which are labeled with a professional labeling company for 202,599 face images and 10,000 IDs.
  • CASIA-WEBFACE[13] is also a mainstream public data set that contains 484414 face images and 10,575 IDs, all from the IMDB website.
  • VGGFACE[17] contains 2.6 faces and 2,600 IDs.
  • MEGAFACE[14,15] can be used to test the robustness of a face recognition algorithm, which contains 1 million interferences in. The dataset contains 2 parts, the first allows for expansion using external training data, and the other provides 4.7 face images and 672,000 IDs.
  • LFW[81] Data contains 13,233 face images and 5,749 IDs, all from the network, where 1680 IDs have two or more images. This data set is primarily used to evaluate the performance of static face recognition algorithms, most of which are positive faces.
  • The ijb-a[69] DataSet contains 500 IDs and 5,397 images, 2042 of which are divided into 20412 frames. The data set is designed to test robustness based on large posture, illumination and image quality changes.
  • The ytf[82] DataSet contains 3,425 videos, involving 1595 IDs, which are standard datasets used to test the video face recognition algorithm.
  • The pasc[83] DataSet contains 2,802 videos, involving 293 IDs, that are used to test the performance of video face algorithms based on large posture, illumination, and fuzzy variations, which are captured from controlled situations.
  • The celebrities in Frontal-profile (CFP) [84] dataset contains 7,000 images and 500 IDs that are used to test the face verification algorithm under extreme posture changes.
  • UMDFACES[85] and Umdface video[35] datasets contain 367,888 static pictures and 82,777 IDs, as well as 22,075 videos and 3,107 IDs. These datasets can be used to train static and video face datasets, and the IDs in Umdface video appear in Umdfaces, which helps to migrate models from static face recognition to the video realm.

Recently, bansal[35] studied the different characteristics of a good large-scale data set, which involves the following issues:

  • Can we just train on a static picture and then extend it to the video?
  • Is the deeper data set better than the broader data set, where the deeper representation of each ID is more images and a wider representation of the number of IDs?
  • Does adding label noise always improve deep network performance?
  • is face alignment necessary for facial recognition?

[69] The author investigates casia-webface[13],umdfaces[85] and his video extensions [35],youtube face[82] and ijb-a datasets. He found that DCNN also trained on static pictures and video frames to get better results on only one of the training. Based on this experiment, he found that on smaller models, the results of training on a broader set of data were better than the deeper datasets, whereas for deeper models, the broader data set was often better . [35] The authors work to show that label noise usually impairs the performance of face recognition, and that face alignment helps to improve the performance of face recognition.

4.5 Performance Summary

This paper summarizes the performance results of face recognition and validation algorithms in LFW and ijb-a datasets.

LFW Data Set
The face verification algorithm used here is a standard protocol that defines 3000 positive and 3000 negative pairs, dividing them into 10 nonoverlapping sub-sets. Each subset contains 300 positive and 300 negative pairs. He contains 7,701 pictures and 4,281 IDs. As shown in table 2, involving deepface[58], deepid2[61], deepid3[86], facenet[62], yi[13], wang[87], ding[88], parkhi[17], wen[66], Liu[67], RANJAN[68], and the results of human

Ijb-a Benchmark
The data set contains a picture that also contains video, video frame 6


The performance of the face verification algorithm is measured by ROC curve, and the accuracy of the face recognition algorithm of the closed set is measured by the accumulative matching feature (cumulative match CHARACTERISTIC,CMC) fraction. In addition, ijb-a on 10 Shard sets face verification (1:1 matches), each set contains about 11748 pairs (1756 positive and 9,992 negative pairs); Similarly, on face recognition (1:n search) contains 10 shard sets. In each collection, there are approximately 112 training templates and 1,763 prediction templates (1,187 Real prediction templates and 576 imposter prediction templates). The training set contains 333 IDs, and the test set contains 167 IDs with no duplicates. Unlike the LFW and YTF datasets, they just use a negative pair sparse set to evaluate the face verification algorithm, the Ijb-a dataset divides the image/video frames into training and test sets, so all available positive and negative pairs can be used for evaluation, and the same, each training and forecasting set contains multiple templates. Each template (ID) contains a collection of samples from multiple images and videos. The LFW and YTF datasets contain only the faces detected by the viola Jones face detector, while the IJB-A data set contains changes such as extreme posture, illumination, and expression. These factors make ijb-a a challenging data set.

The CMC algorithm and ROC curve can be used to evaluate the performance of different algorithms under face recognition and verification, as shown in table 3.


In addition to using the average feature characterization, we also use the media average, that is, the characteristics of the average from the same media (image or video), and then further average, the media average feature to generate the final characterization, and then use the triplet probability vector [73].
Table 3 summarizes the scores of the different algorithms, in which the comparison algorithms are:

  • \ (dcnn_{casia}\) [87]
  • \ (Dcnn_{bl} (bilinear CNN) \) [92]
  • \ (Dcnn_{pose} (multipose dcnn model [n]) \) [70]
  • \ (dcnn_{3d}\) [64]
  • Template adaptation (TP) [93]
  • \ (dcnn_{tpe}\) [73]
  • \ (dcnn_{all}\) [2] [All on one face]
  • \ (dcnn_{l2+tpe}\) [68]
  • [91]
    Detailed comparison of each algorithm in table 4
5. Face Properties

For a single face, we are able to verify the properties of the face such as: gender, expression, age, skin color and so on. These attributes are useful for image retrieval, expression detection and cell phone security, and in biological literature, face properties are called soft-Biology [95]. KUMAR[56] introduces the concept of attribute to the image descriptor, which is used for face verification. They use 65 Two value attributes to describe the image of each person's face. BERG[56] For each face-to-train classifier, then use them to produce the face according to the classifier's characteristics. Everyone here is described as being similar to others. This is a way to automatically create property sets without having to rely on a large set of hand-labeled attribute datasets. In recent years DCNN has also been used to do attribute classification, such as depth attribute attitude alignment Network (pose aligned networks for deep Attributes,panda) by part-based model with pose-normalized DCNN to do attribute classification [96]. [97] using DCNN on adience datasets to focus on age and gender, Liu uses two dcnn, one for face detection and another for attribute recognition, which works better on many properties on Celeba and LFWA datasets than panda[80].

[99] Each attribute is not treated independently, but by using the correlation between attributes to improve the sorting and retrieval of the image, by first training the attribute classifier independently, and then learn the correlation between these classifier output pair. HAND[100] Train A single attribute network to classify 40 attributes, and learn the relationships between the 40 properties to share information between networks, not just property pairs. RANJAN[2] with MTL to train a single network, it can be a face detection, face key points, face recognition, three-dimensional head attitude estimation, gender classification and age assessment, smile detection. Recently Gunther put forward the need to align the face attribute classifier technology (Alignment-free facial attribute classifcation technique,affact) [101] algorithm to perform non-aligned attribute classification, It uses a data enhancement technique, which allows the network to classify the face attributes on the basis of no alignment, and achieves the best results with three network integrated learning methods on the Celeba data set (Guessous-Avant).

In addition, some face properties can be used to speed up mobile authentication performance [17]. The recently proposed attribute continuous authentication [102,103] method shows a good authentication effect on the mobile phone based on the Continental attribute. Similarly, it becomes easier to learn only part of a person's face. By using these two advantages, samangouei[98] designed an efficient DCNN network structure that can be deployed on mobile devices, and Figure 7 shows how to use face attributes for mobile authentication.

6. Multi-tasking learning for facial analysis

In this part, several different MTL methods for face analysis are introduced. CARUANA[104] Firstly, the application of MTL framework in machine learning is analyzed, and then MTL is used to solve many problems in CV. An early facial analysis based on MTL was proposed by zhu[55]. The algorithm is used to solve face detection, key point localization and head posture evaluation. Another method, called jointcascade[105], enhances face detection by combining training key-point labeling tasks. These algorithms are based on the characteristics of manual design, making it difficult to extend the MTL method to more tasks.

Before deep learning, MTL is limited to a subset of data sets because the problem of characterization of different tasks is different. For example, face detection usually uses hog, while face recognition uses LBP. Similar, key-point characterization, practice-level and gender estimation, attribute classification, different tasks naturally require different characteristics. However, with the advent of deep learning, the characteristics of manual design can be discarded, thus training a single network structure to achieve face detection, key point positioning, face property prediction and face recognition become possible.

Generally speaking, when humans look at the face of a picture, he detects where the face is, and then the gender, the approximate posture, the age, the label, and so on. When machines perform these tasks, it is often necessary to design independent algorithms to solve different tasks. However, we can design a deep network to accomplish all these tasks at the same time and take advantage of the relationship between tasks. GOODFELLOW[106] to interpret MTL as a regular about dcnn. When using the MTL method, the parameters learned can be used immediately on all tasks, which reduces overfitting, and ice-clean converges to a robust solution.

HYPERFACE[10] and task-constrained depth convolutional networks (tasks-constrained deep convolutional Network, TCDCN) [107]. Hyperface was proposed to address face detection, key point positioning, head posture assessment, and gender classification. He integrates a DCNN middle layer that allows tasks to take advantage of rich semantic features. So MTL can improve the performance of independent tasks. ZHANG[107] Proposed TCDCN algorithm can also achieve gender recognition, smile prediction, eye detection and so on. The predictions for all tasks in their algorithms come from the same feature space. Their work shows that using auxiliary tasks such as eye detection and smile prediction can enhance the face key point positioning.

Ranjan's recently proposed all in one face[2] is the use of a single dcnn to simultaneously complete face detection, key point labeling, face recognition, three-dimensional head posture estimation, smile detection, face age detection and gender classification. The structure (Fig. 8 (a))


Start with a pre-trained face recognition network [73]. The network consists of 7 layers of convolutional layers and three layers of fully connected layers, which he uses to do the base network to train face recognition tasks, and the parameters of the first 6 layers of the convolutional layer are used to share tasks related to other human faces. The central principle is that CNN, which is pre-trained on face recognition tasks, provides better initialization for common face analysis tasks, because each layer of filters retains discriminating face information.

To take advantage of all the information on multiple datasets, such as face frames, face keys, posture, gender, age, smile, and ID information, multiple sub-networks can be trained on task-related datasets and then share parameters because there is no single dataset that contains the callout information required for all face analysis tasks. In this way, we can use the parameter sharing method to adapt to the entire field, rather than to fit the specific task area. At the time of testing, these subnets are fused into a single all-in-one face. Table 5 lists the training of all on one face based on different data sets.


The specific loss function is used to train the network end-to-end. The all on one face network output is shown in Figure 9.

MTL-based dcnn can also be used to determine multiple face properties. Depghan proposes depth age, gender and facial expression recognition (deep ages, gender, and emotion Recognition,dager) [111], based on the DCNN network to identify age, gender, and expression. Like all in one face[2], it uses different datasets to train the DCNN based on different tasks. HE[112] by training a network to unite human face detection and face attribute analysis. Unlike other MTL methods, they use the entire image as input to the network, not just the area of the face itself. A method based on faster rcnn can be used to detect faces together, table 6 summarizes some recent facial analysis tasks based on the MTL method

7. Open issues

We briefly discussed the design ideas for each component of an automatic face verification and failure system. Including:

  • Face detection: Compared to universal target detection, face detection is a more challenging task, because it involves a variety of human face changes, these changes include light, facial expression, face angle, occlusion and so on. Other factors such as blur and low resolution increase the difficulty of the task;
  • Key detection: Most datasets contain thousands of images, and a large callout and unconstrained data set makes the face alignment system more robust to address challenges such as extreme posture, low light, and small, blurred face images. The researchers assumed that the deeper CNN was able to crawl more robust information, but so far, it has not been studied which layers are able to accurately extract local features to detect face-critical points.
  • Face verification/recognition: for face recognition and validation, performance can be improved by learning a discriminant distance metric. Because of the memory limitations of the graphics card, how to select information pairs or triples and use online methods on large datasets (for example, random gradient descent) to train the network end-to-end is still an outstanding issue. Another challenging issue to solve is the inclusion of full-motion video processing in a deep network for video-based face analysis.
8. Summary

References [12]

Reference documents:
R. Ranjan, S. Sankaranarayanan, A. Bansal, N. Bodla, J. C. Chen, V. M. Patel, C. Castillo, and R. Chellappa. Deep learning for understanding faces:machines could be just as good, or better, than humans [J]. IEEE Signal processing Magazine, 35 (1): 66–83, 2018

Face recognition[translation [deep learning to understand human faces]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.