Deepfashion:powering robust Clothes recognition and retrieval with Rich Annotations

Last Update:2018-07-26 Source: Internet

Author: User

Tags visibility

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It consists of two main jobs:

1. Created a costume data set deepfashion, containing 800,000 garment images, with categories, attributes, feature points, and a label for the dress box. For more information, please refer to my other blog deepfashion: An overview of the clothing public data set.

2. Put forward a deep learning model fashionnet, combining the costume attributes and feature points to extract the characteristics of costumes, can better deal with the problem of clothing deformation occlusion.

. Difficulties

Clothes recognition The difficulty can be divided into three categories: Style/texture variation. Deformation+occlusion. Different Scenarios (Background not pure)

The author argues that a unified dataset is needed, with a large number of attribute to help split the feature space (the abstraction of the style), and landmark can help recognition.

The database provides the tasks that can be studied: category and attribute prediction (categories and Attribute prediction) classify 50 fine-gained categories and 1000 attributes. A total of 63720 picture. For category classification, the standard TOP-K classification accuracy is adopted as the evaluation criterion; 1-of-k classification problem. For attribute prediction, the top-k recall rate is used as the evaluation criterion, and the number of attributes matched in Top-k is detected by ranking 1000 classification scores; Multi-label tagging problem. Merchant Clothing Search (in-shop Clothes retrieval) The task is to determine whether two business images belong to the same paragraph. A total of 11,735 garments, 54642 photos. Using Top-k retrieval precision as the evaluation criterion, if it can be found accurately in the top-k search results, the retrieval is correct. Buyers to the Merchant clothing search (consumer-to-shop Clothes retrieval) The task is to match the buyer's photos with the merchant's costume. 251361 buyers and merchants to.

The network structure of the fashionnet is similar to that of VGG-16, only the last convolutional layer of VGG16 was modified to be able to handle the clothing landmarks, attributes and categories of the network.

As can be seen from the above figure, the last convolutional layer of VGG16 is replaced by three branches designed by the author.

The rightmost Blue branch makes feature point visibility predictions and positional regression.

The middle Green Branch, there are two inputs: VGG16 conv4 + right blue feature points of the branch output results, according to the location of the feature point of the local feature pooling. This step helps to cope with the problem of dress deformation and occlusion.

Left Orange Branch for global feature processing.

The results of the green and Orange Branch operations are fused by Fc7_fusion to predict the categories, attributes and pairs of the garments.

The Global feature branch-the overall feature of the whole clothing commodity; local feature branch-----------------------the pooling of the clothing key point to get the local feature of the costume. Attitude Branch-predict key position and its visibility (i.e., landmarks the probability of existence).

forward propagation (Forward Pass)

It consists of three stages: Stage 1-The costume image input network, which is passed in pose branch to predict the key point position landmark locations and visibility visibity; Stage 2- As shown in the figure below, the place where there will be visibility is deducted from the conv4_featuremaps and stacked together, multiplied by visibility maxpooling to get the pool5_local value. This operation makes the local features invariant to clothing deformation and deletion. The benefit of stacking is to increase the correlation between landmark Stage 3-after getting pool5_local, fix conv4 and previous layer training on category and attribute classification/prediction. The fc7_fusion layer is a connection to the global features of the Fc6_global layer and the local features of the key points after the fc6_local pooling.

The reverse propagation process defines the loss function in 4:

(1) regression loss, which is used to calculate the loss of the characteristic point position.

(2) Softmax loss, for feature point visibility and costume category estimation loss calculation.

(3) cross-entropy loss, for attribute prediction loss calculation. (4) triple loss, used for apparel pairs forecast loss calculation.

Fashionnet is optimized by weighted processing of the above 4 loss. The training process is divided into two steps:

Step1: The feature point visibility prediction and location estimation as the main task, the remainder as the secondary task, so the process gives l_visibility and l_landmark larger weights, the rest of the losses are given a smaller weight.

Because of the correlation between the tasks, this method of multi-task joint processing can accelerate convergence.

STEP2: Uses the prediction results of the branch of the feature point to make predictions for categories, attributes, and pairs.

The above two steps continue to iterate until convergence.

Experiment:

The comparison method selects where to buy it (WTBI) and dual Attribute-aware Ranking Network (darn) [subsequent notes will be mentioned]. Control variables are also tested with the method itself: +100 and +500 refer to the number of attribute, and joints and Poselet replace the landmark predictions in Stage 1. It can be observed that this method has a great improvement in classification, whether it is attribute number or Stage1 landmark regression. This article can also be used as a corroboration of the necessity of attribute prediction. In the attribute task Fashionnet and poselets way accurate rate gap is not small, but far more than other setting. But recall recall we can see the fashionnet part of the insufficiency: e.g, intuitively we will link deep-v and V-neck together, But from the performance on the network did not catch the two relevance: in attribute processing may still need to introduce embedding, especially when we need thousands of attributes one-hot of the label presentation relatively weak. In addition, some style such as distressed (old, I guess visual features are broken holes) and heart (the middle of the clothing heart pattern) are not predict to, probably the center of the dress will be landmark around the details of the information flooded. So maybe in feature fusion you can do attention or window

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More