On-line prediction of deep learning based on TensorFlow serving

Source: Internet
Author: User

First, preface

As deep learning continues to evolve in areas such as image, language, and ad-click Estimation, many teams are exploring the practice and application of deep learning techniques at the business level. And in the Advertisement Ctr forecast aspect, the new model also emerges endlessly: Wide and deep[1], Deepcross network[2], deepfm[3], Xdeepfm[4], the American Regiment many deep study blog also did the detailed introduction. However, when the offline model needs to be online, it will meet a variety of new problems: Offline model performance can meet the online requirements, model estimates how to insert into the original engineering system and so on. Only an accurate understanding of the deep learning framework is needed to better deploy deep learning to the line, which is compatible with the original engineering system and meets the online performance requirements.

In this paper, we first introduce the user growth Group business scenario and the offline training process, then we mainly introduce the whole process of using tensorflow serving to deploy the WDL model to the line, and how to optimize the online service performance, we hope to inspire you.

Ii. business scenarios and offline processes 2.1 business scenarios

In the ads in the scene, for each user, there will be a maximum of hundreds of ad recall, the model based on user characteristics and each ad-related characteristics, the user estimates the number of each ad click-through, to sort. Because the advertising trading platform (Adexchange) for the DSP time-out limit, our sequencing module average response time must be controlled within 10ms, while the US-based DSP needs to participate in real-time bidding according to the estimated CTR, so the model prediction performance requirements are relatively high.

2.2 Offline Training

For off-line data, we use spark to generate TENSORFLOW[5] raw data format Tfrecord to speed up data reading.

Model, using the classic wide and deep model, features include user dimension features, scene dimension features, and commodity dimension features. Wide part has more than 80 features input, deep part has more than 60 features input, after embedding input layer about 600 dimensions, followed by 3 layer 256 equal width full connection, model parameters a total of 350,000 parameters, corresponding to the export model file size of about 11M.

For off-line training, use the distributed framework of TensorFlow Sync + Backup workers[6] To address asynchronous update latency and slow synchronization update performance.

In distributed PS parameter assignment, we can make each PS load balanced by using the Greedyloadbalancing method, according to the estimation parameter size allocation parameter, replacing the method of round robin to take mode allocation.

For computing devices, we found that using only the CPU without the GPU would be faster, mainly because while the performance of GPU computing might improve, it increased the overhead of data transfer between the CPU and the GPU, which would be better when the model calculations were not too complex.

We also use the Estimator Advanced API to encapsulate data read, distributed training, model validation, TensorFlow serving model export.
The main benefits of using Estimator are:

    1. Single-machine training and distributed training can be very simple to switch, and in the use of different devices: CPU, GPU, TPU, no need to modify too much code.
    2. The estimator framework is clear and facilitates communication between developers.
    3. Beginners can also directly use some of the estimator models that have been built: DNN models, xgboost models, linear models, and so on.
Three, TensorFlow serving and performance optimization 3.1 TensorFlow serving Introduction

TensorFlow serving is a high-performance open-Source library for machine learning models serving, which deploys a trained machine learning model to a line and accepts external calls as interfaces using GRPC. TensorFlow serving supports model hot update and automatic model versioning, which has very flexible features.

Serving the entire frame chart for TensorFlow. The client continues to send requests to the manager, which manages the model updates according to the version management policy, and returns the latest model calculations to the client side.



TensorFlow serving architecture, images from TensorFlow serving official documents

The mission's interior is provided by the data platform dedicated TensorFlow serving runs on the cluster in a distributed way through yarn, periodically scanning the HDFs path to check the model version and automatically update. Of course, each local machine can be installed TensorFlow serving for testing.

In the scene of our stand-out ad, every time a user comes, the online requester translates all the information from the user and the recalled 100 ads into the model input format and sends it as a batch to TensorFlow Serving,tensorflow After the serving accepts the request, the CTR Pre-estimate is calculated and then returned to the requesting side.

When deploying the first version of TensorFlow serving, when the QPS is around 200, the package request needs to be 5ms, the network overhead needs to be fixed at around 3ms, only the model pre-estimate requires 10ms, the entire process of the TP50 line is about 18ms, performance is not up to line requirements. The next step is to introduce our performance optimization process in detail.

3.2 Performance Optimization 3.2.1 Request-side optimization

On-line request-side optimizations are primarily for parallel processing of 100 ads, and we use OpenMP multithreading to process data in parallel, reducing request time performance from 5ms to around 2ms.

#pragma omp parallel for for (int i = 0; i < request->ad_feat_size(); ++i) {    tensorflow::Example example;    data_processing();}
3.2.2 Build Model OPS optimization

Before optimization, the input of the model is the raw format data that is not processed, for example, the channel feature value may be: ' Channel 1 ', ' Channel 2 ' such as the string format, and then do one hot processing in the model.

The original model used a large number of high-order Tf.feature_column to process the data into one hot and embedding format. The advantage of using Tf.feature_column is that you do not need to do any processing of the original data at the input, you can use the Feature_column API to do a lot of common processing of features within the model, for example: Tf.feature_column.bucketized_ Column can be divided into buckets, tf.feature_column.crossed_column can be used to cross the characteristics of the category. But the stress of characteristic processing is put in the model.

To further analyze the time-consuming use of feature_column, we use the Tf.profiler tool to analyze the duration of the entire offline training process. It is very convenient to use Tf.profiler under the Estimator framework, just add one line of code.

with tf.contrib.tfprof.ProfileContext(job_dir + ‘/tmp/train_dir’) as pctx:   estimator = tf.estimator.Estimator(model_fn=get_model_fn(job_dir),                                      config=run_config,                                      params=hparams)

In order to use Tf.profiler, the time-consuming distribution of the network's forward propagation shows that the feature processing using the Feature_column API takes a lot of effort.



Optimized pre-profiler recording, forward propagation takes up to 55.78% of total training time, mainly consumed in Feature_column ops pre-processing of raw data

In order to solve the problem that the feature takes a lot of time in the model processing, when we process the offline data, we take all the raw data in the string format, make a map of one hot in advance, and drop the mapping relation to the local Feature_index file, and then use it for the line and online. This is equivalent to omitting the need to compute one hot on the model side, instead of using the dictionary to do an O (1) lookup. At the same time, when building the model, use more low-level APIs with guaranteed performance to replace higher-order APIs such as Feature_column. For performance optimization, forward propagation takes a proportion of the entire training process. It can be seen that the time-consuming ratio of forward propagation is much lower.



Optimized profiler recording, forward propagation time taken to total training 39.53%

3.2.3 Xla,jit Compilation optimization

TensorFlow uses a directed data flow diagram to express the entire computational process, in which node represents the operation (OPS), the data is expressed by means of tensor, the different node has the direction of the edge of the traffic, the whole graph is directed to the streaming diagram.

XLA (Accelerated Linear Algebra) is a compiler specifically optimized for linear algebra operations in TensorFlow, and Just compilers are used when JIT (XLA in time) compilation mode is turned on. The entire compilation process is as follows:



TensorFlow calculation process

First TensorFlow the entire calculation chart is optimized, and the redundant calculations in the graph are cut off. Hlo (High level Optimizer) will generate the optimized calculation diagram Hlo The original operation, xla the compiler will be the original operation of Hlo some optimization, and finally to LLVM IR based on different back-end devices, the generation of different machine code.

The use of the JIT helps LLVM IR generate more efficient machine codes based on the Hlo primitive operation, and for multiple fused hlo primitive operations, it blends into a more efficient computational operation. But JIT compilation is compiled when the code runs, which also means that there is some additional compilation overhead when running the code.



Network structure, Batch size impact on JIT performance [7]

Displays the time-consuming ratio of using JIT compilation with no JIT compilation for different network structures under different batch size. It can be seen that the larger batch size performance optimization is obvious, the number of layers and the number of neurons has little effect on JIT compilation optimization.

In the actual application, the concrete effect will vary depending on the network structure, model parameters, hardware equipment and so on.

3.2.4 Final Performance

After a series of performance optimizations described above, the model estimated time was reduced from 10ms to 1.1ms, and the request time dropped from 5ms to 2ms. The entire process takes about 6ms from packaging to sending requests to receiving results.



The model calculates time-dependent parameters: Qps:1308,50line:1.1ms,999line:3.0ms. The following four figures are: time-consuming distribution graph shows that most time-consuming control in 1ms, the number of requests displayed every minute request about 80,000 times, the equivalent of QPS 1308, the average time is 1.1ms; the success rate is 100%

3.3 Model Switching glitch problem

Monitoring finds that when the model is updated, a large number of requests time out. As shown, each update causes a large number of requests to time out and has a large impact on the system. The TensorFlow serving logs and code analysis found that the time-out problem was mainly due to two aspects, on the one hand, the update, load model and processing TensorFlow serving request thread shared a pool, resulting in the switch model when the request can not be processed; After the model is loaded, the calculation diagram uses the lazy initialization method, which causes the first request to wait for the calculation diagram to initialize.



Model switchover causes request to time out

The problem is primarily due to loading and unloading model thread pool configuration issues in the source code:

uint32 num_load_threads = 0; uint32 num_unload_threads = 0;

These two parameters default to 0, which means that no separate thread pool is used, and serving manager runs in the same thread. Modification to 1 will effectively resolve this issue.

The core operation of model loading is RESTOREOP, which includes reading model files from storage, allocating memory, finding corresponding variable and so on, which is executed by calling the session's Run method. By default, all sessions in a process use the same thread pool for all session operations. The same thread pool is used to cause load operations and processing of serving requests during model loading, resulting in serving request delays. The workaround is to construct multiple thread pools by configuring file settings, which specifies that load operations are performed using a separate thread pool when the model is loaded.

For problem two, when the model first runs for a long time, the warm up operation is done in advance once the model is loaded, which avoids the request performance. The method of using warm up here is to take out the type of input data according to the signature set when the model is exported, and then construct the dummy input data to initialize the model.

Through the optimization of the above two aspects, the problem of request delay after model switching is well solved. As shown, the Burr from the original 84ms is reduced to about 4ms when the model is switched.



After optimized model switching, burr reduction

Iv. Summary and Prospect

This paper mainly introduces the exploration of the user growth group based on the TensorFlow serving in the deep learning line, locates, analyzes and solves the performance problem, and finally realizes the online service with high performance, strong stability and support of various deep learning models.

With a complete offline training and on-line predictive Framework Foundation, we will accelerate the rapid iteration of our strategy. In terms of models, we can quickly try new models, try to combine reinforcement learning with bidding, and in terms of performance, combined with engineering requirements, we will further explore TensorFlow's graph optimization, underlying operator and operation Fusion. In addition, TensorFlow Serving's predictive capabilities can be used for model analysis, and Google is also launching What-if-tools to help model developers analyze the model in depth. Finally, we will also combine the model analysis, the data, the characteristics of re-examine.

Reference documents

[1] Cheng, H. T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., ... & Anil, R. (September). Wide & Deep Learning for recommender systems. In Proceedings of the 1st Workshop on deep learning for Recommender Systems (pp. 7-10). Acm.
[2] Wang, R., Fu, B., Fu, G., & Wang, M. (August). Deep & Cross Network for AD click Predictions. In Proceedings of the Adkdd ' (p. 12). Acm.
[3] Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). DEEPFM:A Factorization-machine based neural network for CTR prediction. ArXiv preprint arxiv:1703.04247.
[4] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). Xdeepfm:combining Explicit and implicit Feature Interactions for Recommender Systems. ArXiv preprint arxiv:1803.05170.
[5] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (November). TENSORFLOW:A system for large-scale machine learning. In Osdi (Vol, pp. 265-283).
[6] Goyal, p., Dollár, p., Girshick, R., Noordhuis, p., Wesolowski, L., Kyrola, A., ... & He, K. (2017). Accurate, large Minibatch sgd:training imagenet in 1 hour. ArXiv preprint arxiv:1706.02677.
[7] Neill, R., Drebes, A., Pop, A. (2018). Performance analysis of Just-in-time compilation for Training TensorFlow multi-layer perceptrons.

About the author

Zhong da Nyingchi, a graduate of data science at the University of Rochester in 2017, worked for Stentor technology company in the California Bay Area and joined the U.S. in 2018, mainly responsible for the user growth group in-depth learning, strengthening learning landing business scenarios.

Codex, joined the American group in 2015. American group Platform and Wine Brigade Business group user Growth team algorithm leader, has worked in Ali, mainly committed to improve the United States through machine learning platform of the active user number, as a technology leader, led the United States group DSP advertising, station pull new projects such as the algorithm work, effectively improve marketing efficiency, reduce marketing costs.

Stability, joined the American Group review in 2015. In the U.S. group reviews the offline computing direction has been engaged in yarn resource scheduling and GPU computing platform construction work.

Recruitment

US company DSP is the core business direction of online digital marketing, join us, you can personally participate in creating and optimizing a marketing platform that can reach billions of users, and guide their life entertainment decision. At the same time, you will face the challenges of precision, efficiency, low-cost marketing, as well as the opportunity to reach the forefront of computing advertising in the field of AI algorithm architecture and Big data solutions. You will work with the company's marketing technology team to promote the establishment of traffic operation ecology, support Wine Brigade, take-out, to shop, taxi, financial and other business continue to rapid development. We invite you with passion, ideas, experience and ability to work alongside us! Participate in the United States Group reviews the implementation of ad delivery system, based on large-scale user behavior data, optimize online advertising algorithm, enhance dau,roi, improve the relevance of online advertising, delivery effect. Welcome Mail wuhongjie#meituan.com Consultation.


Find Articles with errors, questions about the content, you can pay attention to the United States Group technical Team public Number (Meituantech), in the background to leave us a message. We pick out an enthusiastic little partner every week and send a nice little gift. Come and sweep the code and follow us!

On-line prediction of deep learning based on TensorFlow serving

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.