Deep learning tool: TensorFlow system architecture and high performance programming __deep

Source: Internet
Author: User
Tags serialization in python keras

November 9, 2015 Google Open source of the artificial intelligence platform TensorFlow, but also become the 2015 's most popular open source projects. After 12 iterations from v0.1 to v0.12, Google released its version of TensorFlow 1.0 on February 15, 2017, and hosted the first TensorFlow Dev Summit conference in Mountain View, California, USA. TensorFlow 1.0 and Dev Summit (2017) Review

Compared with previous versions, the features of TensorFlow 1.0 are mainly reflected in the following aspects: Faster: TensorFlow 1.0 version uses XLA compiler technology, improved tensorflow performance and memory utilization. From the test results of the benchmark problem, the inception V3 model of single machine is realized 7.3 times times the operation acceleration on the single machine 8 GPUs, and the distributed Inception V3 model realizes 58 times times the operation acceleration on the GPUs of multiple machines. More flexible: This version enables full compatibility of the keras (high-level Neural Networks Library) APIs, in addition to supporting the high-level APIs for the Tf.layers,tf.metrics and tf.losses models. More product: The TensorFlow Python API tends to stabilize in the v1.0 version, laying a solid foundation for product compatibility.

In the TensorFlow 1.0 release of the day, Google also hosted the TensorFlow 2017 DEV Summit. The agenda consists mainly of keynote speeches on the following topics:

Relevant manufacturer 's content hands-on, the owner of the product research and development about the airborne team . The business and product aspects, Mobvista CTO talk about technology and business integration technology leader, how I build my influence. How to catch the dividend of the technological wave change the way of technology leaders ' insight and innovation under digital economy

Related Sponsors

Global Technology Leadership Summit 2017, June 30-July 1, Shanghai Powerlong Marriott Hotel, the best content to see

XLA (TensorFlow, Compiled) compilation technology: Introduce the XLA technology to minimize the calculation of execution time and maximize the use of computing resources, to reduce data training and model results inference time.

Hands-on Tensorboard Visualization Technology: It introduces how to use Tensorboard, and TensorFlow graph model, training data visualization and so on.

TensorFlow High-level API: Describes the use of layers, estimators, and canned estimators API to define the training model.

Integrating Keras & TensorFlow: Describes how to use the Keras API for model definition and training in TensorFlow.

TensorFlow at DeepMind: A typical case for using the TensorFlow platform in DeepMind is described, including applications such as Alphago.

Skin Cancer Image Classification: The Stanford Medical School uses tensorflow to classify skin cancer photographs for medical diagnosis.

Mobile and Embedded TensorFlow: Describes how to run the TensorFlow model in mobile terminals, embedded devices, including Android, iOS and other systems.

Distributed TensorFlow: This paper systematically introduces the related technologies of distributed tensorflow and how to apply them to large-scale model training.

TensorFlow Ecosystem: Explained the TensorFlow ecosystem, including the production of training data, distributed operation TensorFlow and serving models of the product process.

Serving Models in Production with TensorFlow serving: a systematic explanation of how to apply the TensorFlow serving model in a production environment.

ML Toolkit: Introduces the use of TensorFlow machine learning libraries, such as linear regression, Kmeans and other algorithmic models.

Sequence Models and the RNN API: Describes how to build high-performance sequence-to-sequence models and related APIs.

Wide & Deep Learning: This paper introduces how to build a comprehensive training model with Wide model and Deep model.

Magenta,music and Art Generation: Use the enhanced depth learning model to generate music sounds and art pictures.

Case Study,tensorflow in medicine-retinal Imaging: Medical retinal images are classified using the TensorFlow machine learning platform to assist in medical diagnosis. TensorFlow System Architecture

TensorFlow as a distributed machine learning platform, the main architecture is shown in the following figure. RPC and RDMA are network layer, which is mainly responsible for transferring neural network algorithm parameters. CPU and GPU are the device layer, which is mainly responsible for the specific operation of neural network algorithm. Kernel is a concrete implementation of the algorithm operation in TensorFlow, such as convolution operation, activation operation and so on. Distributed master is used to build the child graph, the cut child graph is a plurality of slices, the different sub graph slices run on different devices, and master also distributes the sub graph to the executor/work end. Executor/work on the device (cpus,gpus,etc.), scheduling the execution of the child graph operation, and responsible for sending and receiving the operation results of the graph operation to other worker. The C API splits the TensorFlow into the front-end and back-end, and the front-end (Python/c++/java Client) triggers the TensorFlow back-end program to run based on the C API. Training libraries and inference Libs are model-trained and deduced library functions for users to develop application models for use.

The following figure is the internal workings of the client, master, and worker. "/job:worker/task:0" and "/job:ps/task:0" Represents an execution service in a worker. "Job:ps" represents a parameter server that is used to store and update model parameters. "Job:worker" is used to optimize model parameters, and concurrent parameters are sent to the parameter server. Distributed master and worker service only exist in distributed TensorFlow. The stand-alone version of TensorFlow implements the local session, which is achieved through the internal communication of the native process.

The user writes the TensorFlow application to generate a calculation diagram, and the client component creates a session and, through serialization technology, sends a diagram to define the distributed master component. In the following illustration, the client creates a s+=w*x+b graph calculation model.

When the client triggers the session operation, Maser constructs the child graph that will be run. And according to the equipment condition, the cutting child graph is a plurality of fragments. The following is a run child diagram built for master:

Then cut the child graph, the model parameters are grouped on the parameter server, the graph calculation operations are grouped on the operator. The following figure is a workable graph cutting strategy:

Distributed master cuts edges based on the partitioning of model parameters, inserting communication nodes between tasks that send and receive tensor information, as shown in the following illustration:

Then distributed master sends the sub-graph fragment to the task through the Registergraph method, as shown in the following figure:

Master uses the Rungraph to trigger the child graph operation, and the worker performs the TensorFlow kernel operation using the GPU/CPU operation device. Between the CPU and the GPU of this node, the data is transmitted using Cudamemcpyasync, between the GPU and the GPU of this node, the data is transmitted using Peer-to-peer DMA, and the data is not replicated through the CPU. TensorFlow uses Grpc (TCP) and RDMA (converged Ethernet) technology to achieve data communication and transmission between worker, as shown in the following illustration:

High Performance Programming

The TensorFlow kernel is developed using C/D + + and provides a client API for the C++,python,java,go language. In particular, the Python API is the current mainstream TensorFlow model development interface. But why do you need to use the C + + API to train the model. This article is based on the following two points, first of all, when we use the Python API to train the model, we need to constantly use the Python API to call the C + + interface, repeated interface calls to some extent affect the execution performance of the program. More importantly, a large amount of memory exchange is required to train the model on the GPU, and if the C + + API is used to train the model, it can provide better operational performance and better control the allocation of GPU memory.

The following figure is the operational architecture of the Python API: in each iteration of model training, the program reads batch data through the Python API and then passes the data to C + + through the TensorFlow session run interface, triggering neural network training. As shown in the following illustration:

The following figure is the operation architecture of the C + + API: In each iteration of model training, the model training is directly triggered by reading batch data through the C + + API. It reduces the cyclic invocation and data transmission of API interfaces between different languages. As shown in the following illustration:

To use the C + + API for model training, we first need to write a training model, which can be done in Python language. We first use the Python API to write the training model and then convert the graph model into a PROTOBUF serialization file. It then loads the model file through the C + + API, creates the TensorFlow session, initializes the model variables, loads the training data and performs neural network training. The program architecture is shown in the following illustration:

The following is an example of a training model defined using the Python API:

With TF. Session () as Sess:

 #定义Placeholder tensor access training data x = Tf.placeholder (Tf.float32, [None,], name= "x") y = Tf.placeholder (Tf.float32, [ None, 8], name= "y") #定义训练模型 w1 = tf. Variable ([Tf.truncated_normal], stddev=0.1) B1 = tf. Variable (tf.constant (0.0, shape=[16)) W2 = tf. Variable (Tf.truncated_normal ([8], stddev=0.1)) b2 = tf. Variable (tf.constant (0.0, shape=[8])) A = Tf.nn.tanh (Tf.nn.bias_add (Tf.matmul (x, W1), b1)) Y_out = Tf.nn.tanh (TF.N N.bias_add (Tf.matmul (A, W2), B2), name= "y_out") cost = Tf.reduce_sum (Tf.square (y-y_out), name= "cost") Optimizer = Tf.train.AdamOptimizer (). Minimize (Cost, name= "Train") #定义变量初始化操作 init = Tf.initialize_variables (Tf.all_variables ( ), name= ' Init_all_vars_op ') #把图模型转换为Protobuf文件 tf.train.write_graph (Sess.graph_def, './', ' MLP.PB ', As_text=false) 
   

The following is an example of loading the PROTOBUF diagram model using the C + + API and performing training:

#include "tensorflow/core/public/session.h" #include "tensorflow/core/graph/default_device.h" using namespace

TensorFlow;
    int main (int argc, char* argv[]) {//protobuf model file name std::string graph_definition = "MLP.PB";

    TensorFlow Sesssion session* session;
    Define graph Model Object Graphdef graph_def;

    Sessionoptions opts; 

    Stores the running results of Session sessions std::vector<tensor> outputs;

    #加载Protobuf模型文件到图模型对象中 Tf_check_ok (Readbinaryproto (ENV::D efault (), graph_definition, &graph_def));

    The default is to perform the training operation of the model on GPU 0 graph::setdefaultdevice ("/gpu:0", &graph_def);
    Set GPU video Memory usage parameters opts.config.mutable_gpu_options ()->set_per_process_gpu_memory_fraction (0.5);

    Opts.config.mutable_gpu_options ()->set_allow_growth (true);

    Create TensorFlow session TF_CHECK_OK (NewSession (opts, &session));

    Load Diagram object in Session TF_CHECK_OK (Session->create (graph_def));

    Performs model parameter initialization operations TF_CHECK_OK (Session->run ({}, {}, {"Init_all_vars_op"}, nullptr)); SetSemantic model input data, including data type and dimension information Tensor x (Dt_float, Tensorshape ({100, 32}));

    Tensor y (dt_float, Tensorshape ({100, 8}));
    Converts the tensor to a matrix and initializes the tensor data auto _xtensor = x.matrix<float> ();
    Auto _ytensor = y.matrix<float> ();
    _xtensor.setrandom ();

    _ytensor.setrandom (); for (int i = 0; i < ++i) {//Execute model training operation, {{"X", X}, {"Y", Y}} represents input data tensor name and tensor object; {"Cost}" indicates the name of the operation to get the output value; &A  

        Mp;outputs represents the Tensor object Tf_check_ok (Session->run ({{"X", X}, {"Y", Y}}, {"Cost"}, {}, &outputs) that is returned after the "cost" operation is performed;
        Gets the result of the operation of the "cost" action float = outputs[0].scalar<float> (0);

        Std::cout << "Cost:" << cost << Std::endl; Perform "Train" Operations Tf_check_ok (Session->run ({{"X", X}, {"Y", Y}}, {}, {"Train"}, nullptr));
    Train outputs.clear ();
    //Close Session and Delete session object Session->close ();
    Delete session;
return 0; }

When the C + + program is written, the compile time need to link the header file, open source has helped us tidy up, stored in the directory/usr/lib/python2.7/site-packages/tensorflow/include. When compiling and running, you need to link libtensorflow_cc.so, you can compile the library file as follows: Bazel build-c opt//tensorflow:libtensorflow_cc.so--copt=-m64- linkopt=-m64--spawn_strategy=standalone--genrule_strategy=standalone--verbose_failures. You can refer to the official compilation document of the TensorFlow source code specifically. Summary

This paper first reviews the main new features of TensorFlow 1.0 and the main agenda of TensorFlow 2017 Dev Summit. So far TensorFlow's GitHub star ranked 51000+, fork rankings have reached 24000 +, 15000+ commits. And with the continuous release of TensorFlow new version and the continuous increase of new features, TensorFlow use more flexible, faster operation, the use of more products, has become the mainstream of a deep learning platform.

Then it introduces the system architecture of TensorFlow, including the concept and operation mode of Client,master,worker,kernel, and is a machine learning platform suitable for large-scale distributed training. As can be seen from the above system architecture, the TensorFlow kernel is developed by C/D + +, when the Python API is used to train the model, it is necessary to use Python to call the C + + interface, and the repeated interface calls affect the execution performance of the program to some extent. If you have a friend who has the most high performance calculation, you can try the method recommended in the High Performance section of this article. Reference Document http://www.tensorflow.org Depth learning tool: Distributed TensorFlow and Case analysis

Deep learning tool: TensorFlow use combat
Original address: http://www.infoq.com/cn/articles/tensorflow-architecture-and-programming?utm_source=infoq&utm_medium= Related_content_link&utm_campaign=relatedcontent_articles_clk

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.