Learning notes TF062: TensorFlow linear algebra compiling framework XLA, tf062tensorflow
XLA (Accelerated Linear Algebra), a specialized Linear Algebra compiler (demain-specific compiler), optimizes TensorFlow computing. Real-time (just-in-time, JIT) compilation or advance (ahead-of-time, AOT) compilation to implement XLA, which facilitates hardware acceleration. XLA is still in the trial phase. Https://www.tensorflow.org/versions/master/experimental/xla.
XLA advantages. Dedicated compilers in the field of linear algebra optimize the execution speed of TensorFlow computation (reduce the execution time of sub-graphs with shorter lifecycles, and reduce memory usage with integrated pipeline operations) memory usage (analyzing and planning memory usage requirements, eliminating many intermediate result caches), custom operation dependencies (Improving the Performance of automated fusion of underlying operations low-level op, achieve the custom op effect of manual fusion operations) and memory usage on the Mobile End (AOT compilation subgraph reduces TensorFlow execution time in advance, and the shared header file is directly linked to other programs) and portability (new backend for new hardware, TensorFlow does not need to change a lot of code for use on new hardware devices ).
How XLA works. LLVM compiler framework system, written in C ++, optimized Compilation time (compile time), link time (link time), and run time (run time) for arbitrary programming languages) and idle time ). The input code of front-end parsing, verification, and judgment is incorrect. The parsing code is converted to the intermediate representation of LLVM (intermdediate representation, IR ). IR analyzes and optimizes the improved code and sends it to the code generator to generate the code of the local machine. Three-phase design LLVM implementation. The most important thing is llvm ir. The compiler IR indicates the code. C-> Clang C/C ++/ObjC frontend, Fortran-> llvm-gcc frontend, Haskell-> GHC frontend llvm ir-> LLVM optimizer-> llvm ir llvm X86 backend -> X86, LLVM PowerPC backend-> PowerPC, llvm arm backend-> ARM. Http://www.aosabook.org/en/llvm.html.
XLA input language hlo ir, xla hlo definition graphics, compiled into various Architecture Machine commands. Compilation process. Xla hlo-> Target-independent optimization analysis-> xla hlo-> XLA backend-> Target-related optimization analysis-> target-specific code generation. XLA first performs target-independent optimization analysis (common subexpression elimination CSE is eliminated by public subexpressions, target-independent operation fusion, and runtime memory buffer analysis ). XLA sends HLO computing to the backend. The backend performs further optimization analysis on irrelevant HLO-level targets. Xla gpu back-end execution is beneficial to the GPU programming model, and determines that computing is divided into streams. Generate specific target code. The xla cpu and GPU backend are represented, optimized, and generated using LLVM. The backend uses llvm ir to represent xla hlo computing. XLA supports x86-64, nvidia gpu jit compilation, x86-64, arm aot compilation. AOT is more suitable for mobile and embedded deep learning applications.
JIT compilation method. Compile and run the TensorFlow computing graph in XLA. XLA integrates multiple operations (kernels) into a small number of compiled Kernels to reduce memory bandwidth and improve performance. XLA runs the TensorFlow calculation method. 1. Enable JIT compilation on the CPU and GPU devices. 2. Place operators on XLA_CPU and XLA_GPU devices.
Open JIT compilation. Open in session. Program all possible operators into XLA computing.
Config = tf. ConfigProto ()
Config. graph_options.optimizer_options.global_jit_level = tf. OptimizerOptions. ON_1
Sess = tf. Session (config = config)
Manually enable JIT compilation for one or more operators. Attribute _ XlaCompile = true indicates the compile operator.
Jit_scope = tf. contrib. compiler. jit. experimental_jit_scope
X = tf. placeholder (np. float32)
With jit_scope ():
Y = tf. add (x, x)
Operators are placed on XLA devices. Valid devices: XLA_CPU and XLA_GPU:
With tf. device ("/job: localhost/replica: 0/task: 0/device: XLA_GPU: 0 "):
Output = tf. add (input1, input2)
JIT compilation MNIST implementation. Https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_softmax_xla.py.
Run Without XLA.
Python mnist_softmax_xla.py -- xla = false
Run to generate the timeline file timeline. ctf. json. Use Chrome tracking event analyzer chrome: // tracing to open the timeline file and present the timeline. The GPU is listed on the left side to check the time consumption of operators.
Use XLA to train the model.
TF_XLA_FLAGS = -- xla_generate_hlo_graph =. * python mnist_softmax_xla.py
The XLA framework is in the experimental stage. The main application scenarios of AOT include embedded devices, mobile phones, and Raspberry Pi with low memory.
From _ future _ import absolute_importfrom _ future _ import divisionfrom _ future _ import print_functionimport argparseimport sysimport tensorflow as tffrom tensorflow. examples. tutorials. mnist import input_datafrom tensorflow. python. client import timelineFLAGS = Nonedef main (_): # Import data mnist = input_data.read_data_sets (FLAGS. data_dir, one_hot = True) # Create the model x = tf. placeholder (tf. float32, [None, 784]) w = tf. variable (tf. zeros ([784, 10]) B = tf. variable (tf. zeros ([10]) y = tf. matmul (x, w) + B # Define loss and optimizer y _ = tf. placeholder (tf. float32, [None, 10]) # The raw formulation of cross-entropy, # tf. performance_mean (-tf. reduce_sum (y _ * tf. log (tf. nn. softmax (y), # reduction_indices = [1]) # can be numerically unstable. # So here we use tf. nn. softmax_cross_entropy_with_logits on the raw # outputs of 'y', and then average should ss the batch. cross_entropy = tf. performance_mean (tf. nn. softmax_cross_entropy_with_logits (labels = y _, logits = y) train_step = tf. train. gradientDescentOptimizer (0.5 ). minimize (cross_entropy) config = tf. configProto () jit_level = 0 if FLAGS. xla: # Turns on xla jit compilation. # enable xla jit to compile jit_level = tf. optimizerOptions. ON_1 config. graph_options.optimizer_options.global_jit_level = jit_level run_metadata = tf. runMetadata () sess = tf. session (config = config) tf. global_variables_initializer (). run (session = sess) # Train # training train_loops = 1000 for I in range (train_loops): batch_xs, batch_ys = mnist. train. next_batch (100) # Create a timeline for the last loop and export to json to view with # chrome: // tracing /. # create a timeline file in the last loop and use chrome: // tracing/to open the analysis if I = train_loops-1: sess. run (train_step, feed_dict = {x: batch_xs, y _: batch_ys}, options = tf. runOptions (trace_level = tf. runOptions. FULL_TRACE), run_metadata = run_metadata) trace = timeline. timeline (step_stats = run_metadata.step_stats) with open ('timeline. ctf. json ', 'w') as trace_file: trace_file.write (trace. generate_chrome_trace_format () else: sess. run (train_step, feed_dict = {x: batch_xs, y _: batch_ys}) # Test trained model correct_prediction = tf. equal (tf. argmax (y, 1), tf. argmax (y _, 1) accuracy = tf. performance_mean (tf. cast (correct_prediction, tf. float32) print (sess. run (accuracy, feed_dict = {x: mnist. test. images, y _: mnist. test. labels}) sess. close () if _ name _ = '_ main _': parser = argparse. argumentParser () parser. add_argument ('-- data_dir', type = str, default = '/tmp/tensorflow/mnist/input_data', help = 'Directory for storing input data') parser. add_argument ('-- xla', type = bool, default = True, help = 'Turn xla via JIT on') FLAGS, unparsed = parser. parse_known_args () tf. app. run (main = main, argv = [sys. argv [0] + unparsed)
References:
Analysis and Practice of TensorFlow Technology
Welcome to the Shanghai Machine Learning job opportunity, my qingxingfengzi