Learning notes TF049: TensorFlow model storage and loading, queue threads, loading data, custom operations, tf049tensorflow

Last Update:2017-08-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Generate the checkpoint file (chekpoint file). The extension is. ckpt, And the tf. train. Saver object is generated by calling Saver. save. Contains weights and other program-Defined variables, excluding the graph structure. Another program needs to re-create the graphic structure to tell TensorFlow how to handle the weight.
Graph proto file, binary, and extension. pb, tf. tran. write_graph () is saved and only contains the graph structure, excluding the weight, tf. import_graph_def: loads a graph.

Model storage, create a tf. train. Saver () to save the variable, specify the storage location, extension. ckpt.

Neural Networks, two fully connected layers and one output layer train the MNIST dataset to store the trained model.
Load data and define models:

# Loading data
Mnist = input_data.read_data_sets ("MNIST_data/", one_hot = True)
TrX, trY, teX, teY = mnist. train. images, mnist. train. labels, mnist. test. images, mnist. test. labels
X = tf. placeholder ("float", [None, 784])
Y = tf. placeholder ("float", [None, 10])
# Initialize Weight Parameters
W_h = init_weights ([784,625])
W_h2 = init_weights ([2, 625,625])
W_o = int_weights ([625,10])

# Weight Function Definition
Def init_weights (shape ):
Return tf. Variable (tf. random_normal (shape, stddev = 0.01 ))

# Define a model
Def model (X, w_h, w_h2, w_o, p_keep_input, p_keep_hidden ):
# The first full connection Layer
X = tf. nn. dropout (X, p_keep_input)
H = tf. nn. relu (tf. matmul (X, w_h ))

H = tf. nn. dropout (h, p_keep, hidden)
# Second full connection Layer
H2 = tf. nn. relu (tf. matmul (h, w_h2 ))
H2 = tf. nn. dropout (h2, p_keep_hidden)

Return tf. matmul (h2, w_o) # output predicted value

# Define the loss function
Cost = tf. performance_mean (tf. nn. softmax_cross_entropy_with_logits (py_x, Y ))
Train_op = tf. train. RMSPropOptimizer (0.001, 0.9). minimize (cost)
Predict_op = tf. argmax (py_x, 1)

Training Model and storage model:

# Define the storage path
Ckpt_dir = "./ckpt_dir"
If not OS. path. exists (ckpt_dir ):
OS. makedirs (ckpt_dir)

Define a counter to count the number of training turns:

# Counter variable, set its trainable = False, do not need to be trained
Global_step = tf. Variable (0, name = 'Global _ step', trainable = False)

After defining all variables, tf. train. Saver () saves and extracts the variables. The variables defined later are not stored:

# Call tf. train. Saver after declaring all variables
Saver = tf. train. Saver ()
# Subsequent variables are not stored
Non_storable_variable = tf. Variable (777)

Train and store the model:

With tf. Session () as sess:
Tf. initialize_all_variables (). run ()
Start = global_step.eval () # obtain the initial value of global_stepp.
Print ("Start from:", start)
For I in range (start, 100 ):
#128 batch_size
For start, end in zip (range (0, len (trX), 128), range (128, len (trX) + 1,128 )):
Sess. run (train_op, feed_dict = {X: trX [start: end], Y: trY [start: end], p_keep_input: 0.8, p_keep_hidden: 0.5 })
Global_step.assign (I). eval () # update the counter
Saver. save (sess, ckpt_dir + "/model. ckpt", global_step = global_step) # Storage model

During training, ckpt_dir appears 16 files, 5 model. ckpt-{n}. data00000-of-00001 files, saved models. Five model. ckpt-{n}. meta files, saved metadata. By default, TensorFlow only saves the last five models and metadata. The previous models and metadata are not used. Five model. ckpt-{n}. index files. {N} indicates the number of iterations. One checkpoint text file stores the current model and the last five models.

Save the previous training parameters and start training in the last place after an unexpected competition. Each fixed number of wheels stores a model (. ckpt file) at the checkpoint, and takes the model out for prediction at any time.

Load the model.
Saver. restore load model:

With tf. Session () as sess:
Tf. initialize_all_variables (). run ()
Ckpt = tf. train. get_checkpoint_state (ckpt_dir)
If ckpt and ckpt. model_checkpoint_path:
Print (ckpt. model_checkpoint_path)
Saver. restore (sess, ckpt. model_checkpoint_path) # Load all parameters

Graph storage loading.
Save only the graph model, and write the graph into the binary protocol file:

V = tf. Variable (0, name = 'my _ variable ')
Sess = tf. Session ()
Tf. train. write_graph (sess. graph_def, '/tmp/tfmodel', 'train. pbtxt ')
Read:

With tf. Session () as _ sess:
With grile. FastGFile ("/tmp/tfmodel/train. pbtxt", 'rb') as f:
Graph_def = tf. GraphDef ()
Graph_def.ParseFromString (f. read ())
_ Sess. graph. as_default ()
Tf. import_graph_def (graph_def, name = 'tfgraph ')

Queue and thread.
Queue, graph node, and stateful node. Modify the content of other nodes (enqueue, depueue. A new element is inserted at the end of the queue, and a new element is removed from the queue.
Unzip oqueue, RandomShuffleQueue, source code in tensorflow-1.0.0/tensorflow/python/ops/data_flow_ops.py.
Upload oqueue to create a first-in-first-out queue. Recurrent Neural Network Structure. Reading training samples must be ordered.
Create a queue graph:

Import tensorflow as tf
# Create a first-in-first-out queue and insert 0.1, 0.2, and 0.3 numbers into the initialization queue
Q = tf. Sort oqueue (3, "float ")
Init = q. enqueue_timeout ([0.1, 0.2, 0.3])
# Define team-out, + 1, and team-in operations
X = q. dequeue ()
Y = x + 1
Q_inc = q. enqueue ([y])

The Account Opening Session executes the q_inc operation twice to view the queue content:

With tf. Session () as sess:
Sess. run (init)
Quelen = sess. run (q. size ())
For I in range (2 ):
Sess. run (q_inc) # perform two operations, with the queue value changed to 0.3, 1.1, 1.2
Quelen = sess. run (q. size ())
For I in range (quelen ):
Print (sess. run (q. dequeue () # output queue Value

RandomShuffleQueue to create a random queue. The output queue generates elements in random order. Training image samples, CNN network structure, unordered reading of training samples, each random generation of a training sample.
Asynchronous computing is very important. TensorFlow sessions support multithreading. During the master thread training operation, RandomShuffleQueue is used as the training input and multiple threads are used to prepare training samples. The sample is pushed into the queue, and the main thread extracts the mini-batch sample training from the queue each time.
Create a random queue with a maximum length of 10 and a minimum length of 2 after leaving the queue:

Q = tf. RandomShuffleQueue (capacity = 10, min_after_dequeue = 2, dtype = "float ")
Account Opening session: perform 10 teams and 8 teams:

Sess = tf. Session ()
For I in range (): # Join 10 times
Sess. run (q. enqueue (I ))
For I in range (): #8 times
Print (sess. run (q. dequeue ()))

Block: the queue length is equal to or less than the minimum value. The queue length is equal to the maximum value.
Set the wait time for the painting to unblock:

Run_iptions = tf. RunOptions (timeout_in_ms = 10000) # Wait for 10 seconds
Try:
Sess. run (q. dequeue (), options = run_options)
Except t tf. errors. DeadlineExceededError:
Print ('out of range ')

The main thread of the session is in the queue. When the data volume is large, the queue operation reads data from the hard disk and puts the data into the memory. The main thread needs to wait for the queue operation to complete before training. The session runs multiple threads. The thread manager QueueRunner creates a series of new threads for the queue operation. The main thread continues to use data. The training network and reading data are asynchronous. The main thread trains the network, another thread reads data from the hard disk into the memory.

Queue Manager.
Create a queue graph:

Q = tf. Sort oqueue (1000, "float ")
Counter = tf. Variable (0.0) # counter
Increment_op = tf. assign_add (counter, tf. constant (1.0) # operation: Add 1 to the calculator.
Enqueue_op = q. enqueue (counter) # operation: add the counter value to the queue
Create the queue manager QueueRunner and add elements to the queue q in two operations. Use only one thread:

Qr = tf. train. QueueRunner (q, enqueue_ops = [increment_op, enqueue_op] * 1)
Start the session and create a thread from the queue manager qr:

# Main thread
With tf. Session () as sess:
Sess. run (tf. global_variables_initializer ())
Enqueue_threads = qr. create_threads (sess, start = True) # start the queue entry thread
# Main thread
For I in range (10 ):
Print (sess. run (q. dequeue ()))

The thread is disconnected because it is not a natural sequence. The Add 1 operation and the join operation are not synchronized. After the 1-plus operation is executed multiple times, the team enters the queue. The training of the master thread (out-of-the-queue operation) and the reading data thread (in-queue operation) is asynchronous, and the master thread waits for data to be sent.
The program cannot be completed after the queuing thread executes the task on its own. Tf. train. Coordinator implements thread synchronization and terminates other threads.

Thread and coordinator.
Coordinator management thread.

# Main thread
Sess = tf. Session ()
Sess. run (tf. global_variables_initializer ())
# Coordinator: Coordinator. Coordinating the relations between threads can be considered as a semaphore for synchronization.
Coord = tf. train. Coordinator ()
# Start the queuing thread. The Coordinator is the thread parameter.
Enqueue_threads = qr. create_threads (sess, coord = coord, start = True)
# Main thread
For I in range (0, 10 ):
Print (sess. run (q. dequeue ()))
Coord. request_stop () # notify other threads to close
Coord. join (enqueue_threads) # The join Operation waits until the end of other threads. After all other threads are closed, the function can return

Close the queue thread, execute the queuing operation, and throw the tf. errors. OutOfRange error. Coord. request_stop () and q. dequeue () for the main thread's out-of-the-queue operations.

Coord. request_stop ()
# Main thread
For I in range (0, 10 ):
Print (sess. run (q. dequeue ()))
Coord. join (enqueue_threads)
Tf. errors. OutOfRangeError capture error:
Coord. request_stop ()
# Main thread
For I in range (0, 10 ):
Try:
Print (sess. run (q. dequeue ()))
Failed t tf. errors. OutOfRangeError:
Break
Coord. join (enqueue_threads)
All queue managers are added to the tf. GraphKeys. QUEUE_RENNERS set in the figure by default.

Load data.
Preloaded data defines constants or variables in the TensorFlow graph to save all data. Fill data (feeding), Python generates data, and data fills the backend. Read data from a file (reading from file), and the queue manager reads data from the file.

Pre-load data. Data is directly embedded in the data flow diagram. When the training data is large, the memory is greatly consumed.

X1 = tf. constant ([2, 3, 4])
X2 = tf. constant ([4, 0, 1])
Y = tf. add (x1, x2)

Fill in data. The feed_dict parameter in sess. run (). Python generates data to fill the backend. Large data volumes, memory consumption, and data type conversion increase overhead.

Import tensorflow as tf
# Design Drawing
A1 = tf. placeholder (tf. int16)
A2 = tf. placeholder (tf. int16)
B = tf. add (x1, x2)
# Using Python to generate data
Li1 = [2, 3, 4]
Li2 = [4, 0, 1]
# Enable the session and populate the data with the backend
With tf. Session () as sess:
Print sess. run (B, feed_dict = {a1: li1, a2: li2 })

Read data from a file. The figure defines the file reading method. TensorFlow reads data from the file and decodes the data to a usable sample set.
Write sample data to the TFRecords binary file. Then read from the queue.
TFRecords binary files make better use of the memory to facilitate copying and moving without the need to mark files separately. Tensorflow-1.1.0/tensorflow/examples/how_tos/reading_data/convert_to_records.py.
Generate a TFRecords file. Defines the main function and converts training, verification, and test data sets. Obtain data. Uint8 encoding. Convert data to the tf. train. Example Type and write data to the TFRecords file. The conversion function convert_to is used to fill in the data in the tf. train. Example protocol buffer. The protocol buffer sequence is converted into a string, and tf. python_io.TFRecordWriter is written into the TFRecords file. 55000 training data, 5000 verification data, and 10000 test data. Black and white images, single channel. Write protocol buffer, height, width, depth, label encoding int64 type, image_raw encoding into binary. Serialized as a string. Run the command. Generate the train. tfrecords, validation. tfrecords, and test. tfrecords files in/tmp/data.
Read from the queue. Create a tensor to read a sample from a binary file. Create tensor and read a mini-batch randomly from the binary file. Each batch of tensor is passed into the network as the input node. Code tensorflow-1.1.0/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py. First define to read from the file and parse a sample. Input File Name queue. Parse example. Specify the key name in features. The string type of the image. Mark int64 type. The BytesList decodes the string type 0-Dimensional Tensor to the unit8 Type One-Dimensional Tensor. Image Tensor ("input/sub: 0", shape = (784,), dtype = float32 ). Convert tags from uint8 type to int32 type. Label Tensor ("input/Cast_1: 0", shape = (), dtype = int32 ).
Tf. train. shuffle_batch sample Randomization to obtain the minimum batch tensor. Input parameters: train: Input training data/verification data. batch_size indicates the number of samples in each batch. If num_epochs is used several times, 0/None indicates that training will continue forever. Returned results: images, float type, shape: [batch_size, mnist. IMAGE_PIXELS], range: [-0.5, 0.5]. Labels, type int32, shape [batch_size], range: [0, mnist. NUM_CLASSES]. Tf. train. QueueRunner uses tf. train. start_queue_runners () to start the thread. Obtain the file path:/tmp/data/train. tfrecords,/tmp/data/validation. records. Tf. train. string_input_producer returns a QueueRunner with a queue oqueue. If the sample size is large and divided into several files, the file name list is passed in. Randomization example, which is normalized to the batch_size. Leave some queues to ensure that sufficient data is randomly disrupted each time.
Generate a batch tensor for network input and training. Enter images and labels. Build a chart that predicts data from the timeline model. Define the loss function. Add graph operations to train the model. Initialize the parameter. Create an epoch count variable in string_input_producer, which is included in the tf. GraphKeys. LOCAL_VARIABLES set and initialized separately with initialize_local_variables. Enable the input level thread. Enter the permanent loop. Output results every 100 training times. Notifies other threads to close. The dataset size is 55000,2 rounds of training, 110000 pieces of data, the size of batch_size is 100, and the number of training times is 1100. Results are output every 100 training times and 11 results are output.
Step for TensorFlow to train samples using the TFRecords file: in the file name generation queue, set the number of epochs. Set it to an infinite loop during training. When reading data, if an error is caught, it is terminated.

Implement custom operations.
Must be familiar with the C ++ language. Deep understanding of tensor flow, Forward propagation, and reverse propagation.
Step. Register a new operation in the C ++ file (* _ ops. cc file. Define operation function interface specifications, operation names, inputs, outputs, attributes, and so on. You can perform operations in the C ++ file (* _ kenels. cc file) on multiple CPU and GPU kernels. Test the operation, compile the operation library file (* _ ops. so file), and use the Python operation.
Best practices. Word embedding. Source https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_optimized.py.
Step 1: Create word2vec_ops.cc to register two operations: SkipgramWord2vec and NegTrainWord2vec.
Step 2: perform two operations on the CPU device to generate the word2vec_kernels.cc file.
Step 3: Compile and test the operation files. Compile in a specific header file directory. Python provides get_include to obtain the header file directory. The C ++ compiler compiles the operation into a dynamic library. TensorFlow Python API provides the tf. load_op_library function to load dynamic libraries and register operations with the TensorFlow framework. Load_op_library returns the Python module containing the operation and kernel.

References:
Analysis and Practice of TensorFlow Technology

Welcome to paid consultation (150 RMB per hour), My: qingxingfengzi

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Learning notes TF049: TensorFlow model storage and loading, queue threads, loading data, custom operations, tf049tensorflow

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Learning notes TF049: TensorFlow model storage and loading, queue threads, loading data, custom operations, tf049tensorflow

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support