TensorFlow DataSet production/file queue read mode _tensorflow

Source: Internet
Author: User
3 Ways of reading data

There are 3 ways to read data in a TensorFlow program:
Supply data (feeding): At each step in the TensorFlow program, let the Python code supply the data.
Reading data from a file: At the beginning of the TensorFlow graph, let an input pipeline read the data from the file.
Preload data: Define constants or variables in the TensorFlow diagram to hold all data (only for situations where the data volume is smaller).

All of the above 3 ways of the official website are introduced
1. Input = Tf.placeholder (Tf.float32) feed method, first define a placeholder, need to Sess.run () when the data into.
3. Preload, such as the next, the data stored in the constant, this method for large data is not appropriate, memory resources are not enough.

Training_data = ...
Training_labels = ...
With TF. Session ():
  input_data = tf.constant (training_data)
  input_labels = tf.constant (training_labels)
How files are read

Usage reason: For example, in the picture classification scene, we want to use own dataset, the dataset is bigger, needs to add the data dynamically, this can use the placeholder, also can use the way which the file reads, this way is more convenient. Here's how to read the file. (This is a special addition to the problem that queues are used to solve GPU idle and memory problems.) All the pictures at the same time read in memory is very large, memory may not bear, after the queue, each time from the queue loaded into the memory queue, so you can add larger pictures. http://geek.csdn.net/news/detail/201552 detailed tensorflow read mechanism)

The general steps given in the official website
altogether typical file read pipelines include the following steps:
' file
Name list ' configurable filename disorder (shuffling)
' configurable maximum training iterations (epoch limit)
' File name Queue
' A configurable
preprocessor '
sample queue for reader ' record parser ' in input file format

This article describes using the binary format of the TensorFlow to process, one is because this binary file is easy to operate, and there are more common online processing process, 2 is the use of image generation convenience, 200M binaries can be very fast generation, but if it is a CSV file, the production speed is particularly slow, and 200M Excel can not open the. So let's use binary files.

If there is a binary file, don't worry, I will tell you how to make a data set that belongs to you.

Specific description of the queue, refer to the official website can, I do not understand.
Http://www.tensorfly.cn/tfdoc/how_tos/reading_data.html#AUTOGENERATED-preloaded-data
TensorFlow provides two classes to help with multithreaded implementations: Tf.coordinator and TF. Queuerunner. From the design of these two classes must be used together. The Coordinator class can be used to stop multiple worker threads at the same time and report exceptions to the program that waits for all worker threads to terminate. The Queuerunner class is used to coordinate multiple worker threads to push multiple volumes into the same queue at the same time. The Queuerunner class creates a set of threads that can perform enquene operations repeatedly, and they use the same coordinator to handle thread synchronization termination.

BasePath = '/home/user/xxxxx ' classes = {' C1 ', ' C2 '} #生成数据集 def Create_record (): writer = Tf.python_io.
        Tfrecordwriter ("Train.tfrecords") for index, name in enumerate (classes): Class_path = BasePath + "/" + name+ "/" For Img_name in Os.listdir (class_path): Img_path = Class_path + img_name img = Image.open ( Img_path) img = img.resize ((+)) Img_raw = Img.tobytes () #将图片转化为原生bytes #print I Ndex,img_raw example = Tf.train.Example (Features=tf.train.features (feat
                        ure={"label": Tf.train.Feature (Int64_list=tf.train.int64list (Value=[index)),
            ' Img_raw ': Tf.train.Feature (Bytes_list=tf.train.byteslist (Value=[img_raw))}) ) Writer.write (example. Serializetostring ()) writer.close () #读取二进制数据 img, label = Read_and_decode (". /train.tfrecords ") #分块处理, here Img_baTCH, it can be treated as input, after each sess.run () related operations will be taken out of a part, here is the equivalent of you write a queue operation to feed x data. (Personal understanding) Img_batch, Label_batch = Tf.train.shuffle_batch ([img, label], Batch_siz E=4, capacity=2000, min_after_dequeue=1000) #官网推荐处理模板 # Create the graph, etc
.
Init_op = Tf.initialize_all_variables () # Create a session to running operations in the Graph. Sess = tf.
Session () # Initialize The variables (like the epoch counter).
Sess.run (init_op) # Start input Enqueue threads. Coord = Tf.train.Coordinator () threads = Tf.train.start_queue_runners (sess=sess, Coord=coord) try:while not coord.sh Ould_stop (): # Run Training steps or whatever sess.run (train_op) except TF.ERRORS.OUTOFRANGEERROR:PR
    int ' Done training-epoch limit reached ' finally: # when done, ask the threads to stop.
Coord.request_stop () # Threads to finish. Coord.join (threads) sess.close ()
Make your own data set

Idea: Tfrecords file.
The way to get this format is to first populate the General data format in example protocol buffer, then serialize protocol buffer into a string and then use Tf.python_io. The Tfrecordwriter class method writes a string to a tfrecords file.

#制作二进制数据
def create_record ():
    writer = tf.python_io. Tfrecordwriter ("Train.tfrecords")
    for index, name in enumerate (classes):
        Class_path = BasePath + "/" + name+ "/" C4/>for img_name in Os.listdir (class_path):
            Img_path = class_path + img_name
            img = Image.open (img_path)
            img = Img.resize ((+))
            Img_raw = img.tobytes () #将图片转化为原生bytes
            #print Index,img_raw

            example = Tf.train.Example (
                features=tf.train.features (
                    feature={
                        "label": Tf.train.Feature (int64_list= Tf.train.Int64List (Value=[index]),
                        ' Img_raw ': Tf.train.Feature (Bytes_list=tf.train.byteslist (value=[img_ Raw]))
                    )
            Writer.write (example. Serializetostring ())
    Writer.close ()
Small problem

Using the CNN classification, there is a resource depletion problem, the network is too large, video card memory used up. Mnist use 28*28, my picture 320*240. The main reason comes from the last full connection layer with too many parameters. It needs to be handled separately.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.