There are three methods for reading Tensorflow data (next_batch ),

Source: Internet
Author: User
Tags integer division

There are three methods for reading Tensorflow data (next_batch ),

Tensorflow data can be read in three ways:

  1. Preloaded data: pre-load data
  2. Feeding: Python generates data and then feeds the data to the backend.
  3. Reading from file: read directly from the file

What are the differences between the three read methods? First, we need to know how TensorFlow (TF) works.

The core of TF is written in C ++. The advantage is that it runs fast, but the call is not flexible. Python is the opposite, so it combines the advantages of the two languages. Core operators and runtime frameworks involved in computing are written in C ++ and APIs are provided to Python. Python calls these APIs, designs the Training Model (Graph), and then runs the designed Graph on the backend. In short, Python is designed and C ++ is Run.

I. Pre-load data:

Import tensorflow as tf # design Graph x1 = tf. constant ([2, 3, 4]) x2 = tf. constant ([4, 0, 1]) y = tf. add (x1, x2) # open a session --> calculate y with tf. session () as sess: print sess. run (y)

Ii. python generates data and then feeds the data to the backend

Import tensorflow as tf # design Graph x1 = tf. placeholder (tf. int16) x2 = tf. placeholder (tf. int16) y = tf. add (x1, x2) # Use Python to generate data li1 = [2, 3, 4] li2 = [4, 0, 1] # open a session --> feed data --> calculate y with tf. session () as sess: print sess. run (y, feed_dict = {x1: li1, x2: li2 })

Note: Here x1 and x2 are only placeholders, and there is no specific value. So where can I choose the value during running? In this case, the feed_dict parameter in sess. run () is used to feed the data produced by Python to the backend and calculate y.

Disadvantages of the two solutions:

1. Pre-load: data is directly embedded into the Graph, and then transmitted to the Session for running. When the data volume is large, the transmission of Graph may encounter efficiency problems.

2. Use placeholders to replace data and fill in data when running.

The first two methods are very convenient, but it is very difficult to deal with large data. Even if it is Feeding, the increase in intermediate links is not small, such as data type conversion. The optimal solution is to define the file reading method in Graph, so that the TF can read data from the file and decode it into a usable sample set.

3. reading from a file is simply setting up the diagram of the Data Reading Module

1 TB data, three files, a.csv, B .csv,c.csv

$ echo -e "Alpha1,A1\nAlpha2,A2\nAlpha3,A3" > A.csv $ echo -e "Bee1,B1\nBee2,B2\nBee3,B3" > B.csv $ echo -e "Sea1,C1\nSea2,C2\nSea3,C3" > C.csv 

2. Single Reader, single sample

#-*-Coding: UTF-8-*-import tensorflow as tf # generate a first-in-first-out queue and a QueueRunner to generate a file name queue filenames = pai'a.csv ',' B .csv ', 'C.csv '] filename_queue = tf. train. string_input_producer (filenames, shuffle = False) # define Reader reader = tf. textLineReader () key, value = reader. read (filename_queue) # define Decoder example, label = tf. decode_csv (value, record_defaults = [['null'], ['null']) # example_batch, label_batch = tf. trai N. shuffle_batch ([example, label], batch_size = 1, capacity = 200, min_after_dequeue = 100, num_threads = 2) # Run Graph with tf. session () as sess: coord = tf. train. coordinator () # create a Coordinator and manage the thread threads = tf. train. start_queue_runners (coord = coord) # Start QueueRunner. At this time, the file name queue has entered the queue. For I in range (10): print example. eval (), label. eval () coord. request_stop () coord. join (threads)

Note: tf. train. shuffle_batch is not used here. This will cause the generated samples and labels to not correspond to each other in disorder. The result is as follows:

Alpha1 A2
Alpha3 B1
Bee2 B3
Sea1 C2
Sea3 A1
Alpha2 A3
Bee1 B2
Bee3 C1
Sea2 C3
Alpha1 A2

Solution: Use tf. train. shuffle_batch to match the generated results.

#-*-Coding: UTF-8-*-import tensorflow as tf # generate a first-in-first-out queue and a QueueRunner to generate a file name queue filenames = pai'a.csv ',' B .csv ', 'C.csv '] filename_queue = tf. train. string_input_producer (filenames, shuffle = False) # define Reader reader = tf. textLineReader () key, value = reader. read (filename_queue) # define Decoder example, label = tf. decode_csv (value, record_defaults = [['null'], ['null']) example_batch, label_batch = tf. train . Shuffle_batch ([example, label], batch_size = 1, capacity = 200, min_after_dequeue = 100, num_threads = 2) # Run Graph with tf. session () as sess: coord = tf. train. coordinator () # create a Coordinator and manage the thread threads = tf. train. start_queue_runners (coord = coord) # Start QueueRunner. At this time, the file name queue has entered the queue. For I in range (10): e_val, l_val = sess. run ([example_batch, label_batch]) print e_val, l_val coord. request_stop () coord. join (threads)

3. A single Reader and multiple samples are mainly implemented through tf. train. shuffle_batch.

#-*-Coding: UTF-8-*-import tensorflow as tf filenames = require 'a.csv ',' B .csv ', 'C.csv'] filename_queue = tf. train. string_input_producer (filenames, shuffle = False) reader = tf. textLineReader () key, value = reader. read (filename_queue) example, label = tf. decode_csv (value, record_defaults = [['null'], ['null']) # Use tf. train. batch () adds a sample queue and a QueueRunner. # After Decoder is parsed, the data enters the queue and then leaves the queue in batches. # Although there is only one Reader here, you can set multiple threads. Increasing the number of threads will increase the reading speed, but the more threads, the better. Example_batch, label_batch = tf. train. batch ([example, label], batch_size = 5) with tf. session () as sess: coord = tf. train. coordinator () threads = tf. train. start_queue_runners (coord = coord) for I in range (10): e_val, l_val = sess. run ([example_batch, label_batch]) print e_val, l_val coord. request_stop () coord. join (threads)

Note: In the following method, the extracted batch_size samples are not synchronized between features and labels.

#-*-Coding: UTF-8-*-import tensorflow as tf filenames = require 'a.csv ',' B .csv ', 'C.csv'] filename_queue = tf. train. string_input_producer (filenames, shuffle = False) reader = tf. textLineReader () key, value = reader. read (filename_queue) example, label = tf. decode_csv (value, record_defaults = [['null'], ['null']) # Use tf. train. batch () adds a sample queue and a QueueRunner. # After Decoder is parsed, the data enters the queue and then leaves the queue in batches. # Although there is only one Reader here, you can set multiple threads. Increasing the number of threads will increase the reading speed, but the more threads, the better. Example_batch, label_batch = tf. train. batch ([example, label], batch_size = 5) with tf. session () as sess: coord = tf. train. coordinator () threads = tf. train. start_queue_runners (coord = coord) for I in range (10): print example_batch.eval (), label_batch.eval () coord. request_stop () coord. join (threads)

Note: The output result is as follows: the feature and label do not correspond to each other.

['Alpha1' 'alpha2 ''alpha3 ''bee1 ''bee2'] ['b3 ''c1 ''c2 ''c3 ''a1']
['Alpha2 ''alpha3 ''bee1 ''bee2'' 'bee3 '] ['c1 ''c2 ''c3 ''a1 ''a2']
['Alpha3' 'bee1 ''bee2''' Bee3 ''sea1'] ['c2 ''c3''' a1' A2 ''a3 ']

4. Multiple readers and multiple samples

#-*-Coding: UTF-8-*-import tensorflow as tf filenames = require 'a.csv ',' B .csv ', 'C.csv'] filename_queue = tf. train. string_input_producer (filenames, shuffle = False) reader = tf. textLineReader () key, value = reader. read (filename_queue) record_defaults = [['null'], ['null'] # multiple decoder types are defined. Each decoder is connected to one reader. example_list = [tf. decode_csv (value, record_defaults = record_defaults) for _ in range (2)] # Set Reader to 2 # Using tf. train. batch_join (), you can use multiple readers to read data in parallel. Each Reader uses a thread. Example_batch, label_batch = tf. train. batch_join (example_list, batch_size = 5) with tf. session () as sess: coord = tf. train. coordinator () threads = tf. train. start_queue_runners (coord = coord) for I in range (10): e_val, l_val = sess. run ([example_batch, label_batch]) print e_val, l_val coord. request_stop () coord. join (threads)

The tf. train. batch and tf. train. shuffle_batch functions are read by a single Reader, but can be multithreading. Tf. train. batch_join and tf. train. shuffle_batch_join allow you to set multi-Reader reading. Each Reader uses one thread. As for the efficiency of the two methods, when a single Reader is used, the two threads reach the speed limit. When multiple readers exist, the limit of two readers is reached. Therefore, the more threads, the faster, and more threads, the lower the efficiency.

5. Iterative control: Set the epoch parameter to specify how many rounds of samples can be used during training.

#-*-Coding: UTF-8-*-import tensorflow as tf filenames = require 'a.csv ',' B .csv ', 'C.csv'] # num_epoch: sets the number of iterations filename_queue = tf. train. string_input_producer (filenames, shuffle = False, num_epochs = 3) reader = tf. textLineReader () key, value = reader. read (filename_queue) record_defaults = [['null'], ['null'] # multiple decoder types are defined. Each decoder is connected to one reader. example_list = [tf. decode_csv (value, record_defaults = record_defaults) _ In range (2)] # Set Reader to 2 # Use tf. train. batch_join () to read data in parallel using multiple readers. Each Reader uses a thread. Example_batch, label_batch = tf. train. batch_join (example_list, batch_size = 1) # initialize the local variable init_local_op = tf. initialize_local_variables () with tf. session () as sess: sess. run (init_local_op) coord = tf. train. coordinator () threads = tf. train. start_queue_runners (coord = coord) try: while not coord. should_stop (): e_val, l_val = sess. run ([example_batch, label_batch]) print e_val, l_val limit t tf. errors. outOfRan GeError: print ('epochs Complete! ') Finally: coord. request_stop () coord. join (threads) coord. request_stop () coord. join (threads)

In iteration control, remember to add tf. initialize_local_variables (). The tutorial on the official website is not described. However, if the initialization is not performed, an error will be reported.

For traditional machine learning, for example, the classification problem [x1 x2 x3] is feature. For binary classification, the label is [0, 1] or [1, 0] after one-hot encoding. In general, we will consider organizing data in a csv file, and a row represents a sample. Then use the queue method to read data.

Note: For this data, the first three columns represent feature. Because it is a classification problem, the last two columns are the labels obtained after one-hot encoding.

The code for reading the csv file using the queue is as follows:

#-*-Coding: UTF-8-*-import tensorflow as tf # generate a first-in-first-out queue and a QueueRunner to generate a file name queue filenames = pai'a.csv '] filename_queue = tf. train. string_input_producer (filenames, shuffle = False) # define Reader reader = tf. textLineReader () key, value = reader. read (filename_queue) # define Decoder record_defaults = [[1], [1], [1], [1], [1] col1, col2, col3, col4, col5 = tf. decode_csv (value, record_defaults = record_defaults) fea Tures = tf. pack ([col1, col2, col3]) label = tf. pack ([col4, col5]) example_batch, label_batch = tf. train. shuffle_batch ([features, label], batch_size = 2, capacity = 200, min_after_dequeue = 100, num_threads = 2) # Run Graph with tf. session () as sess: coord = tf. train. coordinator () # create a Coordinator and manage the thread threads = tf. train. start_queue_runners (coord = coord) # Start QueueRunner. At this time, the file name queue has entered the queue. For I in range (10): e_val, l_val = sess. run ([example_batch, label_batch]) print e_val, l_val coord. request_stop () coord. join (threads)

The output result is as follows:

Note:

record_defaults = [[1], [1], [1], [1], [1]]

Indicates the template to be parsed. Each sample has five columns, which are separated by commas (,) by default in the data. The parsing standard is [1]. that is, the value of each column is resolved to an integer. [1.0] is resolved to a floating point, and ['null'] is parsed to the string type.

2. Several Different next_batch methods are provided here. This article only explains the code snippets for future reference:

Def next_batch (self, batch_size, fake_data = False): "" Return the next 'batch _ size' examples from this data set. "if fake_data: fake_image = [1] * 784 if self. one_hot: fake_label = [1] + [0] * 9 else: fake_label = 0 return [fake_image for _ in xrange (batch_size)], [fake_label for _ in xrange (batch_size)] start = self. _ index_in_epoch self. _ index_in_epoch + = batch_size if self. _ index_in_epoch> self. _ num_examples: # indicates whether the sentence subscript in epoch is greater than the number of all corpus. If it is True, a new round of traversal begins # Finished epoch self. _ epochs_completed + = 1 # Shuffle the data perm = numpy. arange (self. _ num_examples) # The arange function is used to create an equality array numpy. random. shuffle (perm) # disrupt self. _ images = self. _ images [perm] self. _ labels = self. _ labels [perm] # Start next epoch start = 0 self. _ index_in_epoch = batch_size assert batch_size <= self. _ num_examples end = self. _ index_in_epoch return self. _ images [start: end], self. _ labels [start: end]

This section of code is taken from mnist. py file, starting from the code 12th line = self. _ index_in_epoch start to explain, _ index_in_epoch-1 is the bottom of the last batch image, the subscript of the first epoch image is from _ index_in_epoch, the subscript of the last image is _ index_in_epoch + batch. If _ index_in_epoch is greater than the number of images in the corpus, it indicates that this epoch is inappropriate, even if it completes the traversal of the corpus again, so we should shuffles the image and start a new round of corpus composition.

Def ptb_iterator (raw_data, batch_size, num_steps): "" Iterate on the raw PTB data. this generates batch_size pointers into the raw PTB data, and allows minibatch iteration along these pointers. args: raw_data: one of the raw data outputs from ptb_raw_data. batch_size: int, the batch size. num_steps: int, the number of unrolls. yields: Pairs of the batched data, each a matrix of shape [batch_size, num_steps]. the second element of the tuple is the same data time-shifted to the right by one. raises: ValueError: if batch_size or num_steps are too high. "raw_data = np. array (raw_data, dtype = np. int32) data_len = len (raw_data) batch_len = data_len // batch_size # How many batch data = np. zeros ([batch_size, batch_len], dtype = np. int32) # How many words does batch_len have for I in range (batch_size): # How many batch_size batchdata [I] = raw_data [batch_len * I: batch_len * (I + 1)] epoch_size = (batch_len-1) // num_steps # batch_len indicates the number of sentences in a batch # epoch_size = (len (data) // model. batch_size)-1) // model. num_steps # // indicates the integer division if epoch_size = 0: raise ValueError ("epoch_size = 0, decrease batch_size or num_steps") for I in range (epoch_size ): x = data [:, I * num_steps :( I + 1) * num_steps] y = data [:, I * num_steps + 1 :( I + 1) * num_steps + 1] yield (x, y)

Method 3:

  def next(self, batch_size):    """ Return a batch of data. When dataset end is reached, start over.    """    if self.batch_id == len(self.data):      self.batch_id = 0    batch_data = (self.data[self.batch_id:min(self.batch_id +                         batch_size, len(self.data))])    batch_labels = (self.labels[self.batch_id:min(self.batch_id +                         batch_size, len(self.data))])    batch_seqlen = (self.seqlen[self.batch_id:min(self.batch_id +                         batch_size, len(self.data))])    self.batch_id = min(self.batch_id + batch_size, len(self.data))    return batch_data, batch_labels, batch_seqlen

Method 4:

Def batch_iter (sourceData, batch_size, num_epochs, shuffle = True): data = np. array (sourceData) # convert sourceData to array Storage data_size = len (sourceData) num_batches_per_epoch = int (len (sourceData)/batch_size) + 1 for epoch in range (num_epochs ): # Shuffle the data at each epoch if shuffle: shuffle_indices = np. random. permutation (np. arange (data_size) shuffled_data = sourceData [partition] else: shuffled_data = sourceData for batch_num in range (partition): start_index = batch_num * batch_size end_index = min (batch_num + 1) * batch_size, data_size) yield shuffled_data [start_index: end_index]

The usage of the iterator. Learn more about the usage of the Python iterator.

Note that the first three methods only traverse all the corpus. The last method is that all the corpus traverses num_epochs.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.