TensorFlow Notes: DataSet export _tensorflow Build Your own dataset

Source: Internet
Author: User
Tags assert scalar shuffle

Update to TensorFlow 1.4 I. Read input data 1. If the database size can be fully read in memory, use the simplest numpy arrays format:

1). Convert the Npy file into a TF. Tensor
2). Using Dataset.from_tensor_slices ()
Example:

# Load The training data into two numpy arrays, for example using ' np.load () '.
With Np.load ("/var/data/training_data.npy") as data:
features = data["Features"]
labels = data["Labels"]
# assume that each row of features corresponds to the same row as ' labels '.
Assert features.shape[0] = = labels.shape[0]
DataSet = Tf.data.Dataset.from_tensor_slices ((features, labels))

Note that the preceding code fragment embeds the features and labels arrays in TensorFlow graph as tf.constant () operations. This applies to small datasets, but it wastes memory-because the contents of the array are replicated multiple times-and can run to the TF. Graphdef protocol buffer (with 2GB limit).

Instead, you can use the Tf.placeholder () tensor to define the dataset and provide an NumPy array when initializing the iterator on the dataset.

# Load The training data into two numpy arrays, for example using ' np.load () '.
With Np.load ("/var/data/training_data.npy") as data:
  features = data["Features"]
  labels = data["Labels"]

# assume that each row of ' features ' corresponds to the same row as ' labels '.
Assert features.shape[0] = = Labels.shape[0]

features_placeholder = Tf.placeholder (Features.dtype, Features.shape )
Labels_placeholder = Tf.placeholder (Labels.dtype, labels.shape)

DataSet = Tf.data.Dataset.from_tensor_ Slices ((Features_placeholder, Labels_placeholder))
# [Other transformations on ' DataSet ' ...]
DataSet =
... iterator = Dataset.make_initializable_iterator ()

sess.run (Iterator.initializer, Feed_dict={features_ Placeholder:features,
                                          Labels_placeholder:labels})
2. Create Tfrecord Data

The Dataset API supports multiple file formats, so you can handle large datasets that do not match existing memory. For example, the Tfrecord file format is a simple, record-oriented (record-oriented) binary format, and many TensorFlow apply Tfrecord to train the data. The Tf.data.TFRecordDataset class allows you to stream the contents of one or more Tfrecord files as part of the input pipline.

# Creates a dataset that reads all of the examples from two files.
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
DataSet = Tf.data.TFRecordDataset ( Filenames

The filenames parameter of the Tfrecorddataset initialization program can be strings,a list of strings or TF. Tensor of strings. Therefore, if you have two sets of files for training and validation, you can use Tf.placeholder (tf.string) to represent the file name and initialize the iterator with the appropriate file name:

Filenames = Tf.placeholder (tf.string, Shape=[none])
DataSet = Tf.data.TFRecordDataset (filenames)
DataSet = Dataset.map (...)  # Parse the record into tensors.
DataSet = Dataset.repeat ()  # Repeat the input indefinitely.
DataSet = Dataset.batch
iterator = Dataset.make_initializable_iterator ()

# can feed the initializer with The appropriate filenames
for the current # phase of execution, e.g. training vs. validation.

# Initialize ' iterator ' with training data.
Training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run (Iterator.initializer, Feed_dict={filenames:training_filenames})

# Initialize ' iterator ' with validation data.
Validation_filenames = ["/var/data/validation1.tfrecord", ...]
Sess.run (Iterator.initializer, feed_dict={filenames:validation_filenames})
3. Create Text data

Many datasets are distributed as one or more text files. Tf.data.TextLineDataset provides a simple way to extract rows (lines) from one or more text files. Given one or more file names, Textlinedataset will generate a string value element (string-valued element) for each row of those files. Like Tfrecorddataset, Textlinedataset accepts a filename as a tf.tensor, so you can parameterize it by passing a tf.placeholder (tf.string).

filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
DataSet = Tf.data.TextLineDataset (filenames)

By default, Textlinedataset produces each row of each file, which may not be desirable, such as the existence of a file that begins with a header row or contains a comment. These lines can be removed using the Dataset.skip () and Dataset.filter () transformations. To apply these transformations to each file separately, we use DATASET.FLAT_MAP () to create one nested dataset for each file.

filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]

DataSet = Tf.data.Dataset.from_tensor_slices ( Filenames)

# use ' Dataset.flat_map () ' to transform each file as a separate nested Dataset,
# and then concatenate Their contents sequentially into a single "flat" DataSet.
# Skip the ' header row '.
# * Filter out lines beginning with "#" (comments).
DataSet = Dataset.flat_map (
    lambda filename: (
        tf.data.TextLineDataset (filename)
        . Skip (1)
        . Filter ( Lambda Line:tf.not_equal (tf.substr (line, 0, 1), "#")))

For a complete example of using Datasets to resolve CSV files, see imports85.py Ii. Using Dataset.map () to preprocess data

The Dataset.map (f) Transformation generates a new dataset by applying the given function f to each element of the input dataset. It is based on the map () function that is typically applied to lists (and other structures) in a functional programming language. function f takes a TF that represents a single element in the input. Tensor the object and returns a TF that will represent a single element in the new dataset. Tensor object. Its implementation uses a standard tensorflow operation to convert an element to another element.

This section describes common examples of how to use Dataset.map (). 1. Parse TF. Example protocol buffer message (protocol buffer messages)

Many input pipelines extract tf.train.Example protocol buffer messages from the Tfrecord format file, such as using Tf.python_io. Written by Tfrecordwriter). Each tf.train.Example record contains one or more "features", which are typically converted into tensor (tensors) by the input pipeline.

# Transforms a scalar string ' Example_proto ' into a pair's a scalar string
and # A scalar integer, representing an IM Age and its label, respectively.
def _parse_function (Example_proto):
  features = {"Image": TF. Fixedlenfeature ((), tf.string, default_value= ""),
              "label": TF. Fixedlenfeature ((), Tf.int32, default_value=0)}
  parsed_features = Tf.parse_single_example (Example_proto, Features) return
  parsed_features["image"], parsed_features["label"]

# Creates a dataset that reads all of the Exa Mples from two files, and extracts
# The image and label features.
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
DataSet = Tf.data.TFRecordDataset ( Filenames)
DataSet = Dataset.map (_parse_function)
2. Decoding image data and resizing

When training neural networks on actual image data, it is often necessary to convert images of different sizes into common sizes so that they can be batch into a fixed size.

# reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function (filename, label):
  image_string = tf.read_file (filename)
  image_decoded = Tf.image.decode_ Image (image_string)
  image_resized = Tf.image.resize_images (image_decoded, [)] return
  image_resized, Label

# A Vector of filenames.
filenames = Tf.constant (["/var/data/image1.jpg", "/var/data/image2.jpg", ...])

# ' labels[i] ' is the label for the image in ' filenames[i '.
Labels = Tf.constant ([0, Panax Notoginseng, ...])

DataSet = Tf.data.Dataset.from_tensor_slices ((filenames, labels))
DataSet = Dataset.map (_parse_function)
3. Use Tf.py_func () to apply arbitrary Python logic

For performance reasons, it is encouraged to use tensorflow operations whenever possible to preprocess data. However, it can sometimes be useful to call an external Python library when parsing input data. To do this, call the Tf.py_func () action in the Dataset.map () transformation. Here's a OpenCV python library cv2 for example.

Import CV2

# Use a custom OpenCV function to read the image, instead of the standard
# tensorflow ' Tf.read_file ' () ' Operation.
def _read_py_function (filename, label):
  image_decoded = Cv2.imread (image_string, Cv2. Imread_grayscale) return
  image_decoded, label

# with standard TensorFlow operations to resize the image to a fixed Shape.
def _resize_function (image_decoded, label):
  Image_decoded.set_shape ([None, none, none])
  image_resized = Tf.image.resize_images (image_decoded, [a]) return
  image_resized, label

filenames = ["/var/data/ Image1.jpg ","/var/data/image2.jpg ", ...]
Labels = [0, Panax Notoginseng, 1, ...]

DataSet = Tf.data.Dataset.from_tensor_slices ((filenames, labels))
DataSet = Dataset.map (
    lambda filename, Label:tuple (Tf.py_func (
        _read_py_function, [filename, label], [Tf.uint8, Label.dtype])
DataSet = Dataset.map (_resize_function)
Iii. batch processing (batching) DataSet 1. Simple batch Processing

The simplest batch form stacks n consecutive elements in a dataset into one element. The Dataset.batch () conversion is exactly the same as the constraints of the Tf.stack () operator, applied to each element of the element, that is, for each element I, it must have a tensor of exactly the same shape.

Inc_dataset = Tf.data.Dataset.range (m)
dec_dataset = Tf.data.Dataset.range (0, -100,-1)
DataSet = Tf.data.Dataset.zip ((Inc_dataset, Dec_dataset))
Batched_dataset = Dataset.batch (4)

iterator = Batched_ Dataset.make_one_shot_iterator ()
next_element = Iterator.get_next ()

print (Sess.run (next_element))  # = => ([0, 1, 2,   3],   [0,-1,  -2,  -3])
print (Sess.run (next_element))  # ==> ([4, 5, 6,   7],   [ -4, -5,  -6,  -7])
print (Sess.run (next_element))  # ==> ([8, 9, ten, one],   [- 8,-9,-10,-11]
2. Use padding fill to bulk tensor

The above method applies to all dimensions of the same tensor. However, many models, such as sequence models, work with input data that may have different sizes (for example, sequences of different lengths). To handle this situation, through Dataset.padded_batch () transformations, you can bulk process the tensor of different shapes by specifying one or more dimensions that may be filled.

DataSet = Tf.data.Dataset.range
DataSet = Dataset.map (Lambda x:tf.fill ([Tf.cast (x, Tf.int32)], x)
DataSet = Dataset.padded_batch (4, Padded_shapes=[none])

iterator = Dataset.make_one_shot_iterator ()
Next_ element = Iterator.get_next ()

print (Sess.run (next_element))  # ==> [[0, 0, 0], [1, 0, 0], [2, 2, 0], [3, 3, 3]]<  C6/>print (Sess.run (next_element))  # ==> [[4, 4, 4, 4, 0, 0, 0],
                               #      [5, 5, 5, 5, 5, 0, 0],
                               #      [6, 6, 6, 6, 6, 6, 0],
                               #      [7, 7, 7, 7, 7, 7, 7]]

The Dataset.padded_batch () transformation allows you to set a different fill (padding) for each dimension of each component, and it can be variable length (represented by none in the above example) or constant length. You can also override the padding value (default is 0). Iv. Training Process 1. Handle multiple Epoches

The Dataset API provides two main ways to handle multiple epoches of the same data.

The easiest way to iterate over a dataset in multiple epoches is to use Dataset.repeat (). For example, to create a dataset that repeats 10 epoches inputs:

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
DataSet = Tf.data.TFRecordDataset ( Filenames)
DataSet = Dataset.map (...)
DataSet = Dataset.repeat (a)
DataSet = Dataset.batch (32)

Applying dataset.repeat () without parameters will be entered indefinitely. Dataset.repeat () can connect its parameters (concatenate) without indicating the end of a epoch and the beginning of the next epoch.
If you want to receive a signal at the end of each epoch, you can write a training loop to capture the tf.errors.OutOfRangeError at the end of the dataset, so that you can gather some statistics (such as validation errors).

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
DataSet = Tf.data.TFRecordDataset ( Filenames)
DataSet = Dataset.map (...)
DataSet = Dataset.batch (
iterator = Dataset.make_initializable_iterator ()
next_element = iterator.get_ Next ()

# Compute for epochs.
For _ in range (m):
  Sess.run (Iterator.initializer) while
  True:
    try:
      Sess.run (next_element)
    except Tf.errors.OutOfRangeError:
      break

  # [perform end-of-epoch calculations here.]
2. Random disorderly Order (shuffle) input data

Dataset.shuffle () is used with TF. Randomshufflequeue a similar algorithm to randomly order the input dataset: keep a fixed size buffer and randomly select the next element from the buffer.

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
DataSet = Tf.data.TFRecordDataset ( Filenames)
DataSet = Dataset.map (...)
DataSet = Dataset.shuffle (buffer_size=10000)
DataSet = Dataset.batch (
) DataSet = Dataset.repeat ()
3. Use of High level APIs

The Tf.train.MonitoredTrainingSession API simplifies many aspects of running TensorFlow in distributed settings. Monitoredtrainingsession uses Tf.errors.OutOfRangeError to indicate that training is complete, so use it in conjunction with the DataSet API, recommending the use of Dataset.make_one_shot_ Iterator (). Examples are as follows:

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
DataSet = Tf.data.TFRecordDataset ( Filenames)
DataSet = Dataset.map (...)
DataSet = Dataset.shuffle (buffer_size=10000)
DataSet = Dataset.batch (a)
DataSet = Dataset.repeat (Num_epochs )
iterator = Dataset.make_one_shot_iterator ()

next_example, Next_label = Iterator.get_next ()
loss = Model _function (Next_example, next_label)

training_op = Tf.train.AdagradOptimizer (...). Minimize (loss) with

tf.train.MonitoredTrainingSession (...) as Sess: While not
  sess.should_stop ():
    Sess.run (TRAINING_OP)

Dataset.make_one_shot_iterator () is still recommended for using a dataset in the Input_fn of Tf.estimator.Estimator. For example:

Def DATASET_INPUT_FN (): filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"] DataSet = Tf.data.TFRecor Ddataset (filenames) # use ' Tf.parse_single_example () ' to extract the data from a ' TF.
  Example ' # protocol buffer, and perform any additional per-record preprocessing. def parser (record): Keys_to_features = {"Image_data": TF. Fixedlenfeature ((), tf.string, default_value= ""), "Date_time": TF. Fixedlenfeature ((), Tf.int64, default_value= ""), "label": TF. Fixedlenfeature ((), Tf.int64, Default_value=tf.zeros ([], Dtype=tf.int64)),} p
    arsed = Tf.parse_single_example (record, Keys_to_features) # Perform additional preprocessing on the parsed data. Image = Tf.decode_jpeg (parsed["Image_data"]) image = Tf.reshape (image, [299, 299, 1]) label = Tf.cast (parsed["lab El "], Tf.int32) return {" Image_data ": Image," date_time ": parsed[" Date_time "]}, Label # with ' Dataset.map () ' to Bui LD a pair of a feature dictionary and a label # tensor for each example. DataSet = Dataset.map (parser) DataSet = Dataset.shuffle (buffer_size=10000) DataSet = Dataset.batch (?) DataSet = Dat Aset.repeat (num_epochs) iterator = Dataset.make_one_shot_iterator () # ' features ' is a dictionary in which each value is a batch of the values for # that feature;
  ' Labels ' is a batch of labels. Features, labels = Iterator.get_next () return features, labels

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.