Chapter III: New TensorFlow Introduction, processing features list _

Chapter III: New TensorFlow Introduction, processing features list __ New TensorFlow

Last Update:2018-08-19 Source: Internet

Author: User

Tags scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Overview

A feature column is a bridge between the original data and the model. In general, the essence of artificial intelligence is to do weights and offset operations to determine the shape of the model.

Before using the TensorFlow version, the data must be processed in a kind and distributed way before it can be used by the artificial intelligence model. The appearance of feature columns makes the work of data processing much easier. 2, the function of the feature column

The characteristic column mainly solves the preprocessing and the characteristic processing to the user data, the appearance of this kind of technology mainly is the original input data diversity.

As shown in the following illustration, you can specify the input of the model through the Feature_columns parameters of the estimator (Iris dnnclassifier). The feature column bridges the input data (returned by INPUT_FN) and the model.

3. Use feature columns

In the example above we define the feature column in the following way.

My_feature_columns = [] for
key in Train_x.keys ():
    my_feature_columns.append (tf.feature_column.numeric_ Column (Key=key))
print (my_feature_columns)

Where the key statement is

Tf.feature_column.numeric_column (Key=key)

To create a feature column, call the function of the Tf.feature_column module. This document describes the nine functions in the module. As the following illustration shows, all nine functions return a categorical-column or a Dense-column object, but they do not return Bucketized_column, which inherits from the two classes:

characteristic columns of a numeric type

Tf.numeric_column (shown below) is also a good way to specify a numeric value with a default data type (TF.FLOAT32) as a model input:

Numeric_feature_column = Tf.feature_column.numeric_column (key= "Sepallength")

To specify a non-default numeric data type, use the Dtype parameter. For example:

Numeric_feature_column = Tf.feature_column.numeric_column (key= "Sepallength",
                                                          Dtype=tf.float64)

By default, numeric columns create a single value (scalar). Use the shape argument to specify a different shape. For example:

Vector_feature_column = Tf.feature_column.numeric_column (key= "Bowling", shape=10)

Matrix_feature_column = Tf.feature_column.numeric_column (key= "Mymatrix", shape=[10,5)

Feature Columns

The feature columns are described in detail in this document. You can treat a feature column as a medium between the original data and the estimator. Rich feature columns allow you to easily experiment by converting various raw data into formats that estimator can use.

In the prebuilt estimator, we used the estimator (Dnnclassifier) to train the model to predict different types of iris based on four input features. The example creates only a numeric feature column (Type Tf.feature_column.numeric_column). Although the numerical feature column effectively models the length of the petals and calyx, the real dataset contains a variety of features, many of which are not numerical. Some real features (such as longitude) are numeric, but many are not numeric values. input of Deep neural network

What kind of data can be processed by a deep neural network. The answer, of course, is numbers (such as Tf.float32). After all, every neuron in a neural network performs multiplication and addition operations on weights and input data. However, the actual input data usually contains non-numeric (sorted) data. Take an example of a product_class feature that can contain the following three non-numeric values: Kitchenware Electronics Sports

The machine learning model generally represents a classification value as a simple vector, where 1 indicates that there is a value, and 0 indicates that there is no value. For example, if you set Product_class to sports, the machine learning model usually represents product_class as [0, 0, 1], meaning: 0:kitchenware does not exist 0:electronics does not exist 1:sports exists

Therefore, although the original data can be numeric or categorical, the machine learning model represents all the features as numbers. Feature Columns

Let's take a look at these functions in more detail. Numeric Columns

Iris classifier calls the Tf.feature_column.numeric_column function for all input features: Sepallength sepalwidth petallength petalwidth

Although Tf.numeric_column provides optional parameters, calling Tf.numeric_column (as shown below) without any arguments is a good way to specify a numeric value with the default data type (TF.FLOAT32) as the model input:

# Defaults to a tf.float32 Scalar.numeric_feature_column = Tf.feature_column.numeric_column (key= "Sepallength")

To specify a non-default numeric data type, use the Dtype parameter. For example:

# represent a tf.float64 scalar.
Numeric_feature_column = Tf.feature_column.numeric_column (key= "Sepallength",
Dtype=tf.float64)

By default, numeric columns create a single value (scalar). Use the shape argument to specify a different shape. For example:

# represent a 10-element vector in which each cell contains a tf.float32.vector_feature_column = Tf.feature_column.numeric _column (key= "Bowling", shape=10) # represent a 10x5 matrix in which Each cell contains a tf.float32.matrix_feature_column = Tf.feature_column.numeric_column (key= "Mymatrix", shape=[10,5]) Sub-barrel Column

Typically, you do not feed numbers directly to the model, instead, you divide the values into different categories based on the range of values. To do this, create a bucket column. For example, the original data indicating the year in which the house was built. Instead of representing the year as a scalar numeric column, we divide the year into the following four-barrel:

The model will represent these buckets in the following manner:

Why do you want to split a number (a fully valid model input) into a category value? Note that the classification divides a single input number into a four element vector. Therefore, the model can now learn four separate weights rather than just one; compared to a weight, four weights can create a richer model of content. More importantly, with the help of the bucket, the model can clearly distinguish between different year categories because only one element (1) is set and the other three elements are cleared (0). When we use only a single digit (year) as input, the model can only learn linear relationships. Thus, the barrel provides the model with additional flexibility that can be used for learning.

The following code demonstrates how to create a split-bucket feature:

# first create an original numeric form of the feature column
numeric_feature_column = Tf.feature_column.numeric_column ("year")

# Use the years as a boundary to handle the data in a barrel
bucketized_feature_column = Tf.feature_column.bucketized_column (
    source_column = Numeric_feature_column,
    boundaries = [1960, 1980, 2000])

Category identity Column

A special case in which a categorized identity column can be treated as a bucket column. In the traditional bucket column, each barrel represents a series of values (for example, from 1960 to 1979). In the Category identity column, each bucket represents a unique integer. For example, suppose you want to represent an integer range [0, 4]. In other words, you want to represent integers 0, 1, 2, or 3. In this case, the taxonomy identity map looks like this:

Call Tf.feature_column.categorical_column_with_identity to implement the taxonomy identity column. For example:

Identity_feature_column = tf.feature_column.categorical_column_with_identity (
    key= ' My_feature_b ', num_buckets =4)

category Vocabulary Column

We cannot enter strings directly into the model. Instead, we must first map the string to a numeric or categorical value. The Categorical Vocabulary column provides a good way to represent a string as a single heat vector. For example:

As you can see, the Category Vocabulary column is like an enumerated version of a categorized identity column. TensorFlow provides two different functions to create a categorized vocabulary column: Tf.feature_column.categorical_column_with_vocabulary_list tf.feature_ Column.categorical_column_with_vocabulary_file

Categorical_column_with_vocabulary_list maps each string to an integer based on an explicit glossary. For example:

Vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_list (
        key= "a feature Returned by INPUT_FN () ",
        vocabulary_list=[" kitchenware "," Electronics "," sports "]

The above function is very simple, but it has an obvious disadvantage. That is, when the vocabulary is very long, you need to enter too much content. For this type of situation, call tf.feature_column.categorical_column_with_vocabulary_file instead to place each word in a separate file. For example:

Vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_file (
        key= "a feature Returned by INPUT_FN () ",
        vocabulary_file=" Product_class.txt ",
        vocabulary_size=3)

Product_class.txt should have one row for each lexical element. In our example:

Kitchenware
Electronics
Sports

Hash-Processed columns

So far, the examples we've dealt with have very few categories. For example, our Product_class sample has only 3 categories. Typically, however, the number of categories is so large that it is impossible to set a separate category for each word or integer because it consumes too much memory. For this kind of situation, we can ask ourselves: "I would like to set the number of categories for my input." Actually

The Tf.feature_column.categorical_column_with_hash_bucket function allows you to specify the number of categories. For this type of feature column, the model computes the hash value of the input and then uses the modulo operator to place it in one of the hash_bucket_size categories

In addition, there are several other more complex feature columns, the use of relatively few scenarios

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More