Use Amazon Machine Learning and Amazon Redshift to establish a binary classification model, amazonredshift

Source: Internet
Author: User
Tags sql client aws management console

Use Amazon Machine Learning and Amazon Redshift to establish a binary classification model, amazonredshift

Most of the decisions in daily life are in binary format. Specifically, such questions can be answered by "yes" or "no. There are also many questions that can be answered in binary mode during business activities. For example: "Is this a transaction fraud ?", "Will this customer buy this product ?" Or "is there a risk of loss for this user ?" And so on. In machine learning, we call this binary classification problem. Many business decisions are enhanced by accurately predicting the answers to binary questions. Amazon Michine Learning (Amazon ML for short) provides a set of simple and low-cost options to help you find answers to these questions in a fast and large-scale manner.

In today's article, we will start with an instance provided by the Kaggle.com website. This time, you will be able to see the frequently used CTR prediction cases in the online advertising industry. In this example, you can predict the actual possibility of a specific user clicking a specific advertisement.

Prepare data for Building Machine Learning Models

It is also feasible to build this model by directly retrieving data from the Kaggle site. However, to enhance its practical significance, we will use Amazon Redshift as a data intermediary this time. In most cases, the historical event data required to establish a machine learning model has been stored in the data warehouse. The powerful combination of Amazon ML and Amazon Redshift can help you query event data and perform aggregation, addition, or processing operations to prepare all required data for the machine learning model. In the following section, we will provide some examples related to this.

To successfully complete this tutorial, you must have an AWS account, a Kaggle account (used to download datasets), an Amazon Redshift cluster, and an SQL client. If you haven't created an Amazon Redshift cluster, you don't have to worry about it. Now you can apply for a free trial period of two months for a Single-node dw2.large cluster. This is enough to support your study.

Create an Amazon Reshift Cluster

Select US East (US East, north virginia) in the Supported Regions (Supported Regions) List on the AWS Management console, and then select Amazon Redshift in the Database section. Select Launch Cluster (start Cluster ).

On the Cluster Details (Cluster Details) page, name the Cluster and database (ml-demo and dev respectively), and enter the primary user name and password.


On the Node Configuration page, define the cluster layout. For the data volume involved in this example, you only need a single dc1.large node (and access the free level of Amazon Redshift ).


Select Continue, review settings in the following page, and select Launch Cluster (start Cluster ). A few minutes later, the cluster will be available to you. Select the cluster name and view its configuration information.


Here, you need to pay attention to the Endpoint value and ensure that it can access the cluster and use the data downloaded from the Kaggle site.

Download and save data

Click here to download the training file from the Kaggle website, and then upload it to the AmazonSimple Storage Service (Amazon simple Storage Service (Amazon S3 ). Because the files are large, we need to use AWS command lines to split and upload them.

# Download the train data from:http://www.kaggle.com/c/avazu-ctr-prediction/download/train.csv.gz# upload the file to S3aws s3 cp train.csv.gz s3:///click_thru/input/
You can use a variety of SQL clients to connect to the cluster, such as SQL-Workbench or Aginity Workbench. Of course, we can also use psql commands in the terminal to achieve access in Linux-based EC2 instances.

ssh -i .pem ec2-user@ec2-.eu-west-1.compute.amazonaws.compsql -h ml-demo..us-east-1.redshift.amazonaws.com -U  -d dev -p 5439psql -h ml-demo.<CLUSTER_ID>.us-east-1.redshift.amazonaws.com -U <USER_NAME> -d dev -p 5439
Create a table in our SQL client to save all event data from the Kaggle website. Make sure that each column uses the correct data type.

CREATE TABLE click_train (  id varchar(25) not null,  click boolean,  -- the format is YYMMDDHH but defined it as string  hour char(8),  C1 varchar(20),  banner_pos smallint,  site_id varchar(10),  site_domain varchar(10),  site_category varchar(10),  app_id varchar(10),  app_domain varchar(10),  app_category varchar(10),  device_id varchar(10),  device_ip varchar(10),  device_model varchar(10),  device_type integer,  device_conn_type integer,  C14 integer,  C15 integer,  C16 integer,  C17 integer,  C18 integer,  C19 integer,  C20 integer,  C21 integer);
On the SQL client, run the COPY command to COPY events to the cluster:

COPY click_train FROM 's3:///input/click_thru/train.csv.gz'credentials 'aws_access_key_id=;aws_secret_access_key=' GZIPDELIMITER ','IGNOREHEADER 1;

If everything is ready, you can see that the number of existing records has exceeded 40 million after using the SELECT query command below:

dev=# SELECT count(*) FROM click_train;  count---------- 40428967(1 row)

Using data from Amazon Redshift to build a machine learning model

In previous articles, we have discussed how to use data files from S3 to build machine learning models. In fact, such data can also be provided by files from the database and dumped in SQL. Because SQL dumping operations are very common, Amazon ML directly integrates two types of popular database sources, that is, Amazon RelationalDatabase Service (Amazon Relational Database Service, or Amazon RDS for short) and Amazon Redshift. After integration, we can speed up the data acquisition process, making it easier to directly use "real-time" data to improve the machine learning model.

To use data from Amazon Redshift to build a machine learning model, we must first allow Amazon ML to access Amazon Redshift. The specific operation is to run the UNLOAD command to query Amazon S3, and then start the next stage of the training process.

Create a new role named AML-Redshift in the IAM console, and then select Continue.

On the Select Role Type pageAmazon Machine Learning Role for Redshift Data SourceSelect the default role type.


On the Attach Policy page, select a Policy from the list and click Continue.


Finally, review the setting information of the new Role, copy the Role ARN value for future use, and then select Create.


On the Amazon Machine Learning console, selectCreate new... Datasource and ML model(Create a new ...... Data Source and machine learning model ).


On the Data Input page, select Redshift and enter relevant information, including the ARN value, cluster name, database name, user name, and password of the created role. You also need to specify the SELECT query to be used (which will be described later), The S3 bucket name, And the folder used as the temporary storage location.


In SQL queries, you need to "click" the binary target as an integer (0 or 1) rather than false or true to convert it to int. We also recommend that you use order by random () to sort records to avoid the influence of the ORDER of data content.

SELECT id,-- target field as 0/1 instead of f/t click::int, hour,       c1,            banner_pos,    site_id,       site_domain,       site_category,     app_id,            app_domain,        app_category,      device_id,         device_ip,         device_model,      device_type,       device_conn_type,  c14, c15, c16, c17, c18, c19, c20, c21 FROM click_train -- Shuffle the records ORDER BY RANDOM();

On the Schema page in the Amazon ML wizard, you can see that Amazon has automatically recognized its Schema definition from the data. At this stage, we 'd better review the recommended values for each attribute and change the numeric value used to display the category ID to "Categorical ".


On the Target page, select "click" as the Target.


Follow the Wizard to continue the next step and define the row ID (id field ). When you go to the Review page, select the default settings to create this machine learning model. By default, Amazon ML splits data, 70% of which are used for Model Training and 30% others are used for model evaluation.


Because a large number of records need to be processed, it may take some time to create a data source, ML Model, and evaluate the data. You can monitor the processing progress in the Amazon ML dashboard.


In the dashboard, you can see that the original data source we created previously is In the "In progress" or "In progress" status. 70% of the content in the data source is used as the training material, and the other 30% is used for model evaluation. The current state of ML Model Creation and evaluation is "Pending", that is, waiting for the data source to be created. After the entire process is completed, check the model evaluation results.

Evaluate the accuracy of the Machine Learning Model

In previous articles, we have discussed how Amazon ML reports the accuracy of the corresponding model through prediction accuracy indicators (a single number) and graphics.

In this binary classification example, the prediction accuracy indicator is called AUC (Area-Under-the-Curve, Area Under the Curve ). You can click here to view the Amazon ML documentation to learn the specific meaning of this critical score. In this example, the score for this set of solutions is 0.74:


To learn more about the meaning, you can click here to view the visual description of the evaluation results provided by Amazon. Directly selecting the overall critical value is obviously easier for everyone to understand. The prediction critical value of each record is a numerical value between 0 and 1. The closer it is to 1, the more likely it is to get the "yes" answer, while the opposite means it is more likely to get the "no" answer. Based on the overall critical value, the evaluation results of corresponding records may be divided into the following four categories:

· True positive (TP)-correctly classified as "yes"

· True negative (TN)-correctly classified as "no"

· False positive (FP)-classified as "yes" by mistake"

· False negative (FN for short)-the error category is "no"


If the overall critical value is close to 1, the fewer records are classified as "no" by errors, but at the same time, the more records that are classified as "yes" by errors. At this time, we need to use this critical value to make business decisions. If each item is incorrectly classified as "yes", it will produce a cost of $1 (assuming that it costs $1 to display an advertisement), it should be adjusted to avoid high costs. However, it is more wise to increase this value if each record is incorrectly classified as "no", which will lead us to miss a large order (for example, a luxury car with a commission of $1000.

You can adjust this threshold by moving the slider left or right, as shown in. Sliding to the left side reduces the value, which reduces the probability of being judged as "yes" by errors, but also leads to more cases where errors are judged as "no. Sliding to the right to increase the critical value leads to the opposite result. You can also use the four slide in the Advance metrics (Advanced metrics) below the graph to comprehensively control the critical value. However, the so-called "there is no free lunch in the world", modifying the value of one of them will also lead to changes in the other three values.

·Accuracy (Accuracy)-This indicator reflects the overall accuracy ratio of all classification prediction results. Increasing Accuracy means finding a balance between two types of errors.

·False negative Rate)-Percentage of all negative results that are actually negative but incorrectly classified as positive.

·Precision)-Percentage of all positive prediction results correctly classified as positive. We usually use it to avoid situations where the predicted result is "yes" with too many records (which may result in a waste of money or make users dislike frequent irrelevant pop-up windows ). In other words, accuracy is used to measure the accuracy of the content you decide to send to someone, or whether the current marketing budget is spent properly. If you are interested, you can click here to view the descriptive information and images (for example) provided for precision and recall in Wikipedia ).


·Recall (Recall)-The ratio of all positive records that are correctly positive. We usually use it to avoid too many records with the predicted result "no" (which may lead to companies missing sales opportunities ). In other words, it indicates how many objects that may be interested in content can be recalled through advertising. In the preceding example, the recall value is 0.06, which means that only 6% of users belong to the expected advertising audience (because they actually click the ad content ).

For example, if we set the recall value to 0.5, it is equivalent to ensuring that at least 50% of the people who see each advertisement belong to the established advertising audience. In this case, what will happen?


As you can see, the decline in accuracy is not obvious (then 0.83 to 0.74), but the accuracy has experienced a sharp dive (from 0.6 to 0.33 ), this means that only one of the three ad recipients will actually click to view it. In the original settings, each three ad recipients will have two actual clicks. These changes come from the specific adjustment of the critical value without affecting or improving the model itself.

You can create more new data sources from Amazon Redshift to improve the machine learning model. For example, you can include more information in the data, including IP address changes based on the customer's working days and schedules (this part of information does not exist in the Kaggle data set, but it is often difficult to obtain it in real life ), or the IP address of the user is rotated every day in the early, middle, or late hours. Next, let's take a look at several sample SELECT queries to learn how to maximize the use of data from Amazon Redshift data sources by modifying them:

SELECT    id,    click::int,    -- Calculating the date of the week from the Hour string    date_part(dow, TO_DATE (hour, 'YYMMDDHH')) as dow,    -- Creating bins of the hours of the day based on common behaviour    case        when RIGHT(Hour,2) >= '00' and RIGHT (Hour,2) <= '05' then 'Night'        when RIGHT(Hour,2) >= '06' and RIGHT (Hour,2) <= '11' then 'Morning'        when RIGHT(Hour,2) >= '12' and RIGHT (Hour,2) <= '17' then 'Afternoon'        when RIGHT(Hour,2) >= '18' and RIGHT (Hour,2) <= '23' then 'Evening'        else 'Unknown'    end        as day_period...

To introduce data containing other types of user information into this clickthrough rate analysis model, such as gender or age, you can use JOIN statements for data from other tables in the Amazon Redshift data warehouse.

Summary

In today's article, we learned when and how to use the binary classification machine learning model provided by Amazon ML. In addition, we also discussed how to use Amazon Redshift as a data source for training data, how to select data, convert the target data type to int to trigger binary classification, and how to use the RANDOM function to mix data content..

At the same time, we are also exposed to the various indicators required for scoring binary classification models, including accuracy, accuracy, and recall. This knowledge will help you build, evaluate, and modify your binary classification model to effectively solve specific problems in business operations.

If you have other questions or suggestions, please speak freely in the comment bar.

Original article:

Https://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R

Nuclear cola Translation



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.