Data analysis Practice: Problems encountered and their impressions

Source: Internet
Author: User

In the process of classifying and predicting using machine learning algorithm, the most difficult part is how to improve the accuracy of model prediction. Sometimes when we worked hard to prepare the data set, the tedious data preprocessing, encoding, submitted to the cluster completed the model training, suddenly found that the accuracy of the prediction is low to people without words, the author has encountered 0, 1 classification, the results of training model accuracy rate is 51.8%, This is almost no different from manual random classification, which is very frustrating.

Often at this time, is the real beginning of data analysis, the data engineer will really come back to examine the meaning of the data, the process of data processing, to explore the evaluation of the results.

Data engineers don't necessarily understand the meaning of data. We are conducting a research project on operational data analysis to train a model to classify and predict outage by combining data from product outage and production line system data. At the beginning of the project, we got a lot of production line data, then according to the historical outage time and data characteristics, artificial screening made a training set, and then brought into the naïve Bayesian algorithm spark training model, in the test set for classification testing, the results of most of the data are divided wrong. First of all, we suspect that the problem is in the manual training data set, our people are not professional operations and maintenance engineers are not system engineers, outage data and system data are taken for granted, and then according to our ideas to give each data corresponding label. Obviously, the label of this training set is not reliable, in order to solve this problem, the most direct and effective way is to find operations engineers, systems engineers and product development engineers to communicate, to the data engineer to explain the significance of the data. However, most of the projects in the development phase, and not so much energy to summon the core of the various roles to explain the data. In this case, what the data engineer can do is to try to eliminate the data features that are not understood in the analysis, this strategy is against the big data idea, but this helps to eliminate the data interference, can concentrate on the follow-up steps more confident, fortunately, the research phase is mainly to prove that the method can make sense, Rather than really solve the problem.

The data preprocessing stage plays a very important role in the training of the model. Raw data after high-quality processing and then input training algorithm, not only can improve the accuracy of the model, but also help improve the efficiency of training. The algorithm is required for data, such as the training of classification model, logistic regression, naive Bayesian, decision tree and so on, each algorithm has some inherent assumptions about the distribution and scale of the input data, and the most common assumption is that the data characteristic satisfies the normal distribution. To make the data more consistent with the assumptions of the model, each feature can be normalized by subtracting the mean of the column from each feature value and dividing it by the standard deviation of the column (PS: Decision trees and naive Bayes are unaffected by feature normalization). In addition to the characteristics of the numerical form, for the processing of class characteristics, the common method is to index each value of the category, and then introduce the category index value in each data, so that each data contains the information of the class characteristics, for example, the weather category has three features Sunny,rain,snow, So for weather as sunny category is (1,0,0), Weather for rain category is (0,1,0), weather for the category of snow is (0,0,1).

In the process of model training, it is often necessary to adjust the parameters according to the test results and to perform iterative training. This involves the question of how to evaluate the performance of the model.

Common accuracy and recall rate to evaluate the completeness of the results. In the two classification problem, the accuracy rate is defined as the number of true positives divided by the total number of true positives and false negatives, where true positive refers to a sample that is correctly predicted to be of type 1, and a false positive is a sample of category 1 incorrectly predicted. If the sample that is predicted by the classifier as Category 1 does belong to Category 1, then the accuracy rate is 100%. The recall rate was defined as the number of true positives divided by true-positive and false-negative sum, where the false negative was a sample of 1 but was predicted to be 0. If any sample of type 1 is not incorrectly predicted as category 0 (i.e. there is no false negative), then the recall rate reaches 100%.

In general, accuracy and recall rates are negatively correlated, and high accuracy often corresponds to low recall rates and vice versa. For example, the prediction of the model is always 1, there will be no false negative, and will not miss any category of 1 samples, so the model recall rate is 1.0, on the other hand, false positive will be very high, meaning the accuracy is very low. Accuracy and recall rates are less useful when measured separately, and they are often combined to use the "accuracy-recall rate (PR)" curve. The area under the PR curve is the average accuracy rate, and intuitively, the area under the PR curve is 1 equivalent to a perfect model. In addition to the PR curve, the ROC curve is also used to evaluate, the ROC is used to indicate that the classifier performance in different decision thresholds under the true positive rate (TPR) to the false positive rate (FPR) of the compromise, the specific principles and application of the scene on the internet there are many.

Data analysis Practice: Problems encountered and their impressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.