First, the threshold for machine learning has dropped even lower.
The first chapter of this book describes the Alibaba Cloud Machine Learning Platform. The Alibaba Cloud Machine Learning Platform is built on the Alibaba Cloud MaxCompute computing platform and integrates data processing, modeling, off-net prediction, and online prediction. The machine learning algorithm platform allows users to experiment by dragging visualized operational components so that engineers without a machine learning background can easily get started with data mining."
This is not true. The right is that the ease of use of the machine learning platform is indeed very important. What is wrong is that the machine learning platform only solves functional problems. In most cases, data mining is doing business analysis, processing data and analyzing data. Rather than choosing algorithms and running through processes, the convenience of a visual platform is still very limited in reducing the cost of machine learning. Otherwise, how many data modelers do you want?
Compared with SASS, SPSS, etc., Alibaba Cloud Machine Learning Platform has its own characteristics in terms of ease of use, algorithm integrity and data processing, and even has some advantages, because it has an endorsement of MaxCompute platform, I believe any enterprise As long as the business personnel understand the basic data concept, it is very easy to get started with this platform, which reflects the ambition of Alibaba Cloud Machine Learning Platform in the enterprise market.
How to use the simple method? Looking at the diagram below, any machine learning can be described in a simple process. The steps are clear and concise, and most of them are very similar.
-
Discrete value feature analysis: It is to analyze the relationship between discrete variables and labels. Alibaba Cloud provides many methods for variable analysis, such as histograms.
-
Split: split the data set into training and test sets
-
Random Forest: It is the algorithm of choice. I have looked at it, including logistic regression, naive Bayes, logistic regression, GBDT, text analysis (such as LDA), collaborative filtering and other algorithms, and also support TensorFlow, but just hit Package
-
Prediction: verifying with test set data
-
Evaluation: It is the traditional evaluation method such as ROC, AUC and the like.
All operations are basically dragged and configured, which is convenient. For those who don't understand machine learning, the whole process of machine learning can be understood through this platform. Getting started is quite good, even for those who understand machine learning. Broaden your horizons and accelerate your model validation process.
Second, business people welcome new opportunities
The author has a premise, to do a good job of data mining, first of all, there must be a certain business accumulation, the data model can be effective, the business understanding and data preparation in data mining account for more than 70% of the time, the foreign monk can not read good Often not the algorithm does not work, but the business and data understanding is not good, so the people who are proficient in the business are actually at least half of the professional data miners.
The lack of business personnel is some IT skills. In the past, the remaining 30% is not easy to master. For example, if a business person has a logical return, he may have to learn the language. This challenge is still relatively large. Now in this kind of easy-to-use machine. With the help of learning tools, he is likely to take data analysis to a new stage based on his rich business experience.
At present, some business people in some enterprises have begun to collect, analyze and mine on their own, but most enterprises still take the way of taking the process or project. This controversy is still relatively large, but I believe that with the big The in-depth application of data, due to its inherently innovative, iterative requirements may lead to the gradual transformation of business personnel, or large adjustments in the organization, such as data miners directly belong to the business sector.
IT personnel should focus on developing and improving the work of the middle class such as machine learning platform, improve the experience of the platform, and do everything possible to let the business personnel use these platforms. This may be the correct posture of IT in the future, and it is also a win-win situation.
Nowadays, the IT staff of many enterprises are engaged in data mining and data retrieval. They are in the middle of IT, data and business. From the perspective of efficiency, it is also a good idea to move to the business department.
Third, the opportunity of the data warehouse modeler
The author believes that this easy-to-use machine learning platform will be more and more in the future, which means that the threshold of the general algorithm is low, and the value of engineers who only know a few algorithms will depreciate in the enterprise.
The reduction of the threshold of machine learning algorithms has increased the value of data warehouse modelers. As machine learning needs increase, data understanding, data cleaning and data preparation in the early stages of machine learning become more important. Who can understand the business in depth? Designing a useful data mining mid-station data model (the data model here is similar to data warehouse modeling) will greatly reduce the cost of data mining.
In fact, the data mining data in the past is actually not approved by the author. Now I still feel that there is a certain necessity. One is of course the growth of machine learning needs, the shared value of data in the middle, and the other is the current data warehouse model. It does not support many data mining scenarios very well. The team's data miners are fighting each other, and good variable design cannot be precipitated.
The following is a case of Ali's data preparation in the e-commerce purchase forecast. I think it is necessary to systematically design the person with business and data experience. It is too costly to prepare for personal temporary preparation. On the other hand, I also want to Incomplete.
What are the characteristics that affect a user’s purchase of a brand?
The first is the user's attention to the brand, such as: click, purchase behavior, collection and if you have a shopping cart, and among these factors, the closer the behavior is concerned, the more likely you are to buy, so we will Focus on the last 3 days, the last week, the last 1 month, the last 2 months, the last 3 months, and all the time recorded, so there are some features as follows.
Number of hits, purchases, favorites, and shopping carts in the last 3 days
Number of hits, purchases, favorites, and shopping carts in the last week
Hits, purchases, favorites, and shopping carts in the last month
Number of clicks, purchases, favorites, and shopping carts in the last 2 months
Hits, purchases, favorites, and shopping carts in the last 3 months
All clicks, purchases, collections, and shopping carts
It is not enough to pay attention to the time segment segmentation. We also want to know the rate of change of the value to describe the persistence of the focus. We can also construct the following features:
Last 3 days click rate change rate (last 3 days clicks / recent 4-6 days clicks), purchase number change rate, collection number change rate, change rate of shopping carts
Last week's click rate change rate (last 1 week clicks / last week clicks), purchase number change rate, collection number change rate, change rate of shopping carts
Last month's click rate change rate (last January hits/last month clicks), purchase number change rate, collection number change rate, change rate of shopping carts
If the user has had a purchase for the brand, we would like to know how many clicks resulted in a purchase and how many times the collection was converted into a purchase, which is the purchase conversion rate. The structural characteristics are as follows:
Last 3 days click conversion rate, favorite conversions, and shopping cart conversion rate
Last click on conversion rate, favorite conversions, and shopping cart conversion rate
Last click on conversion rate, favorite conversions, and shopping cart conversion rate
Overall click conversion rate, favorite conversions, and shopping cart conversion rate
Secondly, we focus on the user and need to construct features to express the characteristics of the user. The focus is on the overall behavior of all the brands that the user cares about. The user's recent attention to all brands has the following characteristics:
Number of hits, purchases, favorites, and shopping carts in the last 3 days
Number of hits, purchases, favorites, and shopping carts in the last week
Hits, purchases, favorites, and shopping carts in the last month
Number of clicks, purchases, favorites, and shopping carts in the last 2 months
Hits, purchases, favorites, and shopping carts in the last 3 months
All clicks, purchases, collections, and shopping carts
Last 3 days click conversion rate, favorite conversions, and shopping cart conversion rate
Last click on conversion rate, favorite conversions, and shopping cart conversion rate
Last click on conversion rate, favorite conversions, and shopping cart conversion rate
Overall click conversion rate, favorite conversions, and shopping cart conversion rate
Finally, look at the impact of the brand alone, some popular brands, high attention, and we are more concerned about its recent situation, with the following characteristics.
Number of clicks, purchases, favorites, and shopping carts in the last 3 days
Number of hits, purchases, favorites, and shopping carts in the last week
Number of hits, purchases, favorites, and shopping carts in January
Number of hits, purchases, favorites, and shopping carts in March
Total number of hits, number of purchases, number of favorites, and number of shopping carts added
Last 3 days click conversion rate, favorite conversions, and shopping cart conversion rate
Last click on conversion rate, favorite conversions, and shopping cart conversion rate
Last click on conversion rate, favorite conversions, and shopping cart conversion rate
Overall click conversion rate, favorite conversions, and shopping cart conversion rate
In summary, a feature of whether a user purchases a brand is composed of various features that describe the user's attention to the brand, describes the characteristics of the user, and describes the characteristics of the brand.
Such complex feature variable design should not be generated every time you do machine learning, but should be precipitated. In fact, every enterprise has a similar scene, but when we do feature design, it is often difficult to consider such a comprehensive thought. Where to do it, this reflects the value of the data in the data mining data.
Fourth, the value of machine learning engineers
After reading the book of Ali, although it is more like reading the instructions of a machine learning platform, maybe the professional will feel LOW, but the author can understand the effort spent on the ease of use of the platform, the team is also doing Something similar, but there is still a big gap, I know it.
The data mining in the middle of the book is also thought of when you look at the case inadvertently. The business practice has this advantage. It is talking about one thing, but the process reveals a lot of practical secrets. There are many similar things. For example, the judgment of the importance of logistic regression variables, I have always understood the mistakes, such as the use of feature dummy, such as KNN and random forest performance in some scenes, as explained by LDA, because the case is there, you It's easy to understand emotionally, and there is GBDT. I haven't heard of it before. When the team said that it was going to use this algorithm, it was a slap in the face.
During the review of a data mining process with members this week, members mentioned that it took a lot of money to replace the matrix algorithm with GBDT. It lasted for a long time, but the effect improved a little, and the author can only be embarrassed. Smiled, paying for the ignorance of oneself.
Many times data diggers work hard, but the results are ambiguous. I think the biggest problem is not understanding the customer's final appeal, narrow vision, and using the algorithm as a result. Data miners often say that the desk has worked hard for one month, and the XX algorithm has improved. XX points, very good, I said, how much income and users brought in the end?
In fact, the situation of different companies is different. It is certainly awesome to upgrade the recommended algorithm by Tencent in Tencent, but in my enterprise, it may be worthless, and everyone's starting point is completely different.
In fact, as a customer, we don’t pay attention to the means at all. What we want is the effect. The means can be simplified and simplified. It is often better to use a new algorithm than a new algorithm. The biggest benefit is to get the lowest cost. The Alibaba Cloud Machine Learning Platform is I hope to reduce that 30% of the cost time, but that's it.
The future is the era of artificial intelligence, and artificial intelligence is gradually being platformized. Today, you said that mastering a deep learning seems to be very advanced, but it is greatly devalued after being integrated. Only differentiated is valuable. Now the TensorFlow technical article is not yet Many, we are slow to try TensorFlow On Spark, this time you know the value.