Big Data Terminated "Human Feature Engineering + Linear Model" Mode

Last Update:2014-08-26 Source: Internet

Author: User

Keywords Big data linear model artificial feature engineering

Tags big data click continue data data sources e-mail example get

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tags: Big Data Machine Learning Feature Engineering During that 11 years I just joined Baidu, at Phoenix Nest through machine learning to predict ad clicks. I was still very much surprised at the crazy growth in training data over the past two years. Everyone is passionate about the features, each new feature can immediately get the AUC increase and revenue growth. We firmly believe that the feature is benevolent, I believe there will be a steady stream of features to join, the data size will double the growth. I am also deeply infected and firmly believe that the data will be at least ten times longer in the next two years. Therefore, all the work is done around this assumption. Now two years have passed, looking back, was the prediction correct?

The rapid growth of data has put tremendous pressure on model training. In fact, model training has been a major hurdle to bringing new features to life 11 years ago. With a young urge and a little bit of knowledge about optimizing distributed systems and numerical algorithms, I started to design the next generation model training system. The goal is to be able to accommodate the current tenfold data with the same resources. The project is on Valentine's Day project, took a fun name, called darlin [Tucao 1], this system should be one of Baidu's highest utilization of machine learning training system. An important question is, is it going to be a performance bottleneck two years from its predecessor?

At present, the answers to the above two questions are negative.

[Tucao 1] means distributed algorithm for linear problems. More playful is that the calculation of the core module called heart, network communication module called telesthesia. The data is in the form of a bigtable, called cake, because it looks like a cake. When developing jokingly said that after coming on line from time to time to hear people say "darlin", is not it very interesting? Unfortunately, after the full flow on the line I went straight CMU, did not enjoy this fun :)

We first discuss the characteristics. The characteristics of the machine learning system is the raw material, the impact on the final model is beyond doubt. If the data is well expressed as a feature, the linear model can usually achieve satisfactory accuracy. A typical process of using machine learning is to ask questions and collect data, understand the problems and analyze the data, and then propose the feature extraction scheme and use the machine learning model to get the prediction model. The second step is feature engineering, which we call human feature engineering if it is done mainly by humans. For example, suppose we want to do a spam filtering system, we collect a large number of user e-mail and the corresponding tag, we can reasonably believe that the title and the text contain the key "friends", "invoice", "free promotion" and other key The word is likely to be spam. So we constructed the bag-of-word feature. Then use the linear logisitic regression to train the model, and finally filter out the mail whose probability that the model is judged to be spam is more than a certain value.

That's it. No. Feature Engineering is a long-term process. In order to improve the quality of features, we must constantly put forward new features. For example, by analyzing the bad case, we soon discovered that if the mailing style was messy with a lot of color text and pictures, the high probability was spam. So we add a style feature. And through brainstorming, we think that if a person who has used Chinese for a long time receives Russian e-mail, it is estimated that the e-mail received is not normal and can be filtered out directly. Then added a new feature of character encoding. Then through hard search or buy or ask, we get a database containing a large number of insecure IP and mailing address, so you can add a new source of unsafe features. Through continuous optimization of features, system accuracy and coverage continue to increase, which in turn drives us to continue to pursue new features.

From this it can be seen that the feature engineering is built on constant in-depth understanding of the problem and access to additional data sources. The problem, however, is that the general categories of features that are usually abstracted from the data people are very limited. For example, ad click predictions, the most thoroughly covered issue for ad delivery companies, are now fully abstracted in a slideshow. Good understanding, easy to use, clean data source will not be much, advertising is nothing less than the advertising itself (title, text, style), advertisers information (industry, location, prestige), and users Information (personal information such as gender, age, income, cookie, session, etc. click information). KDD CUP2013 Tencent provides ad click forecast data, including many of them. So the final number of features that can be obtained is just a few hundred. Another example is that each sample in Google's dataset contains an average of no more than 100 features, and it can be inferred that the number of their feature classes is only up to hundreds.

figure 1

Therefore, the acquisition of new data sources and new features will be harder and harder. However, the accuracy of the model does not increase linearly as the feature grows. In many cases the index. With the deepening of the project of human characteristics, the manpower and time invested are getting longer and longer, and the new features obtained are getting less and less promotion of the system. In the end, system performance seems to be stopping growing. Robin once asked my boss a question: "Machine learning can continue to bring benefits to Baidu?" But when I first response is that the businessman! Now think about it, Robin is very far-sighted.

Another example is IBM's Watson. As can be seen from the figure below, although each performance improvement is basically based on the introduction of new data and new features, the improvement is getting smaller and harder and harder.

figure 2

This explains the first question, why the number of features is much less than originally expected. A feature team, 5 experienced big brother with 10 hands-on strong brother, a few months will be able to explore the characteristics of similar, and then use 1-2 years to make all the features into the system. Then again? Will be found a little behind the weak, and into the middle-aged stability period.

Then discuss model training, because do not want to be multinational hunt, so mainly to use google sibyl? For example. Sibyl is a linear classifier that supports many common types of loss, such as logistc loss, square loss, hingle loss. You can also use l2 penalty, or l1 penalty to get the sparse model. Sibyl to be run hundreds of times a day, is widely used in Google's search, gmail, youtube and other applications. Since these applications are directly related to user experience and revenue, it is usually necessary to obtain a model with high convergence accuracy and stable convergence point. Because a model with tens of billions of items, if there is not enough convergence, even if only a few characteristics of the weight calculation error is too large, it is easy to cause bad case. At the same time, the application also hopes that the output of the model will be stable over time.

Sibyl uses parallel boosting, and darin uses a more obtrusive algorithm. Later, after hearing about the algorithms of linkedin, yahoo and facebook, they surveyed some ancient and optimized essays carefully and found that although everyone's names are different, they are actually equivalent [Tucao 2]. With the right algorithms, it typically takes dozens of iterations to converge to the required accuracy. And as the amount of data increases the number of iterations will be further reduced. More, in the case of online / incremental, but also significantly further reduced.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More