Lessons learned developing a practical large scale machine learning system

Source: Internet
Author: User

Original: http://googleresearch.blogspot.jp/2010/04/lessons-learned-developing-practical.html

Lessons learned developing a practical large scale machine learning systemTuesday, April,Posted by Simon Tong, Google

When faced with a hard prediction problem, one possible approach are to attempt to perform statistical miracles on a small Training set. If data is abundant then often a more fruitful approach are to design a highly scalable learning system and use several Ord ERs of magnitude more training data.

This is notion recurs in many and other fields as well. For example, processing large quantities of data helps immensely forInformation retrieval and machine translation.

Several years ago we began developing a large scale machine learning system, and has been refining it over time. We gave it the codename "Seti" because it searches for signals in a large space. It scales to massive data sets and have become one of the most broadly used classification systems at Google.

After building a few initial prototypes, we quickly settled in a system with the following properties:

      • Binary classification (produces a probability estimate of the class label)

      • Parallelized

      • Scales to process hundreds of billions of instances and beyond

      • Scales to billions of features and beyond

      • Automatically identifies useful combinations of features

      • Accuracy is competitive with state-of-the-art classifiers

      • Reacts to new data within minutes

seti ' s accuracy appears to be pretty decent. For example, tests on the standard smaller datasets indicate the it is comparable with modern classifiers.

Seti have the flexibility to being used on a broad range of training set sizes and feature sets. These sizes is substantially larger than those typically used in academia (e.g., the Largest uci Datasethas 4 million instances). A sample of the data sets used with Seti gives the following statistics:


  Training set size Unique Features
Mean billion 1 billion
median 1 billion Million


A Good machine learning system are all on accuracy, right?

In the process of designing Seti we made plenty of mistakes. However, we made some good key decisions as well. Here is a few of the practical lessons that we learned. Some is obvious in hindsight, but we didn't necessarily realize their importance at the time.

Lesson:keep it simple (even at the expense of a little accuracy).

Having good accuracy across a variety of domains are very important, and we were tempted to focus exclusively on this Aspec T of the algorithm. However, in a practical system there is several other aspects of a algorithm that is equally critical:
      • Ease of Use:teams is more willing to experiment with a machine learning system. Those teams is not necessarily die-hard machine learning experts, and so they does not want-waste much time figuring out How to get a system up and running.

      • System Reliability:teams is much more willing to deploy a reliable machine learning system in a live environment. They want a system is dependable, and unlikely to crash or need constant attention. Early versions of Seti had marginally better accuracy on large data sets, but were complex, stressed the network and G FS architecture considerably, and needed constant babysitting. The number of teams willing to deploy these versions is low.

Seti is typically used in places where a machine learning system would provide a significant improvement in accuracy over t He existing system. The gains is usually large enough that most teams does not care about the small differences in accuracy between different f Lavors of algorithms. And, in practice, the small differences is often washed out by other effects such as better data filtering, adding anothe R useful feature, parameter tuning, etc. Teams much prefer having a stable, scalable and easy-to-use classification system. We found that these other aspects can is the difference between a deployable system and one that gets abandoned.

It's perhaps less academically interesting to design a algorithm that's slightly worse in accuracy and that's has greate R Ease of use and system reliability. However, in our experience, it's very valuable in practice.

Lesson:start with a few specific applications on mind.

It was tempting-to-build a learning system without focusing on any particular application. After all, our goal is to create a large scale system, would is useful on a wide variety of present and the future Classi fication tasks. Nevertheless, we decided to the focus primarily on a small handful of initial applications. We believe this decision is useful in several ways:

      • We could examine what is the small number of domains had in common. By building something this would work for a few domains, it was likely the resulting system would is useful for others.

      • More importantly, it helped us quickly decide what aspects were unnecessary. We noticed that it is surprisingly easy to over-generalize or over-engineer a machine learning system. The domains grounded our project in reality and drove our decision making. Without them, even deciding how broad to make the input file format would has been harder (e.g., it important to Permi T binary/categorical/real-valued features? Multiple classes? Fractional labels? Weighted instances?).

      • Working with a few different teams as initial guinea pigs allowed us-learn about common teething problems, and helped u s smooth the process of deployment for a future teams.
Lesson:know when to say "no".

We have a hammer, but we don't want to end up with bent screws. Being machine learning practitioners, it is very tempting for us to all recommend using machine learning for a problem . We saw very early on this, despite its many significant benefits, machine learning typically adds complexity, opacity and Unpredictability to a system. In reality, simpler techniques is sometimes good enough for the task at hand. And in the long run, the extra effort that would has been spent integrating, maintaining and diagnosing issues with a Liv E Machine learning system could is spent on other the improving the system instead.

Seti is often used in places where there are a good chance of significantly improving predictive accuracy over the Incumben T system. And we usually advise teams against trying the system when we believe there are likely to being only a small improvement.

Large-scale machine learning is a important and exciting area of the. It can be applied to many real world problems. We hope that we have given a flavor of the challenges so we face, and some of the practical lessons that we have learned .

Lessons learned developing a practical large scale machine learning system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.