& http: //www.aliyun.com/zixun/aggregation/37954.html "> nbsp;
Handle large-scale data with Mahout and Hadoop
Is there any practical significance of scale problems in machine learning algorithms? Let us consider the size of a few questions that you may need to deploy with Mahout.
According to a rough estimate, Picasa has 500 million photos three years ago. This means there are millions of new photos to deal with every day. The analysis of a photo is not a big problem in itself, even if it is repeated millions of times. However, it may be necessary to obtain information from billions of photos simultaneously during the learning phase, which can not be calculated on a stand-alone basis.
It's reported that Google News handles about 3.5 million new news articles every day. Although its absolute number of terms appears unlikely, think about it, and in order to deliver those articles in time, they must be clustered in minutes, along with other recent articles.
Netflix includes 100 million ratings for the Netflix Prize scorebook. Because this is just data for contests, it is presumed Netflix multiples the amount of data that is required to process the recommendations.
Machine learning technologies must be deployed in such scenarios as the amount of data that is typically entered is so large that it can not be fully processed on one computer, even if the computer is very powerful. Without the means such as Mahout, this will be an impossible task. That's why Mahout makes scalability the top priority, and the book focuses on the reasons why it effectively handles big data sets, unlike the rest of the book.
Applying sophisticated machine learning techniques to solve large-scale problems is currently only considered by large, high-tech companies. However, today's computing power is much cheaper than before and can be more easily obtained with the help of the open source framework Apache Hadoop. Mahout aims to complete the puzzle by providing a high-quality, open source implementation built on the Hadoop platform that can solve large-scale problems and is available to all technical groups.
Some parts of Mahout make use of Hadoop and include a popular MapReduce distributed computing framework. MapReduce is widely used internally by Google (http://labs.google.com/papers/mapreduce.html), and Hadoop is one of its open source Java-based implementations. MapReduce is a programming paradigm that looks strange at first glance, or simply makes it hard to believe its power. The MapReduce paradigm is designed to solve the problem of inputting as a set of "key-value pairs," the map function converts these key-value pairs to another set of intermediate key-value pairs, and the reduce function in some way associates all the values for each intermediate key Merge to produce output. In fact, many problems can be attributed to MapReduce problems, or their concatenation. This paradigm is also fairly easy to parallelize: All processing is independent and can therefore be distributed across many machines. MapReduce will not be described anymore, but it is recommended that readers refer to some introductory tutorials to learn about it, such as http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html provided by Hadoop.
Hadoop implements the MapReduce paradigm, which is still a big improvement even though MapReduce sounds so simple. It is responsible for managing input data, intermediate key-value pairs, and output data storage; these data can be quite large and must be accessible to many worker nodes, not just to a single node. Hadoop is also responsible for data partitioning and transmission between work nodes, as well as fault monitoring and recovery for individual machines. Understanding the underlying principles behind it can help you prepare yourself for the complexities that you might face with using Hadoop. Hadoop is more than just a library that can be added to a project. It has several components, each with many libraries, and (several) separate service processes that can run on multiple machines. Hadoop-based operations are not easy, but investing in a scalable, distributed implementation can pay off later: your data can grow rapidly to very large scales, and this scalable implementation lets you Application will not fall behind.
Given the growing sophistication of such complex frameworks that require significant computational power, it is not surprising that cloud providers are starting to offer Hadoop-related services. For example, Amazon offers Elastic MapReduce, a service that manages a Hadoop cluster, which provides powerful computational power and allows us to operate and monitor large-scale jobs on Hadoop through a friendly interface that was a very complex The task.