In 2004, Google published a very influential paper introducing the mapreduce framework to the world, which can break down an application into many parallel computing commands, massive datasets run across a large number of computing nodes. Today, mapreduce has become a highly popular infrastructure and programming model in the field of parallel distributed computing. It is the foundation of Apache hadoop, it is used by many well-known manufacturers to provide excellent data services for their customers. However, it was learned from the recent Google I/O conference in San Francisco that Google has abandoned the mapreduce framework and switched to a new Cloud analysis system called cloud dataflow. From Data Center KnowledgeYevgeniy sverdlikI published an article to introduce this. The following is the content organized by the editor according to the article.
The reason why Google abandoned mapreduce is probably that it is difficult to process the amount of data that Google is currently analyzing. Urs hölzle, Mountain View's senior vice president of technical infrastructure, said: once the data volume reaches Pb-level mapreduce, it will become hard to process. Hölzle gave a keynote speech at the Google I/O conference in San Francisco. He mentioned that he had stopped using mapreduce a few years ago.
For cloud dataflow, Google will provide it as a service on the cloud platform to developers, and these services do not have the extension restrictions like mapreduce. Hölzle said, "cloud dataflow is the result of more than a decade of analysis experience. It will be faster and more scalable than any other system on the market ".
"Cloud dataflow is a fully managed service that can be automatically optimized, deployed, managed, and expanded. It makes it easy for developers to create complex pipelines for batch processing and stream services using unified programming, "hölzle says.
All of these features that Google thinks cannot be processed on mapreduce: It is difficult to quickly , it requires many different technologies, batch processing is irrelevant to the stream, you also need to deploy and maintain the mapreduce cluster.
Hölzle also presented some other new services on Google's cloud platform in his keynote speech:
- Cloud saveIs an API that allows an application to save data of a single user on the cloud or elsewhere without any server code. Google PAAs (App Engine provided) Users and IAAs (compute engine provided) users can use this feature to build apps.
- Cloud debuggingThis allows developers to easily screen out bugs in software code deployed on multiple servers on the cloud.
- Cloud TracingProvides latency statistics between different groups (such as database service call latency) and analysis reports.
- Cloud monitoringIt is an intelligent monitoring system that is integrated with stackdriver (a cloud monitoring startup acquired by Google in December. This system monitors cloud infrastructure resources, such as disks and virtual machines, as well as the service level of Google services and a dozen open source software packages not provided by Google.