The year of "Big Data" for cloud computing, a major event for Amazon, Google, Heroku, IBM and Microsoft, has been widely publicized as a big story. However, in public cloud computing, which provider offers the most complete Apache Hadoop implementation, it is not really widely known.
With the platform-as-service (PaaS) cloud computing model being adopted by more and more enterprises as a data warehouse application solution for enterprises, Apache Hadoop and HDFs, MapReduce, Hive, Pig, and other subcomponents are becoming the main force for large data analysis. This trend is also becoming clearer. The Apache Foundation has been upgraded to a landmark version of the Hadoop v1.0, demonstrating that Hadoop is mature and ready for commercial applications in the Production analysis cloud computing environment.
The ability to create a highly scalable, immediate-use Hadoop cluster for batch managed MapReduce processing in the vendor's data center allows enterprise IT departments to avoid capital expenditures resulting from the sporadic use of internal owned servers. As a result, Hadoop has become a necessity for rich and deep PAAs suppliers (Amazon, Google, IBM and Microsoft) to package Hadoop, MapReduce, or both as their pre-built services.
The resilient MapReduce of AWS
April 2009 Amazon Network Service (AWS) became the originator of Flexible mapreduce (EMR). EMR handles Hadoop cluster configuration, run and terminate tasks, and enables data transfer between Amazon EC2 and Amason S3 (Simple storage services). EMR also provides Apache Hive, which can be built on Hadoop for data warehousing services.
The elastic mapreduce function of Amazon Network Service, the workflow index of Cloudwatch work is sampled.
For the machine fault, EMR is a fault-tolerant mechanism; Amazon recommends that you run only task instance groups with spot instances, enabling you to remain available while leveraging a lower cost advantage. However, AWS did not support spot Instances until August 2011.
Amazon has set a surcharge of 0.015 to 0.05 USD per hour for EMR and is rated as a EC2 instance of a small cloud host to a super cluster cloud host. According to AWS's official statement: Once you start a workflow, Amazon's Flex-MapReduce handles instance configurations, security settings, Hadoop configurations and settings, log collection, health monitoring, and other hardware-related complexities with Amazon EC2. For example, automatically remove a fault instance from a workflow that you are running. AWS recently released free Cloudwatch indicators for EMR instances.
According to Google developer Mike Aizatskyi, all Google teams are using MapReduce, which was first launched in 2004. Google has released a Appengine-mapreduce API as an early experimental release of the MapReduce API to support running Hadoop 0.20 programs on Google App engine. Later, the development team released a low-level API v1.4.3 version in March 2011 to provide a system of class files for blobs and for improving the intermediate results of the shuffler features of open source user space.
Google Appengine-mapreduce's shuffle is handled in an I/O 2012 session.
The Google Appengine-mapreduce API carefully arranges map, shuffle, and reduce operations through a Google Pipeline API. In a video clip, the company introduced the current status of Appengine-mapreduce for I/O 2012. However, until the spring of 2012, Google has not changed its "early experimental release" description. Appengine-mapreduce's main target is Java and Python programmers, not big data scientists and analysts. Shuffler is limited to about 100MB of data sets, which is clearly not part of large data applications. For larger datasets, you can request access to Google's Bigshuffler.
Heroku Treasure Data Hadoop attachment
Heroku's Treasure Data Hadoop attachment enables development operators to use Hadoop and hive to analyze the logs and events of managed applications, one of the main features of the technology. Other Heroku large data attachments include Cloudant Apache Couchbase implementations, MONGOHQ from Mongolab and MongoDB, Redis to go, neo4j (public version of the Java Graphics database) and restful indicators. Appharbor, known as ". NET Heroku", provides similar attachment selections, including Cloudant, Mongolab, MONGOHQ, and Redis to go, plus RAVENHQ NoSQL database add-in. Heroku and Appharbor do not support general-purpose Hadoop implementations.
IBM Apache Hadoop in SmartCloud
IBM began in October 2011 to provide data analysis based on Hadoop in the form of Infosphere Biginsights Basic in IBM SmartCloud Enterprise Edition. Biginsights Basic, which can manage up to 10TB of data, also has a free download version available for Linux systems. Biginsights Enterprise Edition is a fee-based download version. Both downloadable versions offer Apache Hadoop, HDFS, MapReduce frameworks, and a complete set of Hadoop subprojects. The downloadable Enterprise Edition includes an Eclipse plug-in that can be used to write text analysis, spreadsheet-like data discovery and mining tools, and JDBC connections to Netezza and DB2. All two versions provide integrated installation and management tools.
My test--drive IBM's SmartCloud Enterprise infrastructure as a service: the first and second tutorials introduce the management features of the SmartCloud Enterprise free trial release in April 2011. From IBM's technical publications, it is not clear what functionality is available in the downloadable biginsight version of public cloud computing. Their cloud computing: the IT Pro Community Resources page lists only one biginsights Basic 1.1:hadoop Master and Data node mirroring; A representative of IBM confirmed that The SmartCloud version does not include MapReduce or other Hadoop subprojects. Hadoop provides a SmartCloud tutorial that explains how to configure and test a three-node cluster in the SmartCloud Enterprise Edition. Thus, IBM's current Biginsights cloud computing version is missing key elements of data analysis.
Microsoft Windows Azure's Apache Hadoop
Microsoft has hired Hortonworks, a Yahoo! that specializes in Hadoop consulting spinoff) to help implement Windows Azure's Apache Hadoop or Azure Hadoop (HoA). Since December 14, 2011, HOA has entered the Community Technology rehearsal phase (CTP or private test) that only accepts invitations to join.
Prior to joining Hadoop, Microsoft relied on a graphics database developed by Microsoft Research and a High-performance Computing Attachment (LINQ to HPC) for large data analysis and processing (Dryad). The Azure CTP Hadoop provides a predefined Hadoop cluster selection from a small (four compute node with 4TB storage capacity) to an oversized (16TB 32-node), simplifying mapreduce operations. Adding CTP pre-release compute nodes or storage is not a charge.
Microsoft offers four examples of hadoop/mapreduce projects: Calculating the value of Pi л, performing terasort and WordCount benchmarks, and demonstrating how to write a MapReduce program for streaming data using the C # language.
HOA should be planning to implement new features and upgrade features in Windows Azure "Tide" in 2012 years. This upgrade will enable the HOA team to attract more testers for the CTP and may include Apache Hadoop for Windows Server 2008 R2, which is used for internal or private cloud computing and hybrid cloud implementation. At the end of 2011 and early 2012, Microsoft was actively working to reduce the cost of Windows Azure computing instance and storage; The azure release version of Hadoop pricing may also be competitive compared with Amazon Flex MapReduce.
Big data means more to Hadoop and MapReduce
"In a world of big data, Hadoop/mapreduce will be a key development framework, but not the only one," said James Kobielus, a Forrester analyst, in a blog post. Microsoft also. NET Framework provides a CTP code-named "Cloud Numerics", which allows the development run team to perform digital intensive computing on large distributed datasets in Windows Azure.
Microsoft Research also publishes the source code for implementing the analysis of Excel cloud computing data in Windows Azure and its "Daytona" project MapReduce iterative implementation. But there are indications that, for the foreseeable future, open source Apache Hadoop and its associated subprojects will dominate cloud computing hosting applications.
PAAs vendors that provide the most automated Hadoop, MapReduce and Hive implementations will receive the closest attention from large data scientists and data analysis practitioners. Microsoft is designed to configure Excel front-end for Business Intelligence (BI) applications, making the company's large data products more comfortable with the growing number of self-service BI users. At present, Amazon and Microsoft provide the most complete and automated cloud computing Hadoop large data Analysis Services.