Hadoop MapReduce: A way for data scientists to explore

Source: Internet
Author: User
Keywords They are now very some data scientists

"The key is not in what methods, but in being able to really solve problems using any available tool or method," said Forrester analyst James Kobielus in a blog about Big data. ”


In recent years, with the urgent sense of solving large data problems, many organizations ' data architects have begun to explore. In short, traditional databases and business intelligence tools that they typically use to analyze enterprise data are no longer competent for large data-processing tasks.


To understand this challenge, it must go back 10 years: There were very few terabytes of enterprise data warehouses at the time. The Forrester analysis reports that two-thirds of enterprise data Warehouses (EDW) are in 1-10 TB range by 2009. By 2015, most of the large organizations will be more than 100TB Edw, "Telecommunications, financial services and E-commerce areas will appear PB-Class Edw."


It should be noted that these large data stores are some of the new tools and methods for "Super scale analysis". He means that the analysis of "super scale" data includes 4 aspects: capacity (from hundreds of TB to PB), speed (up to real time, second level delivery level), diversity (multiple, unstructured, or semi-structured), and volatility (storage of a large number of new data sources involving new applications, new services, new social networks).


The rapid development of Hadoop large data technology


In recent years, one of the most watched methods is Apache Hadoop. This is an open source software framework that supports data-intensive distributed analysis of thousands of nodes and PB-level data.


Its underlying technology stems from the search engine invented by Google's in-house developers. They use it to find useful index data and other "rich" information, and then return the results to the user in a variety of ways. They call this technology mapreduce--and today's Hadoop is an open source version that can be used by data architects to implement compute-intensive depth analysis.


Recently, Hadoop expert Karmasphere a business user survey. The survey found that 94% of Hadoop users said they now implement large-capacity data analysis that they could not achieve before, while 88% said they had improved the level of data analysis; and 82% said they could now master more data.


Hadoop is still a small eco-tech, but it has made great strides over the last year thanks to the commercial licensing and support of Hadoop products by startups such as Cloudera and MAPR. As a result of the widespread concern of EDW vendors such as EMC, IBM and Teradata, its user capacity is growing. In June 2010, EMC acquired the Hadoop specialist greenplum;2011 March, Teradata announced the acquisition of Aster Data, and in May 2011 IBM also released its own Hadoop products.


The new organization, EMC Greenplum architect Mark Sears, points out that Greenplum attracts EMC based on x86, extended MPP, and unshared design. "Not everyone should adopt this approach, but more and more customers are choosing to do it," he said. ”


In many cases, some companies are already proficient in analytics, such as discovering purchase trends by tapping customer data. In the past, however, they might have to query the entire 5 million customer data pool for random data samples related to 500,000 of the customers. In this way, they risk missing some important data. By using large data methods, it is entirely possible to query all 5 million data in a relatively efficient way.


Hadoop is recognized by business users


Although the technician is very concerned about Hadoop, it is still not widely recognized in the business world. In simple terms, Hadoop is designed to support running on a large number of low-end servers, then spread the data capacity into the cluster, using the Hadoop Distributed File System (HDFS) to track the data in each section. The execution of the analysis load takes place in the cluster, using tools such as pig and Hive, and using the large concurrency processing (MPP) pattern. At the same time, the implementation results are in a uniform format.


Earlier this year, Gartner analyst Marcus Collins introduced some of the current methods of Hadoop usage: Financial services companies use it to discover deceptive patterns from credit card transactions; Mobile telecoms providers use it to discover customer change patterns; Researchers used it to discover objects that the telescope observed.


He concludes: "Low-end servers and stored sexual price curves, seems to be in more and more enterprises within the budget, to achieve super large capacity data analysis." This technology will bring significant competitive advantages to early adoption of the organization. ”


Among early adopters, Groupon, a Cloudera, uses the Hadoop release to analyze transaction data for 70 million registered users worldwide. In addition, NYSE Euronext operates the New York Stock Exchange and other stock and derivative markets in Europe and the United States. The NYSE Euronext uses EMC Greenplum to manage more and more transaction data, averaging 2TB per day. Meanwhile, American booksellers Barnes & Noble used Aster Data Ncluster (now Teradata acquired) to understand customers ' preferences and buying habits on three sales channels: retail stores, online stores, and E-reader downloads.


Technical gaps in large data analysis


Collins points out that while the cost advantage of Hadoop may be beneficial to its popularity, technical problems remain worrisome. He said: "Big data analysis is an emerging area, and now many organizations or the broader talent market has not enough technical talent reserves." ”


Technical issues in the Hadoop application process include the technical characteristics of the MapReduce framework (which requires the user to develop Java code) and the lack of knowledge of the Hadoop infrastructure design. Another hurdle is the lack of analysis tools that can be used to check Hadoop data.


Now, however, there have been some changes. First, existing BI tool vendors are adding Apache Hadoop support: One example is Pentaho, which first joined Hadoop support in May 2010 and then added EMC Greenplum release support.


Another sign that Hadoop is becoming mainstream is the support of data integrators, such as Informatica. The most common method of Hadoop is now a "transformation" engine for the ETL (extraction, transformation, and loading) process: The MapReduce technology itself is well suited for data preparation tasks and more in-depth analysis of data in traditional RDBMS data warehouses or hdfs/hive. Correspondingly, Informatica announced in May 2011 that it will consolidate between EMC Greenplum and its data integration platform.


In addition, there are some hadoop devices, including the May 2011 EMC Greenplum HD (a device that consolidates Hadoop mapr, greenplum data, and standard x86 servers) and Dell/cloudera Solution ( Consolidate Cloudera's Hadoop distributions and Cloudera Enterprise management Tools on the Dell PowerEdge C2100 servers and powerconnect switches.


Finally, Hadoop is ideal for deploying to a cloud environment, so it teams are likely to pilot experiments on the cloud infrastructure. For example, Amazon offers Amazon Storage S3 hosting services in its Amazon elastic Compute Cloud (EC2) and Amazon Simple Elastic Service (mapreduce). Also, Apache Hadoop has a set of tools specifically designed to simplify EC2 deployments and operations.


"If the data is already stored in Amazon S3, running Hadoop on EC2 is a great choice," Collins said. "If the data does not exist in Amazon S3, then you must convert: Amazon Web Services (AWS) billing mode contains network charges, Amazon elastic mapreduce fees are attached to regular Amazon EC3 and S3 prices. "Because of the large amount of data required to run large data analysis, the cost of data conversion and execution analysis should be carefully considered," he warns. ”


The bottom line, Kobielus points out, is that "Hadoop is the future of Edw, and its application in the core EDW architecture of the enterprise is likely to grow over the next 10 years." ”

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.