Hadoop becomes a big Data key component

Last Update:2014-12-09 Source: Internet

Author: User

Keywords Large data used we become are

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, Apache Hadoop technology is becoming increasingly important in helping to manage massive amounts of data. Users, including NASA, Twitter and Netflix, are increasingly reliant on the open source distributed computing platform. Hadoop has gained more and more support as a mechanism for dealing with large data. Because the amount of data in the enterprise computer system is growing fast, companies are beginning to try to derive value from these massive amounts of data. Recognizing the great potential of Hadoop, more users are working on their own technology to complement the Hadoop stack while using the existing Hadoop platform technology.

Current usage of Hadoop

NASA wants Hadoop to handle huge amounts of data from numerous projects, such as the ska (square Kilometer array) star image. These images will generate 700tb/seconds in the next 10 years. NASA's senior computer expert Chris Mattmann says technologies such as Hadoop and Apache OODT (object-oriented Data Technology) will be used to deal with these massive data loads.

"Twitter is a big client of Hadoop," says Oscar Boykin, a Twitter data expert. All related products that provide customized recommendations to users interact with Hadoop to some extent. "The company has been using Hadoop for four years and has developed scalding." Scalding is a Scala library designed to make it easier to write Hadoop mapreduce. The product is built on top of the cascading Java library to generalize the complexities of Hadoop.

The sub projects of Hadoop include MapReduce, HDFS (Hadoop Distributed File System), and common. MapReduce is a software framework for processing large datasets on a compute cluster, HDFS provides high-speed access to application data, and common provides utilities to support other Hadoop subprojects.

Netflix, the film rental service, has started using Hadoop-related technology--apache zookeeper for configuration management. Jordan Zimmerman, a senior platform engineer at Netflix, said: "We use this technology in all types of work, such as distributed locks, partial queue alignment, and leadership elections, to optimize service activities." We developed an open source client for zookeeper and we call it curator. This client is connected to the zookeeper as a developer library. ”

Rich McKinley, tagged's senior data engineer, says tagged social networks are using Hadoop technology for data analysis to handle new data that is generated nearly 0.5 bytes a day. Hadoop is also being used in tasks other than Greenplum database capacity. Tagged is still using the Greenplum database. "We want to do more with Hadoop just by extending it," McKinley said. ”

While everyone is praising Hadoop, some users think there are still some problems to solve. For example, Hadoop is deficient in reliability and job tracking. Tagged's McKinley points out the problems of Hadoop in latency. "The time to get the data should be very fast, but everyone's biggest complaint is that the delay is too high for the query." "McKinley said. Tagged is currently using another Hadoop derived project, Apache hive, to query. "It takes a few minutes for Hadoop to give results, and it only takes a few seconds for Greenplum to give results," he said. But Hadoop is cheaper than greenplum. ”

Hadoop 2.0 ready to be sent

Hadoop 1.0, which was launched in 2011, has a high strength security certification through the Kerberos (MIT developed security certification system) that supports HBase databases. For the upcoming release, Hortonworks's CTO Eric Baldeschwieler offers a roadmap for the development of the Hadoop technology, including the 2.0 version. (Hortonworks is one of the main funders of Apache Hadoop).

The Hadoop 2.0 version entered the beta phase in early 2012. "In this release, the MapReduce layer was partially rewritten, and all the storage logic and HDFs were completely rewritten," Baldeschwieler said. "Hadoop 2.0 technology improvements focus on the use of yarn (Next generation MapReduce) and many features for expansion and innovation." Yarn will allow users to add their own computing models, so that users do not have to use MapReduce. "We hope that communities will be able to discover more new ways to use Hadoop, including real-time applications and machine learning algorithms," he said. and scalability, plug-in storage is also in the planning. "Baldeschwieler said. It is reported that the release version of Hadoop 2.0 is expected to be launched in 2012.

(Responsible editor: The good of the Legacy)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More