The Data Warehouse door opens to Hadoop.

Source: Internet
Author: User
Keywords Data warehousing representation fit mass data this

In the large data age, the Hadoop distributed processing architecture brings new life and challenges to it, data management, and data analysis teams. With the development and expansion of Hadoop ecosystem, enterprises need to be ready for the rapid upgrading of technology.

Last week, the Apache Software Foundation just announced a formal GA for Hadoop 2.0, a new version of Hadoop that will bring a lot of change. With HDFs and java-based MapReduce as core components, Hadoop's early adopters used it to deal with massive data processing, including structured and unstructured, from log files to text data, from sensor data to social media data.

Hadoop 1.0 to 2.0 transition

Hadoop is usually run on a clustered basis on a low-cost server, so it can effectively control the cost of massive data processing and storage. Tony Cosentino, vice president of the Ventana Research Institute, said that Hadoop took a lightweight approach to data processing, so it was able to make the most of the new source, a traditional relational database architecture.

But Cosentino that the current Hadoop architecture is also limited by the batch mode, which can be likened to a heavy truck, with a significant performance bottleneck. Hadoop is not suitable for applications with low latency requirements, it is more suitable for heavy work, that is, massive data processing.

Hadoop is suitable for analyzing massive unstructured datasets, which are usually the order of magnitude of TB or even PB. Scaleout Software's CEO, William Bain, said that because of the nature of the Hadoop batch and the large overhead, it was not suitable for real-time analysis of datasets. But by combining Hadoop 2.0 with a new query engine added by other vendors, the problem will be addressed effectively.

The Data Warehouse door opens to Hadoop.

Impetus Technologies's chief architect, Sanjay Sharma, says data warehousing applications also involve massive data processing, so it is a natural hadoop target application. What is the right amount of data? Sharma thinks that about ten TB is the ideal amount of data for Hadoop, and if the dataset is complex, that number will fall.

Users such as Edmunds.com, a car-shopping provider, have deployed Hadoop and related technologies to replace traditional data warehouses. Most enterprise's Hadoop cluster is often regarded as the data enters the organization's buffer domain, the data is "filtered" by the MapReduce, transforms into the traditional relational data, then imports in the Data warehouse or the data mart to carry on the analysis. This approach also provides some flexibility, raw data can be placed in the Hadoop system, the need for analysis in the ETL processing.

Sharma This deployment as "data downstream processing", while Colin White, another research firm, concluded in more precise terms, "business refineries." In a survey published this year, Gartner analyst Mark Beyer and Ted Friedman point out that using Hadoop to collect data and prepare for Data Warehouse analysis is the most mainstream data analysis application practice in the present. More than half of the 272 users surveyed said they planned to do the job in the next 12 months.

From the very beginning, Hadoop has attracted countless software developers to create new tools on their foundation to make up for the many deficiencies they have. such as HBase (distributed database), Hive (SQL based Data Warehouse), Pig (Advanced language of data Analysis program in MapReduce), etc. Other support projects are now part of the Apache project, such as the Hadoop cluster provisioning management and monitoring tools Ambari,nosql database Cassandra and reliable coordination systems for large distributed systems zookeeper.

Yarn brings new energy to Hadoop 2.0

Hadoop 2.0 is now unified as Hadoop 2, and it has entered more and more people's horizons. One of the most important of these is yarn (verb Another Resource negotiator), an updated resource manager that allows applications that are not mapreduce to run on HDFs. In this way, yarn is designed to unlock the batch processing limitations of Hadoop while providing backward compatibility with existing application structures.

Cosentino says yarn is the most important development of Hadoop 2.0, which allows multiple workloads to run concurrently. Yahoo is a good example of the deployment of Storm complex event processing software on yarn to help filter Web site user behavior data into Hadoop clusters.

Hadoop 2 also provides improvements in high availability, with new features that help users create a federated named node architecture on HDFs without relying on a single node to control the entire cluster. In addition, it adds support for the Windows platform, with a variety of utilities developed by large vendors, and the application of Hadoop at the enterprise level will be promising.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.