Integrated into the Hadoop platform in a smarter way

Source: Internet
Author: User
Keywords Hadoop

If you think that Hadoop is ready to be your "single version facts" comprehensive repository, consider this before you leap.

It is true that Hadoop has rapidly become the core component of the large data strategy for most enterprises http://www.aliyun.com/zixun/aggregation/14294.html >. But it is not mature enough to completely replace the Enterprise Data Warehouse (EDW). Because all of the benefits of Hadoop are concentrated as unstructured data integration tiers, the vast majority of Hadoop environments lack strong security, availability, and governance, and these are precisely the standard for mature EDW. These features and other typical EDW-level features are gradually becoming Hadoop through open source distribution and commercial distribution, but it still takes 1-3 years to grow.

At this point, it is wiser to use Hadoop as a tactical integration platform to perform specific analysis and as a source of data. Most notably, Hadoop has proven itself to be a strategic base for large data development "sandbox". This use case is extremely common among many early Hadoop adopters, including providing a petabyte scalable, integrated data repository for the data scientists team to perform interactive exploration, statistical correlation, and predictive modeling.

As a major source of valuable unstructured data, such as geospatial, social, and sensor information, Hadoop can play a central role in any large data plan. In this way, Hadoop can effectively complement, rather than replace, the analysis sandbox, and the enterprise implementation of the analysis sandbox for supporting tools, such as IBM SPSS, tends to focus on managing more traditional structured data from customer relationship management and enterprise resource planning systems. Therefore, Hadoop may not (and does not have to be) the only comprehensive sandbox for all advanced analysis.

In this sandbox use case, we recommend that Hadoop be used as an integrated platform for EDW or operational data storage, rather than the mature EDW functionality described above. In the same way, integrating the Hadoop platform with a rich statistical and mathematical algorithm library is imperative in sandbox use cases. It also needs to focus on automated sandbox configuration, fast data loading and integration, job scheduling and coordination, MapReduce modeling and scoring, model Management, interactive exploration, and advanced visualization tools.

When you start consolidating more operational analysis on a Hadoop cluster, you may find it much wiser to configure different clusters for different purposes than to import all the jobs into a one-size-fits-all cluster. For example, the Hadoop Distributed file system may be sufficient to handle bulk MapReduce jobs. Real-time jobs may be best suited for clusters and nodes that are optimized for other Low-latency database technologies specifically targeted at the HBase or integrated MapReduce execution engine.

Some operational Hadoop deployments may already be included in a large application consolidation plan and may need to unload the integrated Hadoop/mapreduce runtime for analysis tasks on online transaction processing, semantic Web, and decision automation environments. In this case, consider integrating production of hadoop clusters with non-hadoop technologies such as the IBM DB2 V10 resource Description Framework ternary storage, or various other associated databases, NoSQL databases, and other forms of databases.

As your enterprise's hadoop/mapreduce use cases and deployment topologies expand, you may find yourself needing to optimize "conforming" clusters or nodes for more granular operations. As you consolidate more operational applications into Hadoop, you can also specify specific clusters or nodes for specific data sources and downstream applications. In addition, you can assign dedicated nodes to mission-critical data support features when you use an electronic disclosure query for archiving and use log affinity to perform IT root-cause analysis.

Because of the increasing number of data governance, security, cluster management, and other infrastructure tools that are optimized for Hadoop in the marketplace, it is possible to consider comprehensively testing and evaluating these tools in a stand-alone cluster and then deploying them to the "single version Facts" scenario to gain operational business intelligence. In addition, there is at least a need to assess the level of integration between the Hadoop platform and the enterprise bilateral data exchange EDW. If you have completed internal analysis of the database, determine if each platform can consume the output data generated by the corresponding model run.

Intelligent consolidation depends on understanding the strengths and limitations of all data analysis platforms, including Hadoop. Integrating all enterprise data and analytics into Hadoop may not be the best choice for now and in the future, even as Hadoop evolves and penetrates into EDW and other mature methods. It is important to deploy each method to a use case that adapts to your particular large data environment.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.