The fog of integrating SAP with Hadoop

Source: Internet
Author: User
Keywords Can DFS fog ecosystem

Hadoop is very hot, but what is Hadoop? In fact, it is not a specific software. Hadoop is a project of the Apache Software Foundation, which contains a number of core tools for handling massive data and large compute clusters. Around Hadoop, there is a huge ecosystem, and there are a lot of packaged business solutions that we usually call the Hadoop release (Hadoop distribution), such as Cloudera, Hortonworks, IBM and MAPR. Each distribution provides a combination of tools, and commercial distributions are more suitable for enterprise-class large data applications than open source versions.

We want to be clear about the idea that there is no tool or set of tools that can be called "Hadoop," so when a vendor sells Hadoop to you, you need to pay attention. Vendors may provide integration of one or more hadoop tools, sometimes without even one, so many users will be puzzled by the choice of Hadoop. The SAP that this article is about is one of them, and then we'll go into the details of how SAP's software is integrated with Hadoop.

First, let's give Hadoop the next definition. As mentioned above, Hadoop contains a series of core tools, namely:

Hadoop Distributed File System (HDFS), a distributed file system that can be run in large clusters to store massive amounts of data. Other Hadoop tools require data to be fetched from the HDFs for processing. So HDFs is the core component of Hadoop.

YARN (verb Another Resource negotiator) is the core cluster resource management framework for Hadoop. It is one of the most important components of Hadoop2.0, and most (of course not all) of the Hadoop ecosystem tools run on the yarn cluster.

MapReduce is a system for parallel processing of massive datasets, which is derived from a Google paper. Although it is the most primitive component of Hadoop, it is interesting that many commercial version providers do not use MapReduce directly.

Mentioned above are just the core components of Hadoop, and the Hadoop ecosystem includes many utilities, some of which are also the Apache Software Foundation project and some other open source projects. The following tools are hosted in the Apache community:

hive--we can think of it as a data Warehouse tool for Hadoop, Hive is actually a distributed database that has the data definition and query Language hql, which is very similar to standard SQL. Hive tables can be managed entirely by hive, or they can also be defined as "external" tables on data sources such as HDFs and HBase. As a result, hive is often the data storage gateway for Hadoop ecosystems.

pig--It is a platform for programming languages and execution programs to create data analysis projects.

hbase--This is a large-scale parallel database, it is also based on Google BigTable paper derived.

The diagram shows what Hadoop tools the SAP SOFTWARE product has been through, showing only the path of data access, and not going into the technical architecture of each tool

Other projects include Spark (memory cluster computing and streaming data framework), Shark (Hive on Spark), Mahout (Analysis algorithm Library), zookeeper (centralized information maintenance services), Cassandra (similar to hbase database products).

So how is SAP's product combined with these Hadoop tools? The combination of products will be different. So far SAP has integrated Hadoop functionality in Hana, Sybase IQ, SAP Data Services, and businessobjects Business intelligence tools, but they take different approaches.

Both SAP Hana and Sybase IQ support transfer queries for remote Apache hive Systems, which allows users to process hive database tables as if they were local. In Sybase IQ, this function is called a "remote database" and is implemented through the Smart data Access (smart DataAccess) mechanism in Hana. Sybase IQ also supports the MapReduce API to handle unstructured data, which is beyond the reach of Sybase's database and Hadoop.

SAP BusinessObjects BI can support Apache hive mode access through universe concepts like connecting to other databases. It should be noted that this connection is theoretically able to access a variety of different storage systems through hive external table concepts, including Hbase,cassandra and MongoDB.

The combination of SAP and Hadoop mentioned above is only for hive, while integration with hive is the most common and is the way most vendors integrate Hadoop in the HQL. But this is different from the depth integration of the Hadoop system depicted by the manufacturer.

SAP Data Services do more than hive integration. In addition to importing data from hive, the data service can also create and read HDFs files directly, while doing some transformations to implement some operations using pig scripts. That is, data can be combined and filtered directly within the Hadoop cluster without having to be transferred to a particular server for processing. SAP Data Services can also place text data processing directly on the Hadoop cluster as a mapreduce task. So SAP has a unique advantage in the depth integration of Hadoop tools.

Finally, the development of Hadoop ecosystem is very fast, but the development of enterprise software is relatively lagging behind. Based on the lifecycle of the SAP product, only relatively old versions of Hive, pig, and HDFs are currently supported, and some new feature improvements, high availability, and cluster functionality are not supported. When choosing a business, you need to read the vendor's supporting documentation carefully to see if the business software supports the Hadoop tool you want.

Original link: http://www.searchdatabase.com.cn/showcontent_82265.htm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.