A detailed comparison of HPCC and Hadoop

Source: Internet
Author: User
Keywords Provide multiple can run
Tags access application based blade server block client cluster system code

Hardware environment

Clustered systems are typically built using Intel or AMD CPU-based blade servers, outdated hardware that has been discontinued for cost reasons. Node has local memory and hard disk, connected through high-speed switches (usually Gigabit switches), if the cluster nodes are many, you can also use the hierarchical exchange. The nodes in the cluster are peer-to-peer (all resources can be reduced to the same configuration), but this is not necessary.

operating system

Linux or windows

System Configuration

There are two configurations for implementing HPCC clusters: a Thor-like MapReduce cluster similar to Hadoop; and a data distribution engine (Roxie) that provides independent, high-performance online query processing and data warehousing capabilities. Both configurations can be used as distributed file systems, but they are not the same way to try to improve performance. The HPCC environment usually consists of multiple clusters of two configuration types. Although the file systems on each cluster are independent of each other, one cluster can access files in the file system located on other clusters in the same environment.

Hadoop system software implements a cluster using MapReduce processing paradigms. Such a cluster can also be used as a distributed file system running HDFS. Other features are at Hadoop MapReduce and Hbase, Hive and other file system software.

Authorization and maintenance fees

HPCC: Community Edition is free. Enterprise license fees currently depend on the cluster size and the type of system configuration.

Hadoop: Free, but there are multiple vendors offering different paid maintenance services.

Core software

HPCC: If Thor Configuration is used, the core software includes the operating system and various services installed on each node of the cluster for task execution and distributed file system access. Dali's stand-alone server provides file system name services and a unit of work to manage tasks in the HPCC environment. Thor clusters can be configured as one primary node and multiple standby nodes. A Roxie cluster is a peer-to-peer cluster, with each of its nodes running a server and task agents that perform queries and key and file processing. The Roxie cluster's file system uses distributed B + trees to store indexes and data and provide access to encrypted data. Additional middleware components are indispensable to operate Thor and Roxie clusters.

Hadoop: The core software includes the operating system, Hadoop's MapReduce cluster, and HDFS software. Each backup node includes a task tracking service and a data node service. The master node includes a task tracking service that can be configured as a stand-alone hardware node or as a standby hardware node. Similarly, for a naming service, a master nodal service is also required for HDFS, and it can run on an alternate node or a standalone node.

Middleware

HPCC: The middleware includes an ECL code repository implemented on a MySQL server, an ECL server that compiles ECL programs and queries, an ECL agent, a client program that manages task execution on a Thor cluster, and an ESP server (Enterprise Service Platform) that provides authentication, Logging, security, and other services that perform tasks and provide a Web services environment. The Dali server can be used as system data to store task unit of work information and as a name service for distributed file systems. Middleware can flexibly run from one to several nodes. Multiple such servers can provide redundant backups and improve performance.

Hadoop: No middleware. Client software can submit tasks directly to the cluster master task tracker. Hadoop Workflow Scheduler (HWS) Management as a Server Running The functionality of tasks that require multiple MapReduce sequences is under development.

System Tool

HPCC includes a suite of client and operation tools for managing, maintaining, and monitoring HPCC configurations and environments. This suite includes the ECL IDE, program development environment, property migration tools, distributed file application (DFU), environment configuration application and Roxie configuration application. Command-line version is also available. ECLWatch is a web-based application that monitors HPCC environments and includes queue management, distributed file system management, task monitoring, and system performance monitoring tools. Other tools are provided through the web services interface.

Hadoop: The dfsadmin tool provides status information on the file system; fsck is an application that checks the health of files on the HDFS; the data node block scanner periodically verifies all the blocks on the data node; the balancer replaces overloaded data nodes On the block republished to the low-load data node. MapReduce's WEB user interface includes a task tracker page that displays information about running and completed tasks; you can see the details of the task by clicking down on a specific task. There is also a task page showing Map and Reduce task information.

Easy to deploy

HPCC: Environment Configuration Tool. The source server has a centralized repository that distributes operating system-level settings, services, and binaries to all network-bootable nodes in the configuration.

Hadoop: Requires online tool assistance from third-party application wizards. Need to manually deploy the RPM.

Distributed file system

HPCC: Thor's distributed file system is record-oriented and uses local Linux file systems to store partial files. Files are loaded (extracted) across nodes, and each node has a separate section file, which for a distributed file is nullable. Split the file within the user-specified even number of records / documents. The primary and secondary structures are separated by name service and file mapping information stored on separate servers. Only one local file per node is needed to represent a distributed file. Multiple clusters also support read and write access settings in the same environment. Using a specific adapter allows access to files such as MySQL from external databases, allowing transactional data to be combined with distributed file data and incorporated into batch tasks. The Roxie distributed file system uses a distributed B + tree index file that contains key information and data stored in a local file on each node.

Hadoop: Block-oriented, most installations use blocks that are 64MB or 128MB in size. Blocks are stored as individual units / local files of the node's native Unix / Linux file system. The metadata information for each block is stored as a separate file. The primary and secondary structures use a single name node for name service and block mapping, and multiple data nodes are used. The files are divided into blocks and distributed across all nodes in the cluster. A node is represented as a distributed file by storing multiple local files for each logical block on a node (one for holding block data and one for storing metadata).

Fault tolerance

HPCC: Thor and Roxie's distributed file system (configurable) keeps copies of some files on other nodes to prevent disk or node failures. The Thor system provides either automatic or manual switching and hot booting after a node has failed, and the task restarts or continues from the most recent checkpoint. When copying data to a new node, the production of the copy automatically. The Roxie system continues to run when nodes are reduced by reducing the number of nodes.

Hadoop: HDFS (configurable) stores (user-specified) multiple copies on other nodes to prevent disk or node failures due to autorecovery. MapReduce architecture includes exploratory execution, and when a slow or failed Map task is detected, other Map tasks resume from the failed node.

The environment in which the mission is carried out

HPCC: Thor uses a master / slave processing architecture. ECL Task Definition Processing Steps You can specify local actions (data is processed independently on each node) or global (data is processed across all nodes) operations. In order to optimize the execution of a compiled ECL data stream program, multiple processing steps of a process can be performed automatically as part of a single task. If each node's CPU and memory resources are sufficient, to reduce latency, a single Thor cluster can be configured to run multiple tasks in parallel. The middleware, including the ECL agent, ECL server, and Dali server, provides the client interface and manages the execution of tasks packaged as units of work. Roxie uses a number of server / proxy constructs to process ECL programs, each taking server tasks as managers and multiple agent tasks getting and processing the data for that query as needed.

Hadoop: Use MapReduce paradigm for inputting key-value pairs of data. Is the main preparation structure. The task tracker runs on the main node, and the task tracker runs on each spare node. Assign input branches to the Map task input file, usually a block of a task. The number of Reduce tasks specified by the user. Map processing is performed locally for the specified node. Move and sort operations are performed with the stage of the Map. It is used to distribute and sort Reduce tasks corresponding to the keypairs, so that keypairs with the same key are processed by the same Reduce task. For most processes, multiple MapReduce processing steps are usually necessary and must be ordered and concatenated by the user or language such as Pig alone.

Programming language

HPCC: ECL is the main programming language for the HPCC environment. The ECL is compiled into optimized C ++ and then compiled into a DLL that is executable on Thor and Roxie platforms. ECL includes inline C ++ code enclosed in functions. External services can be written in any language and compiled as ECL callable function shared libraries. Pipeline interfaces allow for the execution of external programs written in any language that are incorporated into the task.

Hadoop's MapReduce tasks are usually written in Java. Support for other languages ​​is achieved through streaming or pipelining. Other processing environments are implemented on Hadoop's MapReduce, such as HBase and Hive, which have their own language interfaces. The Pig Latin and Pig execution environments provide a high-level dataflow language and then map that language to multiple MapReduce tasks written in Java.

Integrated programming development environment

The HPCC platform provides the ECL Integrated Development Environment, an integrated, integrated development environment specifically for the ECL language. The ECL Integrated Development Environment provides access to the shared source code repository and provides a complete development and test environment for developing ECL data flow programs. Access to the ECLWatch tool is built-in, which allows developers to view the execution of the task. Access to work units for current and historical tasks is also provided, which allows developers to easily compare the results of one and the next during the development cycle.

Hadoop MapReduce uses the Java programming language and has several excellent Java programming development environments, including NetBeans and Eclipse, which provide plug-ins for accessing Hadoop clusters. Pig environment does not have its own integrated development environment, but you can use Eclipse and other editing environment for grammar checking. The Eclipse pigPen add-on provides access to the Hadoop cluster so that you can run the Pig program and other development features on your Hadoop cluster.

Database function

The HPCC platform includes the ability to build multiple keys, multiple domains (that is, composite) indexes on a distributed file system. These indexes can be used to improve performance and provide keyword access to batch tasks on Thor systems or to support the development of queries deployed on Roxie systems. ECL language directly supports data keyword access.

Basic Hadoop MapReduce does not provide any ability to access indexed databases based on keywords. Hadoop's additional name for the HBase system provides column-oriented database access. Custom scripting languages ​​and Java interfaces are also provided. The Pig environment does not directly support HBase access, which requires user-defined functions or separate MapReduce procedures.

Online query and data warehouse functions

The Roxie system configuration on the HPCC platform is specifically designed to provide the functionality of a data warehouse for structured query and data analytics applications. Roxie is a high-performance platform that supports thousands of users and offers application-dependent sub-second response times.

The basic Hadoop MapReduce system does not provide any data warehouse functionality. An add-on system for Hadoop Hive provides data warehousing capabilities and is allowed to be mounted on HDFS tables and then accessed using SQL-like languages. The Pig environment does not directly support access to Hive, which requires user-defined functions or separate MapReduce procedures.

Scalability

HPCC: One to thousands of nodes. In fact, HPCC configurations require very few nodes to provide the same processing performance of Hadoop clusters. However, the size of the cluster may depend on the overall storage needs of the distributed file system.

Hadoop: one to thousands of nodes.

performance

The HPCC platform has been proven to sort 1 TB of data in 102 seconds on a high performance 400 node system. In a recent benchmarking of Hadoop running on another 400 node systems, HPCC performance was 6:27 and Hadoop performance was 25 minutes and 28 seconds. The same hardware configuration shows that HPCC's benchmark Test 3.95 times faster than Hadoop.

Hadoop: Currently the only standard performance benchmark available is the sort benchmark sponsored by http://sorbenchmark.org. Yahoo has proven that sorting 1TB of data on 1460 nodes is 62 seconds, data on 3450 nodes sorting 100TB is 173 minutes, and 3658 nodes ordering 1PB data is 975 minutes.

Original link: http://hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop/HPCC-vs-Hadoop-Detail

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.