Comparison between hadoop and RDBMS, HPC, and volunteer computing

Source: Internet
Author: User
Tags high cpu usage

Hadoop provides a stable shared storage and analysis system. Storage is implemented by HDFS and analysis is implemented by mapreduce. Even if hadoop has other functions, these functions are at its core.

Compared with other systems

Mapreduce seems to adopt a brute force method. That is, for each query, each dataset-at least a large part-is processed. But this is exactly what it can do. Mapreduce can process a batch of queries, and its ability to process ad hoc queries for the entire dataset and obtain results within a reasonable time is also breakthrough. It changes our views on data and frees the data previously stored on tapes and disks. It gives us the opportunity to innovate data. The questions that took a long time to get answers have now been solved.

1.Relational Database Management System

Why can't we use databases and more disks for large-scale batch analysis? Why do we need mapreduce?

The answer to this question comes from another trend in the development of disk drives: the increasing speed of addressing time is far slower than the increasing speed of transmission rate. Addressing is the process of moving the head to a specific position for read/write operations. It features a delay in disk operations, and the transmission rate corresponds to the bandwidth of the disk. If the data access mode is limited by the disk addressing, it will inevitably take longer (compared with the stream) to read or write most of the data. On the other hand, when updating a small part of database records, the traditional B-tree (a Data Structure Used in relational databases, restricted by the speed of searching) is very effective. However, when updating most of the database data, the efficiency of B-tree is not as high as that of mapreduce, because it needs to use Sorting/merging to recreate the database.

In many cases, mapreduce can be considered a supplement to RDBMS (relational database management system. (The differences between the two systems are shown in Table 1-1 ). Mapreduce is suitable for handling issues that need to analyze the entire dataset, in batches, especially ad hoc (autonomous or ad hoc) analysis. RDBMS applies to point queries and updates (Where datasets have been indexed to provide low-latency retrieval and a small amount of data updates in a short time ). Mapreduce is suitable for applications where data is written and read multiple times, while relational databases are more suitable for continuously updating datasets.

 

 

Another difference between mapreduce and relational databases is the number of structured data in the datasets they operate on. Structured Data is a materialized data with accurate definitions. It has formats such as XML documents or database table definitions and conforms to specific predefined modes. This is what RDBMS includes. On the other hand, semi-structured data is relatively loose. Although there may be patterns, it is often ignored, so it can only be used as a data structure guide. For example, in a workbook, the structure is a grid composed of cells, although it may save any form of data. Unstructured data has no special internal structure, such as plain text or image data. Mapreduce is very effective for unstructured or semi-structured data because it is designed to interpret data during processing time. In other words, the keys and values input by mapreduce are not inherent attributes of data. They are selected by the data analysts.

Relational Data is often standardized to maintain its integrity and delete redundancy. Normalization brings problems to mapreduce because it makes reading records a non-local operation (that is, you need to copy the data you need from other hosts), and one of the core assumptions of mapreduce is, it can read and write (high speed) streams. Web server logs are a good example of non-standardization of the record set (for example, the client host name is specified by the full name each time, even if the same client may appear many times ), this is one of the reasons why mapreduce is very suitable for analyzing various log files.

Mapreduce is a linear and scalable programming model. The programmer writes two functions: map function and reduce function. Each function defines one key/value pair to be mapped to another. These functions ignore the size of data or the features of clusters they are using, so that they can be applied to small-scale or large datasets as they are. More importantly, if the data size is doubled, the running time will be less than twice. However, if the cluster is twice the size, a task is just as fast as the original one. This is not the result of a general SQL query.

Over time, the difference between relational databases and mapreduce is likely to become blurred. Relational databases are beginning to absorb some ideas of mapreduce (such as ASTER data and greenplum databases). On the other hand, mapreduce-based advanced query languages (such as pig and hive) make mapreduce systems closer to traditional database programmers.

2.Grid computing

The High Performance Computing (HPC) and grid computing communities have been doing large-scale data processing for years. They use APIs such as message passing interface (MPI. In a broad sense, the high-performance computing method is to allocate jobs to a machine cluster. These machines access the shared file system and are managed by a storage area network (SAN. This is very suitable for Master computing-intensive jobs. However, when a node needs to access a large amount of data (several hundred GB of data, this is the starting point for mapreduce to actually start to "Shine, this becomes a problem because the network bandwidth becomes a "bottleneck", so the computing node is idle.

Mapreduce tries to store data locally on the computing node, so the data access speed is faster because it is local data. This "Data localization" feature has become the core function of mapreduce and is also one of the reasons for its good performance. After realizing that network bandwidth is the most valuable resource in the data center environment (it is easy to copy data everywhere to saturation the network bandwidth), mapreduce will spare no effort to protect it through the explicit network topology. Please note that this arrangement does not exclude high CPU usage analysis in mapreduce.

MPI gives programmers a lot of control, but it also requires explicit control of the data stream mechanism, which needs to be completed using functional modules in the traditional C language (such as socket ), and more advanced algorithms for analysis. While mapreduce completes the task at a higher level, that is, the programmer considers the key/value pair function and the data stream is implicit.

Coordinating processes on a large-scale distributed computing platform is a great challenge. The most difficult part is to properly handle failures and errors-when you do not know whether a remote process has failed-you still need to continue the entire calculation. Mapreduce frees programmers from the need to consider failed tasks. It detects failed map or reduce tasks and reschedules tasks on healthy machines. Mapreduce can do this because it is a non-shared architecture, which means that tasks are not dependent on each other. (This is a little simpler, because the Mapper output is fed back to Cer CER, but this is controlled by the mapreduce system. In this case, you should pay more attention to the returned reducer than the map that fails to return, because it must ensure that it can retrieve the necessary map output. If not, you must re-run the related map to generate the necessary output .) Therefore, from the programmer's point of view, the execution order of tasks is irrelevant. In contrast, Mpi programs must explicitly manage their own checkpoints and recovery mechanisms to give more control to programmers, but this increases programming difficulty.

Mapreduce seems to be a very strict programming model, and in a sense it does: we are limited to the types of key/value pairs (they are associated in the specified way), and the cooperation between Mapper and reducer is limited, run the task one by one (mapper transfers key/value pairs to Cer CER ). A natural question is: can you use it to do something useful or common?

The answer is yes. As a search index product system, mapreduce was developed by Google engineers, because they found themselves solving the same problem over and over again (mapreduce is inspired by traditional functional programming, distributed computing, and database communities ), however, it was subsequently applied to many other applications in other industries. We are pleasantly surprised to see that many algorithm variants are represented in mapreduce, from image graph analysis to chart-based problems to machine learning algorithms. It certainly cannot solve all problems, but it is a very common data processing tool.

3.Volunteer Computing

When people first heard about hadoop and mapreduce, they often asked: "What is the difference between hadoop and SETI @ home? "SETI, all called Search for Extra-Terrestrial Intelligence, runs a project called SETI @ Home (http://setiathome.berkeley.edu ). In this project, volunteers contributed the idle time of their computer CPUs to analyze the data of the wireless astronomical telescope, so as to find an alien smart life signal. SETI @ home is the most famous project with many volunteers, and others include great Internet Mersenne prime search (search for large prime numbers) project with Folding @ Home (understanding the composition of Protein and Its Relationship with disease ). Volunteer Computing Projects work by dividing the issues they are trying to solve into several work units and sending them to computers around the world for analysis. For example, the unit of work of SETI @ home is about 0.35 MB of radio telescope data, and a typical computer needs to be analyzed several hours or days. After the analysis is complete, the result is sent back to the server, and the client obtains another unit of work. As a precaution against spoofing, each unit of work must be sent to three machines and at least two results must be the same to be accepted. Although SETI @ Home on the surface may be similar to mapreduce (dividing the problem into independent blocks and then performing parallel computing), the difference is significant. The SETI @ Home problem is highly CPU-intensive, making it suitable for running on thousands of computers around the world, because the time for transmitting work units is negligible relative to the computing time. Volunteers donate CPU cycles instead of bandwidth.

Mapreduce is designed to run jobs that take minutes or hours to run on dedicated hardware devices that can be trusted in a high-bandwidth data center. In contrast, the SETI @ Home project runs on untrusted computers connected to the Internet. These computers have different network speeds and data is not stored locally.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.