As two of the most popular technologies in the field of information management, the hardware architecture of database integration machine and large data technology is basically the same, but the software system has the essential difference, which leads to the different characteristic performance.
With the rapid growth of enterprise data and the continuous improvement of users ' requirements for service level, the traditional relational database technology has shown obvious lack of ability in production practice for a long time. How to obtain high availability of massive data at reasonable cost has become a major challenge in modern it field. In response to this challenge, in recent years, there have been many new technology tools in the IT market, the most notable of which is the database integrated machine (Oracle Exadata and IBM Netezza, etc.) dominated by mainstream database vendors, as well as large data technology based on open source power.
However, although database integration machine and large data technology are hot topics today, and have been widely used, but a considerable number of users still can not understand the essential differences between the two and relationship. At the same time, many users are puzzled about how to position the two correctly within the enterprise. For this purpose, this paper compares the technical features of the database integrated machine (also known as the next generation mainstream relational database) and large data technology (such as Hadoop, mainly referring to MapReduce and NoSQL).
Hardware and software
In essence, database integration machine and large data technology hardware architecture is basically the same, the same is the use of x86 server cluster distributed parallel mode to deal with large-scale data and computing. However, the seller of the database machine will usually be the product of the hardware system-oriented, systematic overall tuning, but also have their own unique means. For example, Oracle Exadata InfiniBand, Flash cache,ibm Nettezza FPGA (Field Programmable logic Gate array) and so on.
The most important difference between database integration machine and large data technology is the software system. The core of the database All-in-one is the SQL system, not just SQL parsing, but more importantly, a complete and massive technology system that includes SQL optimization engines, indexing, locks, transactions, logs, security, and management. This system is mature and product oriented.
The MapReduce in the large Data technology software system provides a distributed programming framework for mass data processing, and the users need to compile the computational logic needed by themselves. MapReduce to read and write data is sequential, not random. Yet another system of large data technology NoSQL mostly only provides the distributed storage of massive data and the fast indexing mechanism, which provides most of the programming APIs (although there are SQL-like languages, but not the complete SQL system).
Because of the complexity of SQL system and the whole connection of processing logic, the database integration machine is far less scalable than large data technology system, although the former has greatly improved the bottleneck of vertical expansion of traditional relational database. A single cluster of MapReduce and NoSQL can be extended to thousands of nodes, and if the database All-in-one is extended to this scale on the hardware, it is meaningless from the software.
Characteristics and nature
Based on the different software system, the database integration machine and large data technology have different characteristics. Database integration is often suitable for storing complex data models (such as enterprise core business data) and needs to be limited to relational models based on two-dimensional tables. At the same time, database integration is suitable for computing with high consistency and transactional requirements, as well as complex bi calculations.
Large data technologies are better suited to storing simpler data models and can be unconstrained by patterns. Thus, the data types of storage management are richer. Large data technology is also suitable for the computation of inconsistent and transactional requirements (mainly refers to the NoSQL query operations), as well as large-scale mass data, the bulk of the Distributed parallel Computing (based on MapReduce).
It should be noted that the NoSQL database is more efficient in querying and inserting than in the database because it is freed from the cumbersome SQL system constraints. Large data technology is larger than the amount of data that a database machine can handle, mainly because its clusters can be expanded larger.
In essence, MapReduce is an important innovation in the field of mass data distributed computing, but it is more advantageous in the large-scale batch processing problem which is suitable for parallel processing, but it is not necessarily advantageous for some complex operations. NoSQL can be seen as the result of simplifying traditional relational databases. Because the design idea of NoSQL database only extracts the index mechanism of relational database, and adds distributed storage, it removes all the things that are not needed for "certain special problems" in SQL system, thus achieving better efficiency, expansibility and flexibility.
Therefore, we can clearly see that in practice, there are many problems (especially the popular Big data problem), relational database Many design does not need, this is the NoSQL development and growth of the fundamental foothold.
Relationships and collaboration
Through the previous analysis, we can not be difficult to conclude that large data technology and database integration machine should be complementary, not replace each other. They are designed for different application scenarios and complement and collaborate with each other. In particular, large data technologies can achieve:
The processing results can be directly used to deal with the large amount of unstructured and semi-structured data (such as social data, various logs and even pictures, videos, etc.) in the enterprise.
The above processing results can also be considered as a new input storage to the enterprise-class data Warehouse, when the large data machine is equivalent to large data sources, the new ETL (extraction-conversion-loading) means;
Data-oriented storage or computation that is not appropriate for SQL.
and database integration should still be as the mainstream of enterprise Data Warehouse technology, at least for a long time should be so. It is responsible for the storage and calculation of the most important and valuable business-critical data.
Existing misunderstanding
Some people think that although the original open source state of large data technology is not suitable for the enterprise-class data Warehouse main platform requirements, but after development, supplemented, should be possible. There is no mistake in this view. But in fact, the development of large open source data technology, to be supplemented by the large data technology in the original design to be removed, those belong to the relational database system of things.
If this kind of complementary development, enterprises will not only face a huge, difficult to estimate the development workload, but also difficult to like professional database manufacturers to achieve these work of the theoretical, product and systematic. Although from a purely technical point of view, it is possible to develop anything. But if the company is really prepared to do this, is it going to develop another commercial relational database? Obviously, this is against the design of large data technology.
(Responsible editor: The good of the Legacy)