First, preface
Big Data technology has been going on for more than 10 years, from birth to the present. The market has long been a company or institutions, to the vast number of financial practitioners, "brainwashing" big data the future of good prospects and trends. With the user's deep understanding of big data concepts and technologies, people have begun to move from theoretical exploration to the search for the landing of the scene, allowing big data to land and blossom in the enterprise.
From the management and application of big data focus on two areas. First, big data analysis related, for the massive data mining, complex analysis and calculation; second, online data manipulation, including traditional transactional operations and real-time access to massive amounts of data. Big Data high concurrency query operation. Users choose different big data management methods based on business scenarios and expectations of data processing results.
Analytic Big Data management is based on Hadoop/spark technology, which is suitable for the scene of data batch analysis and mining. Over time, Hadoop is too big and fast to expand because of the open source ecosystem, and it's hard to control big data tools, complexity, and price/performance. A recent report by Gartner, a leading market analysis and consulting agency, [Gartner's 2017 report, Hype Cycle for Data management,2017], reports that big Data services are no longer reliant on a single Hadoop big data business platform, Must be from the perspective of satisfying users ' scenarios and cases.
Distributed database is a kind of online operational big Data management, which emphasizes to satisfy the interactive business scenario of big data under the pressure of real-time high concurrent request. "Big Data" applications in this area are also being embraced by more people, and because of the simplicity of the distributed database, the DevOps is closer to the traditional data management system. So in recent years, the distributed database market is also growing rapidly.
Second, the technical system comparison
In the above big data technology implementation, Hadoop technology seems to be a set of systems. Why are the Hadoop/spark and distributed database design ideas different, and how should the location and usage scenarios be differentiated from distributed database technology? This needs to be analyzed from the origin and development of the two technologies. (Gartner 2017 report)
1. Big Data analytics
The Big Data analysis system is based on the Hadoop ecosystem, and in recent years, spark technology is one of the main ecology. Hadoop technology can only be considered as a distributed file system based on Hdfs+yarn, not a database.
The history of Hadoop can be traced back to 10, when Google invented the MapReduce, the theoretical basis for the birth of Hadoop, in order to build large data sets on tens of thousands of PC servers and provide high-performance concurrent access capabilities.
From the background of the birth of Hadoop, the main problem is how to batch compute unstructured data (such as computing PageRank, etc.) in a super-large cluster. In fact, in a Hadoop architecture, a distributed task can be an association, a sort, a clustered operation that resembles a traditional structured data, or a user-defined program logic for unstructured data.
Take a look at Hadoop's path to development. The first Hadoop was represented by the three development interfaces of big, hive, and MapReduce, respectively, for the application of script batching, SQL batch processing, and user-defined logic types. The development of Spark is more so, the first sparkrdd almost completely no SQL capabilities, or to apply the development of hive shark to a part of SQL support. However, with the increasing use of Hadoop by enterprise users, SQL has gradually become one of the main ways of accessing big data platforms in traditional industries. Hortonworks's Stinger, Cloudera Impala, Databricks Sparksql, and IBM's Bigsql all began to take over the market two years ago, making Hadoop seem like a battleground for SQL.
2. Distributed database
The distributed database has a long history, from the online transactional distributed database represented by Oracle RAC to the IBM DB2 DPF statistical analytic distributed database, which covers almost all of the data application scenarios of OLTP and OLAP.
Most of the distributed database functions are focused on structured computing and on-line additions and deletions. For example, IBM DB2 DPF, users can use the DPF version almost transparently, just as with a normal single point DB2 database. The SQL Optimizer in DPF is able to automatically disassemble and distribute a query to multiple nodes in parallel execution.
However, these traditional distributed databases are mainly based on several warehouses and analytic OLAP systems, the limitation is that their underlying relational database storage structure is not efficient enough to meet the efficiency requirements of high concurrency data query and big Data data processing and analysis.
As a result, distributed databases have also had a great transformation in recent years, from a single data model to a MULTIMODE data model, integrating OLTP, online high concurrency queries, and supporting big data processing and analysis, no longer using OLAP alone as a design goal. At the same time, the distributed database in the access mode also appeared k/v, documents, wide tables, graphs and other branches, support in addition to SQL query language other than the access mode, greatly enriched the traditional distributed database single purpose. In general, the main purpose of a multimode database is to meet operational requirements with high performance requirements and targeted data warehousing capabilities, rather than data mining scenarios like big data deep learning.
3. Business Scenarios
From the use of big data technology, these technologies can be divided into structured and unstructured data types, on the other hand can be based on the business type, that is, statistical analysis and online operation of two types (Figure 1).
Figure 1 Big Data business type
The design idea of Hadoop is to solve the problem of statistical analysis in the ultra-large-scale data scene, and the distributed database can be applied to the statistical analysis of structured data and the online operation of massive data according to different subdivision fields.
The biggest difference between Hadoop and distributed databases is the difference in the granularity of control data. Hadoop tends to manipulate overall data, such as statistical analysis of full-volume data, while distributed databases emphasize precise control over data rows, such as query change operations for a single record. This shows that the Hadoop business scenario is very suitable for low concurrency, high throughput, offline-based data analysis, and distributed database for high concurrency, online real-time data operations. These differences are also significant in the processing of unstructured data.
Third, the industry development trend
Either Hadoop or distributed databases, both have evolved in a way that separates computing storage tiers. This trend is evident for Hadoop, where the separation of HDFs storage from yarn Scheduling calculations allows both compute and storage to scale out on demand. While distributed databases are following similar trends in recent years, many databases have stripped the underlying storage to the upper-level SQL engine, such as using sparksql as a statistical analysis engine and using PostgreSQL as the transaction processing engine. This is the technical route used by many distributed databases in the industry.
Figure 2 Distributed Database/hadoop architecture
As can be seen from Gartner's latest database report in 2016, there is a new division of definitions for new databases in the international industry. The traditional XML database, OO database, and Pre-rdbms are disappearing; Emerging domain document class database, graph database, Table-style database (similar to Cassandra such as the table structure definition, But there is no relationship between tables defined by the database called Table-style database) and the Multi-model database is expanding its own impact; traditional relational databases, Columnstore databases, memory-analytical databases (memory-analytic databases represented by SAP HANA, The PC server configures a large amount of memory as the hardware foundation, caches the massive data in the memory in exchange for the extremely high access efficiency, achieves the real-time interactive analysis to the massive data. This type of business, also known as the Htap scenario, is considering transformation.
As you can see, Hadoop is indeed in a relatively early form, from a technical integrity and maturity perspective. Until today, some sql-on-hadoop programs are still in 1.x or even beta, requiring a lot of manual tuning in many enterprise applications to be able to run. At the same time, the main application scenarios of Hadoop have been oriented to batch-processing and analytic business, and the online processing of traditional database is not its main development direction. At the same time, Hadoop technology because the open source ecosystem is too large, at the same time involved in the transformation of too many manufacturers, making it difficult for users to fully familiar with the entire system, which greatly increased the complexity of development, improve the user's use of the difficulty, on the other hand is the maintenance of different versions of various manufacturers, The development direction of the product may be different from the open source version gradually increased.
On the other hand, the field of distributed database has been honed for several decades, the MPP technology of traditional RDBMS has already been perfected, and its main development direction can be divided into "distributed online database" and "Distributed Analytic Database" in the distributed database with many kinds of classification. For example, using structured data and Multi-model database for high concurrent online processing of structured and semi-structured data, and structured data batch analysis of column storage, Table-style, plus memory Analysis database, is the most common technology implementation means in these two directions. At the same time, after 5-10 years of development, the new generation database has entered into an era of integration with traditional technology and other technologies.
The domestic giant FIR database sequoiadb, as a distributed database, has started to fully support distributed OLTP and distributed object storage based on the Multi-model multi-mode operational database.
Compared with Hadoop and distributed database, we can see that the development direction of Hadoop product is quite overlapping with the database in the distributed database. For example, Pivotal greenplum, IBM DB2 BLU, and the national NTU Gbase 8a have a significant overlap with the location of Hadoop. In the case of high concurrent online transactions, the distributed database occupies an absolute advantage in Hadoop, except that HBase is barely available.
Figure 3 Distributed database and Hadoop application scene limit
At present, from the perspective of the development of the Hadoop industry, Cloudera, Hortonworks and other manufacturers have no longer claimed to be a Hadoop distributor, but to change its positioning as a data science and machine learning service providers. As a result, the business model of Hadoop distribution from a business model has largely ended, and users have experienced the difficulty of maintaining the entire Hadoop platform rather than being forced to buy the entire platform. A large number of users prefer to take the original Hadoop parts apart and use them flexibly, paying for the use of scenarios and results rather than the platform itself.
Another market segment-Unstructured small file storage-has long been a battleground for object storage, block storage, and distributed file systems. Today, a new generation of databases is also beginning to enter the field, and it is foreseeable that in the years to come, small, unstructured file storage can become one of the battlefields of a distributed database with multimode data processing capabilities.
Iv. Application Scenarios
Different scenarios should use different technologies, and no technology can be applied to all business scenarios.
In the era of big data, in a relatively rigorous industry like finance, the core trading business, for some historical reasons, few enterprises dare to immediately replace the main core system with new technology. However, in other systems, the distributed database can be used to replace the traditional Oracle, IBM database, and "thin", while in the big Data application, the distributed database status is also rising.
The extension of Data Warehouse is actually a supplement to the traditional digital silo model. All along, the construction of Data Warehouse is to follow the principle of top-down, that is, the first set up the data model, then according to the data model to build table structure and SQL, then ETL and data cleansing, and finally get the corresponding report. The big data and the new machine learning, brings people another bottom-up analysis idea: first set up an analytical data lake, the data to be analyzed into the lake for Desensitization and standardization, then use machine learning, deep mining and other distributed computing technology, in these massive data to find the law. The biggest difference between this idea and the traditional idea is that the analysis model is based on the facts presented by historical data, rather than building it on the basis of the hypothetical data model. Data Warehouse extension is the flagship scenario for Hadoop and distributed Columnstore.
For both online and real-time data operations, the distributed database is another major technology type. For example, the use of distributed databases for ODS is one of the typical application cases. In relatively large banks, traditional ODS typically retains only a short period of historical data as a temporary storage area for data processing, while earlier historical data is either archived into a tape library or processed and cleaned into several positions. But in big data scenarios, many businesses are starting to put a clear higher demand on online interactive access to historical data. For example, whether the front desk cabinet needs to provide users with a full history of the query function, the bank's internal operations team can be on-line business history of online query access to meet the needs of judicial inquiry, and so on. These types of scenarios have high concurrency, index dimension, query latency and so on, and there are many inconveniences in HBase using Hadoop, which is the main application scenario of distributed online database.
In addition to storing historical data, another major direction of ODS extension is as a data mart that stores the results of analysis and mining from Hadoop for external application invocation queries. For example, mobile banking provides real-time product recommendations based on the label results of each user portrait and current behavior, which combines the results of the analysis with real-time behavioral data. This type of application can be further extended to more core business scenarios such as stroke control.
Therefore, in the era of big data, Hadoop and distributed database should complement each other in the financial industry structure and compensate each other's shortcomings. Hadoop and distributed analytic database can meet the demand of structured data batch analysis, and Hadoop has the advantage of database unmatched for unstructured data analysis, while distributed online database can manage and use data more flexibly in high concurrent online business scenarios.
For example, in recent years, many banks are doing "user portrait" business, in the hope that through the user's historical transaction behavior to each user tag, and in the cabinet face, net silver, mobile banking and other channels targeted to recommend financial products. When using big data technology to implement this scenario, a relatively simple and common practice is to:
(1) Batch write the user's historical behavior to Hadoop;
(2) Using machine learning in Hadoop to model user behavior classification;
(3) Regular batch scan of user history behavior in Hadoop, tag users according to model;
(4) Write user tag results to distributed database;
(5) Each channel business through the middleware connection database, query user tags for product recommendations.
V. Looking to the future
For the future development of big data technology will return to the real needs of users, Hadoop/spark will continue to dominate the field of data analysis, and in the field of real-time online interaction, the distributed database will become another important technical force.
In the bank, for the new technology product selection can not be single from the current business scenario requirements, but also to consider the product in the next 3-5 years of development path and direction, whether it can continuously iterate to meet the future needs of enterprises. Therefore, it is not enough for users to understand the status quo of each technology, only to anticipate and gain insight into the future only when a technology development strategy and its architectural limitations are recognized.
The architectural limitations are not equal to the absence of functionality. Many of the new technologies do not initially offer enterprise-class functionality as complete as Oracle, but they do not have to wait until the full functionality is complete before they begin to think about learning and using. When users evaluate a new product and technology, the functional points of the product need to meet several essential basic functions, while some advanced features do not need to be immediately available. As an IT decision-making point, it is important to assess the architectural limitations of the product and technology, i.e. whether the business needs of the bank for a period of time can be realized and met in the foreseeable future based on the architecture.
The core of Hadoop's architecture is HDFS and yarn, and any request is first sent to yarn for dispatch. Yarn is based on the Namenode to calculate a task needs to access the data block server generated a series of tasks, and sent to the appropriate server for execution. Unless the entire scheduling algorithm is rewritten from the bottom up, the lengthy process of this mechanism restricts hadoop to the online business.
The core of the database architecture is the data storage structure. A database can provide retrieval, query, update, and other operations on data fields only if there is a defined storage structure. On the one hand, this mechanism provides the effective management ability to the structured and semi-structured data, on the other hand restricts the user's ability to deal with unstructured data. In the short term, distributed database is mainly in the field of storage and retrieval of small files in unstructured data management. For the query ability of the internal information of the file, the full-text index can be used, but for the unstructured data of binary non-literal class, the distributed database has no better way to make full-dimension free search and query for the information.
From the point of view of distributed database, the author thinks that in the next 3-5 years, the new generation database will gradually evolve to the Multi-model database, while providing two kinds of data access patterns of SQL and API. For example, the giant Cedar database SEQUOIADB supports other types of data storage formats, including unstructured object storage, while supporting SQL and API access to structured and semi-structured storage. At the same time, the distributed relational database will further enhance convergence, provide a multi-engine storage scheme (Gbase 8a/8t), and even some products have begun to support semi-structured data (PostgreSQL) such as JSON.
All in all, with big data technology, distributed databases are complementary to Hadoop. Hadoop is suitable for unstructured batch analysis scenarios, and distributed databases are better suited for high-concurrency online business scenarios.
Wang Tao, co-founder of the SEQUOIADB giant Cedar database, currently serves as CTO and chief architect of Sequoiadb. Under the leadership of Mr. Wang Tao, SEQUOIADB's technical team from scratch to build a distributed database, now sequoiadb already in the domestic similar products market share first and has been widely recognized by the industry at home and abroad. Since 2011, when the SEQUOIADB database core prototype was successfully developed in North America, Mr. Wang has been working on sequoiadb large-scale distributed database research to promote the development and application of the new generation database in the big data industry. Wang Tao was an IBM DB2 Lab North American Core Development member with over 10 years of experience in database core architecture design, Database engine development and enterprise database applications.
Analysis of distributed database under Big Data requirement