November 2013 22-23rd, as the only large-scale industry event dedicated to the sharing of Hadoop technology and applications, the 2013 Hadoop China Technology Summit (Chinese Hadoop Summit 2013) was held at four points by Sheraton Beijing Group Hotel. Nearly thousands of CIOs, CTO, architects, IT managers, consultants, engineers, enthusiasts for Hadoop technology, and it vendors and technologists engaged in Hadoop research and promotion will be involved in a range of industries from home and abroad.
At Sql&nosql, Wang Tao, CTO of the Gangbin software, made a speech on the application of the document database in the Hadoop cluster, from the review of large data, document database features, The role of NoSQL in the large data age is described in detail in four aspects of the location of the database in Hadoop and the user case.
Wang Tao introduced, when it comes to big data, the first thing we think of is 3V (Volume, produced, Velocity), Volume represents massive data scale, according to the statistics more than 50% of the organization owns and is processing more than 10TB of data, of which more than 10% Organization has surpassed 1PB of data, which is also the first challenge for large data; Produced represents high timeliness, 30% of the organization needs to deal with more than 100G of data per day, how to get the data we want from the massive data is the second challenge of large data; Velocity is diverse, the data we need to deal with in large numbers is more diverse, such as graphics, video, call logs, which may all need to be processed and analyzed, and how to handle these diverse data is the third challenge we face in big data.
To really solve large data problems, you can use hadoop+nosql combinations. As the figure below, Hadoop solves the problem of massive data and diverse data, and NoSQL solves the massive and time-sensitive data. As Wang Tao talks about, Hadoop and nosql complement each other rather than replace it.
▲hadoop and nosql-solve Bigdata nuclear weapons
Talking about the dilemma of common relational database in large data environment, Wang Tao introduced, first, the data model is rigid, unable to deal with a large number of data, resulting in the performance of the line; The second is strong consistency, the relational database logs, locks constitute a performance bottleneck, while the document database can be a good solution to these problems. Wang Tao continues to talk about the flexibility of the document database data model, the schemaless of agile and scalability for development, and the significant improvement in final consistency, as well as the low cost of nosql, which can be expanded horizontally using a PC server.
Then, Wang Tao introduced several features of the document database, first, online expansion, as long as the new node added to the cluster, and then partition the data partitions, the system can automatically move data from other machines to the new machine. Secondly, the mechanism of heterogeneous data replication can ensure the stability and not loss of data. Three is the support of multiple indexes, compared with many KV or wide table databases, document database generally can create multiple indexes on different fields for a set.
Talking about the combination of Hadoop and NoSQL, Wang Tao talked about the NoSQL database in Hadoop (see below), put NoSQL under Hadoop, and HDFs on the same level as a data source. The advantage of this is that each time we access the data, we need to import HDFs from the top and then use it, but we can access the data directly from the native database interface.
The location of ▲nosql database in Hadoop
▲ Importing data from Hadoop
Finally, Wang Tao shared the successful application of Hadoop and NoSQL:
First, customer challenges face a daily need to archive more than 100G of data, need to be able to concurrent, real-time, by multiple dimension access over 2 years of historical data, the current Oracle database can not meet the needs of real-time queries.
Solution: Using MapReduce and Hive as the complement of ETL processing data cleaning and conversion, using hive to load the final results into the SEQUOIADB, small-scale x86 cluster platform to reduce TCO, the use of SEQUOIADB, Establishing multiple indexes on common query fields guarantees query performance.
The final result: can be online for 2 years of historical data for multiple conditional retrieval, high data compression ratio to save data storage space, to facilitate the segmentation of customer base, the discovery of High-value users, reduce customer churn rate, help proprietary products, packages and other design and innovation, improve customer experience for strategic control.