Application of document database in Hadoop cluster

Source: Internet
Author: User
Keywords nbsp. Wang Tao big Data

November 2013 22-23rd, as the only large-scale industry event dedicated to the sharing of Hadoop technology and applications, the 2013 Hadoop China Technology Summit (Chinese Hadoop Summit 2013) was held at four points by Sheraton Beijing Group Hotel. Nearly thousands of CIOs, CTO, architects, IT managers, consultants, engineers, enthusiasts for Hadoop technology, and it vendors and technologists engaged in Hadoop research and promotion will be involved in a range of industries from home and abroad.

At Sql&nosql, Wang Tao, CTO of the Gangbin software, made a speech on the application of the document database in the Hadoop cluster, from the review of large data, document database features, The role of NoSQL in the large data age is described in detail in four aspects of the location of the database in Hadoop and the user case.

Wang Tao introduced, when it comes to big data, the first thing we think of is 3V (Volume, produced, Velocity), Volume represents massive data scale, according to the statistics more than 50% of the organization owns and is processing more than 10TB of data, of which more than 10% Organization has surpassed 1PB of data, which is also the first challenge for large data; Produced represents high timeliness, 30% of the organization needs to deal with more than 100G of data per day, how to get the data we want from the massive data is the second challenge of large data; Velocity is diverse, the data we need to deal with in large numbers is more diverse, such as graphics, video, call logs, which may all need to be processed and analyzed, and how to handle these diverse data is the third challenge we face in big data.

To really solve large data problems, you can use hadoop+nosql combinations. As the figure below, Hadoop solves the problem of massive data and diverse data, and NoSQL solves the massive and time-sensitive data. As Wang Tao talks about, Hadoop and nosql complement each other rather than replace it.

▲hadoop and nosql-solve Bigdata nuclear weapons

Talking about the dilemma of common relational database in large data environment, Wang Tao introduced, first, the data model is rigid, unable to deal with a large number of data, resulting in the performance of the line; The second is strong consistency, the relational database logs, locks constitute a performance bottleneck, while the document database can be a good solution to these problems. Wang Tao continues to talk about the flexibility of the document database data model, the schemaless of agile and scalability for development, and the significant improvement in final consistency, as well as the low cost of nosql, which can be expanded horizontally using a PC server.

Then, Wang Tao introduced several features of the document database, first, online expansion, as long as the new node added to the cluster, and then partition the data partitions, the system can automatically move data from other machines to the new machine. Secondly, the mechanism of heterogeneous data replication can ensure the stability and not loss of data. Three is the support of multiple indexes, compared with many KV or wide table databases, document database generally can create multiple indexes on different fields for a set.

Talking about the combination of Hadoop and NoSQL, Wang Tao talked about the NoSQL database in Hadoop (see below), put NoSQL under Hadoop, and HDFs on the same level as a data source. The advantage of this is that each time we access the data, we need to import HDFs from the top and then use it, but we can access the data directly from the native database interface.

The location of ▲nosql database in Hadoop

▲ Importing data from Hadoop

Finally, Wang Tao shared the successful application of Hadoop and NoSQL:

First, customer challenges face a daily need to archive more than 100G of data, need to be able to concurrent, real-time, by multiple dimension access over 2 years of historical data, the current Oracle database can not meet the needs of real-time queries.

Solution: Using MapReduce and Hive as the complement of ETL processing data cleaning and conversion, using hive to load the final results into the SEQUOIADB, small-scale x86 cluster platform to reduce TCO, the use of SEQUOIADB, Establishing multiple indexes on common query fields guarantees query performance.

The final result: can be online for 2 years of historical data for multiple conditional retrieval, high data compression ratio to save data storage space, to facilitate the segmentation of customer base, the discovery of High-value users, reduce customer churn rate, help proprietary products, packages and other design and innovation, improve customer experience for strategic control.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.