Technical Foundation: The development of large data analysis technology

Last Update:2014-12-09 Source: Internet

Author: User

Keywords Large data analysis extension we these hardware

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Large data analysis technology originally originated in the Internet industry. Data such as Web page archiving, user clicks, commodity information, user relations and so on form a continuous growth of mass data sets. These large data contain a wealth of knowledge that can be used to enhance the user experience, improve service quality, and develop new applications, and how to efficiently and accurately discover these knowledge will basically determine the location of major internet companies in a competitive environment. First of all, Google-led technology-oriented internet companies have proposed the MapReduce technology framework, using Low-cost PC server cluster, large-scale concurrent processing of bulk transactions.

Using file systems to store unstructured data, coupled with a sophisticated backup and disaster-recovery strategy, this affordable, large data solution has not only lost performance but has also won scalability, compared to the previously expensive enterprise minicomputer cluster + Business database solution. Before we design a data center solution, we need to take into account the scalability of the solution after implementation. The usual approach is to estimate the amount of traffic and data in the future for a period of time, adding redundant computing units (CPUs) and storage for the time being.

This approach directly leads to a huge up-front investment, and even that does not guarantee system performance when computing requirements and storing excess design. And once the expansion is needed, the problem will ensue. First of all, commercial parallel databases usually require the physical isomorphism of each node, that is, the approximate computing and storage capabilities. And as the hardware is updated, the new hardware we usually add will be stronger than the existing hardware. In this way, the old hardware becomes the bottleneck of the system. In order to ensure the performance of the system, we have to replace the old hardware gradually, the economic cost of a huge loss. Second, even the current most powerful business parallel database, the data node that it can manage is only dozens of or hundreds of this order of magnitude, this is mainly due to the architectural design problem, so its scalability must be limited.

The MAPREDUCE+GFS framework is not plagued by the above problems. Need to expand, just add a cabinet, add the appropriate computing unit and storage, the cluster system will automatically allocate and schedule these resources, without affecting the operation of the existing system. Today, we're using more of the Open-source implementations of Google MapReduce, Hadoop. In addition to the development of computational models, at the same time, people are looking at the data storage model. The traditional relational database has occupied the dominant position of the market for a long time because of its standard design, friendly query language and efficient data processing online affairs.

However, its strict design formula, to ensure strong consistency and give up performance, scalability problems in large data analysis is gradually exposed. Subsequently, the NOSQL data storage model became popular. NoSQL, it is also understood that not only SQL, is not a specific data storage model, it is a kind of non-relational database collectively. The characteristics are: no fixed data table mode, can be distributed and horizontally expanded. NoSQL is not a simple objection to relational database, but a supplement and extension of its shortcomings. Typical NoSQL data storage models include document storage, key-value storage, graph storage, object database, column storage, etc. And more popular, we have to mention the Google BigTable.

BigTable is a distributed storage system for managing large amounts of structured data, and its data can often be distributed across thousands of nodes, with a total data level of petabytes (10 of 15 bytes, 106GB). HBase is its open source implementation. Now, in the open source community, around the Google MapReduce framework, the development of a number of outstanding open source projects. These projects in the technology and implementation of mutual support and rely on, and gradually formed a unique ecological system. Here, use the architecture diagram depicted by Cloudera to show the Hadoop ecosystem. This system provides us with a solid technical basis for achieving high quality and cheap data analysis.

(Responsible editor: The good of the Legacy)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More